Skip to content

Conversation

@hemidactylus
Copy link
Collaborator

This PR factors the translations to an from the document stored in the Astra DB collection.
To do so, a new VSDocumentEncoder interface is proposed, providing operations such as Document-to-Astra and vice-versa, knowledge of the proper projection clauses when querying, and also the knowledge of whether embeddings are server-side ($vectorize) or not.

The main advantage of this refactoring lie in the next two planned phases after this PR:

  1. a rethinking of the many similarity_search* methods, aimed at sharing as much as possible between the vectorize- and the nonvectorize- paths;
  2. most important, an "auto-detect" mode for creating the Vector Store that creates a suitable VSDocumentEncoder based on the nature of the (already-existing) table.

Copy link
Collaborator

@jordanrfrazier jordanrfrazier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome

@hemidactylus
Copy link
Collaborator Author

I want to draw your attention to the fact that this refactor is "imperfect", in that it leaks usage of _id partially in- and partially outside the classes.
Indeed, the Encoder classes ingest the doc id and are responsible for packaging it into the _id field of Astra document. They also return the _id within the projections they define.
But _id is also used directly in the vectorstore class.

Indeed an invariant for implementers should be to map the doc id to _id (descended from and Astra DB requirement).

Now, if you're happy with this, I'm fine with going on and perhaps refine this later (a slightly annoying task because it would mean kicking _id out of the class entirely ==> dict merging in the vectorstore for projections in a few places, subtleties with possible projections being {"*": 1} and so on).

@cbornet
Copy link
Collaborator

cbornet commented Aug 23, 2024

It's very nice to replace some of the if-else with OOP.
I wonder if we couldn't have _AbstractAstraDBVectorStore, _AstraDBWithVectorizeVectorStore and _AstraDBEmbeddingVectorStore to simplify even more. (with AstraDBVectorStore using one or the other as a delegate to keep backward compat).

@hemidactylus hemidactylus merged commit 7fa7697 into main Aug 23, 2024
@hemidactylus hemidactylus deleted the SL-anycollection branch August 23, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants