generated from langchain-ai/integration-repo-template
-
Notifications
You must be signed in to change notification settings - Fork 12
Vector store, refactor encoding to an Astra document #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
209bf64
wip on factoring encoders as class, untested
dbfcc16
vectorstore: encoding onto Astra fully refactored into encoder class
71bb12e
shortened call stack and removed most assertions
0187f56
Merge branch 'main' into SL-anycollection
2e50e83
style
1a28ffd
docstrings in encoder and error messages instead of asserts
fdfd27a
extra ruff style improvements
4332673
rename id->document_id param to encode
f06995a
Merge branch 'main' into SL-anycollection
943800d
appease TRY003/EM101
63814d6
Style, addressed PR comments
25b4f1d
made another embedding call safe
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from typing import Any | ||
|
|
||
| from langchain_core.documents import Document | ||
| from typing_extensions import override | ||
|
|
||
|
|
||
| def _default_filter_encoder(filter_dict: dict[str, Any]) -> dict[str, Any]: | ||
| metadata_filter = {} | ||
| for k, v in filter_dict.items(): | ||
| # Key in this dict starting with $ are supposedly operators and as such | ||
| # should not be nested within the `metadata.` prefix. For instance, | ||
| # >>> _default_filter_encoder({'a':1, '$or': [{'b':2}, {'c': 3}]}) | ||
| # {'metadata.a': 1, '$or': [{'metadata.b': 2}, {'metadata.c': 3}]} | ||
| if k and k[0] == "$": | ||
hemidactylus marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| if isinstance(v, list): | ||
| metadata_filter[k] = [_default_filter_encoder(f) for f in v] | ||
| else: | ||
| # assume each list item can be fed back to this function | ||
| metadata_filter[k] = _default_filter_encoder(v) # type: ignore[assignment] | ||
| else: | ||
| metadata_filter[f"metadata.{k}"] = v | ||
|
|
||
| return metadata_filter | ||
|
|
||
|
|
||
| class VSDocumentEncoder(ABC): | ||
| """A document encoder for the Astra DB vector store. | ||
|
|
||
| The document encoder contains the information for consistent interaction | ||
| with documents as stored on the Astra DB collection. | ||
|
|
||
| Implementations of this class must: | ||
| - define how to encode/decode documents consistently to and from | ||
| Astra DB collections. The two operations must combine to the identity | ||
| on both sides. | ||
| - provide the adequate projection dictionaries for running find | ||
| operations on Astra DB, with and without the field containing the vector. | ||
| - encode IDs to the `_id` field on Astra DB. | ||
| - define the name of the field storing the textual content of the Document. | ||
| - define whether embeddings are computed server-side (with $vectorize) or not. | ||
| """ | ||
|
|
||
| server_side_embeddings: bool | ||
| content_field: str | ||
| base_projection: dict[str, bool] | ||
| full_projection: dict[str, bool] | ||
|
|
||
| @abstractmethod | ||
| def encode( | ||
| self, | ||
| content: str, | ||
| document_id: str, | ||
| vector: list[float] | None, | ||
| metadata: dict | None, | ||
| ) -> dict[str, Any]: | ||
| """Create a document for storage on Astra DB. | ||
|
|
||
| Args: | ||
| content: textual content for the (LangChain) `Document`. | ||
| document_id: unique ID for the (LangChain) `Document`. | ||
| vector: a vector associated to the (LangChain) `Document`. This | ||
| parameter must be None for and only for server-side embeddings. | ||
| metadata: a metadata dictionary for the (LangChain) `Document`. | ||
|
|
||
| Returns: | ||
| a dictionary ready for storage onto Astra DB. | ||
| """ | ||
| ... | ||
hemidactylus marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| @abstractmethod | ||
| def decode(self, astra_document: dict[str, Any]) -> Document: | ||
| """Create a LangChain Document instance from a document retrieved from Astra DB. | ||
|
|
||
| Args: | ||
| astra_document: a dictionary as retrieved from Astra DB. | ||
|
|
||
| Returns: | ||
| a (langchain) Document corresponding to the input. | ||
| """ | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def encode_filter(self, filter_dict: dict[str, Any]) -> dict[str, Any]: | ||
| """Encode a LangChain filter for use in Astra DB queries. | ||
|
|
||
| Make a LangChain filter into a filter clause suitable for operations | ||
| against the Astra DB collection, consistently with the encoding scheme. | ||
|
|
||
| Args: | ||
| filter_dict: a filter in the standardized metadata-filtering form | ||
| used throughout LangChain. | ||
|
|
||
| Returns: | ||
| an equivalent filter clause for use in Astra DB's find queries. | ||
| """ | ||
| ... | ||
|
|
||
|
|
||
| class DefaultVSDocumentEncoder(VSDocumentEncoder): | ||
| """Encoder for the default vector store usage with client-side embeddings. | ||
|
|
||
| This encoder expresses how document are stored for collections created | ||
| and entirely managed by the AstraDBVectorStore class. | ||
| """ | ||
|
|
||
| server_side_embeddings = False | ||
| content_field = "content" | ||
|
|
||
| def __init__(self) -> None: | ||
| self.base_projection = {"_id": True, "content": True, "metadata": True} | ||
| self.full_projection = { | ||
| "_id": True, | ||
| "content": True, | ||
| "metadata": True, | ||
| "$vector": True, | ||
| } | ||
|
|
||
| @override | ||
| def encode( | ||
| self, | ||
| content: str, | ||
| document_id: str, | ||
| vector: list[float] | None, | ||
| metadata: dict | None, | ||
| ) -> dict[str, Any]: | ||
| if vector is None: | ||
| msg = "Default encoder cannot receive null vector" | ||
| raise ValueError(msg) | ||
| return { | ||
| "content": content, | ||
| "_id": document_id, | ||
| "$vector": vector, | ||
| "metadata": metadata, | ||
| } | ||
|
|
||
| @override | ||
| def decode(self, astra_document: dict[str, Any]) -> Document: | ||
| return Document( | ||
| page_content=astra_document["content"], | ||
| metadata=astra_document["metadata"], | ||
| ) | ||
|
|
||
| @override | ||
| def encode_filter(self, filter_dict: dict[str, Any]) -> dict[str, Any]: | ||
| return _default_filter_encoder(filter_dict) | ||
cbornet marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| class DefaultVectorizeVSDocumentEncoder(VSDocumentEncoder): | ||
| """Encoder for the default vector store usage with server-side embeddings. | ||
|
|
||
| This encoder expresses how document are stored for collections created | ||
| and entirely managed by the AstraDBVectorStore class, for the case of | ||
| server-side embeddings (aka $vectorize). | ||
| """ | ||
|
|
||
| server_side_embeddings = True | ||
| content_field = "$vectorize" | ||
|
|
||
| def __init__(self) -> None: | ||
| self.base_projection = {"_id": True, "$vectorize": True, "metadata": True} | ||
| self.full_projection = { | ||
| "_id": True, | ||
| "$vectorize": True, | ||
| "metadata": True, | ||
| "$vector": True, | ||
| } | ||
|
|
||
| @override | ||
| def encode( | ||
| self, | ||
| content: str, | ||
| document_id: str, | ||
| vector: list[float] | None, | ||
| metadata: dict | None, | ||
| ) -> dict[str, Any]: | ||
| if vector is not None: | ||
| msg = "DefaultVectorize encoder cannot receive non-null vector" | ||
hemidactylus marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| raise ValueError(msg) | ||
| return { | ||
| "$vectorize": content, | ||
| "_id": document_id, | ||
| "metadata": metadata, | ||
| } | ||
|
|
||
| @override | ||
| def decode(self, astra_document: dict[str, Any]) -> Document: | ||
| return Document( | ||
| page_content=astra_document["$vectorize"], | ||
| metadata=astra_document["metadata"], | ||
| ) | ||
|
|
||
| @override | ||
| def encode_filter(self, filter_dict: dict[str, Any]) -> dict[str, Any]: | ||
| return _default_filter_encoder(filter_dict) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.