-
Notifications
You must be signed in to change notification settings - Fork 12
Bump to astrapy 2.0 + Hybrid-Search in vector store (candidate for 0.6.0) #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…fic private methods
…_EMBEDDING_API_KEY_OPENAI
…unction; README on hybrid-mismatch errors
|
|
||
| class _AstraDBCollectionEnvironment(_AstraDBEnvironment): | ||
| def __init__( | ||
| def __init__( # noqa: C901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe exclude C90 if we don't intend to follow it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather leave it like this, optimistically hoping for a fast follow-up rewrite of this whole constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: using the ternary operator as you suggested has reduced the complexity down to "acceptable". So this noqa is (serendipitously) going away.
| UserWarning, | ||
| stacklevel=2, | ||
| ) | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we raise immediately, I wonder if it wouldn't be better to wrap the DataAPIException in a "langchain-datastax" exception with the explanation message instead of sending a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will raise ... from <underlying exception> and add a note to the message, instead of issuing the warning. Very good suggestion, and apt timing at that (with 0.6 going out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name would be AstraDBError (the parent package making it clear that it's a langchain-datastax thing).
Trying to be consistent with the vectorstores.AstraDBVectorStoreError class name already introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, not bad indeed. With a pre-existing (and different) collection,
>>> try:
... store1 = AstraDBVectorStore(
... collection_name="man_tst",
... token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
... api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
... collection_vector_service_options=VectorServiceOptions(provider="nvidia", model_name="NV-Embed-QA"),
... collection_embedding_api_key=os.environ["HEADER_EMBEDDING_API_KEY_OPENAI"],
... )
... except Exception as e:
... print("ERR!")
... err = e
...
APICommander about to raise from: [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}]
ERR!
>>> err
AstraDBError("Astra DB collection 'man_tst' was found to be configured differently than requested by the vector store creation. This is resulting in a hard exception from the Data API (accessible as `<this-exception>.__cause__`). Please see https://github.com/langchain-ai/langchain-datastax/blob/main/libs/astradb/README.md#collection-defaults-mismatch for more context about this issue and possible mitigations.")
>>> err.__cause__
DataAPIResponseException(text="Collection already exists: Collection already exists: trying to create Collection ('man_tst') with different settings (EXISTING_COLLECTION_DIFFERENT_SETTINGS)", command={'createCollection': {'name': 'man_tst', 'options': {'vector': {'service': {'provider': 'nvidia', 'modelName': 'NV-Embed-QA'}}, 'indexing': {'allow': ['metadata']}}}}, raw_response={'errors': [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}]}, error_descriptors=[DataAPIErrorDescriptor('Collection already exists', error_code='EXISTING_COLLECTION_DIFFERENT_SETTINGS', message="Collection already exists: trying to create Collection ('man_tst') with different settings", family='REQUEST', scope='EMPTY', id='d2146774-1bd2-4e2f-a06c-bedc0f48d6a1')], warning_descriptors=[])
| k: int = 4, | ||
| filter: dict[str, Any] | None = None, # noqa: A002 | ||
| k: int, | ||
| filter: dict[str, Any] | None, # noqa: A002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could rename filter arg since it's a private method and remove the noqa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (though I'm not sure if the name change filter => filter_dict => filter in some call patterns is worth the hassle. No strong opinion anyway.)
| s_matches_z = sorted(matches_z, key=doc_sorter) | ||
| assert s_matches_z[0].metadata == {"m1": "A", "m2": "x", "mZ": "Z"} | ||
| assert s_matches_z[1].metadata == {"m1": "B", "m2": "y", "mZ": "Z"} | ||
| # TODO: restore this test once the issue with deleting rows is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open an issue and link it here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (and in the other 3 similar places): the issue, as a datastax/cassandra bug, is merged to main.
What counts is whether it is rolled out consistently to the test kubes, which last time I checked was only on one kube. I have marked the comments accordingly (linking the issue), pending further roll-outs.
libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py
Outdated
Show resolved
Hide resolved
libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py
Outdated
Show resolved
Hide resolved
libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py
Outdated
Show resolved
Hide resolved
libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py
Outdated
Show resolved
Hide resolved
libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py
Outdated
Show resolved
Hide resolved
cbornet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work !
A few comments, nothing blocking.
This (admttedly huge) PR brings the package to the new major release of
astrapy v2, and introduces Hybrid search (a.k.a. findAndRerank-based) to the vector store.This is proposed as version 0.6.0 of this package.
Along the way, several changes and improvements are bundled with this.
Summary
AstraDBVectorStore (+codecs, +autodetect logic)
sortparameter, detecting hybrid etc)"astradb.py" util:
HybridSearchModeenum to control vectorstore strategy for running searchesOther changes:
Some explanations on the "hybrid search" strategy
Intro
Hybrid-capable collections will be created by default, no need to specify anything (but explicit is still possible).
The LC vectorstore adapts to this philosophy, while exposing hybrid-related manual controls; but when the Data API supports hybrid, new stores will get it automatically.
There is a distinction between
"$lexical")<collection>.find_and_rerank), with different syntax and return type.In other words, one can decide that the store does not run hybrid searches: nevertheless, if the collection is detected as hybrid-capable, the codec will still honour the hybrid-compatible format (otherwise inconsistencies on the saved documents will arise).
Defaults
Whether to run hybrid searches, by default, is decided on the basis of the collection config (for autodetect and regular creation likewise); but it can be overridden.
In almost all cases, backward compatibility is guaranteed by the sequence of defaults and the store behaviour. In one case, a user might encounter a "collection config mismatch" error if hybrid is rolled out to the server after the collection is created. By a helpful warning, and a whole section in the README, ample guidance is given to the user about the possible roads to resolving the error that may arise in such case. (This was deemed the least disruptive action).
Hybrid search
The vector store obeys the same contract whether hybrid or not -- namely, search methods accept one
query: strparameter.This parameter, in case of hybrid, is used for both parts of the composite search. One can optionally specify a separate search query string,
lexical_query, for the lexical part of the search, different from that of the ANN search.Note that this applies to only some methods:
query: strWhere applicable, the following holds: regardless of whether the search is hybrid or regular-ANN, the return value has the same shape (as required by the VectorStore abstract class).
Caveats and notes
LANGCHAIN_TEST_HYBRID=yHEADER_RERANKING_API_KEY_NVIDIAshould contain itLANGCHAIN_TEST_ASTRADBGRAPHVECTORSTORE=yto the test invocationLANGCHAIN_TEST_NO_CNDB13480=yif hybrid tests are included.I have extensively run all tests, including those excluded by the default CI process, on various environment and have carefully verified everything is passing with the code of this PR.
Also ...
[0:1]interval. Since these scores are returned as "similarity" by certain search methods, this may require special handling by the caller.Follow-up work
A close companion to this PR will be documentation work on the main LangChain repo (sample notebook on AstraDBVectorStore + another couple of related pages).
Also, some (minor) refinements were mentioned earlier (possibly hybrid support in MMR search, for example).
In two or three cases, optimizations (reduction of code duplication, improvement in internal implementations) are left for future work.