Skip to content

Conversation

@hemidactylus
Copy link
Collaborator

@hemidactylus hemidactylus commented Mar 28, 2025

This (admttedly huge) PR brings the package to the new major release of astrapy v2, and introduces Hybrid search (a.k.a. findAndRerank-based) to the vector store.

This is proposed as version 0.6.0 of this package.

Along the way, several changes and improvements are bundled with this.

Summary

AstraDBVectorStore (+codecs, +autodetect logic)

  • vectorstore new constructor params: hybrid+reranker collection settings, reranker api key, hybrid search on/off/default, hybrid (sub)limits
  • a new named tuple for advanced configuration of hybrid search (sub)limits
  • codecs and autodetect can speak the hybrid and reranking specs (creating sort parameter, detecting hybrid etc)
  • better constructor docstring detailing hybrid
  • the relevant similarity search methods now accommodate a hybrid-search flow (depending on collection config and store setup) (this prompted a slight rename of some internal methods, where "search" is generic, while "find" is a non-hybrid find)
  • added extensive testing of the hybrid capabilities: store "lifecycle" usage, autodetect, vectorize/nonvectorize, force hybrid/nonhybrid
  • and specific hybrid-related doc codecs unit testing

"astradb.py" util:

  • introduced HybridSearchMode enum to control vectorstore strategy for running searches
  • introduced handling of hybrid-related collection configuration (lexical + reranker)
  • added a warning about collection-config-mismatches to better guide users to troubleshooting

Other changes:

  • propagated AstraDBVectorStore changes to AstraDBGraphVectorStore
  • better vector store constructor docstring with more class usage patterns
  • all changes required by porting from astrapy v1 to v2 (esp. collection creation and detection of overwrites in doc insertions)
  • removal of the constructor pattern of passing a ready-made "astrapy 1 client" (already deprecated and slated for deprecation for past versions) in all classes
  • update subdir README, in part. with a new section to help with collection-config mismatch (also added a helpful diagram)
  • expand on the root README for clarity
  • adapted blockbuster exemptions to fully cover running on a local Data API (i.e. http as opposed to https)
  • suppressed all graph vector store tests by default to save on the number of collections during IT (this is unavoidable at this point)

Some explanations on the "hybrid search" strategy

Intro

Hybrid-capable collections will be created by default, no need to specify anything (but explicit is still possible).
The LC vectorstore adapts to this philosophy, while exposing hybrid-related manual controls; but when the Data API supports hybrid, new stores will get it automatically.

There is a distinction between

  • a collection being hybrid-capable: in which case, documents need to be saved to the collection with certain fields (viz. "$lexical")
  • a store running searches using hybrid: if so, a different astrapy primitive is used (<collection>.find_and_rerank), with different syntax and return type.

In other words, one can decide that the store does not run hybrid searches: nevertheless, if the collection is detected as hybrid-capable, the codec will still honour the hybrid-compatible format (otherwise inconsistencies on the saved documents will arise).

Defaults

Whether to run hybrid searches, by default, is decided on the basis of the collection config (for autodetect and regular creation likewise); but it can be overridden.

In almost all cases, backward compatibility is guaranteed by the sequence of defaults and the store behaviour. In one case, a user might encounter a "collection config mismatch" error if hybrid is rolled out to the server after the collection is created. By a helpful warning, and a whole section in the README, ample guidance is given to the user about the possible roads to resolving the error that may arise in such case. (This was deemed the least disruptive action).

Hybrid search

The vector store obeys the same contract whether hybrid or not -- namely, search methods accept one query: str parameter.

This parameter, in case of hybrid, is used for both parts of the composite search. One can optionally specify a separate search query string, lexical_query, for the lexical part of the search, different from that of the ANN search.

Note that this applies to only some methods:

  • those that accept a query: str
  • NOT those that accept an embedding vector directly (which by their nature assume just ANN search is required)
  • NOT the "_with_embedding" methods, used by the GraphRAG project graph store (as well as by the now-deprecated AstraDBGraphVectorStore class). Using hybrid capabilities within GraphRAG is not a goal for this PR (possibly a future development) [Thanks to @epinzur for clarifying this item]
  • NOT the MMR search mode (which falls back to regular ANN search). If necessary, future work may lift this limitation - however, MMR is a form of reranking in itself, hence combining hybrid and MMR was not deemed an immediate priority.

Where applicable, the following holds: regardless of whether the search is hybrid or regular-ANN, the return value has the same shape (as required by the VectorStore abstract class).

Caveats and notes

  • Since the hybrid capabilities are not rolled out to all of Astra DB at the time of writing, all hybrid tests are suppressed. In order to include them, prepend the test invocation with LANGCHAIN_TEST_HYBRID=y
  • If an authentication key is required for the hybrid-related integration tests, environment variable HEADER_RERANKING_API_KEY_NVIDIA should contain it
  • The AstraDBGraphVectorStore integration tests can be run manually, if necessary, by prepending LANGCHAIN_TEST_ASTRADBGRAPHVECTORSTORE=y to the test invocation
  • For the temporary situation where local(HCD) lacks the fix for CNDB-13480 (unresponsive database for findAndRerank with deleted rows), run the tests with LANGCHAIN_TEST_NO_CNDB13480=y if hybrid tests are included.

I have extensively run all tests, including those excluded by the default CI process, on various environment and have carefully verified everything is passing with the code of this PR.

Also ...

  • at the moment the one available reranker model makes no guarantees on the range of the associated scores: rerank scores are not in the [0:1] interval. Since these scores are returned as "similarity" by certain search methods, this may require special handling by the caller.

Follow-up work

A close companion to this PR will be documentation work on the main LangChain repo (sample notebook on AstraDBVectorStore + another couple of related pages).

Also, some (minor) refinements were mentioned earlier (possibly hybrid support in MMR search, for example).

In two or three cases, optimizations (reduction of code duplication, improvement in internal implementations) are left for future work.

Stefano Lottini added 30 commits February 26, 2025 09:47

class _AstraDBCollectionEnvironment(_AstraDBEnvironment):
def __init__(
def __init__( # noqa: C901
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe exclude C90 if we don't intend to follow it ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave it like this, optimistically hoping for a fast follow-up rewrite of this whole constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: using the ternary operator as you suggested has reduced the complexity down to "acceptable". So this noqa is (serendipitously) going away.

UserWarning,
stacklevel=2,
)
raise
Copy link
Collaborator

@cbornet cbornet Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we raise immediately, I wonder if it wouldn't be better to wrap the DataAPIException in a "langchain-datastax" exception with the explanation message instead of sending a warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will raise ... from <underlying exception> and add a note to the message, instead of issuing the warning. Very good suggestion, and apt timing at that (with 0.6 going out).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name would be AstraDBError (the parent package making it clear that it's a langchain-datastax thing).

Trying to be consistent with the vectorstores.AstraDBVectorStoreError class name already introduced.

Copy link
Collaborator Author

@hemidactylus hemidactylus Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, not bad indeed. With a pre-existing (and different) collection,

>>> try:
...     store1 = AstraDBVectorStore(
...         collection_name="man_tst",
...         token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
...         api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
...         collection_vector_service_options=VectorServiceOptions(provider="nvidia", model_name="NV-Embed-QA"),
...         collection_embedding_api_key=os.environ["HEADER_EMBEDDING_API_KEY_OPENAI"],
...     )
... except Exception as e:
...     print("ERR!")
...     err = e
... 
APICommander about to raise from: [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}]
ERR!
>>> err
AstraDBError("Astra DB collection 'man_tst' was found to be configured differently than requested by the vector store creation. This is resulting in a hard exception from the Data API (accessible as `<this-exception>.__cause__`). Please see https://github.com/langchain-ai/langchain-datastax/blob/main/libs/astradb/README.md#collection-defaults-mismatch for more context about this issue and possible mitigations.")
>>> err.__cause__
DataAPIResponseException(text="Collection already exists: Collection already exists: trying to create Collection ('man_tst') with different settings (EXISTING_COLLECTION_DIFFERENT_SETTINGS)", command={'createCollection': {'name': 'man_tst', 'options': {'vector': {'service': {'provider': 'nvidia', 'modelName': 'NV-Embed-QA'}}, 'indexing': {'allow': ['metadata']}}}}, raw_response={'errors': [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}]}, error_descriptors=[DataAPIErrorDescriptor('Collection already exists', error_code='EXISTING_COLLECTION_DIFFERENT_SETTINGS', message="Collection already exists: trying to create Collection ('man_tst') with different settings", family='REQUEST', scope='EMPTY', id='d2146774-1bd2-4e2f-a06c-bedc0f48d6a1')], warning_descriptors=[])

k: int = 4,
filter: dict[str, Any] | None = None, # noqa: A002
k: int,
filter: dict[str, Any] | None, # noqa: A002
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could rename filter arg since it's a private method and remove the noqa

Copy link
Collaborator Author

@hemidactylus hemidactylus Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (though I'm not sure if the name change filter => filter_dict => filter in some call patterns is worth the hassle. No strong opinion anyway.)

s_matches_z = sorted(matches_z, key=doc_sorter)
assert s_matches_z[0].metadata == {"m1": "A", "m2": "x", "mZ": "Z"}
assert s_matches_z[1].metadata == {"m1": "B", "m2": "y", "mZ": "Z"}
# TODO: restore this test once the issue with deleting rows is fixed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open an issue and link it here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and in the other 3 similar places): the issue, as a datastax/cassandra bug, is merged to main.
What counts is whether it is rolled out consistently to the test kubes, which last time I checked was only on one kube. I have marked the comments accordingly (linking the issue), pending further roll-outs.

Copy link
Collaborator

@cbornet cbornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work !
A few comments, nothing blocking.

@hemidactylus hemidactylus merged commit b840c15 into main Mar 31, 2025
13 checks passed
@hemidactylus hemidactylus deleted the SL-vectorstore-hybrid branch March 31, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants