Bump to astrapy 2.0 + Hybrid-Search in vector store (candidate for 0.6.0) #120

hemidactylus · 2025-03-28T23:09:33Z

This (admttedly huge) PR brings the package to the new major release of astrapy v2, and introduces Hybrid search (a.k.a. findAndRerank-based) to the vector store.

This is proposed as version 0.6.0 of this package.

Along the way, several changes and improvements are bundled with this.

Summary

AstraDBVectorStore (+codecs, +autodetect logic)

vectorstore new constructor params: hybrid+reranker collection settings, reranker api key, hybrid search on/off/default, hybrid (sub)limits
a new named tuple for advanced configuration of hybrid search (sub)limits
codecs and autodetect can speak the hybrid and reranking specs (creating sort parameter, detecting hybrid etc)
better constructor docstring detailing hybrid
the relevant similarity search methods now accommodate a hybrid-search flow (depending on collection config and store setup) (this prompted a slight rename of some internal methods, where "search" is generic, while "find" is a non-hybrid find)
added extensive testing of the hybrid capabilities: store "lifecycle" usage, autodetect, vectorize/nonvectorize, force hybrid/nonhybrid
and specific hybrid-related doc codecs unit testing

"astradb.py" util:

introduced HybridSearchMode enum to control vectorstore strategy for running searches
introduced handling of hybrid-related collection configuration (lexical + reranker)
added a warning about collection-config-mismatches to better guide users to troubleshooting

Other changes:

propagated AstraDBVectorStore changes to AstraDBGraphVectorStore
better vector store constructor docstring with more class usage patterns
all changes required by porting from astrapy v1 to v2 (esp. collection creation and detection of overwrites in doc insertions)
removal of the constructor pattern of passing a ready-made "astrapy 1 client" (already deprecated and slated for deprecation for past versions) in all classes
update subdir README, in part. with a new section to help with collection-config mismatch (also added a helpful diagram)
expand on the root README for clarity
adapted blockbuster exemptions to fully cover running on a local Data API (i.e. http as opposed to https)
suppressed all graph vector store tests by default to save on the number of collections during IT (this is unavoidable at this point)

Some explanations on the "hybrid search" strategy

Intro

Hybrid-capable collections will be created by default, no need to specify anything (but explicit is still possible).
The LC vectorstore adapts to this philosophy, while exposing hybrid-related manual controls; but when the Data API supports hybrid, new stores will get it automatically.

There is a distinction between

a collection being hybrid-capable: in which case, documents need to be saved to the collection with certain fields (viz. "$lexical")
a store running searches using hybrid: if so, a different astrapy primitive is used (<collection>.find_and_rerank), with different syntax and return type.

In other words, one can decide that the store does not run hybrid searches: nevertheless, if the collection is detected as hybrid-capable, the codec will still honour the hybrid-compatible format (otherwise inconsistencies on the saved documents will arise).

Defaults

Whether to run hybrid searches, by default, is decided on the basis of the collection config (for autodetect and regular creation likewise); but it can be overridden.

In almost all cases, backward compatibility is guaranteed by the sequence of defaults and the store behaviour. In one case, a user might encounter a "collection config mismatch" error if hybrid is rolled out to the server after the collection is created. By a helpful warning, and a whole section in the README, ample guidance is given to the user about the possible roads to resolving the error that may arise in such case. (This was deemed the least disruptive action).

Hybrid search

The vector store obeys the same contract whether hybrid or not -- namely, search methods accept one query: str parameter.

This parameter, in case of hybrid, is used for both parts of the composite search. One can optionally specify a separate search query string, lexical_query, for the lexical part of the search, different from that of the ANN search.

Note that this applies to only some methods:

those that accept a query: str
NOT those that accept an embedding vector directly (which by their nature assume just ANN search is required)
NOT the "_with_embedding" methods, used by the GraphRAG project graph store (as well as by the now-deprecated AstraDBGraphVectorStore class). Using hybrid capabilities within GraphRAG is not a goal for this PR (possibly a future development) [Thanks to @epinzur for clarifying this item]
NOT the MMR search mode (which falls back to regular ANN search). If necessary, future work may lift this limitation - however, MMR is a form of reranking in itself, hence combining hybrid and MMR was not deemed an immediate priority.

Where applicable, the following holds: regardless of whether the search is hybrid or regular-ANN, the return value has the same shape (as required by the VectorStore abstract class).

Caveats and notes

Since the hybrid capabilities are not rolled out to all of Astra DB at the time of writing, all hybrid tests are suppressed. In order to include them, prepend the test invocation with LANGCHAIN_TEST_HYBRID=y
If an authentication key is required for the hybrid-related integration tests, environment variable HEADER_RERANKING_API_KEY_NVIDIA should contain it
The AstraDBGraphVectorStore integration tests can be run manually, if necessary, by prepending LANGCHAIN_TEST_ASTRADBGRAPHVECTORSTORE=y to the test invocation
For the temporary situation where local(HCD) lacks the fix for CNDB-13480 (unresponsive database for findAndRerank with deleted rows), run the tests with LANGCHAIN_TEST_NO_CNDB13480=y if hybrid tests are included.

I have extensively run all tests, including those excluded by the default CI process, on various environment and have carefully verified everything is passing with the code of this PR.

Also ...

at the moment the one available reranker model makes no guarantees on the range of the associated scores: rerank scores are not in the [0:1] interval. Since these scores are returned as "similarity" by certain search methods, this may require special handling by the caller.

Follow-up work

A close companion to this PR will be documentation work on the main LangChain repo (sample notebook on AstraDBVectorStore + another couple of related pages).

Also, some (minor) refinements were mentioned earlier (possibly hybrid support in MMR search, for example).

In two or three cases, optimizations (reduction of code duplication, improvement in internal implementations) are left for future work.

…fic private methods

…ilures

…_EMBEDDING_API_KEY_OPENAI

…nc init

…rize)

…unction; README on hybrid-mismatch errors

libs/astradb/langchain_astradb/utils/astradb.py

cbornet · 2025-03-29T07:41:54Z

libs/astradb/langchain_astradb/utils/astradb.py


 class _AstraDBCollectionEnvironment(_AstraDBEnvironment):
-    def __init__(
+    def __init__(  # noqa: C901


We could maybe exclude C90 if we don't intend to follow it ?

I would rather leave it like this, optimistically hoping for a fast follow-up rewrite of this whole constructor.

Update: using the ternary operator as you suggested has reduced the complexity down to "acceptable". So this noqa is (serendipitously) going away.

libs/astradb/langchain_astradb/utils/astradb.py

cbornet · 2025-03-29T07:50:32Z

libs/astradb/langchain_astradb/utils/astradb.py

+                                UserWarning,
+                                stacklevel=2,
+                            )
+                            raise


Since we raise immediately, I wonder if it wouldn't be better to wrap the DataAPIException in a "langchain-datastax" exception with the explanation message instead of sending a warning.

Yes, I will raise ... from <underlying exception> and add a note to the message, instead of issuing the warning. Very good suggestion, and apt timing at that (with 0.6 going out).

The name would be AstraDBError (the parent package making it clear that it's a langchain-datastax thing).

Trying to be consistent with the vectorstores.AstraDBVectorStoreError class name already introduced.

Done, not bad indeed. With a pre-existing (and different) collection,

>>> try: ... store1 = AstraDBVectorStore( ... collection_name="man_tst", ... token=os.environ["ASTRA_DB_APPLICATION_TOKEN"], ... api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"], ... collection_vector_service_options=VectorServiceOptions(provider="nvidia", model_name="NV-Embed-QA"), ... collection_embedding_api_key=os.environ["HEADER_EMBEDDING_API_KEY_OPENAI"], ... ) ... except Exception as e: ... print("ERR!") ... err = e ... APICommander about to raise from: [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}] ERR! >>> err AstraDBError("Astra DB collection 'man_tst' was found to be configured differently than requested by the vector store creation. This is resulting in a hard exception from the Data API (accessible as `<this-exception>.__cause__`). Please see https://github.com/langchain-ai/langchain-datastax/blob/main/libs/astradb/README.md#collection-defaults-mismatch for more context about this issue and possible mitigations.") >>> err.__cause__ DataAPIResponseException(text="Collection already exists: Collection already exists: trying to create Collection ('man_tst') with different settings (EXISTING_COLLECTION_DIFFERENT_SETTINGS)", command={'createCollection': {'name': 'man_tst', 'options': {'vector': {'service': {'provider': 'nvidia', 'modelName': 'NV-Embed-QA'}}, 'indexing': {'allow': ['metadata']}}}}, raw_response={'errors': [{'message': "Collection already exists: trying to create Collection ('man_tst') with different settings", 'errorCode': 'EXISTING_COLLECTION_DIFFERENT_SETTINGS', 'id': 'd2146774-1bd2-4e2f-a06c-bedc0f48d6a1', 'title': 'Collection already exists', 'family': 'REQUEST', 'scope': 'EMPTY'}]}, error_descriptors=[DataAPIErrorDescriptor('Collection already exists', error_code='EXISTING_COLLECTION_DIFFERENT_SETTINGS', message="Collection already exists: trying to create Collection ('man_tst') with different settings", family='REQUEST', scope='EMPTY', id='d2146774-1bd2-4e2f-a06c-bedc0f48d6a1')], warning_descriptors=[])

libs/astradb/langchain_astradb/utils/astradb.py

libs/astradb/langchain_astradb/utils/vector_store_codecs.py

libs/astradb/langchain_astradb/vectorstores.py

cbornet · 2025-03-29T08:17:07Z

libs/astradb/langchain_astradb/vectorstores.py

-        k: int = 4,
-        filter: dict[str, Any] | None = None,  # noqa: A002
+        k: int,
+        filter: dict[str, Any] | None,  # noqa: A002


Could rename filter arg since it's a private method and remove the noqa

Done (though I'm not sure if the name change filter => filter_dict => filter in some call patterns is worth the hassle. No strong opinion anyway.)

libs/astradb/langchain_astradb/vectorstores.py

libs/astradb/tests/integration_tests/conftest.py

cbornet · 2025-03-29T08:25:00Z

libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py

-        s_matches_z = sorted(matches_z, key=doc_sorter)
-        assert s_matches_z[0].metadata == {"m1": "A", "m2": "x", "mZ": "Z"}
-        assert s_matches_z[1].metadata == {"m1": "B", "m2": "y", "mZ": "Z"}
+        # TODO: restore this test once the issue with deleting rows is fixed


Should we open an issue and link it here ?

Here (and in the other 3 similar places): the issue, as a datastax/cassandra bug, is merged to main.
What counts is whether it is rolled out consistently to the test kubes, which last time I checked was only on one kube. I have marked the comments accordingly (linking the issue), pending further roll-outs.

libs/astradb/tests/integration_tests/test_vectorstore_autodetect.py

cbornet

Awesome work !
A few comments, nothing blocking.

Stefano Lottini added 30 commits February 26, 2025 09:47

dependencies

543a955

Merge branch 'main' into SL-astrapy2

86513cb

WIP in adapting to astrapy 2

61ed3af

WIP most of the core logic made 2.0

010d8a6

completed porting to astrapy 2

4278316

small improvements + adjusted type hint in a cast for py3.9

929cf04

collection environment and basic vectorstore signatures

b654d2c

refactor and impl for FARR, missing is the actual query path

ef70d1f

hybrid logic implemented for the vectorstore

d73745e

two separate hybrid flags to vectorstore, for coll.creation and querying

a793d39

added hybrid_limits param to vectorstore init; renamed its find-speci…

2a3fabd

…fic private methods

adapt to find_and_rerank returning RerankedResult's

a827a57

adapt to the reworked insertManyException semantics in astrapy2

2d70982

restore calculation of inserted_ids in the vectorstore for partial fa…

43ca7d1

…ilures

replaced vectorstore 'hybrid_collection' init param with and

7f05301

adapt to latest astrapy/conventions

2d36820

introduce collection_reranking_api_key; also secret renamed to HEADER…

9229ee2

…_EMBEDDING_API_KEY_OPENAI

unit test of autodetect has_lexical

d86035b

unit test of codecs re: hybrid, tiny codec fixes

2015739

minor edits

ce324b7

first full integration test for hybrid vectorstore

ef2deb8

getting to astrapy2-rc2 and fixing most tests

3ca358f

indexing-mismatch IT fixed for graph-vectorstore

34898e5

add blockbuster exemptions for locally-running graphVS necessarily-sy…

3bc6182

…nc init

autodetect picks codec based on collection config over documents

ae5efb9

autodetect int.tests on collections explicitly with/out hybrid (vecto…

4c15b59

…rize)

docstring and part of the defaults-mismatch story

cdd9817

note on hybrid/coll.mismatch

63a1e06

more guided handling of coll.mismatch errors; renamed internal util f…

5cac47c

…unction; README on hybrid-mismatch errors

mini diagram

58d3946

hemidactylus mentioned this pull request Mar 28, 2025

Upgrade to using astrapy 2 #119

Closed