Skip to content

Commit 1356d5f

Browse files
Python: Feature new memory stores and collections (#7614)
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> This PR adds new vector store and vector store record collections as well as implementations for: - Azure AI Search - Redis - Qdrant - Volatile (in-memory) It also adds samples, tests, and unit-tests for these. Next it adds the vector store record fields, definition and decorator, with tests and samples. All marked experimental. Existing Redis, Azure AI Search, Qdrant and Volatile will be marked as deprecated in the future, once the new collections are feature complete. ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄
1 parent 07f94f2 commit 1356d5f

63 files changed

Lines changed: 6359 additions & 282 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/python-integration-tests.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ jobs:
131131
VERTEX_AI_PROJECT_ID: ${{ vars.VERTEX_AI_PROJECT_ID }}
132132
VERTEX_AI_GEMINI_MODEL_ID: ${{ vars.VERTEX_AI_GEMINI_MODEL_ID }}
133133
VERTEX_AI_EMBEDDING_MODEL_ID: ${{ vars.VERTEX_AI_EMBEDDING_MODEL_ID }}
134+
REDIS_CONNECTION_STRING: ${{ vars.REDIS_CONNECTION_STRING }}
134135
run: |
135136
cd python
136137
poetry run pytest ./tests/integration ./tests/samples -v --junitxml=pytest.xml
@@ -242,6 +243,7 @@ jobs:
242243
VERTEX_AI_PROJECT_ID: ${{ vars.VERTEX_AI_PROJECT_ID }}
243244
VERTEX_AI_GEMINI_MODEL_ID: ${{ vars.VERTEX_AI_GEMINI_MODEL_ID }}
244245
VERTEX_AI_EMBEDDING_MODEL_ID: ${{ vars.VERTEX_AI_EMBEDDING_MODEL_ID }}
246+
REDIS_CONNECTION_STRING: ${{ vars.REDIS_CONNECTION_STRING }}
245247
run: |
246248
if ${{ matrix.os == 'ubuntu-latest' }}; then
247249
docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest

python/.coveragerc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ omit =
1010
semantic_kernel/connectors/memory/mongodb_atlas/*
1111
semantic_kernel/connectors/memory/pinecone/*
1212
semantic_kernel/connectors/memory/postgres/*
13-
semantic_kernel/connectors/memory/qdrant/*
14-
semantic_kernel/connectors/memory/redis/*
13+
semantic_kernel/connectors/memory/qdrant/qdrant_memory_store.py
14+
semantic_kernel/connectors/memory/redis/redis_memory_store.py
1515
semantic_kernel/connectors/memory/usearch/*
1616
semantic_kernel/connectors/memory/weaviate/*
1717
semantic_kernel/reliability/*
@@ -33,4 +33,4 @@ exclude_lines =
3333
# TYPE_CHECKING and @overload blocks are never executed during pytest run
3434
if TYPE_CHECKING:
3535
@overload
36-
@abstractmethod
36+
@abstractmethod

python/.cspell.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,10 @@
4747
"protos",
4848
"endregion",
4949
"vertexai",
50-
"aiplatform"
50+
"aiplatform",
51+
"serde",
52+
"datamodel",
53+
"vectorstoremodel",
54+
"qdrant"
5155
]
5256
}

python/mypy.ini

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ ignore_errors = true
2626
[mypy-semantic_kernel.connectors.memory.astradb.*]
2727
ignore_errors = true
2828

29+
[mypy-semantic_kernel.connectors.memory.azure_ai_search.*]
30+
ignore_errors = false
2931
[mypy-semantic_kernel.connectors.memory.azure_cognitive_search.*]
3032
ignore_errors = true
3133

@@ -50,9 +52,13 @@ ignore_errors = true
5052
[mypy-semantic_kernel.connectors.memory.postgres.*]
5153
ignore_errors = true
5254

55+
[mypy-semantic_kernel.connectors.memory.qdrant.qdrant_vector_record_store.*]
56+
ignore_errors = true
5357
[mypy-semantic_kernel.connectors.memory.qdrant.*]
5458
ignore_errors = true
5559

60+
[mypy-semantic_kernel.connectors.memory.redis.redis_vector_record_store.*]
61+
ignore_errors = true
5662
[mypy-semantic_kernel.connectors.memory.redis.*]
5763
ignore_errors = true
5864

python/poetry.lock

Lines changed: 229 additions & 98 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

python/pyproject.toml

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,9 @@ chromadb = { version = ">=0.4.13,<0.6.0", optional = true}
5757
google-cloud-aiplatform = { version = "^1.60.0", optional = true}
5858
google-generativeai = { version = "^0.7.2", optional = true}
5959
# hugging face
60-
transformers = { version = "^4.28.1", extras=["torch"], optional = true}
60+
transformers = { version = "^4.28.1", extras=['torch'], optional = true}
6161
sentence-transformers = { version = "^2.2.2", optional = true}
62+
torch = {version = "2.2.2", optional = true}
6263
# mongo
6364
motor = { version = "^3.3.2", optional = true }
6465
# notebooks
@@ -73,20 +74,20 @@ ollama = { version = "^0.2.1", optional = true}
7374
# pinecone
7475
pinecone-client = { version = ">=3.0.0", optional = true}
7576
# postgres
76-
psycopg = { version="^3.1.9", extras=["binary","pool"], optional = true}
77+
psycopg = { version="^3.2.1", extras=["binary","pool"], optional = true}
7778
# qdrant
7879
qdrant-client = { version = '^1.9', optional = true}
7980
# redis
80-
redis = { version = "^4.6.0", optional = true}
81+
redis = { version = "^5.0.7", extras=['hiredis'], optional = true}
82+
types-redis = { version="^4.6.0.20240425", optional = true }
8183
# usearch
8284
usearch = { version = "^2.9", optional = true}
8385
pyarrow = { version = ">=12.0.1,<18.0.0", optional = true}
8486
weaviate-client = { version = ">=3.18,<5.0", optional = true}
85-
ruff = "0.5.2"
87+
pandas = {version = "^2.2.2", optional = true}
8688

8789
[tool.poetry.group.dev.dependencies]
8890
pre-commit = ">=3.7.1"
89-
ruff = ">=0.5"
9091
ipykernel = "^6.29.4"
9192
nbconvert = "^7.16.4"
9293
pytest = "^8.2.1"
@@ -96,6 +97,7 @@ pytest-asyncio = "^0.23.7"
9697
snoop = "^0.4.3"
9798
mypy = ">=1.10.0"
9899
types-PyYAML = "^6.0.12.20240311"
100+
ruff = "^0.5.2"
99101

100102
[tool.poetry.group.unit-tests]
101103
optional = true
@@ -109,8 +111,14 @@ mistralai = "^0.4.1"
109111
ollama = "^0.2.1"
110112
google-cloud-aiplatform = "^1.60.0"
111113
google-generativeai = "^0.7.2"
112-
transformers = { version = "^4.28.1", extras=["torch"]}
113-
sentence-transformers = "^2.2.2"
114+
transformers = { version = "^4.28.1", extras=['torch']}
115+
sentence-transformers = { version = "^2.2.2"}
116+
torch = {version = "2.2.2"}
117+
# qdrant
118+
qdrant-client = '^1.9'
119+
# redis
120+
redis = { version = "^5.0.7", extras=['hiredis']}
121+
pandas = {version = "^2.2.2"}
114122

115123
[tool.poetry.group.tests]
116124
optional = true
@@ -129,8 +137,9 @@ chromadb = ">=0.4.13,<0.6.0"
129137
google-cloud-aiplatform = "^1.60.0"
130138
google-generativeai = "^0.7.2"
131139
# hugging face
132-
transformers = { version = "^4.28.1", extras=["torch"]}
133-
sentence-transformers = "^2.2.2"
140+
transformers = { version = "^4.28.1", extras=['torch']}
141+
sentence-transformers = { version = "^2.2.2"}
142+
torch = {version = "2.2.2"}
134143
# milvus
135144
pymilvus = ">=2.3,<2.4.4"
136145
milvus = { version = ">=2.3,<2.3.8", markers = 'sys_platform != "win32"'}
@@ -147,21 +156,23 @@ psycopg = { version="^3.1.9", extras=["binary","pool"]}
147156
# qdrant
148157
qdrant-client = '^1.9'
149158
# redis
150-
redis = "^4.6.0"
159+
redis = { version="^5.0.7", extras=['hiredis']}
160+
types-redis = { version="^4.6.0.20240425" }
151161
# usearch
152162
usearch = "^2.9"
153163
pyarrow = ">=12.0.1,<18.0.0"
154164
# weaviate
155165
weaviate-client = ">=3.18,<5.0"
166+
pandas = {version = "^2.2.2"}
156167

157168
# Extras are exposed to pip, this allows a user to easily add the right dependencies to their environment
158169
[tool.poetry.extras]
159-
all = ["transformers", "sentence-transformers", "qdrant-client", "chromadb", "pymilvus", "milvus", "mistralai", "ollama", "google", "weaviate-client", "pinecone-client", "psycopg", "redis", "azure-ai-inference", "azure-search-documents", "azure-core", "azure-identity", "azure-cosmos", "usearch", "pyarrow", "ipykernel", "motor"]
170+
all = ["transformers", "sentence-transformers", "torch", "qdrant-client", "chromadb", "pymilvus", "milvus", "mistralai", "ollama", "google", "weaviate-client", "pinecone-client", "psycopg", "redis", "azure-ai-inference", "azure-search-documents", "azure-core", "azure-identity", "azure-cosmos", "usearch", "pyarrow", "ipykernel", "motor"]
160171

161172
azure = ["azure-ai-inference", "azure-search-documents", "azure-core", "azure-identity", "azure-cosmos", "msgraph-sdk"]
162173
chromadb = ["chromadb"]
163174
google = ["google-cloud-aiplatform", "google-generativeai"]
164-
hugging_face = ["transformers", "sentence-transformers"]
175+
hugging_face = ["transformers", "sentence-transformers", "torch"]
165176
milvus = ["pymilvus", "milvus"]
166177
mistralai = ["mistralai"]
167178
ollama = ["ollama"]
@@ -170,7 +181,7 @@ notebooks = ["ipykernel"]
170181
pinecone = ["pinecone-client"]
171182
postgres = ["psycopg"]
172183
qdrant = ["qdrant-client"]
173-
redis = ["redis"]
184+
redis = ["redis", "types-redis"]
174185
usearch = ["usearch", "pyarrow"]
175186
weaviate = ["weaviate-client"]
176187

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Copyright (c) Microsoft. All rights reserved.
2+
3+
from dataclasses import dataclass, field
4+
from typing import Annotated, Any
5+
from uuid import uuid4
6+
7+
from pandas import DataFrame
8+
from pydantic import Field
9+
10+
from semantic_kernel.data.vector_store_model_decorator import vectorstoremodel
11+
from semantic_kernel.data.vector_store_model_definition import VectorStoreRecordDefinition
12+
from semantic_kernel.data.vector_store_record_fields import (
13+
VectorStoreRecordDataField,
14+
VectorStoreRecordKeyField,
15+
VectorStoreRecordVectorField,
16+
)
17+
from semantic_kernel.kernel_pydantic import KernelBaseModel
18+
19+
# This concept shows the different ways you can create a vector store data model
20+
# using dataclasses, Pydantic, and Python classes.
21+
# As well as using types like Pandas Dataframes.
22+
23+
# There are a number of universal things about these data models:
24+
# they must specify the type of field through the annotation (or the definition).
25+
# there must be at least one field of type VectorStoreRecordKeyField.
26+
# If you set the embedding_property_name in the VectorStoreRecordDataField, that field must exist and be a vector field.
27+
# A unannotated field is allowed but must have a default value.
28+
29+
# The purpose of these models is to be what you pass to and get back from a vector store.
30+
# There maybe limitations to data types that the vector store can handle,
31+
# so not every store will be able to handle completely the same model.
32+
# for instance, some stores only allow a string as the keyfield, while others allow str and int,
33+
# so defining the key with a int, might make some stores unusable.
34+
35+
# The decorator takes the class and pulls out the fields and annotations to create a definition,
36+
# of type VectorStoreRecordDefinition.
37+
# This definition is used for the vector store to know how to handle the data model.
38+
39+
# You can also create the definition yourself, and pass it to the vector stores together with a standard type,
40+
# like a dict or list.
41+
# Or you can use the definition in container mode with something like a Pandas Dataframe.
42+
43+
44+
# Data model using built-in Python dataclasses
45+
@vectorstoremodel
46+
@dataclass
47+
class DataModelDataclass:
48+
vector: Annotated[list[float], VectorStoreRecordVectorField]
49+
key: Annotated[str, VectorStoreRecordKeyField()] = field(default_factory=lambda: str(uuid4()))
50+
content: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="vector")] = (
51+
"content1"
52+
)
53+
other: str | None = None
54+
55+
56+
# Data model using Pydantic BaseModels
57+
@vectorstoremodel
58+
class DataModelPydantic(KernelBaseModel):
59+
vector: Annotated[list[float], VectorStoreRecordVectorField]
60+
key: Annotated[str, VectorStoreRecordKeyField()] = Field(default_factory=lambda: str(uuid4()))
61+
content: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="vector")] = (
62+
"content1"
63+
)
64+
other: str | None = None
65+
66+
67+
# Data model using Pydantic BaseModels with mixed annotations (from pydantic and SK)
68+
@vectorstoremodel
69+
class DataModelPydanticComplex(KernelBaseModel):
70+
vector: Annotated[list[float], VectorStoreRecordVectorField]
71+
key: Annotated[str, Field(default_factory=lambda: str(uuid4())), VectorStoreRecordKeyField()]
72+
content: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="vector")] = (
73+
"content1"
74+
)
75+
other: str | None = None
76+
77+
78+
# Data model using Python classes
79+
# This one includes a custom serialize and deserialize method
80+
@vectorstoremodel
81+
class DataModelPython:
82+
def __init__(
83+
self,
84+
vector: Annotated[list[float], VectorStoreRecordVectorField],
85+
key: Annotated[str, VectorStoreRecordKeyField] = None,
86+
content: Annotated[
87+
str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="vector")
88+
] = "content1",
89+
other: str | None = None,
90+
):
91+
self.vector = vector
92+
self.other = other
93+
self.key = key or str(uuid4())
94+
self.content = content
95+
96+
def __str__(self) -> str:
97+
return f"DataModelPython(vector={self.vector}, key={self.key}, content={self.content}, other={self.other})"
98+
99+
def serialize(self) -> dict[str, Any]:
100+
return {
101+
"vector": self.vector,
102+
"key": self.key,
103+
"content": self.content,
104+
}
105+
106+
@classmethod
107+
def deserialize(cls, obj: dict[str, Any]) -> "DataModelDataclass":
108+
return cls(
109+
vector=obj["vector"],
110+
key=obj["key"],
111+
content=obj["content"],
112+
)
113+
114+
115+
# Data model definition for use with Pandas
116+
# note the container mode flag, which makes sure that records that are returned are in a container
117+
# even when requesting a batch of records.
118+
# There is also a to_dict and from_dict method, which are used to convert the data model to and from a dict,
119+
# these should be specific to the type used, if using dict as type then these can be left off.
120+
data_model_definition_pandas = VectorStoreRecordDefinition(
121+
fields={
122+
"vector": VectorStoreRecordVectorField(property_type="list[float]"),
123+
"key": VectorStoreRecordKeyField(property_type="str"),
124+
"content": VectorStoreRecordDataField(
125+
property_type="str", has_embedding=True, embedding_property_name="vector"
126+
),
127+
},
128+
container_mode=True,
129+
to_dict=lambda record, **_: record.to_dict(orient="records"),
130+
from_dict=lambda records, **_: DataFrame(records),
131+
)
132+
133+
134+
if __name__ == "__main__":
135+
data_item1 = DataModelDataclass(content="Hello, world!", vector=[1.0, 2.0, 3.0], other=None)
136+
data_item2 = DataModelPydantic(content="Hello, world!", vector=[1.0, 2.0, 3.0], other=None)
137+
data_item3 = DataModelPydanticComplex(content="Hello, world!", vector=[1.0, 2.0, 3.0], other=None)
138+
data_item4 = DataModelPython(content="Hello, world!", vector=[1.0, 2.0, 3.0], other=None)
139+
print("Example records:")
140+
print(f"DataClass:\n {data_item1}", end="\n\n")
141+
print(f"Pydantic:\n {data_item2}", end="\n\n")
142+
print(f"Pydantic with annotations:\n {data_item3}", end="\n\n")
143+
print(f"Python:\n {data_item4}", end="\n\n")
144+
145+
print("Item definitions:")
146+
print(f"DataClass:\n {data_item1.__kernel_vectorstoremodel_definition__}", end="\n\n")
147+
print(f"Pydantic:\n {data_item2.__kernel_vectorstoremodel_definition__}", end="\n\n")
148+
print(f"Pydantic with annotations:\n {data_item3.__kernel_vectorstoremodel_definition__}", end="\n\n")
149+
print(f"Python:\n {data_item4.__kernel_vectorstoremodel_definition__}", end="\n\n")
150+
print(f"Definition for use with Pandas:\n {data_model_definition_pandas}", end="\n\n")
151+
if (
152+
data_item1.__kernel_vectorstoremodel_definition__.fields
153+
== data_item2.__kernel_vectorstoremodel_definition__.fields
154+
== data_item3.__kernel_vectorstoremodel_definition__.fields
155+
== data_item4.__kernel_vectorstoremodel_definition__.fields
156+
== data_model_definition_pandas.fields
157+
):
158+
print("All data models are the same")
159+
else:
160+
print("Data models are not the same")

0 commit comments

Comments
 (0)