Skip to content

Comments

feat: impl StructArray -- support embedding searches embeddings in embedding list with element level filter expression#45830

Merged
sre-ci-robot merged 22 commits intomilvus-io:masterfrom
SpadeA-Tang:element-filter
Dec 15, 2025
Merged

feat: impl StructArray -- support embedding searches embeddings in embedding list with element level filter expression#45830
sre-ci-robot merged 22 commits intomilvus-io:masterfrom
SpadeA-Tang:element-filter

Conversation

@SpadeA-Tang
Copy link
Contributor

@SpadeA-Tang SpadeA-Tang commented Nov 25, 2025

issue: #42148

For a vector field inside a STRUCT, since a STRUCT can only appear as the element type of an ARRAY field, the vector field in STRUCT is effectively an array of vectors, i.e. an embedding list.
Milvus already supports searching embedding lists with metrics whose names start with the prefix MAX_SIM_.

This PR allows Milvus to search embeddings inside an embedding list using the same metrics as normal embedding fields. Each embedding in the list is treated as an independent vector and participates in ANN search.

Further, since STRUCT may contain scalar fields that are highly related to the embedding field, this PR introduces an element-level filter expression to refine search results.
The grammar of the element-level filter is:

element_filter(structFieldName, $[subFieldName] == 3)

where $[subFieldName] refers to the value of subFieldName in each element of the STRUCT array structFieldName.

It can be combined with existing filter expressions, for example:

"varcharField == 'aaa' && element_filter(struct_field, $[struct_int] == 3)"

A full example:

struct_schema = milvus_client.create_struct_field_schema()
struct_schema.add_field("struct_str", DataType.VARCHAR, max_length=65535)
struct_schema.add_field("struct_int", DataType.INT32)
struct_schema.add_field("struct_float_vec", DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM)

schema.add_field(
    "struct_field",
    datatype=DataType.ARRAY,
    element_type=DataType.STRUCT,
    struct_schema=struct_schema,
    max_capacity=1000,
)
...

filter = "varcharField == 'aaa' && element_filter(struct_field, $[struct_int] == 3 && $[struct_str] == 'abc')"
res = milvus_client.search(
    COLLECTION_NAME,
    data=query_embeddings,
    limit=10,
    anns_field="struct_field[struct_float_vec]",
    filter=filter,
    output_fields=["struct_field[struct_int]", "varcharField"],
)

TODO:

  1. When an element_filter expression is used, a regular filter expression must also be present. Remove this restriction.
  2. Implement element_filter expressions in the query.

@sre-ci-robot sre-ci-robot added area/compilation area/dependency Pull requests that update a dependency file size/XXL Denotes a PR that changes 1000+ lines. labels Nov 25, 2025
@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Nov 25, 2025
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: We are gradually rolling out the new ci-v2 system.

  • Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
  • Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
  • For tests that exist in both v1 and v2, passing in either system is considered PASS.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-ut-integration // for ci-v2/ut-integration
  • /ci-rerun-ut-go // for ci-v2/ut-go
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm [master branch only]
  • /ci-rerun-e2e-default // for ci-v2/e2e-default [master branch only]

If you have any questions or requests, please contact @zhikunyao.

@mergify
Copy link
Contributor

mergify bot commented Nov 25, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@mergify
Copy link
Contributor

mergify bot commented Nov 25, 2025

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Nov 25, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@mergify
Copy link
Contributor

mergify bot commented Nov 25, 2025

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

@SpadeA-Tang cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 85.26265% with 303 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.00%. Comparing base (3f063a2) to head (06b62f7).
⚠️ Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
internal/core/src/segcore/SegmentGrowingImpl.cpp 38.67% 65 Missing ⚠️
internal/core/src/query/PlanProto.cpp 77.22% 41 Missing ⚠️
internal/proxy/search_reduce_util.go 20.00% 21 Missing and 3 partials ⚠️
internal/querynodev2/segments/search_reduce.go 20.00% 18 Missing and 6 partials ⚠️
internal/core/src/common/ArrayOffsets.cpp 86.66% 18 Missing ⚠️
internal/core/src/plan/PlanNode.h 55.00% 18 Missing ⚠️
internal/parser/planparserv2/parser_visitor.go 88.18% 10 Missing and 5 partials ⚠️
internal/proxy/task_index.go 0.00% 15 Missing ⚠️
...ernal/util/indexparamcheck/vector_index_checker.go 0.00% 15 Missing ⚠️
internal/core/src/segcore/reduce/Reduce.cpp 52.17% 11 Missing ⚠️
... and 18 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #45830      +/-   ##
==========================================
+ Coverage   73.29%   76.00%   +2.70%     
==========================================
  Files        1369     1905     +536     
  Lines      213893   298568   +84675     
==========================================
+ Hits       156778   226913   +70135     
- Misses      49633    64160   +14527     
- Partials     7482     7495      +13     
Components Coverage Δ
Client 78.18% <ø> (ø)
Core 82.88% <88.91%> (∅)
Go 73.93% <56.83%> (-0.02%) ⬇️
Files with missing lines Coverage Δ
internal/core/src/common/ArrayOffsets.h 100.00% <100.00%> (ø)
internal/core/src/common/ArrayOffsetsTest.cpp 100.00% <100.00%> (ø)
internal/core/src/common/QueryInfo.h 100.00% <ø> (ø)
internal/core/src/common/QueryResult.h 98.55% <100.00%> (ø)
internal/core/src/common/Schema.h 98.14% <100.00%> (ø)
internal/core/src/common/Utils.h 97.43% <ø> (ø)
internal/core/src/common/VectorArray.h 82.44% <100.00%> (ø)
internal/core/src/exec/Driver.cpp 83.91% <100.00%> (ø)
internal/core/src/exec/QueryContext.h 88.88% <100.00%> (ø)
...ernal/core/src/exec/expression/BinaryRangeExpr.cpp 72.96% <100.00%> (ø)
... and 47 more

... and 526 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: SpadeA <[email protected]>
const FieldMeta&
Schema::GetFirstArrayFieldInStruct(const std::string& struct_name) const {
// Check cache first
auto cache_it = struct_array_field_cache_.find(struct_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of caching it in constructor. using this method we need to worry about concurrency safty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema is is inited by default constructor and add fields one by one. So it's unable to handle this in constructor. I think add a mutext for struct_array_field_cache_ is fine.

Signed-off-by: SpadeA <[email protected]>
BuildFromSegment(const void* segment, const FieldMeta& field_meta);

private:
const std::vector<int32_t> element_row_ids_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to report this portion of memory usage to cachinglayer. reach out to @sparknack for details

@sparknack
Copy link
Contributor

For the resource management section, LGTM.

Signed-off-by: SpadeA <[email protected]>
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SpadeA-Tang, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: SpadeA <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
@SpadeA-Tang
Copy link
Contributor Author

/ci-rerun-e2e-default

Signed-off-by: SpadeA <[email protected]>
@SpadeA-Tang
Copy link
Contributor Author

/ci-rerun-e2e-default

@mergify mergify bot added the ci-passed label Dec 15, 2025
@zhagnlu
Copy link
Contributor

zhagnlu commented Dec 15, 2025

/lgtm

@sre-ci-robot sre-ci-robot merged commit f6f716b into milvus-io:master Dec 15, 2025
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/compilation area/dependency Pull requests that update a dependency file area/test ci-passed dco-passed DCO check passed. kind/feature Issues related to feature request from users lgtm sig/testing size/XXL Denotes a PR that changes 1000+ lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants