New LSMVector index using JVector open source vector engine#2816
Conversation
…internal map (with threadId)
fix: error after compaction
🧪 CI InsightsHere's what we observed from your CI run for c55fa41. 🟢 All jobs passed!But CI Insights is watching 👀 |
Summary of ChangesHello @lvca, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to ArcadeDB's vector search capabilities by integrating the JVector library to provide a new LSM-tree-based vector index. This new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a vector index implementation (LSMVectorIndex) based on the JVector library, with an LSM-tree-like architecture for transactional support and compaction. The changes are extensive, touching the core engine, schema management, SQL functions, and data importers. The addition of a comprehensive test suite for the new index is a great plus. However, the review has identified several critical issues that need to be addressed. There's a correctness issue in the ComparableVector class that violates the Comparable contract, which could lead to data loss in transactions. The compaction logic appears to have a potential race condition that could lead to data corruption. Furthermore, the vectorNeighbors SQL function has a hardcoded property name assumption and an inefficient distance recalculation that should be fixed. Finally, the jvector dependency is a release candidate, which poses a risk for production use.
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompactor.java
Show resolved
Hide resolved
engine/src/main/java/com/arcadedb/query/sql/function/vector/SQLFunctionVectorNeighbors.java
Outdated
Show resolved
Hide resolved
engine/src/main/java/com/arcadedb/query/sql/function/vector/SQLFunctionVectorNeighbors.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR integrates JVector library to provide an LSM-based vector index implementation as an alternative to the existing HNSW approach. The integration introduces LSM-style storage with lazy graph rebuilding, transactional support, and compaction capabilities.
Key changes:
- Adds new
LSMVectorIndeximplementation using JVector library with LSM-style page storage - Implements compaction support via
LSMVectorIndexCompactorandLSMVectorIndexCompacted - Updates import formats and tests to support the new LSM vector index type
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
engine/pom.xml |
Adds JVector 4.0.0-rc.6 dependency |
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndex.java |
New core LSM vector index implementation with JVector integration |
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompactor.java |
K-way merge compaction logic for LSM pages |
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompacted.java |
Compacted immutable page storage implementation |
engine/src/main/java/com/arcadedb/schema/LSMVectorIndexBuilder.java |
Builder for creating LSM vector indexes |
engine/src/main/java/com/arcadedb/schema/Schema.java |
Adds LSM_VECTOR index type enum |
engine/src/main/java/com/arcadedb/schema/LocalSchema.java |
Registers LSM_VECTOR factory handlers |
engine/src/main/java/com/arcadedb/query/sql/parser/CreateIndexStatement.java |
SQL support for creating LSM_VECTOR indexes |
engine/src/main/java/com/arcadedb/query/sql/function/vector/SQLFunctionVectorNeighbors.java |
Extends vectorNeighbors function to support LSMVectorIndex |
integration/src/main/java/com/arcadedb/integration/importer/vector/TextEmbeddingsImporterLSM.java |
New importer using LSM vector indexes |
integration/src/main/java/com/arcadedb/integration/importer/format/*ImporterFormat.java |
Updates import formats to use LSMVectorIndex |
engine/src/test/java/com/arcadedb/index/vector/LSMVectorIndexTest.java |
Comprehensive test coverage for LSM vector index functionality |
integration/src/test/java/com/arcadedb/integration/importer/*IT.java |
Integration tests updated for LSM vector support |
engine/src/main/java/com/arcadedb/serializer/BinarySerializer.java |
Code cleanup with improved type safety |
engine/src/main/java/com/arcadedb/database/TransactionIndexContext.java |
Code cleanup with pattern matching |
network/src/main/java/com/arcadedb/remote/RemoteSchema.java |
Adds buildLSMVectorIndex stub for remote schema |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...ration/src/main/java/com/arcadedb/integration/importer/vector/TextEmbeddingsImporterLSM.java
Outdated
Show resolved
Hide resolved
engine/src/main/java/com/arcadedb/database/TransactionIndexContext.java
Outdated
Show resolved
Hide resolved
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompactor.java
Show resolved
Hide resolved
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
* Initial plan * Fix ComparableVector to maintain Comparable contract Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]>
…internal map (with threadId)
fix: error after compaction
* Initial plan * Fix ComparableVector to maintain Comparable contract Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]>
Co-authored-by: Copilot <[email protected]>
…nce recalculation (#2820) * Initial plan * Add findNeighborsFromVector method to LSMVectorIndex to return scores directly Co-authored-by: lvca <[email protected]> * Remove test artifacts and update .gitignore Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]>
* Initial plan * Add mutable byte indicator to vector index pages - Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries) - New pages are created with mutable=1 (actively being written to) - Pages are marked as immutable (mutable=0) when they become full and a new page is created - Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page - Updated all page reading/writing code to account for the mutable byte in header - All vector index tests passing Co-authored-by: lvca <[email protected]> * Remove test database files and update .gitignore Co-authored-by: lvca <[email protected]> * Add constants for page header offsets to improve maintainability - Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants - Replaced magic numbers throughout the code with named constants - Makes the code more maintainable and self-documenting Co-authored-by: lvca <[email protected]> * Update comments to reference constants instead of hardcoded offsets Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]>
* Initial plan * Make ID property configurable in LSMVectorIndex Co-authored-by: lvca <[email protected]> * Remove test database files and add to .gitignore Co-authored-by: lvca <[email protected]> * Improve documentation for metadata JSON configuration Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]>
f76edcf to
eb636d1
Compare
…text.java Co-authored-by: Copilot <[email protected]>
* First version with jvector * Implemented compaction of vector indexes * Added test cases * Fixed compilation problems * Fixed test cases, now all pass * Refactor vector index using the transaction index changes instead of internal map (with threadId) * feat: integrated new vector index with the `database import` command * Supported lsmvector in `vectorNeighbors()` sql function * Upgraded to jvector 4.0.0-rc.6 * Update LSMVectorIndexCompacted.java fix: error after compaction * Fix ComparableVector Comparable contract violation (#2817) * Initial plan * Fix ComparableVector to maintain Comparable contract Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> * Return similarity scores from LSMVectorIndex to avoid redundant distance recalculation (#2820) * Initial plan * Add findNeighborsFromVector method to LSMVectorIndex to return scores directly Co-authored-by: lvca <[email protected]> * Remove test artifacts and update .gitignore Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> * Add mutable flag to vector index pages for safe compaction (#2819) * Initial plan * Add mutable byte indicator to vector index pages - Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries) - New pages are created with mutable=1 (actively being written to) - Pages are marked as immutable (mutable=0) when they become full and a new page is created - Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page - Updated all page reading/writing code to account for the mutable byte in header - All vector index tests passing Co-authored-by: lvca <[email protected]> * Remove test database files and update .gitignore Co-authored-by: lvca <[email protected]> * Add constants for page header offsets to improve maintainability - Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants - Replaced magic numbers throughout the code with named constants - Makes the code more maintainable and self-documenting Co-authored-by: lvca <[email protected]> * Update comments to reference constants instead of hardcoded offsets Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]> * Make LSMVectorIndex ID property configurable (#2818) * Initial plan * Make ID property configurable in LSMVectorIndex Co-authored-by: lvca <[email protected]> * Remove test database files and add to .gitignore Co-authored-by: lvca <[email protected]> * Improve documentation for metadata JSON configuration Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]> * First version with jvector * Implemented compaction of vector indexes * Added test cases * Fixed compilation problems * Fixed test cases, now all pass * Refactor vector index using the transaction index changes instead of internal map (with threadId) * feat: integrated new vector index with the `database import` command * Supported lsmvector in `vectorNeighbors()` sql function * Upgraded to jvector 4.0.0-rc.6 * Update LSMVectorIndexCompacted.java fix: error after compaction * Fix ComparableVector Comparable contract violation (#2817) * Initial plan * Fix ComparableVector to maintain Comparable contract Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> * Return similarity scores from LSMVectorIndex to avoid redundant distance recalculation (#2820) * Initial plan * Add findNeighborsFromVector method to LSMVectorIndex to return scores directly Co-authored-by: lvca <[email protected]> * Remove test artifacts and update .gitignore Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> * Add mutable flag to vector index pages for safe compaction (#2819) * Initial plan * Add mutable byte indicator to vector index pages - Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries) - New pages are created with mutable=1 (actively being written to) - Pages are marked as immutable (mutable=0) when they become full and a new page is created - Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page - Updated all page reading/writing code to account for the mutable byte in header - All vector index tests passing Co-authored-by: lvca <[email protected]> * Remove test database files and update .gitignore Co-authored-by: lvca <[email protected]> * Add constants for page header offsets to improve maintainability - Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants - Replaced magic numbers throughout the code with named constants - Makes the code more maintainable and self-documenting Co-authored-by: lvca <[email protected]> * Update comments to reference constants instead of hardcoded offsets Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]> * Make LSMVectorIndex ID property configurable (#2818) * Initial plan * Make ID property configurable in LSMVectorIndex Co-authored-by: lvca <[email protected]> * Remove test database files and add to .gitignore Co-authored-by: lvca <[email protected]> * Improve documentation for metadata JSON configuration Co-authored-by: lvca <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Luca Garulli <[email protected]> * fix pre-commit * Update engine/src/main/java/com/arcadedb/database/TransactionIndexContext.java Co-authored-by: Copilot <[email protected]> * Fixed compaction * test: fixed test --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: lvca <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Roberto Franchini <[email protected]> (cherry picked from commit c470e6d)
What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Related issues
A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.
Additional Notes
Anything else we should know when reviewing?
Checklist
mvn clean packagecommand