Skip to content

New LSMVector index using JVector open source vector engine#2816

Merged
lvca merged 37 commits intomainfrom
jvector-integration
Nov 22, 2025
Merged

New LSMVector index using JVector open source vector engine#2816
lvca merged 37 commits intomainfrom
jvector-integration

Conversation

@lvca
Copy link
Contributor

@lvca lvca commented Nov 21, 2025

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Related issues

A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.

Additional Notes

Anything else we should know when reviewing?

Checklist

  • I have run the build using mvn clean package command
  • My unit tests cover both failure and success scenarios

@lvca lvca requested review from Copilot and robfrank November 21, 2025 16:34
@lvca lvca self-assigned this Nov 21, 2025
@lvca lvca added the enhancement New feature or request label Nov 21, 2025
@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2025

🧪 CI Insights

Here's what we observed from your CI run for c55fa41.

🟢 All jobs passed!

But CI Insights is watching 👀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lvca, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to ArcadeDB's vector search capabilities by integrating the JVector library to provide a new LSM-tree-based vector index. This new LSMVectorIndex offers efficient approximate nearest neighbor search with full transactional support and automatic compaction, ensuring high performance and scalable storage for vector embeddings. It allows users to define and query vector indexes directly through SQL, and includes updated data importers to streamline the process of loading vector data.

Highlights

  • New LSM-based Vector Index (LSMVectorIndex): Introduced a new LSMVectorIndex implementation leveraging the JVector library for efficient approximate nearest neighbor (ANN) search. This index is designed with LSM-tree principles, supporting transactional operations and compaction for optimal performance and storage.
  • Transactional Support and Compaction: The LSMVectorIndex provides robust transactional support, buffering operations per transaction and persisting changes incrementally. It also includes LSMVectorIndexCompacted and LSMVectorIndexCompactor components to manage immutable pages and perform background compaction, merging and deduplicating vector data to maintain efficiency.
  • Integration with SQL and Schema: The new LSM_VECTOR index type is fully integrated into ArcadeDB's SQL engine, allowing creation via CREATE INDEX ... LSM_VECTOR statements with configurable metadata (dimensions, similarity function, max connections, beam width). The vectorNeighbors SQL function has been extended to support querying this new index type.
  • Improved Importer for Text Embeddings: A new TextEmbeddingsImporterLSM has been introduced, specifically designed to import text embeddings directly into the LSMVectorIndex. Existing GloVeImporterFormat and Word2VecImporterFormat have been updated to utilize this new importer, and the previous TextEmbeddingsImporter has been deprecated.
  • Core Library Updates: The project's pom.xml has been updated to include the jvector library dependency. Minor code modernizations were applied in TransactionIndexContext.java and BinarySerializer.java to leverage newer Java language features like instanceof pattern matching and improve type safety.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a vector index implementation (LSMVectorIndex) based on the JVector library, with an LSM-tree-like architecture for transactional support and compaction. The changes are extensive, touching the core engine, schema management, SQL functions, and data importers. The addition of a comprehensive test suite for the new index is a great plus. However, the review has identified several critical issues that need to be addressed. There's a correctness issue in the ComparableVector class that violates the Comparable contract, which could lead to data loss in transactions. The compaction logic appears to have a potential race condition that could lead to data corruption. Furthermore, the vectorNeighbors SQL function has a hardcoded property name assumption and an inefficient distance recalculation that should be fixed. Finally, the jvector dependency is a release candidate, which poses a risk for production use.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates JVector library to provide an LSM-based vector index implementation as an alternative to the existing HNSW approach. The integration introduces LSM-style storage with lazy graph rebuilding, transactional support, and compaction capabilities.

Key changes:

  • Adds new LSMVectorIndex implementation using JVector library with LSM-style page storage
  • Implements compaction support via LSMVectorIndexCompactor and LSMVectorIndexCompacted
  • Updates import formats and tests to support the new LSM vector index type

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
engine/pom.xml Adds JVector 4.0.0-rc.6 dependency
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndex.java New core LSM vector index implementation with JVector integration
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompactor.java K-way merge compaction logic for LSM pages
engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexCompacted.java Compacted immutable page storage implementation
engine/src/main/java/com/arcadedb/schema/LSMVectorIndexBuilder.java Builder for creating LSM vector indexes
engine/src/main/java/com/arcadedb/schema/Schema.java Adds LSM_VECTOR index type enum
engine/src/main/java/com/arcadedb/schema/LocalSchema.java Registers LSM_VECTOR factory handlers
engine/src/main/java/com/arcadedb/query/sql/parser/CreateIndexStatement.java SQL support for creating LSM_VECTOR indexes
engine/src/main/java/com/arcadedb/query/sql/function/vector/SQLFunctionVectorNeighbors.java Extends vectorNeighbors function to support LSMVectorIndex
integration/src/main/java/com/arcadedb/integration/importer/vector/TextEmbeddingsImporterLSM.java New importer using LSM vector indexes
integration/src/main/java/com/arcadedb/integration/importer/format/*ImporterFormat.java Updates import formats to use LSMVectorIndex
engine/src/test/java/com/arcadedb/index/vector/LSMVectorIndexTest.java Comprehensive test coverage for LSM vector index functionality
integration/src/test/java/com/arcadedb/integration/importer/*IT.java Integration tests updated for LSM vector support
engine/src/main/java/com/arcadedb/serializer/BinarySerializer.java Code cleanup with improved type safety
engine/src/main/java/com/arcadedb/database/TransactionIndexContext.java Code cleanup with pattern matching
network/src/main/java/com/arcadedb/remote/RemoteSchema.java Adds buildLSMVectorIndex stub for remote schema

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codacy-production
Copy link

codacy-production bot commented Nov 21, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.74% 61.57%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (57ba7b5) 73415 46804 63.75%
Head commit (c55fa41) 74521 (+1106) 46954 (+150) 63.01% (-0.74%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#2816) 1145 705 61.57%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Copy link
Contributor

Copilot AI commented Nov 21, 2025

@lvca I've opened a new pull request, #2817, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan

* Fix ComparableVector to maintain Comparable contract

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Copy link
Contributor

Copilot AI commented Nov 21, 2025

@lvca I've opened a new pull request, #2818, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Nov 21, 2025

@lvca I've opened a new pull request, #2819, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Nov 21, 2025

@lvca I've opened a new pull request, #2820, to work on those changes. Once the pull request is ready, I'll request review from you.

@lvca lvca added this to the 25.11.1 milestone Nov 22, 2025
lvca and others added 17 commits November 22, 2025 11:08
fix: error after compaction
* Initial plan

* Fix ComparableVector to maintain Comparable contract

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
…nce recalculation (#2820)

* Initial plan

* Add findNeighborsFromVector method to LSMVectorIndex to return scores directly

Co-authored-by: lvca <[email protected]>

* Remove test artifacts and update .gitignore

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
* Initial plan

* Add mutable byte indicator to vector index pages

- Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries)
- New pages are created with mutable=1 (actively being written to)
- Pages are marked as immutable (mutable=0) when they become full and a new page is created
- Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page
- Updated all page reading/writing code to account for the mutable byte in header
- All vector index tests passing

Co-authored-by: lvca <[email protected]>

* Remove test database files and update .gitignore

Co-authored-by: lvca <[email protected]>

* Add constants for page header offsets to improve maintainability

- Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants
- Replaced magic numbers throughout the code with named constants
- Makes the code more maintainable and self-documenting

Co-authored-by: lvca <[email protected]>

* Update comments to reference constants instead of hardcoded offsets

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>
* Initial plan

* Make ID property configurable in LSMVectorIndex

Co-authored-by: lvca <[email protected]>

* Remove test database files and add to .gitignore

Co-authored-by: lvca <[email protected]>

* Improve documentation for metadata JSON configuration

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>
@robfrank robfrank force-pushed the jvector-integration branch from f76edcf to eb636d1 Compare November 22, 2025 16:13
@lvca lvca removed the request for review from robfrank November 22, 2025 16:14
@lvca lvca changed the title JVector integration New LSMVector index using JVector open source vector engine Nov 22, 2025
@lvca lvca merged commit c470e6d into main Nov 22, 2025
16 of 19 checks passed
@robfrank robfrank linked an issue Nov 25, 2025 that may be closed by this pull request
4 tasks
robfrank pushed a commit that referenced this pull request Feb 11, 2026
* First version with jvector

* Implemented compaction of vector indexes

* Added test cases

* Fixed compilation problems

* Fixed test cases, now all pass

* Refactor vector index using the transaction index changes instead of internal map (with threadId)

* feat: integrated new vector index with the `database import` command

* Supported lsmvector in `vectorNeighbors()` sql function

* Upgraded to jvector 4.0.0-rc.6

* Update LSMVectorIndexCompacted.java

fix: error after compaction

* Fix ComparableVector Comparable contract violation (#2817)

* Initial plan

* Fix ComparableVector to maintain Comparable contract

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>

* Return similarity scores from LSMVectorIndex to avoid redundant distance recalculation (#2820)

* Initial plan

* Add findNeighborsFromVector method to LSMVectorIndex to return scores directly

Co-authored-by: lvca <[email protected]>

* Remove test artifacts and update .gitignore

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>

* Add mutable flag to vector index pages for safe compaction (#2819)

* Initial plan

* Add mutable byte indicator to vector index pages

- Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries)
- New pages are created with mutable=1 (actively being written to)
- Pages are marked as immutable (mutable=0) when they become full and a new page is created
- Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page
- Updated all page reading/writing code to account for the mutable byte in header
- All vector index tests passing

Co-authored-by: lvca <[email protected]>

* Remove test database files and update .gitignore

Co-authored-by: lvca <[email protected]>

* Add constants for page header offsets to improve maintainability

- Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants
- Replaced magic numbers throughout the code with named constants
- Makes the code more maintainable and self-documenting

Co-authored-by: lvca <[email protected]>

* Update comments to reference constants instead of hardcoded offsets

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>

* Make LSMVectorIndex ID property configurable (#2818)

* Initial plan

* Make ID property configurable in LSMVectorIndex

Co-authored-by: lvca <[email protected]>

* Remove test database files and add to .gitignore

Co-authored-by: lvca <[email protected]>

* Improve documentation for metadata JSON configuration

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>

* First version with jvector

* Implemented compaction of vector indexes

* Added test cases

* Fixed compilation problems

* Fixed test cases, now all pass

* Refactor vector index using the transaction index changes instead of internal map (with threadId)

* feat: integrated new vector index with the `database import` command

* Supported lsmvector in `vectorNeighbors()` sql function

* Upgraded to jvector 4.0.0-rc.6

* Update LSMVectorIndexCompacted.java

fix: error after compaction

* Fix ComparableVector Comparable contract violation (#2817)

* Initial plan

* Fix ComparableVector to maintain Comparable contract

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <[email protected]>

* Return similarity scores from LSMVectorIndex to avoid redundant distance recalculation (#2820)

* Initial plan

* Add findNeighborsFromVector method to LSMVectorIndex to return scores directly

Co-authored-by: lvca <[email protected]>

* Remove test artifacts and update .gitignore

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>

* Add mutable flag to vector index pages for safe compaction (#2819)

* Initial plan

* Add mutable byte indicator to vector index pages

- Added mutable flag byte at offset 8 in page header (after offsetFreeContent and numberOfEntries)
- New pages are created with mutable=1 (actively being written to)
- Pages are marked as immutable (mutable=0) when they become full and a new page is created
- Updated findLastImmutablePage() to scan from end backwards and stop at first immutable page
- Updated all page reading/writing code to account for the mutable byte in header
- All vector index tests passing

Co-authored-by: lvca <[email protected]>

* Remove test database files and update .gitignore

Co-authored-by: lvca <[email protected]>

* Add constants for page header offsets to improve maintainability

- Added OFFSET_FREE_CONTENT, OFFSET_NUM_ENTRIES, OFFSET_MUTABLE, and HEADER_BASE_SIZE constants
- Replaced magic numbers throughout the code with named constants
- Makes the code more maintainable and self-documenting

Co-authored-by: lvca <[email protected]>

* Update comments to reference constants instead of hardcoded offsets

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>

* Make LSMVectorIndex ID property configurable (#2818)

* Initial plan

* Make ID property configurable in LSMVectorIndex

Co-authored-by: lvca <[email protected]>

* Remove test database files and add to .gitignore

Co-authored-by: lvca <[email protected]>

* Improve documentation for metadata JSON configuration

Co-authored-by: lvca <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Luca Garulli <[email protected]>

* fix pre-commit

* Update engine/src/main/java/com/arcadedb/database/TransactionIndexContext.java

Co-authored-by: Copilot <[email protected]>

* Fixed compaction

* test: fixed test

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: lvca <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Roberto Franchini <[email protected]>

(cherry picked from commit c470e6d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement new vector index using JVector library

4 participants