Skip to content

feat: Support cuVS-backed CAGRA index building and searching capabilities for Lance Dataset #9

@qingfeng-occ

Description

@qingfeng-occ

Overview

This issue aims to integrate the CAGRA index into Lance-cuvs. Unlike the existing IVF_PQ (based on clustering and quantization), CAGRA is a graph-based index that leverages GPU acceleration to provide higher accuracy and faster Approximate Nearest Neighbor (ANN) search.

The diagram below illustrates the hierarchy of using the Lance Python API as the entry point and calling the Lance-cuvs Python API to build, load, and search the CAGRA index.

graph TD
    %% 定义样式
    classDef lancePythonLayer fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    classDef rustLayer fill:#fff3e0,stroke:#e65100,stroke-width:2px;
    classDef cLayer fill:#fce4ec,stroke:#880e4f,stroke-width:2px;
    classDef process fill:#ffffff,stroke:#333,stroke-dasharray: 5 5;

    %% 顶层:Lance Python Layer
    A[User Client]
    subgraph LancePy ["Lance Python Layer"]
        direction TB
        A1["dataset.create_index(..., accelerator='cuvs', index_type='CAGRA')"]
        A2["index = dataset.load_index()"]
        A3["dataset.to_table(..., index=index, nearest)"]
        style A1 fill:#fff,stroke:#333
        style A2 fill:#fff,stroke:#333
        style A3 fill:#fff,stroke:#333
    end

    %% 中间层:Lance-cuvs Python Layer (适配层)
    subgraph LanceCuVsPy ["Lance-cuvs Python Layer"]
        direction TB
        B1["tmp_file_path=build_cagra_index()"]
        %%B2["serialize_cagra_index<br>(cagraIndex,tmp_file_path)"]
        B3["(neighbors, distances)=<br>search_cagra_index"]
        B5["deserialize_cagra_index<br>(tmp_file_path)"]
        
        style B1 fill:#fff,stroke:#333
        style B3 fill:#fff,stroke:#333
        style B5 fill:#fff,stroke:#333
    end

    %% 底层:cuvs C Layer
    subgraph CuVsC ["Cuvs C Layer"]
        direction TB

        D1[cuvsCagraBuild]
        D2[cuvsCagraSerialize]

        D5[cuvsCagraDeserialize]
        D3[cuvsCagraSearch]
        
        style D1 fill:#fff,stroke:#333
        style D2 fill:#fff,stroke:#333
        style D3 fill:#fff,stroke:#333
        style D5 fill:#fff,stroke:#333
    end

    %% 持久化层:Lance Rust Layer
    subgraph LanceRust ["Lance Rust Layer"]
    
        C1["lance_file::writer::FileWriter"]
        C3["lance_file::reader::FileReader"]
        style C1 fill:#fff,stroke:#333
        style C3 fill:#fff,stroke:#333
    end

    A -- 1. build index --> A1
    A1 -- <span style="">1.1 <br> build cagra index and serialize to a tmp file --> B1
    A -- 3. search --> A3
    A -- 2. load index to GPU before search--> A2
    A2 -- 2.1 <br>Restore the Lance format index file so that cuVS can recognize it--> C3

    %% 连接关系:Python Layer -> Lance-cuvs Python Layer
    
    A2 -- 2.2 <br>deserialize cagra index to GPU --> B5
    A3 --> B3
    
    %% 连接关系:Lance-cuvs Python Layer -> cuvs C Layer (核心修改点)

    B1 -- 1.1.1 --> D1
    B1 -- 1.1.2 --> D2
    B3 -- calls --> D3
    B5 -- calls --> D5


    %% 连接关系:构建流程中的序列化与持久化
    %%B1 --> B2
    A1 -- <span style="">1.2 <br> write the tmp file to Lance format --> C1
   
    %% 连接关系:搜索流程中的数据返回
    B3 -- dataset.take --> A3
    
    

    %% 应用样式
    class LancePy,A,A1,A2,A3 lancePythonLayer;
    class LanceCuVsPy,B1,B2,B3,B4,B5 lanceCuvsPythonLayer;
    class CuVsC,D1,D2,D3,D5 cLayer;
    class LanceRust,C1,C3 rustLayer;

Loading

Goals

Expose three core Python functions: build_cagra_index, deserialize_cagra_index, and search_cagra_index.

  1. build_cagra_index

    • Logic: Invoke cuVS C API cuvsCagraBuild to construct the CAGRA graph on the GPU and cuvsCagraSerialize to serialize the index object directly to a temporary file path on the host filesystem. Finally, this temporary file path is returned to the Lance Python layer.

    • Input:

      • Dataset
      • CagraIndexParams
    • Output:

      • temporary index file path
    • Note: The output file is in raw cuVS binary format, not the standard Lance format. We will need to use lance_file::writer::FileWriter on the Lance Rust side to write the file to _indices/${uuid}/index.idx.

  2. deserialize_cagra_index

    • Logic: The index file is deserialized into an index object by calling cuVS C API cuvsCagraDeserialize.

    • Input:

      • temporary index file path.
    • Output:

      • Index
    • Note: Before deserialization, we need to use lance_file::reader::FileReader on the Lance Rust side to read the index files in the _indices directory, remove the Lance-specific file header/footer information, and restore them to a temporary index file that cuVS can recognize.

  3. search_cagra_index

    • Logic: Execute parallel nearest neighbor search on the GPU by calling cuVS C API cuvsCagraSearch.

    • Input:

      • Index
      • nearest
    • Output:

      • distances: pyarrow.FixedSizeListArray.
      • neighbors: pyarrow.FixedSizeListArray.
    • Note: We need to use the Lance dataset.take method to convert Cagra search results into the pa.Table format, which is required by Lance.

Based on the above implementation, the Lance side can subsequently delegate the building and searching logic of the CAGRA index to lance-cuvs by configuring accelerator="cuvs" and index_type="CAGRA". The Lance side only needs to perform simple parameter passing and file format conversion, without involving any cuvs code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions