Skip to content

Bug: Vector Quantization (INT8/BINARY) is fundamentally broken across all dimensions #3052

@tae898

Description

@tae898

Bug: Vector Quantization (INT8/BINARY) is fundamentally broken across all dimensions

Summary

Both INT8 and BINARY quantization in LSM_VECTOR indexes are currently unusable. They fail with critical errors ranging from storage overflows and negative index exceptions to severe data loss, even at extremely low dimensions (e.g., 4).

  • INT8: Fails with IndexOutOfBoundsException (negative indices) for dims < 16, and IllegalArgumentException (storage overflow) for dims >= 16.
  • BINARY: Successfully builds the index but drops ~50% of vectors and yields ~0% recall due to read-time errors.

Environment

  • Component: LSM_VECTOR Index (JVector integration)
  • Quantization: INT8, BINARY
  • Dimensions: Tested on 4, 8, 16, 32, 64, 100

Symptoms

INT8 Symptoms

  1. Dims 4-8: Fails with IndexOutOfBoundsException accessing negative indices (e.g., -8, -64).
  2. Dims >= 16: Fails with IllegalArgumentException: Variable length (70) quantity is too long (must be <= 63).
  3. Dims >= 32: Fails with IllegalArgumentException: vector dimensions differ.

BINARY Symptoms

  1. All Dims: Logs Filtered out X vectors with deleted/invalid documents (indicating data loss).
  2. Search: Returns near-zero results or fails with NullPointerException.

Analysis

The error message Variable length (70) quantity is too long (must be <= 63) strongly suggests an overflow in the variable-length integer (VInt) encoding used by the underlying storage engine (likely in com.arcadedb.database.Binary or related serialization logic).

It appears that when INT8 quantization is active, the serialized size of the graph node (or a specific field within it) grows beyond the capacity of the variable-length encoding field being used.

  • Dimensions 16: Fails (Recall 0%, Storage Overflow).
  • Dimensions 32: Fails (Length 70 > 63).
  • Dimensions 64: Fails.
  • Dimensions 100: Fails.

This prevents the use of INT8 quantization for any practical vector dimensionality (e.g., 768, 1536) used in modern embeddings.

Note on BINARY Quantization

I also tested BINARY quantization. It behaves differently but is equally broken:

  1. Index Creation: Succeeded for all dimensions (up to 100). It does not hit the "Variable length" storage limit.
  2. Search: Fails immediately with IndexOutOfBoundsException and NullPointerException.

While INT8 fails at write time (storage overflow), BINARY fails at read time (incorrect offset calculation or data corruption during retrieval). Both are unusable for high-dimensional vectors.

Accuracy & Correctness (Dim=4, 8, 16, 32)

Further testing across multiple dimensions confirms that quantization is fundamentally broken. To validate the test harness, we also measured the recall of the unquantized (NONE) index finding itself (Ground Truth). The NONE index achieved 100% recall in all cases, proving the test data and search logic are correct.

Benchmark Results (N=1,000, K=10):

Dim NONE (Self) INT8 (vs NONE) BINARY (vs NONE) Notes
4 100.00% 7.00% 0.00% INT8: Index -8 out of bounds
8 100.00% 3.50% 4.50% INT8: Index -64 out of bounds
16 100.00% 1.50% 3.00% INT8: Variable length (70) > 63
32 100.00% 0.00% 3.00% INT8: Variable length (70) > 63

Logs Analysis

INT8 Errors:

  • Dim 4: Index -8 out of bounds for length 3 (Negative index access).
  • Dim 8: Index -64 out of bounds for length 3 (Negative index access).
  • Dim 16+: IllegalArgumentException - Variable length (70) quantity is too long (Storage overflow).
  • Dim 32: IllegalArgumentException: vector dimensions differ: 65536!=32 (Severe serialization mismatch).

BINARY Errors:

  • All Dims: Filtered out X vectors with deleted/invalid documents (Data loss during indexing).
  • Dim 8+: Error reading vector from offset ...: null (Read failure).

This confirms that INT8 suffers from severe offset calculation errors (negative indices) at very low dimensions and storage overflow at slightly higher dimensions. BINARY consistently fails to retrieve vectors correctly, often reading null or dropping data.

Performance Benchmark (Dim=16)

Warning: These performance numbers are for an index that produces incorrect results (0% recall). They are provided only to show the potential speedup if the feature were working.

Benchmark Results (Dim=16, N=10,000):

Metric NONE BINARY Speedup
Insert Time 0.28s 0.15s ~1.9x
Index Time 0.12s 0.04s ~2.7x
Search Latency (Avg) 9.36ms 1.98ms ~4.7x

Conclusion: BINARY quantization offers a ~5x speedup in search latency, but currently yields 0% accuracy and drops data.

Conclusion

Both INT8 and BINARY quantization are currently unusable.

  1. INT8: Fails with IllegalArgumentException (storage overflow) even at low dimensions (16).
  2. BINARY: Builds the index but drops ~50% of data and yields ~0% recall.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions