-
-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Bug: Vector Quantization (INT8/BINARY) is fundamentally broken across all dimensions
Summary
Both INT8 and BINARY quantization in LSM_VECTOR indexes are currently unusable. They fail with critical errors ranging from storage overflows and negative index exceptions to severe data loss, even at extremely low dimensions (e.g., 4).
- INT8: Fails with
IndexOutOfBoundsException(negative indices) for dims < 16, andIllegalArgumentException(storage overflow) for dims >= 16. - BINARY: Successfully builds the index but drops ~50% of vectors and yields ~0% recall due to read-time errors.
Environment
- Component:
LSM_VECTORIndex (JVector integration) - Quantization:
INT8,BINARY - Dimensions: Tested on 4, 8, 16, 32, 64, 100
Symptoms
INT8 Symptoms
- Dims 4-8: Fails with
IndexOutOfBoundsExceptionaccessing negative indices (e.g.,-8,-64). - Dims >= 16: Fails with
IllegalArgumentException: Variable length (70) quantity is too long (must be <= 63). - Dims >= 32: Fails with
IllegalArgumentException: vector dimensions differ.
BINARY Symptoms
- All Dims: Logs
Filtered out X vectors with deleted/invalid documents(indicating data loss). - Search: Returns near-zero results or fails with
NullPointerException.
Analysis
The error message Variable length (70) quantity is too long (must be <= 63) strongly suggests an overflow in the variable-length integer (VInt) encoding used by the underlying storage engine (likely in com.arcadedb.database.Binary or related serialization logic).
It appears that when INT8 quantization is active, the serialized size of the graph node (or a specific field within it) grows beyond the capacity of the variable-length encoding field being used.
- Dimensions 16: Fails (Recall 0%, Storage Overflow).
- Dimensions 32: Fails (Length 70 > 63).
- Dimensions 64: Fails.
- Dimensions 100: Fails.
This prevents the use of INT8 quantization for any practical vector dimensionality (e.g., 768, 1536) used in modern embeddings.
Note on BINARY Quantization
I also tested BINARY quantization. It behaves differently but is equally broken:
- Index Creation: Succeeded for all dimensions (up to 100). It does not hit the "Variable length" storage limit.
- Search: Fails immediately with
IndexOutOfBoundsExceptionandNullPointerException.
While INT8 fails at write time (storage overflow), BINARY fails at read time (incorrect offset calculation or data corruption during retrieval). Both are unusable for high-dimensional vectors.
Accuracy & Correctness (Dim=4, 8, 16, 32)
Further testing across multiple dimensions confirms that quantization is fundamentally broken. To validate the test harness, we also measured the recall of the unquantized (NONE) index finding itself (Ground Truth). The NONE index achieved 100% recall in all cases, proving the test data and search logic are correct.
Benchmark Results (N=1,000, K=10):
| Dim | NONE (Self) | INT8 (vs NONE) | BINARY (vs NONE) | Notes |
|---|---|---|---|---|
| 4 | 100.00% | 7.00% | 0.00% | INT8: Index -8 out of bounds |
| 8 | 100.00% | 3.50% | 4.50% | INT8: Index -64 out of bounds |
| 16 | 100.00% | 1.50% | 3.00% | INT8: Variable length (70) > 63 |
| 32 | 100.00% | 0.00% | 3.00% | INT8: Variable length (70) > 63 |
Logs Analysis
INT8 Errors:
- Dim 4:
Index -8 out of bounds for length 3(Negative index access). - Dim 8:
Index -64 out of bounds for length 3(Negative index access). - Dim 16+:
IllegalArgumentException - Variable length (70) quantity is too long(Storage overflow). - Dim 32:
IllegalArgumentException: vector dimensions differ: 65536!=32(Severe serialization mismatch).
BINARY Errors:
- All Dims:
Filtered out X vectors with deleted/invalid documents(Data loss during indexing). - Dim 8+:
Error reading vector from offset ...: null(Read failure).
This confirms that INT8 suffers from severe offset calculation errors (negative indices) at very low dimensions and storage overflow at slightly higher dimensions. BINARY consistently fails to retrieve vectors correctly, often reading null or dropping data.
Performance Benchmark (Dim=16)
Warning: These performance numbers are for an index that produces incorrect results (0% recall). They are provided only to show the potential speedup if the feature were working.
Benchmark Results (Dim=16, N=10,000):
| Metric | NONE | BINARY | Speedup |
|---|---|---|---|
| Insert Time | 0.28s | 0.15s | ~1.9x |
| Index Time | 0.12s | 0.04s | ~2.7x |
| Search Latency (Avg) | 9.36ms | 1.98ms | ~4.7x |
Conclusion: BINARY quantization offers a ~5x speedup in search latency, but currently yields 0% accuracy and drops data.
Conclusion
Both INT8 and BINARY quantization are currently unusable.
- INT8: Fails with
IllegalArgumentException(storage overflow) even at low dimensions (16). - BINARY: Builds the index but drops ~50% of data and yields ~0% recall.