-
-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Creating NOTUNIQUE String Index on Large Dataset Breaks = Operator
Summary
Creating a NOTUNIQUE index on a string property after loading a large dataset (86K+ records) causes the = operator to return 0 results for exact string matches, while the LIKE operator continues to work correctly.
Reproduction
Setup
-- Create schema
CREATE VERTEX TYPE Movie;
CREATE PROPERTY Movie.title STRING;
CREATE PROPERTY Movie.movieId INTEGER;
-- Load large dataset (86,537 movies)
-- ... bulk insert ...
-- Test BEFORE creating index
SELECT FROM Movie WHERE title = 'Toy Story (1995)'; -- Returns 1 result ✓Trigger Bug
-- Create NOTUNIQUE index on existing data
CREATE INDEX ON Movie (title) NOTUNIQUE;
-- Test AFTER creating index
SELECT FROM Movie WHERE title = 'Toy Story (1995)'; -- Returns 0 results ✗
SELECT FROM Movie WHERE title LIKE 'Toy Story (1995)'; -- Returns 1 result ✓Test Results
| Database Size | Before Index | After Index | Bug Triggered? |
|---|---|---|---|
| 9,742 records | = works ✓ |
= works ✓ |
No |
| 86,537 records | = works ✓ |
= broken ✗ |
Yes |
Detailed Test (86K records)
- Before index: 5/5 test movies return 1 result with
=operator - After index: 1/5 test movies return 1 result with
=operator (4 return 0 results) LIKEoperator: Works correctly in both cases
Environment
- ArcadeDB version: [embedded in arcadedb-embedded-python]
- Dataset: MovieLens 25M (86,537 movies)
- Index type: NOTUNIQUE
- Property type: STRING
Expected Behavior
The = operator should continue to work correctly after creating an index.
Actual Behavior
After creating NOTUNIQUE index on existing large dataset, the = operator returns 0 results for most exact string matches, while LIKE continues to work.
Workaround
- Avoid creating
NOTUNIQUEindexes on string properties in large databases - Use
LIKEoperator for string matching (slower, cannot use index) - Use numeric IDs with
UNIQUEindexes instead
Note: FULL_TEXT is NOT an alternative
FULL_TEXT index changes the semantics of the = operator to perform tokenized word search instead of exact matching:
-- Before FULL_TEXT index
SELECT FROM Movie WHERE title = 'Toy Story (1995)'; -- Returns 1 result (exact match) ✓
-- After creating FULL_TEXT index
CREATE INDEX ON Movie (title) FULL_TEXT;
SELECT FROM Movie WHERE title = 'Toy Story (1995)'; -- Returns 1,686 results ✗
-- (matches any movie containing words "Toy", "Story", or "1995")Result: While FULL_TEXT doesn't break the = operator (no 0 results), it makes it return semantically incorrect results (thousands of partial word matches instead of 1 exact match). The LIKE operator continues to work correctly for exact matching.
Notes
- The bug is size-dependent: Only occurs with large datasets (86K+ records)
- Small datasets (9K records) are not affected
- The bug occurs when index is created after data is loaded
- Unknown if creating index before loading data prevents the issue