Skip to content

Creating NOTUNIQUE String Index on Large Dataset Breaks = Operator #2757

@tae898

Description

@tae898

Creating NOTUNIQUE String Index on Large Dataset Breaks = Operator

Summary

Creating a NOTUNIQUE index on a string property after loading a large dataset (86K+ records) causes the = operator to return 0 results for exact string matches, while the LIKE operator continues to work correctly.

Reproduction

Setup

-- Create schema
CREATE VERTEX TYPE Movie;
CREATE PROPERTY Movie.title STRING;
CREATE PROPERTY Movie.movieId INTEGER;

-- Load large dataset (86,537 movies)
-- ... bulk insert ...

-- Test BEFORE creating index
SELECT FROM Movie WHERE title = 'Toy Story (1995)';  -- Returns 1 result ✓

Trigger Bug

-- Create NOTUNIQUE index on existing data
CREATE INDEX ON Movie (title) NOTUNIQUE;

-- Test AFTER creating index
SELECT FROM Movie WHERE title = 'Toy Story (1995)';  -- Returns 0 results ✗
SELECT FROM Movie WHERE title LIKE 'Toy Story (1995)';  -- Returns 1 result ✓

Test Results

Database Size Before Index After Index Bug Triggered?
9,742 records = works ✓ = works ✓ No
86,537 records = works ✓ = broken ✗ Yes

Detailed Test (86K records)

  • Before index: 5/5 test movies return 1 result with = operator
  • After index: 1/5 test movies return 1 result with = operator (4 return 0 results)
  • LIKE operator: Works correctly in both cases

Environment

  • ArcadeDB version: [embedded in arcadedb-embedded-python]
  • Dataset: MovieLens 25M (86,537 movies)
  • Index type: NOTUNIQUE
  • Property type: STRING

Expected Behavior

The = operator should continue to work correctly after creating an index.

Actual Behavior

After creating NOTUNIQUE index on existing large dataset, the = operator returns 0 results for most exact string matches, while LIKE continues to work.

Workaround

  1. Avoid creating NOTUNIQUE indexes on string properties in large databases
  2. Use LIKE operator for string matching (slower, cannot use index)
  3. Use numeric IDs with UNIQUE indexes instead

Note: FULL_TEXT is NOT an alternative

FULL_TEXT index changes the semantics of the = operator to perform tokenized word search instead of exact matching:

-- Before FULL_TEXT index
SELECT FROM Movie WHERE title = 'Toy Story (1995)';  -- Returns 1 result (exact match) ✓

-- After creating FULL_TEXT index
CREATE INDEX ON Movie (title) FULL_TEXT;
SELECT FROM Movie WHERE title = 'Toy Story (1995)';  -- Returns 1,686 results ✗
-- (matches any movie containing words "Toy", "Story", or "1995")

Result: While FULL_TEXT doesn't break the = operator (no 0 results), it makes it return semantically incorrect results (thousands of partial word matches instead of 1 exact match). The LIKE operator continues to work correctly for exact matching.

Notes

  • The bug is size-dependent: Only occurs with large datasets (86K+ records)
  • Small datasets (9K records) are not affected
  • The bug occurs when index is created after data is loaded
  • Unknown if creating index before loading data prevents the issue

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions