-
-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Problem Description
When creating FULL_TEXT indexes on large datasets, the index build process is completely silent for extended periods (hours), making it impossible to determine if:
- The process is making progress
- The process is stuck/frozen
- How long until completion
This is particularly problematic for datasets with millions of records where the build phase can take 5-10+ hours with zero feedback.
Current Behavior
LSM_TREE indexes provide good logging during compaction:
2025-11-08 15:34:58.737 WARNI [LSMTreeIndex] - Creating a new entry in index 'PostLink_0_...' root page [2024-02-12...]->385
2025-11-08 15:34:59.334 WARNI [LSMTreeIndex] Index 'PostLink_0_...' compacted in 83381ms (keys=5758994 values=6541209...)
FULL_TEXT indexes are silent during the build phase:
15:34:59 - Previous index completes
[5-10 HOURS OF SILENCE - NO LOGS]
23:45:00 - Compaction starts (finally some logs)
Reproduction
-
Create a FULL_TEXT index on a large text field with millions of records:
database.getSchema().createTypeIndex( SchemaImpl.INDEX_TYPE.FULL_TEXT, false, "Post", new String[] { "Body" } );
-
Observe complete silence during the initial build phase (scanning records, tokenizing text, inserting into LSM tree)
-
Only when compaction starts do logs appear
Root Cause
In LSMTreeIndex.java (line 626-657), the build() method has no logging:
public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
// ... setup code ...
db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
total.incrementAndGet();
// NO LOGGING HERE - could be processing for hours!
if (total.get() % buildIndexBatchSize == 0) {
db.getWrappedDatabaseInstance().commit();
db.getWrappedDatabaseInstance().begin();
}
if (callback != null)
callback.onDocumentIndexed((Document) record, total.get());
return true;
});
// ... cleanup code ...
}For FULL_TEXT indexes, this means:
- Scanning millions of records
- Tokenizing billions of words (via Lucene analyzer)
- Inserting each token into the underlying LSM tree
- All happening silently
Impact on Large Datasets
Based on testing with Stack Overflow dataset:
| Dataset Size | Records | Build Time | Logging |
|---|---|---|---|
| Medium (425K posts) | 425,735 | 3m 25s | ❌ None |
| Large (60M posts) | 59,819,048 | 5-7 hours | ❌ None |
During the 5-7 hour build phase, there is zero indication that work is being done.
Proposed Solution
Add periodic progress logging in LSMTreeIndex.build():
public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
checkIsValid();
final AtomicLong total = new AtomicLong();
final long LOG_INTERVAL = 10000; // Log every 10K records
if (propertyNames == null || propertyNames.isEmpty())
throw new IndexException("Cannot rebuild index '" + name + "' because metadata information are missing");
final DatabaseInternal db = getDatabase();
if (status.compareAndSet(INDEX_STATUS.AVAILABLE, INDEX_STATUS.UNAVAILABLE)) {
LogManager.instance().log(this, Level.INFO,
"Building index '%s' on %d properties...", name, propertyNames.size());
final long startTime = System.currentTimeMillis();
db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
total.incrementAndGet();
// ADD: Periodic progress logging
if (total.get() % LOG_INTERVAL == 0) {
final long elapsed = System.currentTimeMillis() - startTime;
final double rate = total.get() / (elapsed / 1000.0);
LogManager.instance().log(this, Level.INFO,
"Building index '%s': processed %d records (%.0f records/sec)...",
name, total.get(), rate);
}
if (total.get() % buildIndexBatchSize == 0) {
db.getWrappedDatabaseInstance().commit();
db.getWrappedDatabaseInstance().begin();
}
if (callback != null)
callback.onDocumentIndexed((Document) record, total.get());
return true;
});
// ADD: Completion logging
final long elapsed = System.currentTimeMillis() - startTime;
LogManager.instance().log(this, Level.INFO,
"Completed building index '%s': processed %d records in %dms",
name, total.get(), elapsed);
status.set(INDEX_STATUS.AVAILABLE);
} else
throw new NeedRetryException("Error on building index '" + name + "' because not available");
return total.get();
}Expected Output (After Fix)
2025-11-08 15:35:00 INFO [LSMTreeIndex] Building index 'Post[Body]' on 1 properties...
2025-11-08 15:35:12 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 10000 records (833 records/sec)...
2025-11-08 15:35:24 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 20000 records (833 records/sec)...
2025-11-08 15:35:36 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 30000 records (833 records/sec)...
...
2025-11-08 21:45:23 INFO [LSMTreeIndex] Completed building index 'Post[Body]': processed 59819048 records in 22823000ms
2025-11-08 21:45:24 INFO [LSMTreeIndex] Compacting index 'Post_0_...' (pages=150000 pageSize=262144...)
Benefits
- Progress visibility: Users can monitor index build progress in real-time
- Time estimation: Records/sec rate allows estimating time to completion
- Debugging: Immediately identify if process is stuck vs. just slow
- Consistency: Matches existing logging behavior during compaction phase
- Operations: Easier to monitor production index builds
Environment
- ArcadeDB version: 25.11.1-SNAPSHOT
- Affected index types: FULL_TEXT (and all other index types using
LSMTreeIndex.build()) - Dataset: Stack Overflow (60M documents, 97GB XML)
- Observed on: Python bindings, but affects all language bindings