Skip to content

No logging during FULL_TEXT index build phase #2763

@tae898

Description

@tae898

Problem Description

When creating FULL_TEXT indexes on large datasets, the index build process is completely silent for extended periods (hours), making it impossible to determine if:

  • The process is making progress
  • The process is stuck/frozen
  • How long until completion

This is particularly problematic for datasets with millions of records where the build phase can take 5-10+ hours with zero feedback.

Current Behavior

LSM_TREE indexes provide good logging during compaction:

2025-11-08 15:34:58.737 WARNI [LSMTreeIndex] - Creating a new entry in index 'PostLink_0_...' root page [2024-02-12...]->385
2025-11-08 15:34:59.334 WARNI [LSMTreeIndex] Index 'PostLink_0_...' compacted in 83381ms (keys=5758994 values=6541209...)

FULL_TEXT indexes are silent during the build phase:

15:34:59 - Previous index completes
[5-10 HOURS OF SILENCE - NO LOGS]
23:45:00 - Compaction starts (finally some logs)

Reproduction

  1. Create a FULL_TEXT index on a large text field with millions of records:

    database.getSchema().createTypeIndex(
        SchemaImpl.INDEX_TYPE.FULL_TEXT,
        false,
        "Post",
        new String[] { "Body" }
    );
  2. Observe complete silence during the initial build phase (scanning records, tokenizing text, inserting into LSM tree)

  3. Only when compaction starts do logs appear

Root Cause

In LSMTreeIndex.java (line 626-657), the build() method has no logging:

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
    // ... setup code ...
    
    db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
        db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
        total.incrementAndGet();
        
        // NO LOGGING HERE - could be processing for hours!
        
        if (total.get() % buildIndexBatchSize == 0) {
            db.getWrappedDatabaseInstance().commit();
            db.getWrappedDatabaseInstance().begin();
        }
        
        if (callback != null)
            callback.onDocumentIndexed((Document) record, total.get());
        
        return true;
    });
    
    // ... cleanup code ...
}

For FULL_TEXT indexes, this means:

  • Scanning millions of records
  • Tokenizing billions of words (via Lucene analyzer)
  • Inserting each token into the underlying LSM tree
  • All happening silently

Impact on Large Datasets

Based on testing with Stack Overflow dataset:

Dataset Size Records Build Time Logging
Medium (425K posts) 425,735 3m 25s ❌ None
Large (60M posts) 59,819,048 5-7 hours ❌ None

During the 5-7 hour build phase, there is zero indication that work is being done.

Proposed Solution

Add periodic progress logging in LSMTreeIndex.build():

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
    checkIsValid();
    final AtomicLong total = new AtomicLong();
    final long LOG_INTERVAL = 10000; // Log every 10K records
    
    if (propertyNames == null || propertyNames.isEmpty())
        throw new IndexException("Cannot rebuild index '" + name + "' because metadata information are missing");
    
    final DatabaseInternal db = getDatabase();
    
    if (status.compareAndSet(INDEX_STATUS.AVAILABLE, INDEX_STATUS.UNAVAILABLE)) {
        
        LogManager.instance().log(this, Level.INFO, 
            "Building index '%s' on %d properties...", name, propertyNames.size());
        
        final long startTime = System.currentTimeMillis();
        
        db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
            db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
            total.incrementAndGet();
            
            // ADD: Periodic progress logging
            if (total.get() % LOG_INTERVAL == 0) {
                final long elapsed = System.currentTimeMillis() - startTime;
                final double rate = total.get() / (elapsed / 1000.0);
                LogManager.instance().log(this, Level.INFO, 
                    "Building index '%s': processed %d records (%.0f records/sec)...", 
                    name, total.get(), rate);
            }
            
            if (total.get() % buildIndexBatchSize == 0) {
                db.getWrappedDatabaseInstance().commit();
                db.getWrappedDatabaseInstance().begin();
            }
            
            if (callback != null)
                callback.onDocumentIndexed((Document) record, total.get());
            
            return true;
        });
        
        // ADD: Completion logging
        final long elapsed = System.currentTimeMillis() - startTime;
        LogManager.instance().log(this, Level.INFO, 
            "Completed building index '%s': processed %d records in %dms", 
            name, total.get(), elapsed);
        
        status.set(INDEX_STATUS.AVAILABLE);
        
    } else
        throw new NeedRetryException("Error on building index '" + name + "' because not available");
    
    return total.get();
}

Expected Output (After Fix)

2025-11-08 15:35:00 INFO [LSMTreeIndex] Building index 'Post[Body]' on 1 properties...
2025-11-08 15:35:12 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 10000 records (833 records/sec)...
2025-11-08 15:35:24 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 20000 records (833 records/sec)...
2025-11-08 15:35:36 INFO [LSMTreeIndex] Building index 'Post[Body]': processed 30000 records (833 records/sec)...
...
2025-11-08 21:45:23 INFO [LSMTreeIndex] Completed building index 'Post[Body]': processed 59819048 records in 22823000ms
2025-11-08 21:45:24 INFO [LSMTreeIndex] Compacting index 'Post_0_...' (pages=150000 pageSize=262144...)

Benefits

  1. Progress visibility: Users can monitor index build progress in real-time
  2. Time estimation: Records/sec rate allows estimating time to completion
  3. Debugging: Immediately identify if process is stuck vs. just slow
  4. Consistency: Matches existing logging behavior during compaction phase
  5. Operations: Easier to monitor production index builds

Environment

  • ArcadeDB version: 25.11.1-SNAPSHOT
  • Affected index types: FULL_TEXT (and all other index types using LSMTreeIndex.build())
  • Dataset: Stack Overflow (60M documents, 97GB XML)
  • Observed on: Python bindings, but affects all language bindings

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions