Scaling to 20M is not working #41

Zsclarx · 2025-07-01T09:05:44Z

Zsclarx
Jul 1, 2025

⚡️ Performance Bottleneck on Large-scale HNSW Querying (20M x 384D vectors)

I'm working on scaling approximate nearest neighbor search using HNSW in Spark on a large dataset of sentence embeddings.

✅ Setup Summary

Data: 20 million records, each with a 384-dimensional embedding (from all-MiniLM-L6-v2)
Cluster Config:
- 12 Executors, each with:
  - 8 cores
  - 64 GB RAM
- 1 Driver with same config

⚙️ Code Snippet

val hnsw = new HnswSimilarity()
  .setIdentifierCol("row_id")
  .setFeaturesCol("normFeatures")  // Manually normalized to unit vectors
  .setDistanceFunction("inner-product")
  .setNumPartitions(224)
  .setM(16)
  .setEfConstruction(100)
  .setEf(30)
  .setK(10)
  .setPredictionCol("neighbors")

val model = hnsw.fit(data)            // ⚡ Fast: ~15 minutes for 20M records
val knnDf = model.transform(data)     // 🐢 Very slow: ~10 hours!
knnDf.cache()
knnDf.count()

❗ The Problem

Index build time is excellent (~15 mins) ✅
But the querying phase (model.transform) is extremely slow (~10 hours) ❌
Reducing partitions causes executor OOM/failures
Dataset is read in advance (no lazy transformations)
Vectors are pre-normalized to avoid on-the-fly overhead

💡 Questions

Why is model.transform() so slow despite fast index construction?
Could this be due to internal broadcast or shuffle mechanisms used by the library?
Are there any best practices for:
- Optimal numPartitions (e.g. per executor/core)?
- Avoiding serialization/deserialization bottlenecks?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scaling to 20M is not working #41

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scaling to 20M is not working #41

Uh oh!

Uh oh!

Zsclarx Jul 1, 2025

⚡️ Performance Bottleneck on Large-scale HNSW Querying (20M x 384D vectors)

✅ Setup Summary

⚙️ Code Snippet

❗ The Problem

💡 Questions

🙏 Any ideas, tuning tips, or architectural suggestions are highly appreciated!

Replies: 0 comments

Zsclarx
Jul 1, 2025