fix: pass entity_chunks_storage and relation_chunks_storage to all merge_nodes_and_edges calls#250
Conversation
…rge_nodes_and_edges calls Three call sites in processor.py and modalprocessors.py were omitting entity_chunks_storage and relation_chunks_storage (and in modalprocessors.py also full_entities_storage, full_relations_storage, and doc_id). Because these default to None, calls succeeded silently but entity-to-chunk and relation-to-chunk mappings were never persisted for multimodal content, degrading retrieval quality. Fixes HKUDS#241 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Abdeltoto
left a comment
There was a problem hiding this comment.
Hi @peterCheng123321 👋
Nice catch on this fix. I went through the diff carefully against lightrag/operate.py and noticed that this PR is actually slightly more complete than the parallel #247 and my own #260 — both of those forward the four *_storage kwargs to BaseModalProcessor._process_chunk_for_extraction, but only this PR also forwards doc_id (extracted from chunk_data["full_doc_id"]). doc_id is used inside merge_nodes_and_edges for traceability of which document each entity/relation came from, so omitting it leaves a real gap. Thanks for spotting that.
LGTM 👍
One small cohesion nit (non-blocking): in modalprocessors.py, entity_chunks_storage and relation_chunks_storage are placed after llm_response_cache, while full_entities_storage / full_relations_storage are grouped earlier with global_config. #247 groups all four *_storage kwargs together — the call site reads a bit more uniformly that way. Purely cosmetic, feel free to ignore.
Disclosure: I opened #260 a few hours after this one with a partial version of the same fix (without doc_id). Closing #260 in favor of this PR.
|
Thanks for the detailed review @Abdeltoto and for approving @ashah1992 (the original reporter — glad it addresses the issue fully)! Good point on the cosmetic ordering nit — happy to reorder the kwargs to group all four |
Summary
Fixes #241.
Three
merge_nodes_and_edgescall sites were omitting storage parameters that are required to persist entity-to-chunk and relation-to-chunk mappings for multimodal content:processor.py_process_multimodal_content_individual(~line 756)entity_chunks_storage,relation_chunks_storageprocessor.py_batch_merge_lightrag_style_type_aware(~line 1362)entity_chunks_storage,relation_chunks_storagemodalprocessors.pyBaseModalProcessor._process_chunk_for_extraction(~line 800)full_entities_storage,full_relations_storage,doc_id,entity_chunks_storage,relation_chunks_storageBecause these parameters default to
None, the calls succeeded silently — butkv_store_entity_chunks.jsonandkv_store_relation_chunks.jsonwere never populated for multimodal content, degrading retrieval quality compared to text-only ingestion.For
modalprocessors.py,doc_idis derived fromchunk_data["full_doc_id"]which is already stored when the chunk is created.Test plan
process_document_complete()kv_store_entity_chunks.jsonandkv_store_relation_chunks.jsoncontain entries for multimodal entities