Skip to content

Fix Ray deduplicator shared state#978

Draft
macroguo-ghy wants to merge 3 commits into
datajuicer:mainfrom
macroguo-ghy:codex/fix-ray-dedup-state-971
Draft

Fix Ray deduplicator shared state#978
macroguo-ghy wants to merge 3 commits into
datajuicer:mainfrom
macroguo-ghy:codex/fix-ray-dedup-state-971

Conversation

@macroguo-ghy
Copy link
Copy Markdown

@macroguo-ghy macroguo-ghy commented May 14, 2026

Summary

  • Share Ray deduplicator backend state across map_batches tasks for a single execution.
  • Prepare Ray actor-backed dedup sets before serializing the operator into Ray tasks without recreating existing actor handles.
  • Materialize Ray basic deduplicator stats for all stateful backends, including Redis, before later actions can re-run the lazy stats stage.
  • Add regression coverage for document deduplication across Ray blocks, repeated executions, Redis materialization signaling, and actor handle reuse.
  • Fix Ray test helper conversion by using RayDataset.to_list() instead of iterating RayDataset directly.

Fixes #971.

Validation

python3 -m pytest tests/ops/deduplicator/test_ray_document_deduplicator.py -q

Result: 9 passed, 10 warnings in 69.43s.

@macroguo-ghy macroguo-ghy changed the title [codex] Fix Ray deduplicator shared state Fix Ray deduplicator shared state May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle stateful operators within Ray datasets by allowing operators to trigger dataset materialization after execution. This change specifically addresses potential issues in deduplication where Ray's lazy re-execution could lead to incorrect results due to persistent state in actors or external backends. The feedback highlights that the RedisBackend should also trigger this materialization to prevent similar state conflicts and suggests refactoring the actor initialization logic to eliminate code duplication.

Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated
Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RayBasicDeduplicator 懒加载策略导致无法实现全局去重

1 participant