Fix Ray deduplicator shared state by macroguo-ghy · Pull Request #978 · datajuicer/data-juicer

macroguo-ghy · 2026-05-14T04:12:06Z

Summary

Share Ray deduplicator backend state across map_batches tasks for a single execution.
Prepare Ray actor-backed dedup sets before serializing the operator into Ray tasks without recreating existing actor handles.
Materialize Ray basic deduplicator stats for all stateful backends, including Redis, before later actions can re-run the lazy stats stage.
Add regression coverage for document deduplication across Ray blocks, repeated executions, Redis materialization signaling, and actor handle reuse.
Fix Ray test helper conversion by using RayDataset.to_list() instead of iterating RayDataset directly.

Fixes #971.

Validation

python3 -m pytest tests/ops/deduplicator/test_ray_document_deduplicator.py -q

Result: 9 passed, 10 warnings in 69.43s.

gemini-code-assist

Code Review

This pull request introduces a mechanism to handle stateful operators within Ray datasets by allowing operators to trigger dataset materialization after execution. This change specifically addresses potential issues in deduplication where Ray's lazy re-execution could lead to incorrect results due to persistent state in actors or external backends. The feedback highlights that the RedisBackend should also trigger this materialization to prevent similar state conflicts and suggests refactoring the actor initialization logic to eliminate code duplication.

macroguo-ghy added 2 commits May 14, 2026 10:58

fix(ray): share dedup state per execution

09a756f

Fix Ray document deduplicator test dataset conversion

14899c8

macroguo-ghy requested a deployment to Testing May 14, 2026 04:12 — with GitHub Actions Waiting

macroguo-ghy changed the title ~~[codex] Fix Ray deduplicator shared state~~ Fix Ray deduplicator shared state May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated

Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated

Address Ray deduplicator review comments

5dce715

macroguo-ghy requested a deployment to Testing May 14, 2026 06:18 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Ray deduplicator shared state#978

Fix Ray deduplicator shared state#978
macroguo-ghy wants to merge 3 commits into
datajuicer:mainfrom
macroguo-ghy:codex/fix-ray-dedup-state-971

macroguo-ghy commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

macroguo-ghy commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

macroguo-ghy commented May 14, 2026 •

edited

Loading