[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity by yxdyc · Pull Request #969 · datajuicer/data-juicer

yxdyc · 2026-04-20T09:37:09Z

Add training-data YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, TRAIN_DATA_FIELD_GUIDE.
New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), syslog/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map).
Wire MetaKeys and mapper init exports; extend bad_case / insight / dialog normalize for the training-data stack.
Add a training-free before/after report (migration matrix, stage retention, hardness/quality ranking checks).
Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0.
Docs: demos/agent README links; analysis recipe points to R0 + field guide.

…del similarity - Add delivery YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, DELIVERY_FIELD_GUIDE, and diff_agent_exports.py for export diffs. - New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), SLS/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map). - Wire MetaKeys and mapper __init__ exports; extend bad_case / insight / dialog normalize for the delivery stack. - Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0. - Docs: demos/agent README links; analysis recipe points to R0 + field guide. Made-with: Cursor

Align agent mapper/meta naming with training-dataset semantics by replacing delivery-tier keys and tier-gating params, and rename SLS noise signals to sys_log across recipes, docs, and tests for consistent pipeline usage. Made-with: Cursor

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of new Mapper operators and recipes designed to transform agent interaction logs into high-quality training datasets. Key additions include mappers for cross-model pairing, error taxonomy classification, learnable value scoring, and trajectory distillation using teacher models. The PR also updates existing mappers to support these new features and provides detailed documentation and recipes for end-to-end training data pipelines. A critical issue was identified in the AgentCrossModelPairMapper regarding its scalability, as it currently loads the entire dataset into memory, which may lead to out-of-memory errors for large-scale data processing.

gemini-code-assist · 2026-04-20T09:45:34Z

+                desc=f"{self._name}_add_meta",
+            )
+
+        rows: List[dict] = copy.deepcopy(dataset.to_list())


This implementation loads the entire dataset into memory using dataset.to_list(), which will not scale to large datasets and can lead to out-of-memory errors. This is a critical issue for a data processing operator, as it bypasses the streaming capabilities of the underlying datasets library.

For group_key_mode set to exact or normalized_query, this can be implemented more scalably. Consider the following approach:

Use dataset.map to compute the grouping key for each sample and add it as a new column.

Sort the dataset by this new grouping key column using dataset.sort().

Iterate through the sorted dataset to process samples group by group. This avoids loading the entire dataset into memory at once.

For simhash_lsh mode, a global view of the data is harder to avoid. However, the current implementation's memory limitation should be clearly documented in the class docstring, noting that it's only suitable for small to medium-sized datasets. For future scalability, you might consider exploring approximate nearest neighbor (ANN) libraries that support distributed or out-of-core computation.

…-value op Add transition_report.py and docs; extend bad-case / learnable-value mapper notes. Rename agent_learnable_value_scorer to agent_learnable_value_mapper (Mapper-consistent). Fix build_op_doc: parse OP type from markdown section; default translators region before import. Made-with: Cursor

…ure100s - agent_cross_model_pair_mapper: full-table grouping via REQUIRES_FULL_DATASET_PASS and Ray take_all/apply_full_dataset_annotations; update tests - ray_dataset: honor full-dataset mappers in Ray mode - agent_dialog_normalize_mapper: stringify dict tool arguments for HF Arrow - Move agent_interaction_quality_analysis.yaml under demos/agent/recipes - Add R0_synthesis_from_pure100s.yaml (normalized_query + agent_request_model) - Refresh R0–R3 bridge docs and recipe cross-links Made-with: Cursor

ShenQianli · 2026-04-24T07:23:52Z

+            peer_models = sorted(
+                {str(m) for m in models if m is not None and str(m).strip()},
+            )
+            has_contrast = len(members) >= self.min_group_size and len(peer_models) >= 2


here has_contrast only captures data with >=2 models, regardless of version. for example, {model=qwen-plus, version=3.5}, {model=qwen-plus, version=3.6} -> len(peer_models)=1 -> has_contrast=False. feel free to ignore if it is desired:)

yxdyc added 3 commits April 20, 2026 16:24

yxdyc requested review from HYLcool and ShenQianli April 20, 2026 09:37

yxdyc added dj:op issues/PRs about some specific OPs agent related to agent dj:post-tuning issues/PRs about post-tuning scenarios labels Apr 20, 2026

yxdyc requested a deployment to Testing April 20, 2026 09:37 — with GitHub Actions Waiting

yxdyc changed the title ~~feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity~~ [WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

yxdyc requested a deployment to Testing April 20, 2026 11:20 — with GitHub Actions Waiting

yxdyc requested a deployment to Testing April 24, 2026 01:37 — with GitHub Actions Waiting

+ add new OP agent_session_deduplicator

0422609

HYLcool requested a deployment to Testing April 24, 2026 01:54 — with GitHub Actions Waiting

ShenQianli reviewed Apr 24, 2026

View reviewed changes

+ merge recipes for two routines

c3ef735

HYLcool requested a deployment to Testing April 27, 2026 07:08 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969

[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969
yxdyc wants to merge 7 commits into
mainfrom
agent_data_dev

yxdyc commented Apr 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

ShenQianli Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yxdyc commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ShenQianli Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yxdyc commented Apr 20, 2026 •

edited

Loading