Skip to content

[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969

Open
yxdyc wants to merge 7 commits into
mainfrom
agent_data_dev
Open

[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969
yxdyc wants to merge 7 commits into
mainfrom
agent_data_dev

Conversation

@yxdyc
Copy link
Copy Markdown
Collaborator

@yxdyc yxdyc commented Apr 20, 2026

  • Add training-data YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, TRAIN_DATA_FIELD_GUIDE.

  • New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), syslog/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map).

  • Wire MetaKeys and mapper init exports; extend bad_case / insight / dialog normalize for the training-data stack.

  • Add a training-free before/after report (migration matrix, stage retention, hardness/quality ranking checks).

  • Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0.

  • Docs: demos/agent README links; analysis recipe points to R0 + field guide.

yxdyc added 3 commits April 20, 2026 16:24
…del similarity

- Add delivery YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, DELIVERY_FIELD_GUIDE, and diff_agent_exports.py for export diffs.

- New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), SLS/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map).

- Wire MetaKeys and mapper __init__ exports; extend bad_case / insight / dialog normalize for the delivery stack.

- Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0.

- Docs: demos/agent README links; analysis recipe points to R0 + field guide.

Made-with: Cursor
Align agent mapper/meta naming with training-dataset semantics by replacing delivery-tier keys and tier-gating params, and rename SLS noise signals to sys_log across recipes, docs, and tests for consistent pipeline usage.

Made-with: Cursor
Align agent mapper/meta naming with training-dataset semantics by replacing delivery-tier keys and tier-gating params, and rename SLS noise signals to sys_log across recipes, docs, and tests for consistent pipeline usage.

Made-with: Cursor
@yxdyc yxdyc requested review from HYLcool and ShenQianli April 20, 2026 09:37
@yxdyc yxdyc added dj:op issues/PRs about some specific OPs agent related to agent dj:post-tuning issues/PRs about post-tuning scenarios labels Apr 20, 2026
@yxdyc yxdyc changed the title feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity [WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of new Mapper operators and recipes designed to transform agent interaction logs into high-quality training datasets. Key additions include mappers for cross-model pairing, error taxonomy classification, learnable value scoring, and trajectory distillation using teacher models. The PR also updates existing mappers to support these new features and provides detailed documentation and recipes for end-to-end training data pipelines. A critical issue was identified in the AgentCrossModelPairMapper regarding its scalability, as it currently loads the entire dataset into memory, which may lead to out-of-memory errors for large-scale data processing.

desc=f"{self._name}_add_meta",
)

rows: List[dict] = copy.deepcopy(dataset.to_list())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This implementation loads the entire dataset into memory using dataset.to_list(), which will not scale to large datasets and can lead to out-of-memory errors. This is a critical issue for a data processing operator, as it bypasses the streaming capabilities of the underlying datasets library.

For group_key_mode set to exact or normalized_query, this can be implemented more scalably. Consider the following approach:

  1. Use dataset.map to compute the grouping key for each sample and add it as a new column.
  2. Sort the dataset by this new grouping key column using dataset.sort().
  3. Iterate through the sorted dataset to process samples group by group. This avoids loading the entire dataset into memory at once.

For simhash_lsh mode, a global view of the data is harder to avoid. However, the current implementation's memory limitation should be clearly documented in the class docstring, noting that it's only suitable for small to medium-sized datasets. For future scalability, you might consider exploring approximate nearest neighbor (ANN) libraries that support distributed or out-of-core computation.

…-value op

Add transition_report.py and docs; extend bad-case / learnable-value mapper notes.
Rename agent_learnable_value_scorer to agent_learnable_value_mapper (Mapper-consistent).
Fix build_op_doc: parse OP type from markdown section; default translators region before import.

Made-with: Cursor
…ure100s

- agent_cross_model_pair_mapper: full-table grouping via REQUIRES_FULL_DATASET_PASS
  and Ray take_all/apply_full_dataset_annotations; update tests
- ray_dataset: honor full-dataset mappers in Ray mode
- agent_dialog_normalize_mapper: stringify dict tool arguments for HF Arrow
- Move agent_interaction_quality_analysis.yaml under demos/agent/recipes
- Add R0_synthesis_from_pure100s.yaml (normalized_query + agent_request_model)
- Refresh R0–R3 bridge docs and recipe cross-links

Made-with: Cursor
peer_models = sorted(
{str(m) for m in models if m is not None and str(m).strip()},
)
has_contrast = len(members) >= self.min_group_size and len(peer_models) >= 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here has_contrast only captures data with >=2 models, regardless of version. for example, {model=qwen-plus, version=3.5}, {model=qwen-plus, version=3.6} -> len(peer_models)=1 -> has_contrast=False. feel free to ignore if it is desired:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent related to agent dj:op issues/PRs about some specific OPs dj:post-tuning issues/PRs about post-tuning scenarios

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants