[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969
[WIP] feat(agent): training-ready data recipes, learnable-value mappers, cross-model similarity#969yxdyc wants to merge 7 commits into
Conversation
…del similarity - Add delivery YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, DELIVERY_FIELD_GUIDE, and diff_agent_exports.py for export diffs. - New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), SLS/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map). - Wire MetaKeys and mapper __init__ exports; extend bad_case / insight / dialog normalize for the delivery stack. - Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0. - Docs: demos/agent README links; analysis recipe points to R0 + field guide. Made-with: Cursor
Align agent mapper/meta naming with training-dataset semantics by replacing delivery-tier keys and tier-gating params, and rename SLS noise signals to sys_log across recipes, docs, and tests for consistent pipeline usage. Made-with: Cursor
Align agent mapper/meta naming with training-dataset semantics by replacing delivery-tier keys and tier-gating params, and rename SLS noise signals to sys_log across recipes, docs, and tests for consistent pipeline usage. Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of new Mapper operators and recipes designed to transform agent interaction logs into high-quality training datasets. Key additions include mappers for cross-model pairing, error taxonomy classification, learnable value scoring, and trajectory distillation using teacher models. The PR also updates existing mappers to support these new features and provides detailed documentation and recipes for end-to-end training data pipelines. A critical issue was identified in the AgentCrossModelPairMapper regarding its scalability, as it currently loads the entire dataset into memory, which may lead to out-of-memory errors for large-scale data processing.
| desc=f"{self._name}_add_meta", | ||
| ) | ||
|
|
||
| rows: List[dict] = copy.deepcopy(dataset.to_list()) |
There was a problem hiding this comment.
This implementation loads the entire dataset into memory using dataset.to_list(), which will not scale to large datasets and can lead to out-of-memory errors. This is a critical issue for a data processing operator, as it bypasses the streaming capabilities of the underlying datasets library.
For group_key_mode set to exact or normalized_query, this can be implemented more scalably. Consider the following approach:
- Use
dataset.mapto compute the grouping key for each sample and add it as a new column. - Sort the dataset by this new grouping key column using
dataset.sort(). - Iterate through the sorted dataset to process samples group by group. This avoids loading the entire dataset into memory at once.
For simhash_lsh mode, a global view of the data is harder to avoid. However, the current implementation's memory limitation should be clearly documented in the class docstring, noting that it's only suitable for small to medium-sized datasets. For future scalability, you might consider exploring approximate nearest neighbor (ANN) libraries that support distributed or out-of-core computation.
…-value op Add transition_report.py and docs; extend bad-case / learnable-value mapper notes. Rename agent_learnable_value_scorer to agent_learnable_value_mapper (Mapper-consistent). Fix build_op_doc: parse OP type from markdown section; default translators region before import. Made-with: Cursor
…ure100s - agent_cross_model_pair_mapper: full-table grouping via REQUIRES_FULL_DATASET_PASS and Ray take_all/apply_full_dataset_annotations; update tests - ray_dataset: honor full-dataset mappers in Ray mode - agent_dialog_normalize_mapper: stringify dict tool arguments for HF Arrow - Move agent_interaction_quality_analysis.yaml under demos/agent/recipes - Add R0_synthesis_from_pure100s.yaml (normalized_query + agent_request_model) - Refresh R0–R3 bridge docs and recipe cross-links Made-with: Cursor
| peer_models = sorted( | ||
| {str(m) for m in models if m is not None and str(m).strip()}, | ||
| ) | ||
| has_contrast = len(members) >= self.min_group_size and len(peer_models) >= 2 |
There was a problem hiding this comment.
here has_contrast only captures data with >=2 models, regardless of version. for example, {model=qwen-plus, version=3.5}, {model=qwen-plus, version=3.6} -> len(peer_models)=1 -> has_contrast=False. feel free to ignore if it is desired:)
Add training-data YAMLs (R0 bridge, R1–R3, CPU-only R3), recipes README, TRAIN_DATA_FIELD_GUIDE.
New mappers: cross-model cohorts (exact | normalized_query | simhash_lsh + match_basis), syslog/harness noise, error taxonomy (string evidence leaves), learnable value scorer, safety gate, distill trajectory, rewrite hints, training card (JSON string; forces turbo map).
Wire MetaKeys and mapper init exports; extend bad_case / insight / dialog normalize for the training-data stack.
Add a training-free before/after report (migration matrix, stage retention, hardness/quality ranking checks).
Stabilize HF Arrow meta typing: tool_success_ratio uses -1.0 when undefined; usage total_tokens always int; bad_case respects ratio >= 0.
Docs: demos/agent README links; analysis recipe points to R0 + field guide.