Skip to content

boazlavon/eg_cfg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

282 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Execution-Guided Line-by-Line Code Generation

Our paper presents a fundamentally new approach to code generation.

EG-CFG is an inference-time algorithm for code generation that injects real-time execution feedback directly into the modelโ€™s decoding loop. By incorporating dynamic runtime signals during generation, it steers the model toward solutions that are not only syntactically valid, but also functionally correct and executable.

Using the open-source DeepSeek-V3 model, our experiments demonstrate that EG-CFG significantly improves code generation performance, achieving state-of-the-art (SOTA) results across various levels of complexity. This includes foundational problems like MBPP (96.6%) and HumanEval (99.4%), challenging data science tasks on DS-1000 (69.9%), and competitive programming problems on CodeContests (60.6%). Furthermore, EG-CFG establishes new SOTA results on the more rigorous variants, MBPP-ET (73.0%) and HumanEval-ET (89.02%), highlighting the method's robustness and ability to generalize to complex coding challenges.

๐ŸŽ‰ Our paper got accepted to NeurIPS 2025

NeurIPS 2025 arXiv Video Patent


๐Ÿš€ Highlights

๐Ÿ“ˆ State-of-the-art (SOTA) results:

  • MBPP: 96.6%
  • MBPP-ET: 73.0%
  • HumanEval: 99.4%
  • HumanEval-ET: 89.02%
  • DS-1000: 69.9%
  • CodeContests: 60.6%

โœ… Achieved using open models only (DeepSeek-V3-0324)
โšก Real-time execution feedback integrated during decoding
๐Ÿ› ๏ธ Fully configurable pipeline โ€” supports both local and endpoint inference
๐Ÿ” Reproducible and extensible framework for code generation research

๐Ÿ—ž๏ธ Media Coverage

๐Ÿง  Models

EG-CFG supports any causal language model that provides token-level log probabilities. In our experiments, we use two models from the DeepSeek family:

๐Ÿ”น DeepSeek-V3-0324

  • Large-scale foundation model
  • Used via inference endpoint
  • 1.3B parameter instruction-tuned model
  • Suitable for local inference
  • Efficient yet surprisingly strong for Python code generation

๐Ÿ“Š Benchmark Results

MBPP and MBPP-ET

Model Method MBPP (%) MBPP-ET (%) RSR (MBPP) RSR (MBPP-ET)
DeepSeek-Coder 1.3B Baseline LLM 49.4 42.6 0.0 0.0
DeepSeek-Coder 1.3B EG-CFG (Ours) 83.2 59.8 66.79 29.96
DeepSeek-V3-0324 Baseline LLM 82.8 64.8 0.0 0.0
DeepSeek-V3-0324 EG-CFG (Ours) 96.6 73.0 80.23 23.30
GPT-4o LPW 84.4 65.3 N/A N/A
Claude-Sonnet-3.5 QualityFlow 94.2 N/A N/A N/A
GPT-4 MetaGPT 87.7 N/A N/A N/A

HumanEval and HumanEval-ET

Model Method HumanEval (%) HumanEval-ET (%) RSR (HE) RSR (HE-ET)
DeepSeek-V3-0324 Baseline LLM 82.92 79.20 0.0 0.0
DeepSeek-V3-0324 EG-CFG (Ours) 99.4 89.02 94.04 47.21
DeepSeek-V3-0324 MapCoder 96.95 81.70 81.88 12.02
DeepSeek-V3-0324 MGDebugger 87.20 81.09 25.39 9.44
DeepSeek-V3-0324 LPW 95.12 84.74 68.02 26.89
GPT-4o LPW 98.2 84.8 N/A N/A

CodeContests

Model Method Accuracy (%) RSR (%)
DeepSeek-V3-0324 Baseline LLM 41.81 0.00
DeepSeek-V3-0324 EG-CFG (Ours) 60.6 32.29
DeepSeek-V3-0324 MapCoder 50.30 14.59
GPT-4o LPW 34.7 N/A
GPT-4o LDB 29.3 N/A
GPT-4 CodeSim 29.1 N/A
GPT-4 MapCoder 28.5 N/A
GPT-3.5 Turbo CodeSim 16.4 N/A
GPT-3.5 Turbo MapCoder 12.7 N/A
MoTCoder-15B MoTCoder 26.34 N/A

DS-1000

Model Method Accuracy (%) RSR (%)
DeepSeek-V3-0324 EG-CFG (Ours) 69.9 50.73
DeepSeek-V3-0324 Baseline LLM 38.9 0.00
GPT-4 CONLINE 68.0 N/A
GPT-4 Baseline LLM 60.2 N/A
GPT-3.5 Turbo SelfEvolve 57.1 N/A

RSR: Relative Success Rate = Accuracy gain over baseline normalized to full success. See full tables and ablations in the paper.

Evaluation Limitations

We manually reviewed all 17 MBPP tasks that were not solved by DeepSeek-V3-0324 and found that 9 contain invalid unit tests, with some also having incorrect reference solutions. In these cases, the model-generated code is correct but marked as failed due to flawed benchmark tests. Full details are available in the analysis/mbpp_analysis/ directory.


๐Ÿงฑ Project Structure

eg_cfg/           # Core implementation (EG-CFG inference loop, CFG logic, and prompts)
traces_dumper/    # Tools for extracting execution traces
scripts/          # Entry points for launching and monitoring experiments
configs/          # Configuration files
trials/           # Generated results from inference runs
output/           # Stdout logs from inference runs
data/             # Input data for inference, such as prompts and baseline results
submodules/       # Local submodules (e.g., xpython, trepan, transformers)
environment.yml   # Conda environment definition

โšก Quickstart

git clone --recurse-submodules https://github.com/boazlavon/eg_cfg.git
cd eg_cfg
conda env create -f environment.yml -n eg-cfg-env
conda activate eg-cfg-env
python scripts/redirect_env_to_submodules.py $PWD/submodules/

๐Ÿค– Multi-Agent Launch Options

EG-CFG supports two ways to define and launch multiple agents:

๐Ÿ› ๏ธ Option 1: Launch from Parameter Combinations

Use dynamic_signals_params.json to define a set of decoding parameter values, and automatically launch all combinations.

python eg_cfg/eg_cfg_grid.py \
  --dynamic-signals-params configs/dynamic_signals_params.json \
  --session-config-json configs/session_config.local.json

Example dynamic_signals_params.json:

{
  "t": [0.7, 0.75],         # Sampling Temperatures
  "s": [3],                 # Number of Candidates (Beam Size)
  "d": [2, 3],              # Completion Horizon (lines)
  "k": [1],                 # New Dynamic Signal Frequency (lines)
  "prompt_type": ["deepseek_instruct", "long_code"]
}

This launches one agent for each combination of the above parameters (e.g., 2 ร— 1 ร— 2 ร— 1 ร— 2 = 8 combinations).
All agents run in parallel with full synchronization support.

๐Ÿ› ๏ธ Option 2: Launch from Config Strings

Use dynamic_signals_configs.json to define a list of specific config strings, each representing a complete decoding configuration.

python eg_cfg/eg_cfg_trails.py \
  --trials-json configs/dynamic_signals_configs.json \
  --session-config-json configs/session_config.local.json

Example dynamic_signals_configs.json:

[
  "ns3t0.75d5k1_lci_ln",
  "ns2t0.9d3k1_lci_ln",
  "ns3t1.2d5k3_ln"
]

Each string is automatically parsed into a full configuration. The format includes:

  • ns3 โ†’ 3 candidates (beam size)
  • t0.75 โ†’ sampling temperature = 0.75
  • d5 โ†’ completion horizon = 5 lines
  • k1 โ†’ signal update frequency = every 1 line
  • _ln or _lci_ln suffix โ†’ prompt type (deepseek_instruct or long_code)

This method is best when you want to explicitly control and review the exact configs.

๐Ÿ”ง Session Configuration File

Defines the runtime setup for each session as specified in session_config.local.json or session_config.inference_endpoint.json:

Field Description
model_name Model to use (local path or HuggingFace hub name)
gammas CFG guidance strengths (e.g., [0.0, 0.5, 1.0, 3.0])
deployment_type "local" or "inference_endpoint"
dataset Target dataset: "mbpp", "humaneval", or "CodeContests"
results_dir Root directory for saving results
inference_endpoint_url (if endpoint) API URL for inference
inference_endpoint_api_key (if endpoint) API key for Fireworks
use_global_cache Avoid recomputing same completions
debug_mode Enable logging/debug information
is_prod Run in production mode (disable debug/test toggles)
minimal_trace Use final-state-only traces instead of full step-by-step traces
exec_eval Use the ExecEval evaluation for CodeContests dataset
exec_eval_host_ip IP address of the ExecEval server (used only if exec_eval is true)
exec_eval_host_port Port of the ExecEval server (used only if exec_eval is true)

๐Ÿš€ SLURM Integration for Parallel Inference

To maximize throughput, launch the following script multiple timesโ€”once per available node.
The pipeline supports full synchronization across jobs, so no manual coordination is needed.
Agents will automatically run in parallel.

./scripts/job_runners/inference_sbatch.local.sh
# Or monitor in watch mode
./scripts/job_runners/inference_sbatch.local.sh watch

๐Ÿ“ Results Directory Structure

Each trial is written under the path defined by results_dir in your session config. For example:

{
  "results_dir": "trials/local_results",
  "model_name": "deepseek-ai/deepseek-coder-1.3b-instruct",
  "deployment_type": "local",
  "dataset": "mbpp",
  ...
}

This results in directories like:

trials/local_results/mbpp/deepseek-ai_deepseek-coder-1.3b-instruct/ns2t0.75d2_ln/

The folder name encodes the run configuration:

  • s2 โ†’ 2 candidates
  • t0.75 โ†’ temperature 0.75
  • d2 โ†’ horizon 2 lines
  • _ln or _lci_ln suffix โ†’ prompt type (deepseek_instruct or long_code)

Each config directory contains:

  • One JSON per task and gamma (e.g. task_id=395_gamma=1.0.json)

๐Ÿงช JSON file format

Each file includes:

{
  "code": "...",  # Model-generated Python code
  "results": {
    "assert ...": {
      "result": true,           # Whether test case passed
      "time": 0.123,            # Execution time in seconds
      "error": null             # Any runtime error (or null)
    }
  },
  "passed": true,              # True if all test cases passed
  "accuracy": 1.0,             # Fraction of passed test cases
  "general_error": null,       # Top-level failure unrelated to test cases
  "has_testcase_error": false, # True if any test case raised an exception
  "stats": {
    "start_time": "...",
    "end_time": "...",
    "input_tokens": 1234,      # Total prompt tokens
    "output_tokens": 456,      # Total generated tokens
    "duration": "00:01:23"     # Inference wall-time duration
  }
}

A successful solution is:

  • passed = true
  • accuracy = 1.0

These fields are used for filtering and reporting.


๐Ÿ“ˆ Monitor Results

Solved task outputs are stored under the solved_tasks/ directory within each trial.

For example:
trials/local_results/mbpp/deepseek-ai_deepseek-coder-1.3b-instruct/solved_tasks/

Each JSON file in this directory corresponds to a task that was successfully solved ("passed": true) and includes the final code and execution metadata.

You can iterate over this folder to analyze solved tasks.


๐Ÿ”ง Submodules and Custom Modifications

Some core functionality in EG-CFG relies on custom extensions of external libraries, which are included as Git submodules and redirected into the conda environment via symlinks.

๐Ÿ› ๏ธ Modified transformers Library

In local inference mode, we extend the internal decoding loop of the HuggingFace transformers library to support execution-aware generation. Specifically, our modifications in transformers/generation/utils.py enable token-level integration of runtime feedback, allowing the model to dynamically condition on execution traces as described in Section 3 of the paper. This integration is essential for realizing EG-CFG's line-by-line guidance mechanism during inference.

๐Ÿ› ๏ธ Execution Tracing via trepan-xpy

We use the trepan-xpy debugger to execute partially completed code and extract execution traces during inference. To support our framework, we extended the debugger to emit canonicalized traces โ€” a consistent structure that captures all relevant runtime signals, regardless of whether the execution succeeds or fails. This includes not only variable values and function calls, but also bytecode-level events such as instruction execution, enabling fine-grained introspection. The canonical format allows us to easily manipulate the trace to retain only the information most relevant for guiding generation.

These are included in submodules/ and linked into site-packages/ using:

python scripts/redirect_env_to_submodules.py $PWD/submodules/

๐Ÿ“š Data

We evaluate EG-CFG on three widely used Python code generation benchmarks:

๐Ÿ”น MBPP

The MBPP (Mostly Basic Python Problems) benchmark [Austin et al., 2021] includes 500 Python tasks, each with a natural language description, function name, and 3 unit tests. It is a popular dataset for evaluating basic code generation.

๐Ÿ”น HumanEval

The HumanEval benchmark [Chen et al., 2021] consists of 164 hand-written Python programming tasks with hidden test cases. Each task defines a function signature and problem description, designed to measure functional correctness.

๐Ÿ”น MBPP-ET & HumanEval-ET

We also evaluate on MBPP-ET and HumanEval-ET, extended test suites proposed in CodeScore [Dong et al., 2025]. These enhancements add more challenging edge cases and improve coverage, offering better estimates of real-world generalization.

๐Ÿ”น CodeContests

The CodeContests benchmark [Li et al., 2022] is a suite of competitive programming problems designed to evaluate advanced algorithmic reasoning and problem-solving skills. Each task includes a problem description and multiple hidden test cases. Solutions are evaluated using the ExecEval framework [Khan et al., 2024]. Performance on CodeContests reflects a modelโ€™s robustness and problem-solving depth under competitive constraints.

๐Ÿ”น DS-1000

The DS-1000 benchmark [Lai et al., 2022] is a collection of 1000 data science problems designed to test code generation capabilities on popular libraries like Pandas and NumPy. It provides a challenging evaluation of practical, domain-specific programming skills.

๐Ÿงพ Prompt Format

We use two prompt types to ensure broad and reproducible evaluation:

๐Ÿ”น Official Few-Shot Prompt (DeepSeek-Coder)

We adopt the official evaluation prompt provided by DeepSeek-Coderโ€™s GitHub [Guo et al., 2024]:

๐Ÿ”น Long-Code Prompt (ours)

In addition, we introduce a long-code instruction-only prompt that:

  • Encourages line-by-line, traceable completions
  • Follows stylistic constraints aligned with dynamic execution trace extraction
  • Designed for EG-CFGโ€™s runtime-guided generation
  • Detailed in Appendix A of our paper

โ˜๏ธ Inference Endpoint

For large-scale model inference (e.g., using DeepSeek-V3-0324), we use Fireworks.ai as the inference endpoint provider. Fireworks supports token-level log probabilities, which are essential for performing Classifier-Free Guidance (CFG) during decoding.

No local GPU is requiredโ€”all inference runs remotely on Fireworks infrastructure.

Endpoint access is configured via session_config.inference_endpoint.json using your Fireworks API key and endpoint URL.


๐Ÿ“– Citation

If you find our work helpful, please consider citing:

@inproceedings{lavon2025execution,
  title={Execution Guided Line-by-Line Code Generation},
  author={Lavon, Boaz and Katz, Shahar and Wolf, Lior},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}

๐Ÿ“š Related Work Citations

We gratefully acknowledge the authors of the following works for their implementations and publicly available models. If you find this repository helpful, please consider citing their papers as well.

@article{guo2024deepseek,
  title={DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence},
  author={Guo, Daya and Zhu, Qihao and Yang, Dejian and Xie, Zhenda and Dong, Kai and Zhang, Wentao and Chen, Guanting and Bi, Xiao and Wu, Yu and Li, YK and others},
  journal={arXiv preprint arXiv:2401.14196},
  year={2024}
}
@article{liu2024deepseek,
  title={DeepSeek-V3 Technical Report},
  author={Liu, Aixin and Feng, Bei and Xue, Bing and Wang, Bingxuan and others},
  journal={arXiv preprint arXiv:2412.19437},
  year={2024}
}
@article{austin2021program,
  title={Program synthesis with large language models},
  author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others},
  journal={arXiv preprint arXiv:2108.07732},
  year={2021}
}
@article{chen2021evaluating,
  title={Evaluating large language models trained on code},
  author={Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and others},
  journal={arXiv preprint arXiv:2107.03374},
  year={2021}
}
@article{dong2025codescore,
  title={CodeScore: Evaluating Code Generation by Learning Code Execution},
  author={Dong, Yihong and Ding, Jiazheng and Jiang, Xue and Li, Ge and Li, Zhuo and Jin, Zhi},
  journal={ACM Transactions on Software Engineering and Methodology},
  volume={34},
  number={3},
  pages={1--22},
  year={2025}
}
@article{li2022alphacode,
  title={Competition-level code generation with AlphaCode},
  author={Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and others},
  journal={Science},
  volume={378},
  number={6624},
  pages={1092--1097},
  year={2022}
}
@inproceedings{khan2024xcodeeval,
  title={XCodeEval: An Execution-Based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval},
  author={Khan, Mohammad Abdullah Matin and Bari, M Saiful and Long, Do and Wang, Weishi and Parvez, Md Rizwan and Joty, Shafiq},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={6766--6805},
  year={2024}
}

โœ… ML Code Checklist

  • Dependency spec: environment.yml
  • Inference + Analysis code
  • Evaluation scripts and commands
  • Result tables + reproducibility

๐Ÿงพ License

This repository is licensed under the CC BY-NC-SA 4.0 license. This software is provided for non-commercial use only. For commercial use, you must obtain a commercial license by contacting Ramot - Technology Transfer Company of Tel Aviv University ([email protected]). The underlying technology is patented. For more information on commercial licensing, please visit the official technology page at Ramot.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors