TextAtari: 100K Frames Game Playing with Language Agent

This project benchmarks decision-making capabilities of large language models (LLMs) across different simulated environments using text-based interfaces. It introduces a translation layer from Gym to natural language, enabling LLMs to interact with and control classic Atari games.

🧠 Purpose

This extension focuses on evaluating prompting strategies (e.g., naive, cot, reflexion_last, reflexion_max) across different LLMs (qwen7b, llama, gemma) and four distinct levels of information access.

📊 Benchmark Structure

We design environments with increasing accessibility to trajectory or expert information. The four levels of the benchmark are:

Basic: No trajectory provided. This is the zero-shot baseline.
Obscuring: A random policy trajectory is given, but the agent doesn't know it is random. Simulates limited noisy feedback.
RL_traj: A trajectory generated by a trained RL agent is provided. On-policy imitation-like learning.
Expert_goal: The goal or core instruction of an expert agent is provided. A high-level instruction-following scenario.

🔍 Visualization Results (by Model + Scenario)

Each image is named using the structure:

<model>_<scenario>_<agent1>_<agent2>.png

<model>: one of qwen7b, llama, or gemma

<scenario>: one of basic, obscuring, RL_traj, expert_goal

<agent1> vs. <agent2>: compared agent strategies

🛠️ Setup & Usage

1. GPT Agent Setup

Create ./deciders/gpt.py to define your GPT-based decision-making agent:

import openai
class gpt:
    def __init__(self, args):
        if args.api_type == "azure":
            openai.api_type = "azure"
            openai.api_version = "2023-05-15"
            openai.api_base = "https://<your-endpoint>.openai.azure.com/"
            openai.api_key = "your-azure-key"
        else:
            openai.api_key = "your-openai-key"

2. Environment Installation

conda env create --file environment.yaml

3. Run Experiments

Run with shell script:

sh shell/test_cartpole.sh

Or manually:

python main_reflexion.py --env_name CartPole-v0 \
    --init_summarizer cart_init_translator \
    --curr_summarizer cart_basic_translator \
    --decider exe_actor --prompt_level 1 --num_trails 1 \
    --distiller guide_generator --api_type openai

📊 Visualization Results

Rl In Traj

cot vs llama

cot vs qwen7b

naive vs llama

naive vs qwen7b

reflexion vs last

reflexion vs max

Basic In Cot

llama vs gemma

qwen7b vs gemma

qwen7b vs llama

Basic In Naive

llama vs gemma

qwen7b vs gemma

qwen7b vs llama

Basic In Reflexion

last vs llama

last vs qwen7b

max vs llama

max vs qwen7b

Game In Manual

cot vs llama

cot vs qwen7b

naive vs llama

naive vs qwen7b

reflexion vs last

reflexion vs max

Gemma In Rl

traj vs cot

traj vs naive

traj vs reflexion

Gemma In Basic

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Gemma In Cot

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Gemma In Game

manual vs cot

manual vs naive

manual vs reflexion

Gemma In Naive

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Gemma In Obscuring

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Gemma In Reflexion

last vs basic

last vs game

last vs obscuring

max vs basic

max vs game

max vs obscuring

Llama In Rl

traj vs cot

traj vs naive

traj vs reflexion

Llama In Basic

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Llama In Cot

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Llama In Game

manual vs cot

manual vs naive

manual vs reflexion

Llama In Naive

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Llama In Obscuring

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Llama In Reflexion

last vs basic

last vs game

last vs obscuring

max vs basic

max vs game

max vs obscuring

Obscuring In Cot

llama vs gemma

qwen7b vs gemma

qwen7b vs llama

Obscuring In Naive

llama vs gemma

qwen7b vs gemma

qwen7b vs llama

Obscuring In Reflexion

last vs llama

last vs qwen7b

max vs llama

max vs qwen7b

Qwen7B In Rl

traj vs cot

traj vs naive

traj vs reflexion

Qwen7B In Basic

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Qwen7B In Cot

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Qwen7B In Game

manual vs cot

manual vs naive

manual vs reflexion

Qwen7B In Naive

basic vs RL

basic vs game

basic vs obscuring

game vs manual

obscuring vs RL

obscuring vs game

Qwen7B In Obscuring

cot vs reflexion

naive vs cot

naive vs reflexion

reflexion vs last

Qwen7B In Reflexion

last vs basic

last vs game

last vs obscuring

max vs basic

max vs game

max vs obscuring

🧪 Add New Environments

Translate your Gym env to TextGym format in envs/
Add PPO/expert results to record_reflexion.csv
Add prompt templates and few-shot examples in prompts/
Test using shell/ scripts or CLI

📁 Project Structure

├── fig1/                     # All visualizations
├── envs/                     # Text-based Gym translators
├── prompts/                 # Prompt templates & few-shots
├── deciders/                # LLM-based agents
├── shell/                   # Experiment shell scripts
├── README_visualizations.md # Image gallery (192 images)
└── record_reflexion.csv     # Reward logs

📚 Citation

If you use this project or our visualization protocol, please cite this work (citation info to be added post-publication).

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
RL_based		RL_based
deciders		deciders
distillers		distillers
envs		envs
figs		figs
language_traj		language_traj
manual		manual
memory		memory
prompts		prompts
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_test.py		env_test.py
environment.yaml		environment.yaml
gpt_test.py		gpt_test.py
gradio_reflexion.py		gradio_reflexion.py
main_reflexion.py		main_reflexion.py
record_reflexion.csv		record_reflexion.csv
result.csv		result.csv
test_pacman.sh		test_pacman.sh

Folders and files

Latest commit

History

Repository files navigation

TextAtari: 100K Frames Game Playing with Language Agent

🧠 Purpose

📊 Benchmark Structure

🔍 Visualization Results (by Model + Scenario)

🛠️ Setup & Usage

1. GPT Agent Setup

2. Environment Installation

3. Run Experiments

📊 Visualization Results

Rl In Traj

Basic In Cot

Basic In Naive

Basic In Reflexion

Game In Manual

Gemma In Rl

Gemma In Basic

Gemma In Cot

Gemma In Game

Gemma In Naive

Gemma In Obscuring

Gemma In Reflexion

Llama In Rl

Llama In Basic

Llama In Cot

Llama In Game

Llama In Naive

Llama In Obscuring

Llama In Reflexion

Obscuring In Cot

Obscuring In Naive

Obscuring In Reflexion

Qwen7B In Rl

Qwen7B In Basic

Qwen7B In Cot

Qwen7B In Game

Qwen7B In Naive

Qwen7B In Obscuring

Qwen7B In Reflexion

🧪 Add New Environments

📁 Project Structure

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages