Skip to content

Lww007/Text-Atari-Agents

Β 
Β 

Repository files navigation

TextAtari: 100K Frames Game Playing with Language Agent

This project benchmarks decision-making capabilities of large language models (LLMs) across different simulated environments using text-based interfaces. It introduces a translation layer from Gym to natural language, enabling LLMs to interact with and control classic Atari games.

🧠 Purpose

This extension focuses on evaluating prompting strategies (e.g., naive, cot, reflexion_last, reflexion_max) across different LLMs (qwen7b, llama, gemma) and four distinct levels of information access.

πŸ“Š Benchmark Structure

We design environments with increasing accessibility to trajectory or expert information. The four levels of the benchmark are:

  • Basic: No trajectory provided. This is the zero-shot baseline.
  • Obscuring: A random policy trajectory is given, but the agent doesn't know it is random. Simulates limited noisy feedback.
  • RL_traj: A trajectory generated by a trained RL agent is provided. On-policy imitation-like learning.
  • Expert_goal: The goal or core instruction of an expert agent is provided. A high-level instruction-following scenario.

πŸ” Visualization Results (by Model + Scenario)

Each image is named using the structure:

<model>_<scenario>_<agent1>_<agent2>.png

  • <model>: one of qwen7b, llama, or gemma
  • <scenario>: one of basic, obscuring, RL_traj, expert_goal
  • <agent1> vs. <agent2>: compared agent strategies

πŸ› οΈ Setup & Usage

1. GPT Agent Setup

Create ./deciders/gpt.py to define your GPT-based decision-making agent:

import openai
class gpt:
    def __init__(self, args):
        if args.api_type == "azure":
            openai.api_type = "azure"
            openai.api_version = "2023-05-15"
            openai.api_base = "https://<your-endpoint>.openai.azure.com/"
            openai.api_key = "your-azure-key"
        else:
            openai.api_key = "your-openai-key"

2. Environment Installation

conda env create --file environment.yaml

3. Run Experiments

Run with shell script:

sh shell/test_cartpole.sh

Or manually:

python main_reflexion.py --env_name CartPole-v0 \
    --init_summarizer cart_init_translator \
    --curr_summarizer cart_basic_translator \
    --decider exe_actor --prompt_level 1 --num_trails 1 \
    --distiller guide_generator --api_type openai

πŸ“Š Visualization Results

Rl In Traj

cot vs llama
cot vs llama

cot vs qwen7b
cot vs qwen7b

cot vs qwen7b
cot vs qwen7b

naive vs llama
naive vs llama

naive vs qwen7b
naive vs qwen7b

naive vs qwen7b
naive vs qwen7b

reflexion vs last
reflexion vs last

reflexion vs last
reflexion vs last

reflexion vs last
reflexion vs last

reflexion vs max
reflexion vs max

reflexion vs max
reflexion vs max

reflexion vs max
reflexion vs max


Basic In Cot

llama vs gemma
llama vs gemma

qwen7b vs gemma
qwen7b vs gemma

qwen7b vs llama
qwen7b vs llama


Basic In Naive

llama vs gemma
llama vs gemma

qwen7b vs gemma
qwen7b vs gemma

qwen7b vs llama
qwen7b vs llama


Basic In Reflexion

last vs llama
last vs llama

last vs qwen7b
last vs qwen7b

last vs qwen7b
last vs qwen7b

max vs llama
max vs llama

max vs qwen7b
max vs qwen7b

max vs qwen7b
max vs qwen7b


Game In Manual

cot vs llama
cot vs llama

cot vs qwen7b
cot vs qwen7b

cot vs qwen7b
cot vs qwen7b

naive vs llama
naive vs llama

naive vs qwen7b
naive vs qwen7b

naive vs qwen7b
naive vs qwen7b

reflexion vs last
reflexion vs last

reflexion vs last
reflexion vs last

reflexion vs last
reflexion vs last

reflexion vs max
reflexion vs max

reflexion vs max
reflexion vs max

reflexion vs max
reflexion vs max


Gemma In Rl

traj vs cot
traj vs cot

traj vs cot
traj vs cot

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs reflexion
traj vs reflexion


Gemma In Basic

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Gemma In Cot

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Gemma In Game

manual vs cot
manual vs cot

manual vs cot
manual vs cot

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs reflexion
manual vs reflexion


Gemma In Naive

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Gemma In Obscuring

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Gemma In Reflexion

last vs basic
last vs basic

last vs basic
last vs basic

last vs basic
last vs basic

last vs game
last vs game

last vs obscuring
last vs obscuring

last vs obscuring
last vs obscuring

max vs basic
max vs basic

max vs basic
max vs basic

max vs basic
max vs basic

max vs game
max vs game

max vs obscuring
max vs obscuring

max vs obscuring
max vs obscuring


Llama In Rl

traj vs cot
traj vs cot

traj vs cot
traj vs cot

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs reflexion
traj vs reflexion


Llama In Basic

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Llama In Cot

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Llama In Game

manual vs cot
manual vs cot

manual vs cot
manual vs cot

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs reflexion
manual vs reflexion


Llama In Naive

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Llama In Obscuring

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Llama In Reflexion

last vs basic
last vs basic

last vs basic
last vs basic

last vs basic
last vs basic

last vs game
last vs game

last vs obscuring
last vs obscuring

last vs obscuring
last vs obscuring

max vs basic
max vs basic

max vs basic
max vs basic

max vs basic
max vs basic

max vs game
max vs game

max vs obscuring
max vs obscuring

max vs obscuring
max vs obscuring


Obscuring In Cot

llama vs gemma
llama vs gemma

qwen7b vs gemma
qwen7b vs gemma

qwen7b vs llama
qwen7b vs llama


Obscuring In Naive

llama vs gemma
llama vs gemma

qwen7b vs gemma
qwen7b vs gemma

qwen7b vs llama
qwen7b vs llama


Obscuring In Reflexion

last vs llama
last vs llama

last vs qwen7b
last vs qwen7b

last vs qwen7b
last vs qwen7b

max vs llama
max vs llama

max vs qwen7b
max vs qwen7b

max vs qwen7b
max vs qwen7b


Qwen7B In Rl

traj vs cot
traj vs cot

traj vs cot
traj vs cot

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs naive
traj vs naive

traj vs reflexion
traj vs reflexion


Qwen7B In Basic

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Qwen7B In Cot

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Qwen7B In Game

manual vs cot
manual vs cot

manual vs cot
manual vs cot

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs naive
manual vs naive

manual vs reflexion
manual vs reflexion


Qwen7B In Naive

basic vs RL
basic vs RL

basic vs game
basic vs game

basic vs obscuring
basic vs obscuring

game vs manual
game vs manual

obscuring vs RL
obscuring vs RL

obscuring vs game
obscuring vs game


Qwen7B In Obscuring

cot vs reflexion
cot vs reflexion

cot vs reflexion
cot vs reflexion

naive vs cot
naive vs cot

naive vs reflexion
naive vs reflexion

naive vs reflexion
naive vs reflexion

reflexion vs last
reflexion vs last


Qwen7B In Reflexion

last vs basic
last vs basic

last vs basic
last vs basic

last vs basic
last vs basic

last vs game
last vs game

last vs obscuring
last vs obscuring

last vs obscuring
last vs obscuring

max vs basic
max vs basic

max vs basic
max vs basic

max vs basic
max vs basic

max vs game
max vs game

max vs obscuring
max vs obscuring

max vs obscuring
max vs obscuring


πŸ§ͺ Add New Environments

  1. Translate your Gym env to TextGym format in envs/
  2. Add PPO/expert results to record_reflexion.csv
  3. Add prompt templates and few-shot examples in prompts/
  4. Test using shell/ scripts or CLI

πŸ“ Project Structure

β”œβ”€β”€ fig1/                     # All visualizations
β”œβ”€β”€ envs/                     # Text-based Gym translators
β”œβ”€β”€ prompts/                 # Prompt templates & few-shots
β”œβ”€β”€ deciders/                # LLM-based agents
β”œβ”€β”€ shell/                   # Experiment shell scripts
β”œβ”€β”€ README_visualizations.md # Image gallery (192 images)
└── record_reflexion.csv     # Reward logs

πŸ“š Citation

If you use this project or our visualization protocol, please cite this work (citation info to be added post-publication).

About

This project provides a set of translators to convert OpenAI Gym environments into text-based environments. It is designed to investigate the capabilities of large language models in decision-making tasks within these text-based environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 77.0%
  • Shell 23.0%