You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TextAtari: 100K Frames Game Playing with Language Agent
This project benchmarks decision-making capabilities of large language models (LLMs) across different simulated environments using text-based interfaces. It introduces a translation layer from Gym to natural language, enabling LLMs to interact with and control classic Atari games.
π§ Purpose
This extension focuses on evaluating prompting strategies (e.g., naive, cot, reflexion_last, reflexion_max) across different LLMs (qwen7b, llama, gemma) and four distinct levels of information access.
π Benchmark Structure
We design environments with increasing accessibility to trajectory or expert information. The four levels of the benchmark are:
Basic: No trajectory provided. This is the zero-shot baseline.
Obscuring: A random policy trajectory is given, but the agent doesn't know it is random. Simulates limited noisy feedback.
RL_traj: A trajectory generated by a trained RL agent is provided. On-policy imitation-like learning.
Expert_goal: The goal or core instruction of an expert agent is provided. A high-level instruction-following scenario.
π Visualization Results (by Model + Scenario)
Each image is named using the structure:
<model>_<scenario>_<agent1>_<agent2>.png
<model>: one of qwen7b, llama, or gemma
<scenario>: one of basic, obscuring, RL_traj, expert_goal
<agent1> vs. <agent2>: compared agent strategies
π οΈ Setup & Usage
1. GPT Agent Setup
Create ./deciders/gpt.py to define your GPT-based decision-making agent:
If you use this project or our visualization protocol, please cite this work (citation info to be added post-publication).
About
This project provides a set of translators to convert OpenAI Gym environments into text-based environments. It is designed to investigate the capabilities of large language models in decision-making tasks within these text-based environments.