Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm of interacting with LLM agents, which we call overhearing agents. Overhearing agents do not actively participate in conversation--instead, they "listen in" to human-human conversations and perform background actions or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to use implicit audio cues to perform the overhearing task.
Requires Python 3.10 or higher (I used 3.12 but I'm pretty sure it should work).
$ pip install -r requirements.txtYou will also need to export an OpenAI API key:
$ export OPENAI_API_KEY="sk-proj-your-key-here..."Since various open-weight models require extra heavyweight dependencies, their requirements are listed separately.
Use one of the following requirements files instead:
# Phi-4-multimodal, ultravox-0.5
$ pip install -r requirements-hf.txt
# Qwen-2.5-omni
$ pip install -r requirements-qwen.txt
# Step-Audio
# NOTE: Requires Python 3.10 exactly!
$ pip install -r requirements-step.txtAdditionally, some HF models might require/recommend Flash Attention, which must be installed with a GPU attached:
$ pip install flash-attn --no-build-isolationFull Webserver
- First, you must build the web UI. Run:
$ cd viz
$ npm i
$ npm run buildAlternatively, you can run npm run dev to automatically rebuild the web UI when you make any changes.
- Then, run (from the repo root):
$ python server.pyThe web UI will be at http://127.0.0.1:8000. (Or a different port if you're using dev)
Test on File
To run the system on a given input file, first run the steps below in Batch Process/Preprocessing.
Then run:
$ python sandbox/fromfile_demo.pyThis will load a random input file, seek to a random position in it, and run the system as if it had received that data over the mic.
Chat in Terminal
A good way to test function implementations without having to use audio. Run:
$ python sandbox/terminal_chat.pyOverhearing agents are implemented
using Kani's function calling capability. In order to add
new tasks to the overhearing agent, all you need to do is define a class that extends BaseKani
(from overhearing_agents.kanis.base import BaseKani). In this class, define one or more @ai_functions -- these are
the tasks that your agent will be able to do. These functions can have any implementation!
To launch the overhearing agents server, see the example below! This example defines a function to return a new agent instance for each session, then passes that factory function to the server.
from overhearing_agents.kanis.base import BaseKani
from overhearing_agents.server import VizServer
from overhearing_agents.session import OverhearingAgentsSession
class MyAgentClass(BaseKani):
# ...
async def create_session():
ai = MyAgentClass()
return OverhearingAgentsSession(ai)
server = VizServer(create_session)
server.serve()Example agents (e.g. the D&D agent used in the Overhearing Agents paper) can be found in overhearing_agents/kanis.
Preprocessing the data will take a lot of storage space (>250GB)!
- Data goes in
/data/<src>/*.[m4a|mp3|...]- To download the Critical Role datasets, run
download-cr.shanddownload-cr2.sh
- To download the Critical Role datasets, run
- Run
/data/mux.shand/data/transcribe.sh- This will create
muxedandtranscribeddirs in each datasrc dir - Muxed PCM files are signed 16bit PCM mono @ 24kHz
- Transcript files are JSON
- This will create
- Make sure the requisite engine is configured in
experiments/models.py - Run
python experiments/main.py:
usage: main.py [-h] --model-key ... [--force-rerun] audio_file [audio_file ...]
positional arguments:
audio_file
options:
-h, --help show this help message and exit
--model-key ... use a model key in experiments/models.py
--force-rerun force a model to rerun even if its logs exist in the index
Each experiment run will record its state in experiments/logs/index.json.
Each experiment's logs will be saved to experiments/logs/<filename>/<system-key>.
Note
If an experiment crashes partway through, it will record how much it was successfully able to process in the index.
Later runs of the same experiment will start from the point where the previous run crashed.
The log directory for the previous crashed run will be renamed
to experiments/logs/<filename>/<system-key>__until-<crashed-duration>.
Warning
To force an experiment to rerun, either run with --force-rerun or delete the corresponding entry
in experiments/logs/index.json.
This will overwrite any existing log directory for that experiment!
Text
- gpt-4o-2024-11-20
- gpt-4o-mini-2024-07-18
- microsoft/Phi-4-multimodal-instruct (5.6B)
- Qwen/Qwen2.5-Omni-7B (7B)
- meta-llama/Llama-3.3-70B-Instruct (70B)
Audio
- gpt-4o-realtime-preview-2024-12-17
- gpt-4o-mini-realtime-preview-2024-12-17
- microsoft/Phi-4-multimodal-instruct (5.6B, audio embeddings)
- Qwen/Qwen2.5-Omni-7B (7B, audio embeddings)
- fixie-ai/ultravox-v0_5-llama-3_3-70b (70B, audio embeddings based off whisper encoder)
See evaluation/README.md.
-
Install Theatre Inserts (Foundry main menu -> Add-on Modules -> Install Module -> search "Theatre Inserts" as package name)
-
Install the (foundry-module/dev) pa folder to .../FoundryVTT/Data/Modules (right click foundry on taskbar -> Browse User Data for quick access) (TODO(Evan): add support for manifest or package search)
-
Within a foundry game session, enable the modules (Passive Agents, Theatre Inserts, and dependencies libWrapper and socketlib) from Game Settings -> Manage Modules. Follow the foundry prompts to relaunch.
The PA module connects to the web socket specified at \scripts\main.js and executes the received foundry requests, supporting the following types:
list_all_npcs
Returns an array containing the npcs (foundry actors by name) within the foundry actor folder "npcs"
list_stage_npcs
Returns an array containing all staged npcs (that is, all npcs currently displayed on screen with pop-ups, not to be confused with the npcs in the TI NavBar that's just above the text chatting window, which is also refered to as a stage by TI docs)
add_npc_to_stage
Stages the passed npc (also adds it to the NavBar)
remove_npc_from_stage
Unstages the passed npc (does not remove it from the NavBar)
send_npc_speech
Stages the passed npc who says the passed dialogue (also adds npc to NavBar)
For Python, use PEP-8 compliant code styling with a width of 120
characters. I recommend using Black
and isort to automatically format your code after major changes. The projectfile
(pyproject.toml) will automatically configure these tools.
$ black .
$ isort .For JavaScript and TypeScript, use Prettier to format your code with the code format found
in viz/.prettierrc.json.
data: Directory containing the raw input data for experiments. Mostly gitignored aside from the scripts.data/<datasrc>: For each data source, contains the m4a files, as well as their muxed and transcribed versions.
evaluation: Python module containing everything related to evaluating the logs generated by experiments.experiments: Python module containing model definitions and prompts for offline experiments.experiments/logs: For each input data file, contains logs for each model/prompt system run on it.
foundry-module: Directory containing the source code of the Foundry module for integrating PA with Foundry VTT.logs(gitignored): Default directory containing the logs of the interactive server.overhearing_agents: The main Python module for the project. Contains all the business logic. Should not have any dependencies to theexperimentsorevaluationmodules.sandbox: A Python module containing throwaway/one-off scripts. Should not be depended upon by anything.viz: Contains the full frontend project (a Vue 3 + TS frontend).
Much of the logging and visualization code is based off of ReDel (Zhu et al., 2024)'s codebase, under the MIT license.
Additional acknowledgements will be added here after the anonymity period.