FAAF_ACL

Code repository for the ACL 2025 Main Conference paper "Frictional Agent Alignment Framework: Slow Down and Don’t Break Things". Read our paper.

FAAF Implementation Files

Main Files

faaf_config.py - Configuration and hyperparameter settings for FAAF training
- Includes definitions for FAAF ablation losses
faaf_main_training_file.py - Main training script for FAAF
- Supports DPO (loss_type="sigmoid") and IPO (loss_type="ipo") baselines
- Controls hyperparameter sweeps and logging
faaf_trainer.py - Core FAAF trainer implementation
- Handles loss computation
- Manages phi-unconditioned forward passes

Implements preference alignment

FAAF Loss Function (Click to expand)

def faaf_loss(
    self,
    policy_chosen_logps: torch.FloatTensor,      # π_θ(f_w|φ,x)
    policy_rejected_logps: torch.FloatTensor,     # π_θ(f_l|φ,x)
    reference_chosen_logps: torch.FloatTensor,    # π_ref(f_w|φ,x)
    reference_rejected_logps: torch.FloatTensor,  # π_ref(f_l|φ,x)
    policy_chosen_friction_logps: torch.FloatTensor,      # π_θ(f_w|x)
    policy_rejected_friction_logps: torch.FloatTensor,    # π_θ(f_l|x)
    reference_chosen_friction_logps: torch.FloatTensor,   # π_ref(f_w|x)
    reference_rejected_friction_logps: torch.FloatTensor  # π_ref(f_l|x)
) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
    """
    FAAF loss combining conditional and unconditioned policy ratios for preference learning.
    Adapted from DPO loss in DPO trainer.

    Returns:
        Tuple of FAAF loss and reward components (all shape: batch_size,)
    """
    chosen_logratios = policy_chosen_logps.to(self.accelerator.device) - (
        not self.reference_free
    ) * reference_chosen_logps.to(self.accelerator.device)

    rejected_logratios = policy_rejected_logps.to(self.accelerator.device) - (
        not self.reference_free
    ) * reference_rejected_logps.to(self.accelerator.device)

llm_judge_evals.py - LLM evaluation pipeline
- Runs pairwise position-swapped evaluations
- Generates results for Table 1 comparisons
opt_reward_modeling.py - Reward modeling implementation
- Uses OPT models for reward computation
- Supports PPO training and Table 2 evaluations
ppo_baseline_training.py - PPO baseline implementation
friction_agent_inference.py - FAAF model inference
- Handles generation and parsing
- Computes evaluation metrics

Usage

Requirements

Install dependencies: pip install -r requirements.txt

All training and evaluation data splits and preference pairs are available at:

WTD Original
WTD Simulated
Delidata
Run training through faaf_main_training_file.py
- Uses configs from faaf_config.py
- Implements FAAF trainer from faaf_trainer.py
For baselines:
- Use opt_reward_modeling.py for reward modeling
- Use ppo_baseline_training.py for PPO comparison
For evaluation:
- Run friction_agent_inference.py for model generations
- Use llm_judge_evals.py for preference scoring

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
plots		plots
LICENSE		LICENSE
README.md		README.md
README.txt		README.txt
faaf_main_training_file.py		faaf_main_training_file.py
faaf_trainer.py		faaf_trainer.py
friction.py		friction.py
friction_agent_inference.py		friction_agent_inference.py
llm_judge_evals.py		llm_judge_evals.py
opt_reward_modeling.py		opt_reward_modeling.py
ppo_baseline_training.py		ppo_baseline_training.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAAF_ACL

FAAF Implementation Files

Main Files

Usage

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FAAF_ACL

FAAF Implementation Files

Main Files

Usage

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages