AI Auto-Moderation — intelligent automated moderation powered by Claude SDK

Blocked by #55

Replace passive moderation logging with intelligent, automated moderation actions. When the triage layer (#55) classifies a message as `moderate`, an AI assessment determines content category, severity, and confidence — then automatically applies tiered actions (warn/timeout/ban), deletes high-severity messages, and alerts moderators with a detailed embed they can reverse via reaction. This eliminates the gap between detection and action while keeping humans in the loop for oversight and edge cases.

## How It Works

When triage flags a message as `moderate`, the auto-mod module performs a detailed assessment:

1. **AI Assessment** — SDK `query()` with Sonnet evaluates the message content, surrounding context, and the target user's moderation history
2. **Classification** — Returns content category, severity (low/medium/high), confidence score (0–100), and explanation
3. **Confidence Gate** — If confidence is below threshold (default 70%), alert moderators without acting
4. **Tiered Action** — If confident: low → warn, medium → timeout, high → ban + delete message
5. **Alert Embed** — Posts assessment details to the mod alert channel for moderator oversight
6. **Reaction Reversal** — Moderators react to the alert embed to reverse any auto-mod action

### Content Categories

Harassment, threats, hate speech, NSFW content, doxxing/personal information exposure, spam patterns, general toxicity.

### Model Selection

Mirrors the triage verification pattern from #55 — Sonnet by default for assessment, escalates to Opus when the model's own confidence is low or content is ambiguous.

## Requirements

- [ ] When triage classifies a message as `moderate`, perform a detailed AI assessment using SDK `query()` with Sonnet, including message content, context, and the target user's moderation history
- [ ] Assessment returns structured response: content category, severity tier (low/medium/high), confidence score (0–100), and brief explanation
- [ ] When confidence meets threshold (default 70%): low → warn, medium → timeout, high → ban
- [ ] When confidence is below threshold, alert moderators without taking automatic action
- [ ] High-severity violations above confidence threshold trigger automatic message deletion
- [ ] Query target user's moderation history (case count by type, recent actions within 30 days) and include in assessment prompt for context-aware severity decisions
- [ ] All auto-mod actions create cases via existing `createCase()` with bot's user ID as moderator and AI assessment summary as reason
- [ ] Post alert embed to configured mod alert channel with: content summary, category, severity, confidence, action taken (or "deferred to moderators"), and user history summary
- [ ] Moderators react to alert embed with configured reversal emoji to reverse the action (untimeout, unban) — creates reversal case attributed to the reacting moderator
- [ ] If Sonnet returns low confidence on its own assessment, re-assess with Opus before deciding (mirrors triage verification pattern from #55)
- [ ] Broad content categories: harassment, threats, hate speech, NSFW, doxxing, spam, toxicity
- [ ] Auto-mod toggleable via config — global enable/disable and per-channel exclusion list
- [ ] Timeout duration for medium-severity configurable (default: 1 hour)
- [ ] DM notifications follow existing `dmNotifications` config for auto-mod actions
- [ ] Existing slash commands (/unban, /untimeout) work as fallback reversal alongside reaction-based undo
- [ ] Per-assessment cost logged via Winston structured logging
- [ ] Assessment timeout: 15 seconds — on timeout, defer to moderators (no automatic action)
- [ ] Per-assessment budget configurable (default: $0.10)
- [ ] 80% test coverage maintained

## Out of Scope

- Custom keyword/regex filter rules (see backlog: "Auto-mod rules engine")
- Appeal system for auto-mod actions (see backlog: "Appeal system")
- Multi-guild auto-mod customization (single config for all guilds initially)
- Training or fine-tuning moderation models
- Image/attachment content moderation (text messages only)
- Voice channel moderation
- Rate limiting auto-mod assessments per user (triage interval serves as natural throttle)
- Auto-mod statistics dashboard (logging only)

## Dependencies

- #55 — Claude SDK Agent must be implemented first (provides triage `moderate` classification, SDK integration, cost logging)
- `@anthropic-ai/claude-agent-sdk` — SDK for AI assessment queries
- Existing moderation infrastructure (`createCase()`, `sendModLogEmbed()`, DM notifications)
- Existing case history in PostgreSQL

— Generated by Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Auto-Moderation — intelligent automated moderation powered by Claude SDK #56

How It Works

Content Categories

Model Selection

Requirements

Out of Scope

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AI Auto-Moderation — intelligent automated moderation powered by Claude SDK #56

Description

How It Works

Content Categories

Model Selection

Requirements

Out of Scope

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions