-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Blocked by #55
Replace passive moderation logging with intelligent, automated moderation actions. When the triage layer (#55) classifies a message as moderate, an AI assessment determines content category, severity, and confidence — then automatically applies tiered actions (warn/timeout/ban), deletes high-severity messages, and alerts moderators with a detailed embed they can reverse via reaction. This eliminates the gap between detection and action while keeping humans in the loop for oversight and edge cases.
How It Works
When triage flags a message as moderate, the auto-mod module performs a detailed assessment:
- AI Assessment — SDK
query()with Sonnet evaluates the message content, surrounding context, and the target user's moderation history - Classification — Returns content category, severity (low/medium/high), confidence score (0–100), and explanation
- Confidence Gate — If confidence is below threshold (default 70%), alert moderators without acting
- Tiered Action — If confident: low → warn, medium → timeout, high → ban + delete message
- Alert Embed — Posts assessment details to the mod alert channel for moderator oversight
- Reaction Reversal — Moderators react to the alert embed to reverse any auto-mod action
Content Categories
Harassment, threats, hate speech, NSFW content, doxxing/personal information exposure, spam patterns, general toxicity.
Model Selection
Mirrors the triage verification pattern from #55 — Sonnet by default for assessment, escalates to Opus when the model's own confidence is low or content is ambiguous.
Requirements
- When triage classifies a message as
moderate, perform a detailed AI assessment using SDKquery()with Sonnet, including message content, context, and the target user's moderation history - Assessment returns structured response: content category, severity tier (low/medium/high), confidence score (0–100), and brief explanation
- When confidence meets threshold (default 70%): low → warn, medium → timeout, high → ban
- When confidence is below threshold, alert moderators without taking automatic action
- High-severity violations above confidence threshold trigger automatic message deletion
- Query target user's moderation history (case count by type, recent actions within 30 days) and include in assessment prompt for context-aware severity decisions
- All auto-mod actions create cases via existing
createCase()with bot's user ID as moderator and AI assessment summary as reason - Post alert embed to configured mod alert channel with: content summary, category, severity, confidence, action taken (or "deferred to moderators"), and user history summary
- Moderators react to alert embed with configured reversal emoji to reverse the action (untimeout, unban) — creates reversal case attributed to the reacting moderator
- If Sonnet returns low confidence on its own assessment, re-assess with Opus before deciding (mirrors triage verification pattern from Claude Agent SDK — replace OpenClaw with intelligent triage and dynamic model selection #55)
- Broad content categories: harassment, threats, hate speech, NSFW, doxxing, spam, toxicity
- Auto-mod toggleable via config — global enable/disable and per-channel exclusion list
- Timeout duration for medium-severity configurable (default: 1 hour)
- DM notifications follow existing
dmNotificationsconfig for auto-mod actions - Existing slash commands (/unban, /untimeout) work as fallback reversal alongside reaction-based undo
- Per-assessment cost logged via Winston structured logging
- Assessment timeout: 15 seconds — on timeout, defer to moderators (no automatic action)
- Per-assessment budget configurable (default: $0.10)
- 80% test coverage maintained
Out of Scope
- Custom keyword/regex filter rules (see backlog: "Auto-mod rules engine")
- Appeal system for auto-mod actions (see backlog: "Appeal system")
- Multi-guild auto-mod customization (single config for all guilds initially)
- Training or fine-tuning moderation models
- Image/attachment content moderation (text messages only)
- Voice channel moderation
- Rate limiting auto-mod assessments per user (triage interval serves as natural throttle)
- Auto-mod statistics dashboard (logging only)
Dependencies
- Claude Agent SDK — replace OpenClaw with intelligent triage and dynamic model selection #55 — Claude SDK Agent must be implemented first (provides triage
moderateclassification, SDK integration, cost logging) @anthropic-ai/claude-agent-sdk— SDK for AI assessment queries- Existing moderation infrastructure (
createCase(),sendModLogEmbed(), DM notifications) - Existing case history in PostgreSQL
— Generated by Claude Code