Skip to content

AI Auto-Moderation — intelligent automated moderation powered by Claude SDK #56

@AnExiledDev

Description

@AnExiledDev

Blocked by #55

Replace passive moderation logging with intelligent, automated moderation actions. When the triage layer (#55) classifies a message as moderate, an AI assessment determines content category, severity, and confidence — then automatically applies tiered actions (warn/timeout/ban), deletes high-severity messages, and alerts moderators with a detailed embed they can reverse via reaction. This eliminates the gap between detection and action while keeping humans in the loop for oversight and edge cases.

How It Works

When triage flags a message as moderate, the auto-mod module performs a detailed assessment:

  1. AI Assessment — SDK query() with Sonnet evaluates the message content, surrounding context, and the target user's moderation history
  2. Classification — Returns content category, severity (low/medium/high), confidence score (0–100), and explanation
  3. Confidence Gate — If confidence is below threshold (default 70%), alert moderators without acting
  4. Tiered Action — If confident: low → warn, medium → timeout, high → ban + delete message
  5. Alert Embed — Posts assessment details to the mod alert channel for moderator oversight
  6. Reaction Reversal — Moderators react to the alert embed to reverse any auto-mod action

Content Categories

Harassment, threats, hate speech, NSFW content, doxxing/personal information exposure, spam patterns, general toxicity.

Model Selection

Mirrors the triage verification pattern from #55 — Sonnet by default for assessment, escalates to Opus when the model's own confidence is low or content is ambiguous.

Requirements

  • When triage classifies a message as moderate, perform a detailed AI assessment using SDK query() with Sonnet, including message content, context, and the target user's moderation history
  • Assessment returns structured response: content category, severity tier (low/medium/high), confidence score (0–100), and brief explanation
  • When confidence meets threshold (default 70%): low → warn, medium → timeout, high → ban
  • When confidence is below threshold, alert moderators without taking automatic action
  • High-severity violations above confidence threshold trigger automatic message deletion
  • Query target user's moderation history (case count by type, recent actions within 30 days) and include in assessment prompt for context-aware severity decisions
  • All auto-mod actions create cases via existing createCase() with bot's user ID as moderator and AI assessment summary as reason
  • Post alert embed to configured mod alert channel with: content summary, category, severity, confidence, action taken (or "deferred to moderators"), and user history summary
  • Moderators react to alert embed with configured reversal emoji to reverse the action (untimeout, unban) — creates reversal case attributed to the reacting moderator
  • If Sonnet returns low confidence on its own assessment, re-assess with Opus before deciding (mirrors triage verification pattern from Claude Agent SDK — replace OpenClaw with intelligent triage and dynamic model selection #55)
  • Broad content categories: harassment, threats, hate speech, NSFW, doxxing, spam, toxicity
  • Auto-mod toggleable via config — global enable/disable and per-channel exclusion list
  • Timeout duration for medium-severity configurable (default: 1 hour)
  • DM notifications follow existing dmNotifications config for auto-mod actions
  • Existing slash commands (/unban, /untimeout) work as fallback reversal alongside reaction-based undo
  • Per-assessment cost logged via Winston structured logging
  • Assessment timeout: 15 seconds — on timeout, defer to moderators (no automatic action)
  • Per-assessment budget configurable (default: $0.10)
  • 80% test coverage maintained

Out of Scope

  • Custom keyword/regex filter rules (see backlog: "Auto-mod rules engine")
  • Appeal system for auto-mod actions (see backlog: "Appeal system")
  • Multi-guild auto-mod customization (single config for all guilds initially)
  • Training or fine-tuning moderation models
  • Image/attachment content moderation (text messages only)
  • Voice channel moderation
  • Rate limiting auto-mod assessments per user (triage interval serves as natural throttle)
  • Auto-mod statistics dashboard (logging only)

Dependencies

— Generated by Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions