Public roadmap for @iris-eval/mcp-server. Updated 2026-03-18.
Status: Released
The foundation: an MCP server that evaluates agent output quality, logs traces, and surfaces results in a web dashboard.
- 3 MCP tools:
log_trace,evaluate_output,get_traceswith Zod-validated input schemas - Evaluation engine: 12 built-in rules across 4 categories (completeness, relevance, safety, cost), weighted scoring with configurable threshold, custom rule support
- SQLite storage: single-file database via better-sqlite3, schema migrations, queryable trace and evaluation history
- Web dashboard: React-based dark-mode UI with summary cards, trace list, span tree view, and evaluation results (served via Express on port 6920)
- Security hardening: API key authentication, rate limiting (express-rate-limit), helmet security headers, CORS, input validation, ReDoS-safe regex, 1MB body limit
- Dual transport: stdio for local MCP clients (Claude Desktop, Cursor), HTTP for networked deployments
Status: Planned
Take agent eval from local to team-scale. Shared eval results, managed infrastructure, usage-based pricing.
- PostgreSQL storage adapter: drop-in replacement for SQLite, connection pooling, concurrent writes
- Multi-tenancy: workspace isolation, per-workspace database schemas or row-level security
- Team eval dashboards: shared eval results, team-level quality scores, agent-by-agent comparison
- Eval history: track scores per rule over time per agent — detect quality regressions before users do
- API key management: create, rotate, and revoke API keys per workspace via dashboard UI
- Data export: CSV and JSON export of evaluations and traces
Status: Planned
Alert on quality regressions, bring non-MCP frameworks into the eval pipeline, and open the door for community-contributed eval rules.
- Quality alert rules: configurable conditions (e.g., average eval score drops below 0.6 over 1 hour, safety rule failure rate exceeds 5%, cost per trace exceeds $0.50)
- Webhook notifications: POST alert payloads to any URL (Slack, PagerDuty, custom endpoints)
- Email notifications: SMTP integration for alert emails with summary and affected evaluations
- Retention policies: automatic trace/eval deletion after configurable TTL (e.g., 30 days), storage usage tracking
- Dashboard alert panel: view active alerts, alert history, and configure rules from the UI
- Framework integration packages: SDK adapters that pipe framework events to Iris for eval — each install is a permanent adoption channel
@iris-eval/langchain— LangChain callback handler → Iris eval pipeline (scaffolded)@iris-eval/crewai— CrewAI workflow eval integration (scaffolded)@iris-eval/autogen— AutoGen conversation eval integration (scaffolded)
- Community eval rule registry: publish and discover community-contributed eval rules as npm packages (like ESLint plugins for agent output)
Status: Planned
Semantic evaluation powered by LLMs, plus export to your existing monitoring stack. Iris evaluates; your APM monitors. They work together.
- LLM-as-judge evaluation: use an LLM to score output quality on dimensions like accuracy, helpfulness, and safety (configurable model, prompt templates, cost controls)
- OpenTelemetry trace export: export Iris traces as OTel spans to Jaeger, Grafana Tempo, Datadog, Sentry, or any OTel-compatible backend — Iris handles eval, your APM handles infrastructure
- Annotation API: human-in-the-loop feedback — annotate traces with ground truth, corrections, or quality labels via API and dashboard
- Comparison mode: side-by-side trace comparison for A/B testing different prompts, models, or agent configurations
- Eval regression detection: automated alerts when eval scores trend downward per agent or per rule
Status: Planned
Features for regulated environments and large organizations.
- SSO / SAML: single sign-on via SAML 2.0 and OIDC providers (Okta, Azure AD, Google Workspace)
- RBAC: role-based access control with predefined roles (admin, editor, viewer) and custom roles
- Audit logs: immutable log of all user actions (trace access, config changes, API key operations) with export capability
- SOC 2 compliance: documentation, controls, and architecture changes to support SOC 2 Type II certification
- SLA and support tiers: uptime guarantees, priority support, dedicated onboarding for enterprise customers
Status: Ongoing
Community-driven features and ecosystem growth.
- Framework integration guides: step-by-step guides for using Iris with LangChain, CrewAI, AutoGen, Semantic Kernel, and the MCP SDK directly
- Eval rule marketplace: community-contributed evaluation rules published as npm packages, discoverable via a registry — the ESLint of agent output
- Plugin system: extend Iris with plugins for custom storage adapters, notification channels, authentication providers, and dashboard widgets
- Example agents: reference implementations of agents with Iris eval baked in, covering common patterns (RAG, tool-use, multi-agent)
- Contributing guide: documentation for contributing rules, storage adapters, and dashboard components
- Open an issue on GitHub with a feature request
- Upvote existing feature requests with a thumbs-up reaction
- Join the discussion in pull requests and issues
- Contribute directly -- see CONTRIBUTING.md for guidelines