A two-goal system:
- Spec-driven development — developers write C# record types and Gherkin scenarios that drive handler scaffolding, projection stubs, and executable tests for Wolverine/Marten/Polecat
- An AI-friendly Gherkin tool — focused on the Critter Stack first, but potentially replacing some acceptance testing in Wolverine and Marten themselves
The developer defines domain types as plain records. These are the source of truth for structure:
// Commands: imperative naming convention
public record PlaceOrder(string CustomerName, List<OrderItem> Items);
public record ShipOrder(Guid OrderId);
public record CancelOrder(Guid OrderId, string Reason);
// Events: past-tense naming convention
public record OrderPlaced(Guid OrderId, string CustomerName, List<OrderItem> Items);
public record OrderShipped(Guid OrderId, DateTimeOffset ShippedAt);
public record OrderCancelled(Guid OrderId, string Reason);
// Value objects
public record OrderItem(string Name, decimal Price);
// Aggregate
public class Order
{
public Guid Id { get; set; }
public string CustomerName { get; set; } = "";
public List<OrderItem> Items { get; set; } = new();
public OrderStatus Status { get; set; }
public void Apply(OrderPlaced e) { /* ... */ }
public void Apply(OrderShipped e) { /* ... */ }
public void Apply(OrderCancelled e) { /* ... */ }
}Role inference from naming conventions:
- Imperative form (
PlaceOrder,ShipOrder) → command - Past tense (
OrderPlaced,OrderShipped) → event - Class with
Apply()methods → aggregate - Optional:
ICommand,IEventmarker interfaces still work if teams prefer them
The AI reads the C# types and generates .feature files describing expected behavior. These are continuously regenerated when types change — the developer does not hand-edit these files.
# Auto-generated by Critter Stack Spec Tool — DO NOT EDIT
# Source: Order aggregate, PlaceOrder command, OrderPlaced event
Feature: Order Aggregate
Background:
Given an empty Order stream
Scenario: Place an order
When PlaceOrder is received
| CustomerName | Items |
| Alice | Widget ($9.99) |
Then OrderPlaced is emitted
| OrderId | CustomerName | Items |
| <generated> | Alice | Widget ($9.99) |
Scenario: Ship an order
Given events for Order
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
When ShipOrder is received
| OrderId |
| <id> |
Then OrderShipped is emitted
Scenario: Cancel an order
Given events for Order
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
When CancelOrder is received
| OrderId | Reason |
| <id> | Changed mind |
Then OrderCancelled is emittedThe developer writes additional scenarios covering business rules, edge cases, and validation. These files are never overwritten by the AI tool:
# Developer-authored — managed by the team
# Locked scenarios are marked with @locked tag
Feature: Order Business Rules
@locked
Scenario: Cannot ship a cancelled order
Given events for Order
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
| OrderCancelled |
| OrderId: <id>, Reason: Changed mind |
When ShipOrder is received
| OrderId |
| <id> |
Then validation fails with "Cannot ship a cancelled order"
@locked
Scenario: High-value order gets flagged
Given events for Order
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
| Items: Expensive Thing ($5000) |
When PlaceOrder is received
| CustomerName | Items |
| Alice | Expensive Thing ($5000)|
Then OrderPlaced is emitted
And HighValueOrderFlagged is emittedThe @locked tag tells the sync tool to never modify this scenario.
Pre-built step definitions that work with Marten, Polecat, and EF Core:
Event stream setup (Given):
# Start with an empty stream
Given an empty {aggregateType} stream
# Pre-populate with events (Marten/Polecat)
Given events for {aggregateType}
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
| OrderShipped |
| OrderId: <id> |
# Load existing entity (EF Core)
Given an existing {entityType}
| Id | Name | Status |
| <id> | Alice | Active |
# Table/document setup with defaults
Given a {documentType} document
| Name | Status |
| Alice | |
# Status uses default value, not specifiedCommand execution (When):
When {CommandType} is received
| Field1 | Field2 |
| value | value |
# HTTP endpoint variant
When POST /api/orders
| CustomerName | Items |
| Alice | Widget ($9.99) |
Then status code is 201Assertions (Then):
# Event emission
Then {EventType} is emitted
Then {EventType} is emitted
| Field | Expected |
| OrderId | <id> |
# Multiple events
Then events emitted in order
| OrderPlaced |
| OrderShipped |
# No events
Then no events are emitted
# Validation failure
Then validation fails with "{message}"
# Set verification (Storyteller-inspired)
Then the Order aggregate state is
| Property | Expected |
| Status | Shipped |
| CustomerName | Alice |
# Cascading messages (Wolverine)
Then {MessageType} is published
Then {MessageType} is scheduled for {duration}
# Collection/set verification
Then the projected OrderSummary contains
| OrderId | CustomerName | Status |
| <id1> | Alice | Placed |
| <id2> | Bob | Shipped |
# Extra rows in actual = reported
# Missing rows from expected = reportedStep definitions support default values to reduce verbosity:
// Step definition with defaults
[Given("an Order with defaults")]
public void GivenOrderWithDefaults(
[DefaultValue("Alice")] string customerName,
[DefaultValue("Widget ($9.99)")] string items)
{
// Sets up an Order stream with these defaults
// Developer only specifies fields they care about
}In Gherkin, unspecified table columns use defaults:
Given an Order with defaults
| CustomerName |
| Bob |
# Items defaults to "Widget ($9.99)"For constrained values:
[SelectionValues("Placed", "Shipped", "Cancelled", "Refunded")]
public OrderStatus Status { get; set; }Storyteller's key insight: the specification IS the documentation. The HTML rendering shows:
- The specification — Gherkin scenarios rendered as readable prose with input values highlighted
- Pass/fail state — each step colored green (pass), red (fail), or yellow (pending)
- Actual vs expected — for failed assertions, show what was expected and what was received
- Set verification — tables with matched rows (green), missing rows (red), extra rows (yellow)
- Correlated logging — Wolverine tracked session output rendered inline between steps
Command-line output for dotnet run -- spec run:
Order Aggregate
═══════════════════════════════════════════════════════
✅ Place an order
Given an empty Order stream
When PlaceOrder is received
│ CustomerName │ Items │
│ Alice │ Widget ($9.99) │
Then OrderPlaced is emitted ✅
│ OrderId │ CustomerName │ Items │
│ a1b2c3d4-... │ Alice │ Widget ($9.99) │
❌ Cannot ship a cancelled order
Given events for Order
│ OrderPlaced │
│ OrderCancelled │
When ShipOrder is received
│ OrderId │
│ a1b2c3d4-... │
Then validation fails with "Cannot ship a cancelled order" ❌
Expected: ProblemDetails with "Cannot ship a cancelled order"
Actual: OrderShipped event emitted (handler missing validation)
Tracked Session:
┌─────────┬──────────────────┬────────────┬──────────┐
│ Order │ Message Type │ Direction │ Duration │
├─────────┼──────────────────┼────────────┼──────────┤
│ 1 │ ShipOrder │ Received │ 2ms │
│ 2 │ OrderShipped │ Sent │ 0ms │
└─────────┴──────────────────┴────────────┴──────────┘
Marten Events Appended:
┌─────────┬──────────────────┬───────────┐
│ Version │ Event Type │ Stream │
├─────────┼──────────────────┼───────────┤
│ 1 │ OrderPlaced │ a1b2c3d4 │
│ 2 │ OrderCancelled │ a1b2c3d4 │
│ 3 │ OrderShipped │ a1b2c3d4 │
└─────────┴──────────────────┴───────────┘
After each When step, the tool captures and renders:
TrackedSession.RecordsInOrder()— tabular display of all messages sent, received, and executed during the step- Marten/Polecat events appended — all events written to the event store during the step
- EF Core changes — entities inserted, updated, deleted (from
ChangeTracker) - Timing — execution time per step and per message
This output serves two purposes:
- Human readable in HTML and Spectre.Console
- AI scrapable for troubleshooting failing tests (the textual output is structured enough for an LLM to reason about)
# Generate Gherkin from C# types (Layer 2)
dotnet run -- spec generate
# Run all specifications
dotnet run -- spec run
# Run with HTML report output
dotnet run -- spec run --html report.html
# Run a specific feature
dotnet run -- spec run --feature "Order Aggregate"
# Sync Gherkin after type changes (non-destructive to @locked scenarios)
dotnet run -- spec syncgenerate_gherkin_from_types — Read C# types, generate .feature files
sync_gherkin_with_types — Update generated Gherkin after type changes
run_specification — Execute specs, return structured results
diagnose_failing_spec — Analyze TrackedSession output from a failed spec
The generated step definitions are Reqnroll-compatible. The .feature files can be executed by:
- The Critter Stack spec tool (primary — full rendering, tracked session capture)
- Standard Reqnroll/xUnit (fallback — runs in CI without the full rendering)
| Artifact | Who writes it | AI regenerates? |
|---|---|---|
| C# record types (commands, events) | Developer | Never |
| Aggregate classes with Apply methods | Developer | Never |
Generated .feature files (Layer 2) |
AI | Yes, continuously |
Developer .feature files (Layer 3) |
Developer | Never (except obvious syntax fixes) |
| Step definition classes | AI | Yes, for generated steps |
@locked step definitions |
Developer | Never |
| Handler scaffolds | AI (from Gherkin scenarios) | On request |
| Projection stubs | AI (from event types) | On request |
| HTML/Spectre rendering | Tool | Always |
This builds on:
- ScaffoldVerticalSlice MCP tool → generates handler skeletons from feature descriptions
- ScaffoldHandler MCP tool → generates handler stubs from message types
- DiagnoseDescribeOutput MCP tool → analyzes runtime configuration
- Integration Testing skill → tracked sessions,
WaitForNonStaleProjectionDataAsync - Pure Functions & Testability skill → Given/When/Then maps to aggregate handler testing
Storyteller 3 introduced a fail-fast model that distinguished between exceptions at different severity levels. The key insight was that not all failures mean the same thing:
- A wrong assertion value means the business logic is wrong — you want to keep running to gather more diagnostic data
- An
NpgsqlExceptionmid-scenario means the database connection dropped — continuing this scenario is pointless, but other scenarios might still work - The IHost failing to start means nothing can run — stop the entire suite immediately
Storyteller formalized this as StorytellerCriticalException (abort the current spec, skip retries) and StorytellerCatastrophicException (halt all execution). Fixture SetUp() and TearDown() failures were automatically treated as critical. The system's ISystem failing to bootstrap was automatically catastrophic.
Storyteller also enforced strict retry discipline: acceptance specs were informational only (never fail the build), regression specs were mandatory (failures break the build), and critical/catastrophic exceptions were never retried regardless of configuration. This prevented CI builds from burning time retrying fundamentally broken infrastructure.
We adopt and extend this model for the Critter Stack spec runner:
| Level | Scope | Behavior | Continue? | Retry? |
|---|---|---|---|---|
| Assertion failure | Single step | Mark step as failed, continue executing remaining steps in the scenario | Yes — gather all failures | Per policy |
| Critical failure | Single scenario | Abort this scenario immediately, skip remaining steps, proceed to next scenario | No — skip to next scenario | Never |
| Catastrophic failure | Entire suite | Stop all execution, report results gathered so far, exit with non-zero code | No — stop everything | Never |
Assertion failures are the expected case — the system ran but produced the wrong result. These are the most valuable failures because they tell you what the code is doing wrong:
// Assertion: expected vs actual mismatch
// The step ran, the handler executed, but the output wasn't what we expected
// → Mark step as FAILED, continue to next step in this scenario
Then OrderShipped is emitted // actual: no events emitted → FAIL, continue
Then the Order status is Shipped // actual: status is Placed → FAIL, continueCritical failures are infrastructure exceptions during step execution — the step couldn't even run properly. Continuing the scenario is pointless because subsequent steps depend on the failed step's side effects:
// Critical: infrastructure exception during a When step
// The handler threw an unhandled NpgsqlException, timeout, etc.
// → Abort this scenario, mark remaining steps as SKIPPED, move to next scenario
// Automatically classified as critical:
// - Unhandled exceptions in Given steps (data setup failure)
// - Unhandled exceptions in When steps (command execution failure)
// - Connection timeouts, serialization errors, OOM
// - Fixture SetUp/TearDown failures
// Developer can also force this:
throw new SpecCriticalException("Database connection lost mid-scenario");Catastrophic failures mean the entire test environment is broken — nothing else will work either:
// Catastrophic: system-level failure
// → Stop all execution immediately, report what we have, exit with error code
// Automatically classified as catastrophic:
// - IHost fails to start (UseWolverine, AddMarten, etc. throws)
// - Marten/Polecat schema migration fails
// - No database connection at all (initial connection refused)
// - Docker container not running
// - Port already in use
// Developer can also force:
throw new SpecCatastrophicException("External dependency permanently unavailable");The step type matters for classification:
| Step type | Exception thrown | Classification | Rationale |
|---|---|---|---|
Given |
Any exception | Critical | Data setup failed — scenario can't proceed |
When |
Any exception | Critical | Command execution failed — nothing to assert against |
Then |
Assertion mismatch | Assertion (continue) | System ran but produced wrong result — gather more data |
Then |
Unexpected exception | Critical | Assertion code itself crashed — different from a mismatch |
This means a scenario like "Ship a cancelled order" that throws InvalidOperationException from the handler would be classified differently depending on WHERE the exception surfaces:
- If the handler throws and it's unhandled by Wolverine → Critical (When step failed)
- If the handler returns
ProblemDetailsand theThenstep checks the wrong status code → Assertion failure (continue) - If the
Thenstep's own assertion code crashes withNullReferenceException→ Critical (test infrastructure bug)
When a Then step fails with an assertion (not an exception), the runner continues executing subsequent Then steps. This produces richer diagnostic output — you see ALL failures in a scenario, not just the first one:
❌ Ship an order
Given events for Order
│ OrderPlaced │ OrderCancelled │
When ShipOrder is received ✅
Then validation fails with "Cannot ship cancelled order" ❌
Expected: ProblemDetails with "Cannot ship cancelled order"
Actual: No ProblemDetails returned
Then no events are emitted ❌
Actual: OrderShipped was emitted
Then the Order aggregate status is Cancelled ✅
(this still ran and passed — useful diagnostic info)
Both failures are reported. The fact that the status IS Cancelled (last step passed) while events WERE emitted (second-to-last step failed) tells you the handler is appending OrderShipped without checking the aggregate status — that's a much more specific diagnosis than just "first assertion failed."
This continuation behavior is essential for AI troubleshooting: when the spec runner's structured output is fed to an LLM via the diagnose_failing_spec MCP tool, more assertion results = better diagnostic reasoning.
Wolverine has its own error handling (retry, circuit breaker, DLQ). The spec runner needs to work with these, not fight them:
- Handler retries: If the handler is configured with
RetryNow(typeof(SqlException), 50, 100, 250), Wolverine will retry transparently. The spec runner only sees the final outcome (success or exhausted retries). This is correct — the spec tests business behavior, not retry infrastructure. - DLQ routing: If a message hits the DLQ during a
Whenstep, the spec runner should report this as diagnostic info (not necessarily a failure — the scenario might be testing DLQ behavior). - Circuit breaker trips: If a circuit breaker trips during test execution, classify as Critical (the endpoint is paused, subsequent steps can't run).
- TrackedSession exceptions: When using
InvokeMessageAndWaitAsync, the tracked session captures exceptions from cascading handlers. The spec runner should checksession.HasExceptions()and report them as part of the diagnostic output, even if the immediateWhenstep "succeeded."
Borrowed from Storyteller's two-lifecycle model:
@acceptance— work-in-progress scenarios. Failures are informational — they appear in reports but don't fail the CI build. Useful during active development.@regression— production-ready scenarios. Any failure breaks the CI build. This is the default if no tag is specified.
@acceptance
Scenario: Complex multi-tenant order flow
# Still being developed — won't break CI
@regression
Scenario: Place an order
# Must pass in CI — this is a regression testThe dotnet run -- spec run command returns:
- Exit code 0: all
@regressionscenarios passed (acceptance failures are OK) - Exit code 1: one or more
@regressionscenarios failed - Exit code 2: catastrophic failure (suite aborted early)
Strict by default — matches Storyteller's philosophy that flaky tests must be fixed, not masked:
- Default: no retries. A failing scenario fails.
@regressionscenarios: Never retried, period. If it fails, the build fails.@acceptancescenarios: Optionally retried once (configurable via--retry-acceptanceflag). This gives early-stage scenarios a small grace period.- Critical/catastrophic failures: Never retried regardless of any configuration. An infrastructure failure won't magically fix itself on retry.
- Per-scenario override:
@retry(3)tag allows a specific scenario to retry (useful for known-flaky external dependencies in acceptance tests only).
Each scenario has a configurable timeout (default: 30 seconds). When a scenario times out:
- The current step is marked as Critical (timed out)
- Remaining steps are marked as Skipped
- The
TearDownruns regardless (borrowed from Storyteller — Dispose always runs) - The next scenario proceeds
@timeout(60)
Scenario: Long-running projection rebuild
Given 10000 events for Order
When projection OrderSummary is rebuilt
Then the projected OrderSummary contains 10000 ordersGlobal timeout override: dotnet run -- spec run --timeout 120
The specification format for the Critter Stack is C#/F# record types as the source of truth for domain shapes, combined with Gherkin for behavioral specifications and test authoring. No external DSL, no YAML schema, no XML format.
EMN (Event Modeling Notation) — Holixon/AxonIQ's XML-based specification format. Evaluated against our requirements:
| Criterion | EMN | C# Records + Gherkin |
|---|---|---|
| Developer authoring experience | XML with embedded JSON Schema in CDATA blocks. Designed for visual tooling, not hand-authoring | C# records: public record PlaceOrder(string CustomerName); — already compilable, IDE-supported |
| IDE support | No .NET tooling exists. Would need a custom XML parser | Full IntelliSense for C# types. Gherkin has syntax highlighting + step completion in Rider/VS Code |
| Type richness | JSON Schema for field definitions — no generics, no nullable reference types, no C# type system | Full C# type system — generics, nullable, enums, value objects, inheritance |
| Test data expressiveness | Specifications reference flow node IDs only — no inline test data, no tables | Gherkin tables with inline values, default values, set verification |
| Ecosystem | Java/AxonIQ only. No .NET parser, no community SDK | Reqnroll for .NET, Cucumber ecosystem, wide IDE support |
| Information redundancy | Re-encodes command→event flows that Wolverine already discovers from handler signatures | Zero redundancy — the C# types ARE the authoritative definitions |
| AI friendliness | XML is parseable but verbose; LLMs handle C# and Gherkin much better | C# records and Gherkin are among the best-understood formats by LLMs |
TEML / Emlang — YAML-based event modeling languages. Cleaner than EMN but still introduce an external DSL that duplicates what C# records already express. The YAML provides no additional semantic value over the records themselves, and requires maintaining two representations of the same information.
EventSauce YAML — PHP-specific, generates classes from YAML definitions. This is the opposite direction from what we want: we want types-first, not schema-first. C# records are already more expressive than any YAML schema could be.
C# and F# record syntax is already the most concise way to define domain types:
// This IS the specification. No DSL needed.
public record PlaceOrder(string CustomerName, List<OrderItem> Items);
public record OrderPlaced(Guid OrderId, string CustomerName, List<OrderItem> Items);
public record OrderItem(string Name, decimal Price);// F# is even more terse
type PlaceOrder = { CustomerName: string; Items: OrderItem list }
type OrderPlaced = { OrderId: Guid; CustomerName: string; Items: OrderItem list }
type OrderItem = { Name: string; Price: decimal }Benefits:
- Compiles — the types are real code, not a disconnected specification artifact
- IDE support from day one — IntelliSense, refactoring, Find Usages all work
- No synchronization problem — the types are the single source of truth, not a copy
- AI reads them natively — LLMs understand C# records better than any custom DSL
- Role inference from naming conventions — imperative (
PlaceOrder) = command, past tense (OrderPlaced) = event, class withApply()methods = aggregate. No metadata annotations needed (though marker interfaces remain available for teams that want them)
Gherkin expresses what C# types cannot: expected behavior under specific conditions. The types tell you the shape; Gherkin tells you the story:
Scenario: Cannot ship a cancelled order
Given events for Order
| OrderPlaced |
| OrderId: <id>, CustomerName: Alice |
| OrderCancelled |
| OrderId: <id>, Reason: Changed mind |
When ShipOrder is received
| OrderId |
| <id> |
Then validation fails with "Cannot ship a cancelled order"No type definition can express this. The Gherkin adds the behavioral dimension — the business rules, edge cases, and validation constraints that make the domain meaningful.
The tool reads C# types (via Roslyn or reflection), infers roles from naming conventions, and generates Gherkin scenarios describing the expected happy-path behavior. The developer then extends with business rules and edge cases. The Gherkin drives both test execution (Reqnroll-compatible) and handler scaffold generation.
No EMN import/export is planned. If an AxonIQ migration scenario arises, a one-time conversion tool (dotnet run -- import-emn) could be built as a weekend project — it would output C# records, not maintain an ongoing EMN representation.
- Package structure: New NuGet (e.g.,
CritterStack.Specs)? Or part ofWolverine.CritterWatch? - Feature file location convention:
Specs/folder alongsidesrc/? Or colocated with the aggregate? - How to handle projections in Gherkin:
Then the OrderSummary projection containswith table verification? - Multi-tenant scenarios:
Given tenant "tenant-1"step that sets the tenant context? - Saga testing: How to express multi-step saga workflows in Gherkin?
- F# support: F# discriminated unions as event types — the AI needs to understand these too
- Type discovery: Roslyn source analysis (works at design time, before compilation) vs reflection (requires compiled assembly)?
| Competitor | What they have | What we'd have |
|---|---|---|
| AxonIQ EMN + AI Dev Agent | Proprietary XML format, Java-only, platform-locked, visual-tool-oriented | Open Gherkin, C# types as source, any .NET IDE, AI-assisted |
| TEML/Emlang | YAML DSL, generates diagrams only | Real C# types + executable Gherkin + full test execution + HTML reports |
| EventSauce (PHP) | YAML → class codegen (opposite direction) | C# types → Gherkin → handlers + tests + living documentation |
| Context Mapper | CML DSL for modeling, no code generation | Types → specs → code → tests → living documentation |
| Reqnroll | Generic BDD, no event sourcing awareness | Event sourcing-native step definitions, TrackedSession integration |
| Storyteller (legacy) | Rich rendering, correlated logging, set verification | Same vision, modern implementation, AI-assisted, Gherkin-based |