Spec-Driven Development for the Critter Stack — Design Document

Vision

A two-goal system:

Spec-driven development — developers write C# record types and Gherkin scenarios that drive handler scaffolding, projection stubs, and executable tests for Wolverine/Marten/Polecat
An AI-friendly Gherkin tool — focused on the Critter Stack first, but potentially replacing some acceptance testing in Wolverine and Marten themselves

The Three Layers

Layer 1: Shape Definition (developer writes C#/F#)

The developer defines domain types as plain records. These are the source of truth for structure:

// Commands: imperative naming convention
public record PlaceOrder(string CustomerName, List<OrderItem> Items);
public record ShipOrder(Guid OrderId);
public record CancelOrder(Guid OrderId, string Reason);

// Events: past-tense naming convention
public record OrderPlaced(Guid OrderId, string CustomerName, List<OrderItem> Items);
public record OrderShipped(Guid OrderId, DateTimeOffset ShippedAt);
public record OrderCancelled(Guid OrderId, string Reason);

// Value objects
public record OrderItem(string Name, decimal Price);

// Aggregate
public class Order
{
    public Guid Id { get; set; }
    public string CustomerName { get; set; } = "";
    public List<OrderItem> Items { get; set; } = new();
    public OrderStatus Status { get; set; }

    public void Apply(OrderPlaced e) { /* ... */ }
    public void Apply(OrderShipped e) { /* ... */ }
    public void Apply(OrderCancelled e) { /* ... */ }
}

Role inference from naming conventions:

Imperative form (PlaceOrder, ShipOrder) → command
Past tense (OrderPlaced, OrderShipped) → event
Class with Apply() methods → aggregate
Optional: ICommand, IEvent marker interfaces still work if teams prefer them

Layer 2: Implied Gherkin (AI generates, continuously synced)

The AI reads the C# types and generates .feature files describing expected behavior. These are continuously regenerated when types change — the developer does not hand-edit these files.

# Auto-generated by Critter Stack Spec Tool — DO NOT EDIT
# Source: Order aggregate, PlaceOrder command, OrderPlaced event

Feature: Order Aggregate

  Background:
    Given an empty Order stream

  Scenario: Place an order
    When PlaceOrder is received
      | CustomerName | Items                    |
      | Alice        | Widget ($9.99)           |
    Then OrderPlaced is emitted
      | OrderId      | CustomerName | Items          |
      | <generated>  | Alice        | Widget ($9.99) |

  Scenario: Ship an order
    Given events for Order
      | OrderPlaced                          |
      | OrderId: <id>, CustomerName: Alice   |
    When ShipOrder is received
      | OrderId |
      | <id>    |
    Then OrderShipped is emitted

  Scenario: Cancel an order
    Given events for Order
      | OrderPlaced                          |
      | OrderId: <id>, CustomerName: Alice   |
    When CancelOrder is received
      | OrderId | Reason     |
      | <id>    | Changed mind |
    Then OrderCancelled is emitted

Layer 3: Developer-authored scenarios (Gherkin, developer owns)

The developer writes additional scenarios covering business rules, edge cases, and validation. These files are never overwritten by the AI tool:

# Developer-authored — managed by the team
# Locked scenarios are marked with @locked tag

Feature: Order Business Rules

  @locked
  Scenario: Cannot ship a cancelled order
    Given events for Order
      | OrderPlaced                          |
      | OrderId: <id>, CustomerName: Alice   |
      | OrderCancelled                       |
      | OrderId: <id>, Reason: Changed mind  |
    When ShipOrder is received
      | OrderId |
      | <id>    |
    Then validation fails with "Cannot ship a cancelled order"

  @locked
  Scenario: High-value order gets flagged
    Given events for Order
      | OrderPlaced                                    |
      | OrderId: <id>, CustomerName: Alice             |
      | Items: Expensive Thing ($5000)                 |
    When PlaceOrder is received
      | CustomerName | Items                  |
      | Alice        | Expensive Thing ($5000)|
    Then OrderPlaced is emitted
    And HighValueOrderFlagged is emitted

The @locked tag tells the sync tool to never modify this scenario.

Step Definitions

Critter Stack-specific step definitions

Pre-built step definitions that work with Marten, Polecat, and EF Core:

Event stream setup (Given):

# Start with an empty stream
Given an empty {aggregateType} stream

# Pre-populate with events (Marten/Polecat)
Given events for {aggregateType}
  | OrderPlaced                              |
  | OrderId: <id>, CustomerName: Alice       |
  | OrderShipped                             |
  | OrderId: <id>                            |

# Load existing entity (EF Core)
Given an existing {entityType}
  | Id   | Name  | Status |
  | <id> | Alice | Active |

# Table/document setup with defaults
Given a {documentType} document
  | Name  | Status |
  | Alice |        |
  # Status uses default value, not specified

Command execution (When):

When {CommandType} is received
  | Field1 | Field2 |
  | value  | value  |

# HTTP endpoint variant
When POST /api/orders
  | CustomerName | Items          |
  | Alice        | Widget ($9.99) |
  Then status code is 201

Assertions (Then):

# Event emission
Then {EventType} is emitted
Then {EventType} is emitted
  | Field    | Expected |
  | OrderId  | <id>     |

# Multiple events
Then events emitted in order
  | OrderPlaced    |
  | OrderShipped   |

# No events
Then no events are emitted

# Validation failure
Then validation fails with "{message}"

# Set verification (Storyteller-inspired)
Then the Order aggregate state is
  | Property     | Expected    |
  | Status       | Shipped     |
  | CustomerName | Alice       |

# Cascading messages (Wolverine)
Then {MessageType} is published
Then {MessageType} is scheduled for {duration}

# Collection/set verification
Then the projected OrderSummary contains
  | OrderId | CustomerName | Status  |
  | <id1>   | Alice        | Placed  |
  | <id2>   | Bob          | Shipped |
  # Extra rows in actual = reported
  # Missing rows from expected = reported

Default values (Storyteller-inspired)

Step definitions support default values to reduce verbosity:

// Step definition with defaults
[Given("an Order with defaults")]
public void GivenOrderWithDefaults(
    [DefaultValue("Alice")] string customerName,
    [DefaultValue("Widget ($9.99)")] string items)
{
    // Sets up an Order stream with these defaults
    // Developer only specifies fields they care about
}

In Gherkin, unspecified table columns use defaults:

Given an Order with defaults
  | CustomerName |
  | Bob          |
  # Items defaults to "Widget ($9.99)"

Selection lists (Storyteller-inspired)

For constrained values:

[SelectionValues("Placed", "Shipped", "Cancelled", "Refunded")]
public OrderStatus Status { get; set; }

Rendering & Output

HTML specification rendering

Storyteller's key insight: the specification IS the documentation. The HTML rendering shows:

The specification — Gherkin scenarios rendered as readable prose with input values highlighted
Pass/fail state — each step colored green (pass), red (fail), or yellow (pending)
Actual vs expected — for failed assertions, show what was expected and what was received
Set verification — tables with matched rows (green), missing rows (red), extra rows (yellow)
Correlated logging — Wolverine tracked session output rendered inline between steps

Spectre.Console CLI rendering

Command-line output for dotnet run -- spec run:

Order Aggregate
═══════════════════════════════════════════════════════

  ✅ Place an order
     Given an empty Order stream
     When PlaceOrder is received
       │ CustomerName │ Items          │
       │ Alice        │ Widget ($9.99) │
     Then OrderPlaced is emitted ✅
       │ OrderId       │ CustomerName │ Items          │
       │ a1b2c3d4-...  │ Alice        │ Widget ($9.99) │

  ❌ Cannot ship a cancelled order
     Given events for Order
       │ OrderPlaced                        │
       │ OrderCancelled                     │
     When ShipOrder is received
       │ OrderId      │
       │ a1b2c3d4-... │
     Then validation fails with "Cannot ship a cancelled order" ❌
       Expected: ProblemDetails with "Cannot ship a cancelled order"
       Actual:   OrderShipped event emitted (handler missing validation)

  Tracked Session:
  ┌─────────┬──────────────────┬────────────┬──────────┐
  │ Order   │ Message Type     │ Direction  │ Duration │
  ├─────────┼──────────────────┼────────────┼──────────┤
  │ 1       │ ShipOrder        │ Received   │ 2ms      │
  │ 2       │ OrderShipped     │ Sent       │ 0ms      │
  └─────────┴──────────────────┴────────────┴──────────┘

  Marten Events Appended:
  ┌─────────┬──────────────────┬───────────┐
  │ Version │ Event Type       │ Stream    │
  ├─────────┼──────────────────┼───────────┤
  │ 1       │ OrderPlaced      │ a1b2c3d4  │
  │ 2       │ OrderCancelled   │ a1b2c3d4  │
  │ 3       │ OrderShipped     │ a1b2c3d4  │
  └─────────┴──────────────────┴───────────┘

TrackedSession integration

After each When step, the tool captures and renders:

TrackedSession.RecordsInOrder() — tabular display of all messages sent, received, and executed during the step
Marten/Polecat events appended — all events written to the event store during the step
EF Core changes — entities inserted, updated, deleted (from ChangeTracker)
Timing — execution time per step and per message

This output serves two purposes:

Human readable in HTML and Spectre.Console
AI scrapable for troubleshooting failing tests (the textual output is structured enough for an LLM to reason about)

Tooling

CLI Commands (JasperFx commands in Wolverine.CritterWatch)

# Generate Gherkin from C# types (Layer 2)
dotnet run -- spec generate

# Run all specifications
dotnet run -- spec run

# Run with HTML report output
dotnet run -- spec run --html report.html

# Run a specific feature
dotnet run -- spec run --feature "Order Aggregate"

# Sync Gherkin after type changes (non-destructive to @locked scenarios)
dotnet run -- spec sync

MCP Tools (CritterWatch.Services)

generate_gherkin_from_types   — Read C# types, generate .feature files
sync_gherkin_with_types       — Update generated Gherkin after type changes
run_specification             — Execute specs, return structured results
diagnose_failing_spec         — Analyze TrackedSession output from a failed spec

Reqnroll Integration

The generated step definitions are Reqnroll-compatible. The .feature files can be executed by:

The Critter Stack spec tool (primary — full rendering, tracked session capture)
Standard Reqnroll/xUnit (fallback — runs in CI without the full rendering)

What the AI generates vs what the developer writes

Artifact	Who writes it	AI regenerates?
C# record types (commands, events)	Developer	Never
Aggregate classes with Apply methods	Developer	Never
Generated `.feature` files (Layer 2)	AI	Yes, continuously
Developer `.feature` files (Layer 3)	Developer	Never (except obvious syntax fixes)
Step definition classes	AI	Yes, for generated steps
`@locked` step definitions	Developer	Never
Handler scaffolds	AI (from Gherkin scenarios)	On request
Projection stubs	AI (from event types)	On request
HTML/Spectre rendering	Tool	Always

Relationship to Existing AI Skills & MCP Tools

This builds on:

ScaffoldVerticalSlice MCP tool → generates handler skeletons from feature descriptions
ScaffoldHandler MCP tool → generates handler stubs from message types
DiagnoseDescribeOutput MCP tool → analyzes runtime configuration
Integration Testing skill → tracked sessions, WaitForNonStaleProjectionDataAsync
Pure Functions & Testability skill → Given/When/Then maps to aggregate handler testing

Error Handling & Abort Semantics

Background: Lessons from Storyteller

Storyteller 3 introduced a fail-fast model that distinguished between exceptions at different severity levels. The key insight was that not all failures mean the same thing:

A wrong assertion value means the business logic is wrong — you want to keep running to gather more diagnostic data
An NpgsqlException mid-scenario means the database connection dropped — continuing this scenario is pointless, but other scenarios might still work
The IHost failing to start means nothing can run — stop the entire suite immediately

Storyteller formalized this as StorytellerCriticalException (abort the current spec, skip retries) and StorytellerCatastrophicException (halt all execution). Fixture SetUp() and TearDown() failures were automatically treated as critical. The system's ISystem failing to bootstrap was automatically catastrophic.

Storyteller also enforced strict retry discipline: acceptance specs were informational only (never fail the build), regression specs were mandatory (failures break the build), and critical/catastrophic exceptions were never retried regardless of configuration. This prevented CI builds from burning time retrying fundamentally broken infrastructure.

Three failure levels

We adopt and extend this model for the Critter Stack spec runner:

Level	Scope	Behavior	Continue?	Retry?
Assertion failure	Single step	Mark step as failed, continue executing remaining steps in the scenario	Yes — gather all failures	Per policy
Critical failure	Single scenario	Abort this scenario immediately, skip remaining steps, proceed to next scenario	No — skip to next scenario	Never
Catastrophic failure	Entire suite	Stop all execution, report results gathered so far, exit with non-zero code	No — stop everything	Never

How failures are classified

Assertion failures are the expected case — the system ran but produced the wrong result. These are the most valuable failures because they tell you what the code is doing wrong:

// Assertion: expected vs actual mismatch
// The step ran, the handler executed, but the output wasn't what we expected
// → Mark step as FAILED, continue to next step in this scenario
Then OrderShipped is emitted  // actual: no events emitted → FAIL, continue
Then the Order status is Shipped  // actual: status is Placed → FAIL, continue

Critical failures are infrastructure exceptions during step execution — the step couldn't even run properly. Continuing the scenario is pointless because subsequent steps depend on the failed step's side effects:

// Critical: infrastructure exception during a When step
// The handler threw an unhandled NpgsqlException, timeout, etc.
// → Abort this scenario, mark remaining steps as SKIPPED, move to next scenario

// Automatically classified as critical:
//   - Unhandled exceptions in Given steps (data setup failure)
//   - Unhandled exceptions in When steps (command execution failure)
//   - Connection timeouts, serialization errors, OOM
//   - Fixture SetUp/TearDown failures

// Developer can also force this:
throw new SpecCriticalException("Database connection lost mid-scenario");

Catastrophic failures mean the entire test environment is broken — nothing else will work either:

// Catastrophic: system-level failure
// → Stop all execution immediately, report what we have, exit with error code

// Automatically classified as catastrophic:
//   - IHost fails to start (UseWolverine, AddMarten, etc. throws)
//   - Marten/Polecat schema migration fails
//   - No database connection at all (initial connection refused)
//   - Docker container not running
//   - Port already in use

// Developer can also force:
throw new SpecCatastrophicException("External dependency permanently unavailable");

The critical distinction: When vs Then failures

The step type matters for classification:

Step type	Exception thrown	Classification	Rationale
`Given`	Any exception	Critical	Data setup failed — scenario can't proceed
`When`	Any exception	Critical	Command execution failed — nothing to assert against
`Then`	Assertion mismatch	Assertion (continue)	System ran but produced wrong result — gather more data
`Then`	Unexpected exception	Critical	Assertion code itself crashed — different from a mismatch

This means a scenario like "Ship a cancelled order" that throws InvalidOperationException from the handler would be classified differently depending on WHERE the exception surfaces:

If the handler throws and it's unhandled by Wolverine → Critical (When step failed)
If the handler returns ProblemDetails and the Then step checks the wrong status code → Assertion failure (continue)
If the Then step's own assertion code crashes with NullReferenceException → Critical (test infrastructure bug)

Assertion continuation within a scenario

When a Then step fails with an assertion (not an exception), the runner continues executing subsequent Then steps. This produces richer diagnostic output — you see ALL failures in a scenario, not just the first one:

  ❌ Ship an order
     Given events for Order
       │ OrderPlaced │ OrderCancelled │
     When ShipOrder is received ✅
     Then validation fails with "Cannot ship cancelled order" ❌
       Expected: ProblemDetails with "Cannot ship cancelled order"
       Actual: No ProblemDetails returned
     Then no events are emitted ❌
       Actual: OrderShipped was emitted
     Then the Order aggregate status is Cancelled ✅
       (this still ran and passed — useful diagnostic info)

Both failures are reported. The fact that the status IS Cancelled (last step passed) while events WERE emitted (second-to-last step failed) tells you the handler is appending OrderShipped without checking the aggregate status — that's a much more specific diagnosis than just "first assertion failed."

This continuation behavior is essential for AI troubleshooting: when the spec runner's structured output is fed to an LLM via the diagnose_failing_spec MCP tool, more assertion results = better diagnostic reasoning.

Interaction with Wolverine's error handling

Wolverine has its own error handling (retry, circuit breaker, DLQ). The spec runner needs to work with these, not fight them:

Handler retries: If the handler is configured with RetryNow(typeof(SqlException), 50, 100, 250), Wolverine will retry transparently. The spec runner only sees the final outcome (success or exhausted retries). This is correct — the spec tests business behavior, not retry infrastructure.
DLQ routing: If a message hits the DLQ during a When step, the spec runner should report this as diagnostic info (not necessarily a failure — the scenario might be testing DLQ behavior).
Circuit breaker trips: If a circuit breaker trips during test execution, classify as Critical (the endpoint is paused, subsequent steps can't run).
TrackedSession exceptions: When using InvokeMessageAndWaitAsync, the tracked session captures exceptions from cascading handlers. The spec runner should check session.HasExceptions() and report them as part of the diagnostic output, even if the immediate When step "succeeded."

Lifecycle states (CI integration)

Borrowed from Storyteller's two-lifecycle model:

@acceptance — work-in-progress scenarios. Failures are informational — they appear in reports but don't fail the CI build. Useful during active development.
@regression — production-ready scenarios. Any failure breaks the CI build. This is the default if no tag is specified.

@acceptance
Scenario: Complex multi-tenant order flow
  # Still being developed — won't break CI

@regression
Scenario: Place an order
  # Must pass in CI — this is a regression test

The dotnet run -- spec run command returns:

Exit code 0: all @regression scenarios passed (acceptance failures are OK)
Exit code 1: one or more @regression scenarios failed
Exit code 2: catastrophic failure (suite aborted early)

Retry policy

Strict by default — matches Storyteller's philosophy that flaky tests must be fixed, not masked:

Default: no retries. A failing scenario fails.
@regression scenarios: Never retried, period. If it fails, the build fails.
@acceptance scenarios: Optionally retried once (configurable via --retry-acceptance flag). This gives early-stage scenarios a small grace period.
Critical/catastrophic failures: Never retried regardless of any configuration. An infrastructure failure won't magically fix itself on retry.
Per-scenario override: @retry(3) tag allows a specific scenario to retry (useful for known-flaky external dependencies in acceptance tests only).

Timeout handling

Each scenario has a configurable timeout (default: 30 seconds). When a scenario times out:

The current step is marked as Critical (timed out)
Remaining steps are marked as Skipped
The TearDown runs regardless (borrowed from Storyteller — Dispose always runs)
The next scenario proceeds

@timeout(60)
Scenario: Long-running projection rebuild
  Given 10000 events for Order
  When projection OrderSummary is rebuilt
  Then the projected OrderSummary contains 10000 orders

Global timeout override: dotnet run -- spec run --timeout 120

Design Decision: C#/F# Records + Gherkin (not EMN, not YAML DSL)

Decision

The specification format for the Critter Stack is C#/F# record types as the source of truth for domain shapes, combined with Gherkin for behavioral specifications and test authoring. No external DSL, no YAML schema, no XML format.

Alternatives evaluated

EMN (Event Modeling Notation) — Holixon/AxonIQ's XML-based specification format. Evaluated against our requirements:

Criterion	EMN	C# Records + Gherkin
Developer authoring experience	XML with embedded JSON Schema in CDATA blocks. Designed for visual tooling, not hand-authoring	C# records: `public record PlaceOrder(string CustomerName);` — already compilable, IDE-supported
IDE support	No .NET tooling exists. Would need a custom XML parser	Full IntelliSense for C# types. Gherkin has syntax highlighting + step completion in Rider/VS Code
Type richness	JSON Schema for field definitions — no generics, no nullable reference types, no C# type system	Full C# type system — generics, nullable, enums, value objects, inheritance
Test data expressiveness	Specifications reference flow node IDs only — no inline test data, no tables	Gherkin tables with inline values, default values, set verification
Ecosystem	Java/AxonIQ only. No .NET parser, no community SDK	Reqnroll for .NET, Cucumber ecosystem, wide IDE support
Information redundancy	Re-encodes command→event flows that Wolverine already discovers from handler signatures	Zero redundancy — the C# types ARE the authoritative definitions
AI friendliness	XML is parseable but verbose; LLMs handle C# and Gherkin much better	C# records and Gherkin are among the best-understood formats by LLMs

TEML / Emlang — YAML-based event modeling languages. Cleaner than EMN but still introduce an external DSL that duplicates what C# records already express. The YAML provides no additional semantic value over the records themselves, and requires maintaining two representations of the same information.

EventSauce YAML — PHP-specific, generates classes from YAML definitions. This is the opposite direction from what we want: we want types-first, not schema-first. C# records are already more expressive than any YAML schema could be.

Why C#/F# records as source of truth

C# and F# record syntax is already the most concise way to define domain types:

// This IS the specification. No DSL needed.
public record PlaceOrder(string CustomerName, List<OrderItem> Items);
public record OrderPlaced(Guid OrderId, string CustomerName, List<OrderItem> Items);
public record OrderItem(string Name, decimal Price);

// F# is even more terse
type PlaceOrder = { CustomerName: string; Items: OrderItem list }
type OrderPlaced = { OrderId: Guid; CustomerName: string; Items: OrderItem list }
type OrderItem = { Name: string; Price: decimal }

Benefits:

Compiles — the types are real code, not a disconnected specification artifact
IDE support from day one — IntelliSense, refactoring, Find Usages all work
No synchronization problem — the types are the single source of truth, not a copy
AI reads them natively — LLMs understand C# records better than any custom DSL
Role inference from naming conventions — imperative (PlaceOrder) = command, past tense (OrderPlaced) = event, class with Apply() methods = aggregate. No metadata annotations needed (though marker interfaces remain available for teams that want them)

Why Gherkin for behavior (not the types)

Gherkin expresses what C# types cannot: expected behavior under specific conditions. The types tell you the shape; Gherkin tells you the story:

Scenario: Cannot ship a cancelled order
  Given events for Order
    | OrderPlaced                          |
    | OrderId: <id>, CustomerName: Alice   |
    | OrderCancelled                       |
    | OrderId: <id>, Reason: Changed mind  |
  When ShipOrder is received
    | OrderId |
    | <id>    |
  Then validation fails with "Cannot ship a cancelled order"

No type definition can express this. The Gherkin adds the behavioral dimension — the business rules, edge cases, and validation constraints that make the domain meaningful.

What this means for the tooling

The tool reads C# types (via Roslyn or reflection), infers roles from naming conventions, and generates Gherkin scenarios describing the expected happy-path behavior. The developer then extends with business rules and edge cases. The Gherkin drives both test execution (Reqnroll-compatible) and handler scaffold generation.

No EMN import/export is planned. If an AxonIQ migration scenario arises, a one-time conversion tool (dotnet run -- import-emn) could be built as a weekend project — it would output C# records, not maintain an ongoing EMN representation.

Open Questions

Package structure: New NuGet (e.g., CritterStack.Specs)? Or part of Wolverine.CritterWatch?
Feature file location convention: Specs/ folder alongside src/? Or colocated with the aggregate?
How to handle projections in Gherkin: Then the OrderSummary projection contains with table verification?
Multi-tenant scenarios: Given tenant "tenant-1" step that sets the tenant context?
Saga testing: How to express multi-step saga workflows in Gherkin?
F# support: F# discriminated unions as event types — the AI needs to understand these too
Type discovery: Roslyn source analysis (works at design time, before compilation) vs reflection (requires compiled assembly)?

Competitive Position

Competitor	What they have	What we'd have
AxonIQ EMN + AI Dev Agent	Proprietary XML format, Java-only, platform-locked, visual-tool-oriented	Open Gherkin, C# types as source, any .NET IDE, AI-assisted
TEML/Emlang	YAML DSL, generates diagrams only	Real C# types + executable Gherkin + full test execution + HTML reports
EventSauce (PHP)	YAML → class codegen (opposite direction)	C# types → Gherkin → handlers + tests + living documentation
Context Mapper	CML DSL for modeling, no code generation	Types → specs → code → tests → living documentation
Reqnroll	Generic BDD, no event sourcing awareness	Event sourcing-native step definitions, TrackedSession integration
Storyteller (legacy)	Rich rendering, correlated logging, set verification	Same vision, modern implementation, AI-assisted, Gherkin-based

FilesExpand file tree

spec-driven-development-design.md

Latest commit

History