Skip to content

Redesign the test reordering subsystem #6585

@sebastianbergmann

Description

@sebastianbergmann

Status quo

Test reordering is implemented by TestSuiteSorter and driven by two orthogonal knobs plus one toggle:

  • a main order: default, reverse, random, duration-ascending, duration-descending, size-ascending, size-descending
  • a defects overlay: on/off
  • a dependency resolver: on/off

Both the XML executionOrder attribute and the CLI --order-by option accept a comma-separated list where each token flips one of these three knobs. Under the hood, TestSuiteSorter::sort() applies them in a hard-coded three-phase pipeline:

  1. main order
  2. defects-first overlay (stable, so the main order is preserved inside
    each defect tier)
  3. dependency resolution

Recent fixes

Three real bugs in TestSuiteSorter have just been corrected:

  1. cmpDefectPriorityAndTime() used to fall through to a duration-based tie-break for any pair of equally-weighted items. Combined with PHP's stable usort(), this could invert the order produced by the main phase, breaking the implicit "unit suite before end-to-end suite" ordering when both contained equally-weighted defects. The comparator now returns 0 for equal weights and the preceding phase's order is preserved.
  2. Skipped and incomplete tests were being incorrectly treated as defects. They are no longer reordered by the "defects first" strategy.
  3. cmpSize treated every non-TestCase, non-DataProviderTestSuite input as unknown, which meant a plain TestSuite wrapping a TestCase class was always sorted as unknown rather than by the size of the tests it contained. The comparator now recurses into child tests and takes the maximum size weight.

What is still "awkward"

Even with the bug fixes, several things about the current design feel wrong:

  • Phase order is hard-coded. The user cannot express "resolve dependencies first, then apply defects on top of the result", or "defects first, then randomize the non-defective tail", even though both are sensible asks. The comma-separated token list looks like a pipeline but is really just three flags whose precedence is fixed in sort().
  • Contradictory tokens are silently accepted. duration-ascending,duration-descending resolves to "last wins" without a warning, and reverse,random is similarly under-specified. A misspelled XML token is ignored entirely.
  • default means two different things. As a token it resets every knob to its default; as an enum value it means "no main order". This conflates "I have an opinion about the default" with "I don't".
  • The "defects" notion is binary. Every defective status collapses into "hoist to front, ordered by severity". There is no way to say "prioritize only failures and errors", or "put warnings last, before success".
  • The result cache drives ordering, but the cache is opaque to the user. There is no way to see the priorities the sorter is about to apply without reading the JSON by hand, and no diagnostic when the cache is stale or absent.
  • Suites and test cases are treated the same by the comparator. A top-level TestSuite aggregates priority as max() of its direct children, and that aggregate then competes with leaf TestCase weights in the same comparator. This is why the original bug only surfaced in the mixed unit-vs-end-to-end case.
  • No single source of truth for "what reordering applied". The Test Suite Sorted event fires with no payload; users cannot inspect the effective pipeline after configuration is resolved.

Proposal

The goal should be an ordering subsystem where the user expresses an ordered pipeline of stages, each stage is a well-defined transform over a list of tests, and the configuration surface (XML + CLI) round-trips cleanly to that pipeline.

Sketch of the new model

  • A ReorderStage interface with a single method: apply(list<Test> $tests, Context $context): list<Test>.
  • Concrete stages:
    • ByDuration(Direction), BySize(Direction), Reverse, Random
    • DefectsFirst(WeightPolicy) — weight policy lets callers exclude specific statuses or remap weights (this subsumes the current "skipped is not a defect" rule)
    • ResolveDependencies
  • A ReorderPipeline is an ordered list of stages. Applied at each TestSuite level, exactly as today, but the phase order is whatever the user declared instead of the hard-coded three-phase sequence.
  • Context carries the ResultCache, the current TestSuite, and a way for a stage to emit diagnostics (for a future --explain-order feature).

Configuration

  • XML executionOrder and CLI --order-by both accept an ordered, comma-separated list of stage tokens. Precedence is token order, not hard-coded phase order.
  • Contradictory tokens (two main orders, depends and no-depends together, etc.) raise a configuration error instead of silently resolving to last-wins.
  • Unknown tokens raise a configuration error, not silent skip.
  • The existing short tokens (default, depends, no-depends, defects, reverse, random, duration-ascending, size-ascending, etc.) map onto the new stage types. default stays as a reset word but is disambiguated from "no main order".
  • Deprecated tokens (duration, size) keep emitting their existing deprecation events during the transition and are removed in PHPUnit 14 as already announced.

Diagnostics

The "Test Suite Sorted" event gains a payload describing the effective pipeline and, optionally, the per-test weights that were applied. A new --explain-order CLI flag (or extension to --debug) prints the effective pipeline once at startup. This makes the "why did test X run at position Y?" question answerable without reading the result cache JSON.

Backward compatibility

  • Single-token inputs (defects, depends, duration-ascending, ...) behave exactly as today.
  • Multi-token inputs that happen to match the existing three-phase interpretation (depends,defects, depends,defects,duration-ascending, ...) continue to produce the same order, now pinned by the end-to-end tests added in the bug-fix commit.
  • Multi-token inputs that previously relied on the silent last-wins behavior (e.g. duration-ascending,duration-descending) will start raising configuration errors. This is a deliberate, user-visible change; it should go through the usual deprecation cycle and be released in a major version.

Open questions

  • Should DefectsFirst be weight-customizable via XML/CLI at all, or should the "skipped and incomplete are not defects" decision stay as a hard-coded framework policy? Exposing it invites bikeshedding; hiding it loses flexibility.
  • Should the top-level TestSuite comparator aggregate children by max() (current behavior), by sum(), or by a count of leaf defects? The current aggregation is why this bug surfaced only under specific conditions, and the new design is a good opportunity to revisit it.
  • Can the ResolveDependencies stage be generalized so that a test's declared #[Depends] can influence the sort order even when depends is not explicitly enabled? In practice, violating a declared dependency is always wrong, and making it opt-out feels safer than making it opt-in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions