Redesign the test reordering subsystem

## Status quo

Test reordering is implemented by `TestSuiteSorter` and driven by two orthogonal knobs plus one toggle:

- a **main order**: `default`, `reverse`, `random`, `duration-ascending`, `duration-descending`, `size-ascending`, `size-descending`
- a **defects overlay**: on/off
- a **dependency resolver**: on/off

Both the XML `executionOrder` attribute and the CLI `--order-by` option accept a comma-separated list where each token flips one of these three knobs. Under the hood, `TestSuiteSorter::sort()` applies them in a hard-coded three-phase pipeline:

1. main order
2. defects-first overlay (stable, so the main order is preserved inside
   each defect tier)
3. dependency resolution

### Recent fixes

Three real bugs in `TestSuiteSorter` have just been corrected:

1. `cmpDefectPriorityAndTime()` used to fall through to a duration-based tie-break for any pair of equally-weighted items. Combined with PHP's stable `usort()`, this could invert the order produced by the main phase, breaking the implicit "unit suite before end-to-end suite" ordering when both contained equally-weighted defects. The comparator now returns `0` for equal weights and the preceding phase's order is preserved.
2. Skipped and incomplete tests were being incorrectly treated as defects. They are no longer reordered by the "defects first" strategy.
3. `cmpSize` treated every non-`TestCase`, non-`DataProviderTestSuite` input as `unknown`, which meant a plain `TestSuite` wrapping a `TestCase` class was always sorted as `unknown` rather than by the size of the tests it contained. The comparator now recurses into child tests and takes the maximum size weight.

### What is still "awkward"

Even with the bug fixes, several things about the current design feel wrong:

- Phase order is hard-coded. The user cannot express "resolve dependencies first, then apply defects on top of the result", or "defects first, then randomize the non-defective tail", even though both are sensible asks. The comma-separated token list looks like a pipeline but is really just three flags whose precedence is fixed in `sort()`.
- Contradictory tokens are silently accepted. `duration-ascending,duration-descending` resolves to "last wins" without a warning, and `reverse,random` is similarly under-specified. A misspelled XML token is ignored entirely.
- `default` means two different things. As a token it resets every knob to its default; as an enum value it means "no main order". This conflates "I have an opinion about the default" with "I don't".
- The "defects" notion is binary. Every defective status collapses into "hoist to front, ordered by severity". There is no way to say "prioritize only failures and errors", or "put warnings last, before success".
- The result cache drives ordering, but the cache is opaque to the user. There is no way to see the priorities the sorter is about to apply without reading the JSON by hand, and no diagnostic when the cache is stale or absent.
- Suites and test cases are treated the same by the comparator. A top-level `TestSuite` aggregates priority as `max()` of its direct children, and that aggregate then competes with leaf `TestCase` weights in the same comparator. This is why the original bug only surfaced in the mixed unit-vs-end-to-end case.
- No single source of truth for "what reordering applied". The `Test Suite Sorted` event fires with no payload; users cannot inspect the effective pipeline after configuration is resolved.

## Proposal

The goal should be an ordering subsystem where the user expresses an ordered pipeline of stages, each stage is a well-defined transform over a list of tests, and the configuration surface (XML + CLI) round-trips cleanly to that pipeline.

### Sketch of the new model

- A `ReorderStage` interface with a single method: `apply(list<Test> $tests, Context $context): list<Test>`.
- Concrete stages:
  - `ByDuration(Direction)`, `BySize(Direction)`, `Reverse`, `Random`
  - `DefectsFirst(WeightPolicy)` — weight policy lets callers exclude specific statuses or remap weights (this subsumes the current "skipped is not a defect" rule)
  - `ResolveDependencies`
- A `ReorderPipeline` is an ordered list of stages. Applied at each `TestSuite` level, exactly as today, but the phase order is whatever the user declared instead of the hard-coded three-phase sequence.
- `Context` carries the `ResultCache`, the current `TestSuite`, and a way for a stage to emit diagnostics (for a future `--explain-order` feature).

### Configuration

- XML `executionOrder` and CLI `--order-by` both accept an ordered, comma-separated list of stage tokens. Precedence is token order, not hard-coded phase order.
- Contradictory tokens (two main orders, `depends` and `no-depends` together, etc.) raise a configuration error instead of silently resolving to last-wins.
- Unknown tokens raise a configuration error, not silent skip.
- The existing short tokens (`default`, `depends`, `no-depends`, `defects`, `reverse`, `random`, `duration-ascending`, `size-ascending`, etc.) map onto the new stage types. `default` stays as a reset word but is disambiguated from "no main order".
- Deprecated tokens (`duration`, `size`) keep emitting their existing deprecation events during the transition and are removed in PHPUnit 14 as already announced.

### Diagnostics

The "Test Suite Sorted" event gains a payload describing the effective pipeline and, optionally, the per-test weights that were applied. A new `--explain-order` CLI flag (or extension to `--debug`) prints the effective pipeline once at startup. This makes the "why did test X run at position Y?" question answerable without reading the result cache JSON.

### Backward compatibility

- Single-token inputs (`defects`, `depends`, `duration-ascending`, ...) behave exactly as today.
- Multi-token inputs that happen to match the existing three-phase interpretation (`depends,defects`, `depends,defects,duration-ascending`, ...) continue to produce the same order, now pinned by the end-to-end tests added in the bug-fix commit.
- Multi-token inputs that previously relied on the silent last-wins behavior (e.g. `duration-ascending,duration-descending`) will start raising configuration errors. This is a deliberate, user-visible change; it should go through the usual deprecation cycle and be released in a major version.

## Open questions

- Should `DefectsFirst` be weight-customizable via XML/CLI at all, or should the "skipped and incomplete are not defects" decision stay as a hard-coded framework policy? Exposing it invites bikeshedding; hiding it loses flexibility.
- Should the top-level `TestSuite` comparator aggregate children by `max()` (current behavior), by `sum()`, or by a count of leaf defects? The current aggregation is why this [bug](https://github.com/sebastianbergmann/phpunit/pull/6270#issuecomment-3174755852) surfaced only under specific conditions, and the new design is a good opportunity to revisit it.
- Can the `ResolveDependencies` stage be generalized so that a test's declared `#[Depends]` can influence the sort order even when `depends` is not explicitly enabled? In practice, violating a declared dependency is always wrong, and making it opt-out feels safer than making it opt-in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign the test reordering subsystem #6585

Status quo

Recent fixes

What is still "awkward"

Proposal

Sketch of the new model

Configuration

Diagnostics

Backward compatibility

Open questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Redesign the test reordering subsystem #6585

Description

Status quo

Recent fixes

What is still "awkward"

Proposal

Sketch of the new model

Configuration

Diagnostics

Backward compatibility

Open questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions