-
Notifications
You must be signed in to change notification settings - Fork 4
Description
LLAR Testing System Research Summary
1. Starting Point
This research did not begin as a generic question about “how to write tests.” It began with a more specific problem:
- LLAR faces a large build matrix derived from
default options + require. - That matrix cannot be fully executed in practice.
- Random sampling is not acceptable as a release gate.
- Configuration failures should not be discovered only after packages reach end users.
The research therefore focused on four questions:
- How do mainstream package ecosystems handle large configuration matrices?
- Do they rely on exhaustive testing, incremental testing, or layered validation?
- Are pairwise or other combinatorial reduction strategies used as primary release gates?
- Under LLAR’s constraints of being black-box, multi-language, and cloud-native, what is a realistic design direction?
2. Phase One: Industry Practice
2.1 Nix / NixOS / Hydra / ofborg
The Nix ecosystem was studied first, with attention to the full workflow rather than isolated commands:
- PR workflows rely heavily on ofborg for incremental evaluation and limited builds.
- Mainline and release infrastructure rely on Hydra for evaluation, scheduling, building, and binary publication.
- Test scope is not “all combinations”; it is constrained through mechanisms such as
supportedSystems,hydraPlatforms, and related platform gates. - NixOS testing is fundamentally a platformized design with a clear separation between control plane and execution plane.
Examples:
- In Nixpkgs pull requests, maintainers can trigger
@ofborg eval,@ofborg build ..., or@ofborg test .... ofborg then runs the correspondingnix-instantiate/nix-buildstyle workflows, but only on approved machines and targets. - A package can declare
meta.platforms = lib.platforms.linux, or evenmeta.hydraPlatforms = [ ]. This means a package may logically support Linux while Hydra does not promise official binary production for it.
Core takeaway:
- Nix does not attempt to run every combination.
- It controls the problem by shrinking the matrix it officially promises to build and validate.
2.2 Conan / ConanCenter
Conan was then examined to answer two concrete questions:
- How are complex binary matrices handled in practice?
- What kind of test contract does Conan actually promote?
Findings:
- ConanCenter uses fixed profile sets plus
package_iddeduplication to build binaries; it does not exhaustively enumerate all option combinations. - ConanCenter explicitly focuses on binary creation and consumer-level validation rather than running upstream test suites in full.
- For large product/configuration spaces, Conan reduces rebuild cost through
build-order,build-order-merge, and lockfile consistency.
Examples:
- The Conan documentation uses
libpngas an example forconan graph build-order --requires=libpng/1.5.30 --order-by=recipe. The result placeszlibbeforelibpng; ifzlibalready exists in cache, Conan does not rebuild it. - In the header-only tutorial,
sum/0.1usespackage_id()withself.info.clear()so that dimensions such asDebugorC++17do not affect binary identity. As a result, only one package ID is produced.
Core takeaway:
- Conan does not test all combinations.
- Instead, the recipe explicitly defines which dimensions affect binary identity, and the system builds and caches accordingly.
2.3 Bazel
Bazel was included because it offers a strong engineering perspective on reproducibility and cacheability:
- Bazel stores action results under
/ac/<action-hash>and output files under/cas/<sha256>. - If another machine computes the same action hash, it can reuse the result directly instead of recompiling.
- However, Bazel explicitly warns that if an action depends on undeclared external tools such as
/usr/bin/gcc, cache hits may become incorrect.
Core takeaway:
- Reuse is not based on “this looks similar.”
- Reuse depends on strictly defined inputs and sufficiently hermetic / reproducible execution.
2.5 Debian / autopkgtest / debci / britney
For Debian, the relevant testing-system line is not reproducible builds, but the interaction among autopkgtest, debci, and britney.
The main findings are:
- Debian packages can declare
autopkgtesttests. - These tests are executed on Debian CI infrastructure, with
debciserving as the execution framework. britneytriggers relevant tests during the migration process from unstable to testing, and uses the results to influence migration decisions.- However, Debian does not perform a full-matrix or full-archive retest of all affected packages. In scenarios such as library transitions, it runs the tests triggered by the library package itself, rather than rerunning autopkgtests for every package that could be rebuilt against the new library version.
More concrete examples include:
- The Debian Wiki explicitly states that maintainers can add autopkgtests to packages, and that these tests are run on
ci.debian.net. - The same documentation explains that
britneycalls thedebciAPI to test migration candidates and uses the results when deciding whether a package can migrate from unstable to testing. - The documentation also makes clear that, in a library transition, Debian does not rerun autopkgtests for every package that could be rebuilt against the new library version; it only runs the tests that are triggered by that library package.
The core takeaway is:
- Debian’s strategy is not full-matrix testing, but rather “package-declared tests + CI execution + migration-triggered relevant testing.”
- It treats test results as an input to release migration decisions, rather than assuming that every possible configuration deserves equal testing priority.
3. Phase Two: Cross-Ecosystem Synthesis
After studying Nix and Conan in depth, a stable external pattern emerged:
- No mainstream package manager was found to use pairwise testing as its primary release gate.
- No mainstream ecosystem was found to exhaustively test all option combinations.
- Common engineering patterns are:
- fixed platform matrices,
- incremental triggering,
- key product / key test subsets,
- separation between build and test,
- binary deduplication,
- lockfile-based dependency consistency.
This clarified two points:
- The absence of pairwise as a release gate is not accidental. It lacks the determinism required for release correctness.
- The absence of exhaustive execution is not laziness. Large matrices are genuinely infeasible to cover in full.
Compressed into one sentence:
- Nix / Conan: first reduce the matrix you are willing to promise.
- Bazel / Debian: then make build results strict, reproducible, and comparable.
- GitHub / Homebrew: then add provenance and attestation to prove the artifact came from a trusted build flow.
This also exposed an important boundary:
- Industry provides mature techniques for scope reduction, cache reuse, and provenance.
- But there is no ready-made solution that satisfies all of LLAR’s constraints at once: black-box, multi-language, no random sampling, and deterministic treatment of untested combinations.
4. Phase Three: Research Directions Beyond Industry Practice
In addition to production systems, several common research directions were reviewed in order to determine what is useful and what is incompatible with LLAR’s constraints.
4.1 Pairwise / Covering Arrays
The standard argument is:
- Many faults are triggered by interactions among only a small number of parameters.
- Therefore, pairwise or higher-order
t-wise testing can cover a large interaction space with relatively few test cases.
Examples:
- NIST materials often use examples such as
if (A && B)to illustrate that many faults arise from low-order interactions. - NIST also explicitly notes that real faults may require interaction strength up to 6, which means pairwise is not always sufficient.
Conclusion:
- This is useful for cost-effective bug discovery.
- It is not suitable for issuing deterministic release approval for untested combinations.
4.2 Family-Based / Product-Line Model Checking
The core idea is:
- Build a formal feature model and behavior model.
- Verify the whole product family using SAT / IC3 / IMC style techniques instead of testing configurations one by one.
Example:
- Research papers encode an entire feature family into a single SMV model and perform product-line verification over the whole family.
Conclusion:
- The method is powerful.
- But it requires formal feature models and behavioral models.
- That is not realistic for LLAR, which only orchestrates shell builds and does not understand language semantics.
4.3 Variational Execution
The core idea is:
- Share common execution paths among multiple configurations within a single run.
Example:
- The OOPSLA 2018 variational execution work rewrites JVM bytecode and reports 2x to 46x speedups on several highly configurable systems.
Conclusion:
- This requires deep integration with a specific language runtime or execution model.
- It is not language-agnostic and therefore does not fit LLAR’s role.
5. Phase Four: Returning to LLAR’s Real Constraints
A major turning point in the research was shifting the discussion from “what the outside world does” back to LLAR’s actual constraints:
- LLAR is a multi-language package manager; it is not appropriate to elevate a language-specific ABI model such as DWARF-based C/C++ analysis into platform infrastructure.
- LLAR is a black-box orchestrator; it should not require formula authors to understand sophisticated internal analysis models.
- LLAR is cloud-native; it needs a CI-time method that can produce a reusable validation model without relying on local build workflows.
At this point it became clear that “incremental triggering” alone is insufficient, because it does not address hidden option coupling inside a large configuration space, such as structure-layout interactions in C.
LLAR therefore needs a matrix-reduction strategy that remains black-box and language-agnostic.
6. Phase Five: Exploring LLAR’s Own Matrix-Reduction Model
This phase began with ideas such as physical footprints, orthogonality analysis, and ABI safety nets, and gradually converged toward a more robust black-box model.
6.1 Initial Direction: File Footprints and Orthogonality Inference (Rejected)
The initial idea was intuitive:
- Run a single-variable probe for each option.
- Observe which paths it reads and writes.
- If two options touch disjoint source files, treat them as orthogonal and place them into separate “independent islands.”
Fatal flaw: the struct S padding trap.
This file-level view cannot reliably prevent hidden ABI breakage. A typical counterexample is structure-layout coupling: options A and B may each add different fields under different compile-time branches of the same struct S. Considered independently, both deltas may look harmless. But under the combined configuration A+B, C padding and alignment rules may cause a non-linear change in structure size and field offsets. Treating them as orthogonal without running the combined test would risk severe runtime failure.
6.2 Revised Direction: ABI-Level Semantic Analysis (Rejected)
To address the structure-layout problem, the discussion temporarily moved toward tools such as abidiff, which compare binary ABI structure through DWARF information.
Engineering blocker:
- This approach is precise, but tightly bound to the C/C++ ecosystem.
- LLAR’s role is a language-agnostic black-box orchestrator.
- Such a solution does not generalize cleanly to Python extensions, Go modules, or pure binary packages.
6.3 Converged Direction: Action Graph Collision
Under the dual requirement of staying black-box while still reducing the matrix conservatively, the discussion shifted from code semantics to the physical behavior of the build pipeline.
The converged idea is:
- Decompose the build into command-level actions such as compile, link, package, or install steps.
- For each action, observe only black-box facts: command arguments, working directory, input paths, and changed paths.
- Compare baseline and single-option probe runs to determine which actions and downstream paths are