-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Extend the backend cost model (src/blis/backend_selection.rs) with a computational reducibility classifier that estimates whether a workload's computation is reducible (shortcuttable, parallelizable) or irreducible (inherently sequential, no speedup from GPU offload) before dispatching to SIMD vs GPU.
Motivation
Wolfram's recent ruliological analysis of P vs. NP (Jan 2026) formalizes a concept directly applicable to our backend selection: not all computations benefit from a larger/different compute model. His exhaustive enumeration of Turing machines shows:
- Some functions see exponential→constant speedup when moved to a "bigger machine" (analogous to CPU→GPU offload)
- Other functions see zero speedup regardless of machine size — they are computationally irreducible
Currently, BackendCostModel::select_backend() decides SIMD vs GPU based on:
- Arithmetic intensity vs ridge point (roofline model)
- Element count threshold (1M minimum for GPU)
- Hardware availability
This misses a critical dimension: whether the computation itself can exploit parallelism. A matmul is highly reducible (GPU wins). A sequential recurrence like x[n] = f(x[n-1]) is irreducible — shipping it to GPU wastes PCIe transfer time for no parallel gain.
Proposed Change
Add a Reducibility enum and classification step before the existing roofline check:
pub enum Reducibility {
/// Highly parallel, data-independent — GPU candidate
Reducible,
/// Mixed: some parallel, some sequential dependencies
Partial { parallel_fraction: f64 },
/// Inherently sequential — SIMD-only, skip GPU consideration
Irreducible,
}Classification heuristics (not exhaustive enumeration — we're practical, not ruliological):
- Data-parallel ops (elementwise, map, broadcast):
Reducible - Reductions with associative operators (sum, max, dot):
Reducible - Tiled matmul / convolution:
Reducible - Scan / prefix-sum:
Partial { parallel_fraction: 0.5 }(Blelloch-parallelizable but with overhead) - Recurrences / sequential dependencies:
Irreducible - Sparse irregular access patterns:
Partial(memory-bound, not compute-bound on GPU)
Integrate into select_backend():
pub fn select_backend(&self, m: usize, n: usize, k: usize, reducibility: Reducibility) -> ComputeBackend {
match reducibility {
Reducibility::Irreducible => {
// Skip GPU entirely — no parallel gain, avoid PCIe cost
self.select_simd_backend()
}
Reducibility::Partial { parallel_fraction } => {
// Adjust effective arithmetic intensity by parallel fraction
// (Amdahl's law: speedup bounded by 1 / (1 - parallel_fraction))
let effective_ai = self.arithmetic_intensity(m, n, k) * parallel_fraction;
self.select_by_roofline(effective_ai, m * n * k)
}
Reducibility::Reducible => {
// Existing logic: full roofline + size threshold
self.select_by_roofline(self.arithmetic_intensity(m, n, k), m * n * k)
}
}
}Impact
- Avoids GPU dispatch for irreducible workloads (saves PCIe round-trip)
- More accurate cost model for mixed workloads via Amdahl-adjusted AI
- Enables downstream crates (aprender, realizar) to annotate their ops with reducibility hints
References
- Wolfram, "P vs. NP and the Difficulty of Computation: A Ruliological Approach" (Jan 2026) — empirical evidence that computational irreducibility is ubiquitous and speedup from larger machines is not guaranteed
- Gregg & Hazelwood (2011) — existing 5× PCIe rule already in
backend_selection.rs - Amdahl (1967) — parallel fraction bounding for
Partialcase
Files
src/blis/backend_selection.rs— primary integration pointsrc/hardware.rs— may needReducibilityinHardwareCapabilityprofilesrc/brick/mod.rs—ComputeBackenddispatch