[XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm by shawnwang18 · Pull Request #35132 · openxla/xla

shawnwang18 · 2025-12-11T05:21:53Z

📝 Summary of Changes

This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage.

🎯 Justification
Potentially reducing memory usage.

🚀 Kind of Contribution
Please remove what does not apply: ⚡️ Performance Improvement,

🧪 Unit Tests:
Existing gemm tests should cover the workspace size config.

…ithm

akuegel

Thanks for adjusting the comment. Can you please also add a test for GemmWorkspaceRewriter?

shawnwang18 · 2025-12-15T14:08:58Z

Thanks for adjusting the comment. Can you please also add a test for GemmWorkspaceRewriter?

Done, add a new test: xla/service/gpu/transforms/gemm_workspace_rewriter_test.cc

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845083661

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845138789

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845138789

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845138789

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845138789

… select the algorithm Imported from GitHub PR #35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed265 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48a by Shawn Wang <[email protected]>: fix comments -- 613e090 by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=#35132 from shawnwang18:shawnw/cublas_workspace 613e090 PiperOrigin-RevId: 845138789

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844994531

The gpu_types.h header is no longer included by any code and can be safely removed. This also allows for the removal of the associated build target. FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845604473

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845526292

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844572959

They don't need a separate reference to BufferAssignment. It is available via IrEmitterContext. FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845605653

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844599422

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844615460

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844616464

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 844607169

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#35132 from shawnwang18:shawnw/cublas_workspace 613e0909ff390c5c1962345fc5f36f174e45393f PiperOrigin-RevId: 845526283

… select the algorithm Imported from GitHub PR openxla/xla#35132 📝 Summary of Changes This PR introduces a pass that updates the workspace size for cuBLAS/cuBLASLt GEMM operations after autotuning has selected a specific algorithm. The GemmRewriter pass conservatively allocates workspace before autotuning. After autotuning,we know the exact algorithm selected and can query its actual workspace requirement, potentially reducing memory usage. 🎯 Justification Potentially reducing memory usage. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Unit Tests: Existing gemm tests should cover the workspace size config. Copybara import of the project: -- a6ed2653e758a2a57e9bf2ce994549c5bc3e72d3 by Shawn Wang <[email protected]>: Update cublas workspace size with the exact size extracted from algorithm -- d67a48ae705069e68c854f403abc7f1c1a07ef47 by Shawn Wang <[email protected]>: fix comments -- 613e0909ff390c5c1962345fc5f36f174e45393f by Shawn Wang <[email protected]>: add unittest Merging this change closes #35132 PiperOrigin-RevId: 845601031

Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR openxla#35132.

Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR openxla#35132. Fixes CublasLtGemmRewriteTest.MatrixBiasSwishActivation on ROCm.

Imported from GitHub PR #36963 Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR #35132. Fixes CublasLtGemmRewriteTest.MatrixBiasSwishActivation on ROCm. Copybara import of the project: -- 14a67c8 by Aleksei Nurmukhametov <[email protected]>: [ROCm] Use shared AsBlasLtEpilogue in GemmWorkspaceRewriter Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR #35132. Fixes CublasLtGemmRewriteTest.MatrixBiasSwishActivation on ROCm. Merging this change closes #36963 FUTURE_COPYBARA_INTEGRATE_REVIEW=#36963 from ROCm:anurmukh/use-shared-asblasltepilogue-in-gemmworkspacerewriter 14a67c8 PiperOrigin-RevId: 864310855

Imported from GitHub PR #36963 Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR #35132. Fixes CublasLtGemmRewriteTest.MatrixBiasSwishActivation on ROCm. Copybara import of the project: -- 14a67c8 by Aleksei Nurmukhametov <[email protected]>: [ROCm] Use shared AsBlasLtEpilogue in GemmWorkspaceRewriter Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR #35132. Fixes CublasLtGemmRewriteTest.MatrixBiasSwishActivation on ROCm. Merging this change closes #36963 COPYBARA_INTEGRATE_REVIEW=#36963 from ROCm:anurmukh/use-shared-asblasltepilogue-in-gemmworkspacerewriter 14a67c8 PiperOrigin-RevId: 864319257

* [ROCm] Build infrastructure and CI scripts * Fix infinite recursion in HloInstruction::Accept/Visit const wrappers (#470) The const wrapper methods for Accept() and Visit() were calling themselves instead of the template versions, causing infinite recursion and stack overflow. * Mark nvshmem tests as cuda-only (#458) * Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. * Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error * [ROCm] Enable embeded bitcode libs and inprocess lld (#507) Added TF_ROCM_INPROCESS_LLD and TF_ROCM_EMBEDDED_DEVICE_LIB form 0.6.0 otherwise identical to openxla#32439. Env vars only needed for 0.8.0. * [ROCm] Pass warp size to Triton compilation pipeline * [ROCm] Add FNUZ FP8 type support in Triton * [ROCm] Temporary workaround for column reduction warp size * PR openxla#36046: [ROCm] Fix failing unit tests on ROCm platform Imported from GitHub PR openxla#36046 📝 Summary of Changes - layout_assignment tests are marked cuda-only. - sample_file_test needs higher autotuner level for MIOpen to return conv algorithm. Earlier this was coming from GetDebugOptionsForTest. - buffer_debug_log test is made gpu agnostic by using cannonical gpu name. - cublas_gemm_rewriter_test_amdgpu_any fix unit test to remove padding for ROCm as introduced in openxla#33854 - gpu_kernel_tiling_test_amdgpu_any is updated to respect higher launch dimensions now supported by hipruntime - Mark dynamic_shared_memory_test as cuda-only - Add arch specific checks for barriers to sorting.hlo 🎯 Justification Fixes failing unit tests on ROCm platform * Fix build break in tfrt_gpu_buffer_test using absl_testing::StatusIs (#534) * Port transpose changes from v0.8.0 to v0.8.2 (#526) It should be dropped after the rebase on top of 330a305 * [ROCm] Fix failing test TritonEmitterTest/RocmWarpSizeIsSetCorrectly (#545) * [ROCm] Fix failing test TritonEmitterTest/RocmWarpSizeIsSetCorrectly Define valid tile parameters and non-zero shared memory. * Update xla/backends/gpu/codegen/triton/fusion_emitter_device_test.cc Co-authored-by: Maxime France-Pillois <[email protected]> * Update xla/backends/gpu/codegen/triton/fusion_emitter_device_test.cc Co-authored-by: Maxime France-Pillois <[email protected]> --------- Co-authored-by: Maxime France-Pillois <[email protected]> * Fix MIOpen linking for RNN kernels Add explicit linkopts to miopen cc_library target to ensure libMIOpen.so is properly linked at runtime. This fixes AttributeError: module 'jaxlib.gpu_rnn' has no attribute 'compute_rnn_workspace_reserve_space_sizes' in experimental_rnn_test in JAX. Without this change, the _rnn.so shared library fails to load MIOpen symbols properly, causing RNN test failures. * Force rbe incompatible tests to be executed locally (#485) * [ROCm] Add missing cuda-only tag * enable mx datatype for rocm (#462) * enable mx datatype for rocm * add // TF_ROCM_VERSION >= 70000 * fix unit test build * Add rocprofiler-sdk (v3) integration with roctracer fallback Integrate rocprofiler-sdk for ROCm profiling with fallback to roctracer (v1) when rocprofiler-sdk is not available. * [ROCm] Always process convolutions through MIOpen backend for decomposition Override AddConvAndGemmAutotuningPass in AMDGPUCompiler to ensure convolutions are always sent to MIOpen for processing, regardless of xla_gpu_autotune_level. This is required because MIOpen handles decomposition of unsupported fused convolutions back to regular convs, which must happen even when autotuning is disabled. Fixes cudnn_fused_conv_rewriter_autotune_disabled_test failures on ROCm. * Changed error value for SplitK test in fusion_emitter_device_legacy_port_test.cc (#538) * [ROCm] Add PJRT_Triton_Extension support (#548) This change is PJRT_Triton_Extension support for ROCm as counterpart of that for CUDA. Pallas Triton calls are lowered to HSACO directly rather than PTX on ROCm platform. * Fix expected output in fusion_emitter_int4_device_test for ROCm. * skip conditional graph tests * Fixed missing rtne in Triton to pass support_test. * [ROCm] Add rocm-only tag to triton_rocm target Fix dependency validation by tagging triton_rocm as rocm-only since it depends on the rocm-only amdgpu_backend target. * Avoid upcast of lib func operands to F32 for F16 type. * Modify fusion_emitter_large_test to work on ROCm. (#568) * Modify fusion_emitter_large_test to work on ROCm. * Fix fall-through warning in support_legacy.cc * Fixed dot_algorithms_test. Updated support_legacy and test itself. * Modified triton_fusion_numerics_verifier_test to work on ROCm. * [ROCm] Use shared AsBlasLtEpilogue in GemmWorkspaceRewriter Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR openxla#35132. * Sync mgpu tests with xla_mgpu config * [ROCm] Fix RocmWarpSizeIsSetCorrectly test to use new dump file naming After commit 4ce9326, Triton pass dumps use the naming pattern {module}.{kernel}.{pass_manager_name}.txt instead of *.triton-passes.log. Update the test to match the new convention. * Enable hlo_runner_main_gpu for rocm * enable hipblaslt as a default choice and disable nccl comm split to avoid hanging * Add flag to control swish activation fusion. (#577) Add flag to control swish activation fusion. * Improve test strategy for swish fusion flag (#585) Move tests to a more suitable file. * Revert "Fix infinite recursion in HloInstruction::Accept/Visit const wrappers (#470)" This reverts commit 21a2d57. * Disable hipblaslt as default choice * Execute test directly if running on system without GPU (#608) * Execute test directly if running on system without GPU * Address review comments * Address review comments * Remove non-existent test targets from ROCm CI exclusion list The following targets no longer exist in their respective BUILD files and were causing Bazel target pattern parsing failures. * Bundle librocm_smi64.so for MI200 lit tests MI200 lit tests use hlo-opt which links against ROCm libraries. When running on remote workers without ROCm installed, hlo-opt fails with: "error while loading shared libraries: librocm_smi64.so.1" The _tools_on_path rule bundles libraries into lit_lib/ by extracting them from CcInfo.linking_context.linker_inputs[].dynamic_library. However, ROCm's cc_library targets with .so files in srcs don't populate dynamic_library (unlike CUDA which uses cc_import). Add a new rocm_smi_import target using cc_import, which properly exposes the shared library via CcInfo. Use this target in lit.bzl so librocm_smi64.so.1 gets bundled into lit_lib/ and is available at runtime via hlo-opt's rpath. --------- Co-authored-by: Pham Binh <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: Zoran Jovanovic <[email protected]> Co-authored-by: Dragan Mladjenovic <[email protected]> Co-authored-by: Harsha H S <[email protected]> Co-authored-by: Maxime France-Pillois <[email protected]> Co-authored-by: magaonka-amd <[email protected]> Co-authored-by: Xuefei Jiang <[email protected]> Co-authored-by: cj401-amd <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Jian Li <[email protected]> Co-authored-by: Chao Chen <[email protected]> Co-authored-by: Alexandros Theodoridis <[email protected]> Co-authored-by: Milica Makevic <[email protected]>

Replace the duplicate with the shared function to fix the issue and prevent future divergence. The duplicate AsBlasLtEpilogue in gemm_workspace_rewriter.cc was missing SILU epilogue support, breaking ROCm Swish fusion tests. This duplicate was introduced in PR openxla#35132.

shawnwang18 changed the title ~~[XLA:GPU] Update cublas workspace size with size of algorithm~~ [XLA:GPU] Update cublas workspace size after autotuner select the algorithm Dec 11, 2025

shawnwang18 changed the title ~~[XLA:GPU] Update cublas workspace size after autotuner select the algorithm~~ [XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm Dec 11, 2025

shawnwang18 marked this pull request as draft December 11, 2025 08:37

Update cublas workspace size with the exact size extracted from algor…

a6ed265

…ithm

shawnwang18 force-pushed the shawnw/cublas_workspace branch from d8a991d to a6ed265 Compare December 12, 2025 05:21

shawnwang18 requested a review from akuegel December 12, 2025 06:02

shawnwang18 marked this pull request as ready for review December 12, 2025 06:03

akuegel reviewed Dec 15, 2025

View reviewed changes

Comment thread xla/backends/gpu/runtime/gpublas_lt_matmul_thunk.cc

fix comments

d67a48a

akuegel reviewed Dec 15, 2025

View reviewed changes

add unittest

613e090

shawnwang18 requested a review from akuegel December 15, 2025 22:53

akuegel approved these changes Dec 16, 2025

View reviewed changes

copybara-service Bot mentioned this pull request Dec 16, 2025

PR #35132: [XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm #35328

Closed

copybara-service Bot mentioned this pull request Dec 16, 2025

PR #35132: [XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm tensorflow/tensorflow#106335

Closed

copybara-service Bot mentioned this pull request Dec 16, 2025

PR #35132: [XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm #35342

Merged

copybara-service Bot mentioned this pull request Dec 16, 2025

PR #35132: [XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm tensorflow/tensorflow#106358

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106306

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Remove unused gpu_types.h header and build target. tensorflow/tensorflow#106499

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106500

Closed

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106502

Closed

copybara-service Bot mentioned this pull request Dec 17, 2025

[XLA:GPU] Simplify Copy Fusions. tensorflow/tensorflow#106503

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106239

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106504

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106505

Merged

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106506

Closed

copybara-service Bot mentioned this pull request Dec 17, 2025

Automated Code Change tensorflow/tensorflow#106471

Closed

nurmukhametov mentioned this pull request Jan 28, 2026

[ROCm] Use shared AsBlasLtEpilogue in GemmWorkspaceRewriter #36963

Closed

copybara-service Bot mentioned this pull request Feb 2, 2026

PR #36963: [ROCm] Use shared AsBlasLtEpilogue in GemmWorkspaceRewriter #37149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm#35132

[XLA:GPU] Update HLO cublas workspace size after autotuner select the algorithm#35132
shawnwang18 wants to merge 3 commits into
openxla:mainfrom
shawnwang18:shawnw/cublas_workspace

shawnwang18 commented Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

akuegel left a comment

Uh oh!

shawnwang18 commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shawnwang18 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

akuegel left a comment

Choose a reason for hiding this comment

Uh oh!

shawnwang18 commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shawnwang18 commented Dec 11, 2025 •

edited

Loading