Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0#302
Conversation
i-chaochen
left a comment
There was a problem hiding this comment.
could you indicate the instructions how to compile it with roctracev1 or rocprofiler-sdk (v3)?
fa59b34 to
069b25f
Compare
| #ifdef HIP_R_8F_E5M2 | ||
| return layout.type() == HIP_R_8F_E5M2_FNUZ || | ||
| layout.type() == HIP_R_8F_E4M3_FNUZ || | ||
| layout.type() == HIP_R_8F_E5M2 || layout.type() == HIP_R_8F_E4M3; | ||
| layout.type() == HIP_R_8F_E4M3_FNUZ || | ||
| layout.type() == HIP_R_8F_E5M2 || | ||
| layout.type() == HIP_R_8F_E4M3; | ||
| #else | ||
| return false; | ||
| #endif |
There was a problem hiding this comment.
I'm not sure I understand this part. How are these hipBLASlt changes related to your profiler work?
There was a problem hiding this comment.
HIP_R_8F_E5M2 seems not on ROCm-6.2 when I was testing rocmtracer(v1), so here I was trying to guard it or I should submit a separate PR for this.
There was a problem hiding this comment.
I think a separate PR might be a good idea. What do you think @i-chaochen ?
There was a problem hiding this comment.
I guess it's because hipblaslt doesn't support fp8 on rocm62? Yes, please have a seperate PR just to our local guaranting this TF_ROCM_VERSION <= 60200
There was a problem hiding this comment.
Pull Request Overview
This PR integrates rocprofiler-sdk (v3) along with maintaining roctracer (v1) support for rocm-jaxlib-v0.6.0. The implementation uses compile-time version guards to select between the two profiling systems based on ROCm version, with v3 being used for ROCm >= 6.3.
Key changes:
- Add rocprofiler-sdk (v3) support with improved profiling capabilities including time/step-based profiling
- Maintain backward compatibility with roctracer (v1) for ROCm versions < 6.3
- Implement version-specific compilation guards and API mappings
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
xla/stream_executor/rocm/roctracer_wrapper.h |
Adds version guards and rocprofiler-sdk headers/API mappings for v3 |
xla/stream_executor/rocm/hip_blas_lt.cc |
Adds compile guards for FP8 data type constants to prevent undefined symbols |
xla/backends/profiler/gpu/rocm_tracer_test.cc |
New unit tests for rocm_tracer v3 functionality |
xla/backends/profiler/gpu/rocm_tracer.h |
Adds v3 RocmTracer class definition with rocprofiler-sdk integration |
xla/backends/profiler/gpu/rocm_tracer.cc |
Implements v3 tracer with rocprofiler-sdk callbacks and event handling |
xla/backends/profiler/gpu/rocm_collector_test.cc |
New unit tests for rocm_collector v3 functionality |
xla/backends/profiler/gpu/rocm_collector.h |
Adds v3 collector interfaces and data structures |
xla/backends/profiler/gpu/rocm_collector.cc |
Implements v3 collector with event processing and export logic |
xla/backends/profiler/gpu/device_tracer_rocm.cc |
Updates device tracer to use appropriate version-specific APIs |
xla/backends/profiler/gpu/BUILD |
Adds rocprofiler-sdk linkage and new test targets |
Comments suppressed due to low confidence (4)
xla/backends/profiler/gpu/rocm_tracer.cc:277
- Inconsistent use of optional type - header file uses 'absl::types::optional.h' but implementation uses 'std::optional'. Should be consistent throughout.
return "Invalid";
xla/backends/profiler/gpu/rocm_tracer.cc:163
- The stream_id type has been changed from int64_t to uint64_t in the header but this change should be consistently applied throughout the codebase to avoid potential issues.
oss << ", sizeBytes=" << data->args.hipMemcpyDtoD.sizeBytes;
xla/backends/profiler/gpu/rocm_collector.cc:1
- [nitpick] Empty line at the beginning of the file should be removed for consistency with code style.
/* Copyright 2024 The OpenXLA Authors. All Rights Reserved.
| void __attribute__((constructor)) init_rocm_lib() { | ||
| rocprofiler_force_configure(xla::profiler::rocprofiler_configure); | ||
| } |
There was a problem hiding this comment.
Using compiler-specific attributes like 'attribute((constructor))' reduces portability. Consider using a more portable initialization mechanism.
| void __attribute__((constructor)) init_rocm_lib() { | |
| rocprofiler_force_configure(xla::profiler::rocprofiler_configure); | |
| } | |
| namespace { | |
| struct RocmLibInitializer { | |
| RocmLibInitializer() { | |
| rocprofiler_force_configure(xla::profiler::rocprofiler_configure); | |
| } | |
| }; | |
| static RocmLibInitializer rocm_lib_initializer; | |
| } // namespace |
There was a problem hiding this comment.
Why do you need rocprofiler_force_configure?
There was a problem hiding this comment.
We need rocprofiler_force_configure to initialize rocprofiler-sdk's hooks into hip runtime before hipInit. it is called automatically when xla_rocm_plugin.so is loaded. I believe rocprofiler-sdk team is trying to solve this, then we can update our code.
There was a problem hiding this comment.
Can you explain a bit further. For the look at the source rocprofiler_force_configure cannot work if hipInit was called. So at the point rocprofiler_force_configure is called hipInit was not called yet, so I see no reason for rocprofiler_force_configure to be called.
|
@cj401-amd I'm confused this PR, what's differences between this and #251 ? why you create separate one? |
mrodden
left a comment
There was a problem hiding this comment.
Wasn't able to finish my review fully yet but here's what I have so far.
| linkopts = select({ | ||
| "//conditions:default": [ | ||
| "-L/opt/rocm/lib", # search path for all ROCm shared objects | ||
| "-lrocprofiler-sdk", # the library that owns the missing symbols | ||
| ], | ||
| }), |
There was a problem hiding this comment.
you shouldn't need this anymore if you are using the macros in roctracer_wrapper.h which will do the dlopen at runtime to load the lib.
If you remove this and it doesn't work then we probably have another issue that needs to be fixed.
There was a problem hiding this comment.
And you should not link against rocprofiler-sdk if you don't use it. Say on 6.2. If we end up with compile-time switch between the tracers I suggest this be modeled as an proper library and added to dependencies.
| deps = if_rocm([ | ||
| ":rocm_tracer", | ||
| ":rocm_collector", | ||
| ]) + [ |
There was a problem hiding this comment.
Kinda weird that this has to be under an if. I would think it wouldn't matter since it shouldn't be invoked already due to the rocm-only tag...
There was a problem hiding this comment.
this is just to align upstream syntax
| options.device_type() != ProfileOptions::UNSPECIFIED) | ||
| return nullptr; | ||
|
|
||
| #if TF_ROCM_VERSION < 60300 |
There was a problem hiding this comment.
I still think we need to do this switch at runtime, not compile time. It should be possible since you have both symbol sets from roctracer and rocprofiler available from the roctracer_wrapper.h, so you just have to switch on an environment variable or something.
There was a problem hiding this comment.
would be great if you can have another PR to make it feasible on those tested workloads.
There was a problem hiding this comment.
@mrodden But we only have a single rocm version where both rocprofiler-sdk and roctracer coexist?
| #include "absl/strings/str_cat.h" | ||
| #include "absl/strings/str_format.h" | ||
| #include "absl/strings/str_join.h" | ||
| #include "xla/stream_executor/rocm/roctracer_wrapper.h" |
pemeliya
left a comment
There was a problem hiding this comment.
As I see rocm_tracer/collector for rocprofiler v1 and for v3 are two orthogonal implementations. They have very few things in common. If we really insist on keeping rocprofiler v1, perhaps can we split rocm_tracer.h/.cc into rocm_tracer_v1.h/cc and rocm_tracer_v3.h/cc and maybe the same for rocm_collector? This way we can improve code maintainability, and can get rid of major #if / #else' clauses by letting bazel to do the conditional compilation:
e.g. if ROCM_VERSION < 60300 -> compile rocm_tracer/collector_v1
else -> compile rocm_tracer/collector_v3
| LOG(INFO) << "agent id = " << agent.id.handle | ||
| << ", dev = " << agent.device_id | ||
| << ", name = " << (agent.name ? agent.name : "null"); | ||
| agents_[agent.id.handle] = agent; |
There was a problem hiding this comment.
You should include only HIP visible gpus https://github.com/ROCm/rocprofiler-sdk/blob/3a924a0d36a3b4be70ce3ef6e81f7e302e30258e/source/include/rocprofiler-sdk/agent.h#L211C43-L211C61 Otherwise your numbering scheme ignores HIP_VISIBLE_DEVICES.
| << ", dev = " << agent.device_id | ||
| << ", name = " << (agent.name ? agent.name : "null"); | ||
| agents_[agent.id.handle] = agent; | ||
| if (agent.type == ROCPROFILER_AGENT_TYPE_GPU) { |
There was a problem hiding this comment.
What is the reason we include both CPU and GPU agents in the map?
There was a problem hiding this comment.
rocprofiler-sdk retrieves all agents, including both CPUs and GPUs on the system. then gpus are filtered.
There was a problem hiding this comment.
Where are the gpus filtered?
There was a problem hiding this comment.
Where are the gpus filtered?
firstly, filtered out CPU agents and only keep GPU agents https://github.com/ROCm/xla/blob/ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0/xla/backends/profiler/gpu/rocm_profiler_sdk.cc#L406
then for writing out events based on GPU ids https://github.com/ROCm/xla/blob/ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0/xla/backends/profiler/gpu/rocm_collector.cc#L536
initially, the idea is to put v1 along v3 for a transition period (v1 is being phased out.) and then remove v1 from the code totally. |
|
@cj401-amd @i-chaochen Here is a modification of the current changes to switch between them at runtime ae1c5d8 I am not sure my plan of trying to extend and override the one function member of RocmTraceCollector is going to work, but it could be changed to a composition or templated out instead. |
|
@cj401-amd I think you need to rebase your branch to the latest 0.6 to get a green CI pass |
70e5252 to
a585235
Compare
now the backend is split into rocm_tracer_v1 and rocm_profiler_sdk with default to rocm_profiler_sdk (v3), which can be changed to --bazel_options="--define=xla_rocm_profiler=v1" for ROCm < 6.3. it seems v1 could not be built on ROCm-6.4, and rocprofiler-sdk could not be built on ROCm-6.2. |
|
based on latest exps, v1 seems not be built on ROCm-6.4, and rocprofiler-sdk could not be built on ROCm-6.2. |
i-chaochen
left a comment
There was a problem hiding this comment.
it's ok to me @pemeliya @ScXfjiang please have a check
we are going to remove it when rocprofiler-sdk fixes the initialization of its hooks. If we remove it now, rocprofiler-sdk can not trace any GPU events for some workloads due to hipInit is called before rocprofiler_configure, e.g., maxtext LLAMA2-7B workload. |
There was a problem hiding this comment.
LGTM please squash your commit as one as well, and if @pemeliya or @draganmladjenovic approved we can merge it. Thanks!
pemeliya
left a comment
There was a problem hiding this comment.
@cj401-amd , can you also take a look at my two last comments ? I mean we do not really need to access that 'agents_' array to get the device ID. Furthermore, we can experiment with the profiler buffer size to see if increasing it, solves the problem with missing events (as we saw previously)
| const auto &src_gpu = agents_[static_cast<uint32_t>(rec.src_agent_id.handle)], | ||
| &dst_gpu = agents_[static_cast<uint32_t>(rec.dst_agent_id.handle)]; |
There was a problem hiding this comment.
I think we can simplify this - because src_gpu.id.handle seems to be the same as rec.src_agent_id.handle
and same for dst_gpu. So we do not really need to access 'agents_' map here.
There was a problem hiding this comment.
The reason I used that way was for checking https://github.com/ROCm/xla/blob/ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0/xla/backends/profiler/gpu/rocm_profiler_sdk.cc#L225, otherwise, sometimes we got CPU id, then here we start iterating from cpu_id https://github.com/ROCm/xla/blob/ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0/xla/backends/profiler/gpu/rocm_collector.cc#L534.
Hi @draganmladjenovic, is it possible to help remove the lock now? |
7c7eaba to
cd5615f
Compare
|
I can see there are still 2 failed UTs from CI, is it expecable? |
fixed them already. |
… CLI update with VLOG(2) and put more comments for using rocprofiler_force_configure
e978e5c to
6f2106d
Compare
upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f)
upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f)
upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f)
upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f)
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test --------- Co-authored-by: Chunyu Jin <[email protected]>
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]>
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) * Ported all triton related changes from v0.6.0 to v0.7.1 (cherry picked from commit 1851bcc) Disable softmax triton fusion if triton gemm is off (#281) * Disable softmax rewriter triton if triton gemm is disabled * Add specific flag to enable triton softmax fusion * Address review comments (cherry picked from commit 51a7f4b) [ROCm][Triton] Disable transposed load in certain conditions (cherry picked from commit 50860e9) Enable unit tests that pass after fixing some Triton related issues. (#285) * Enable unit tests that pass after fixing some Triton related issues. * fusion_emitter_device_legacy_test still fails on MI200 (cherry picked from commit 97dd565) Rocm jaxlib v0.6.0 triton support ut (#279) * Fixed triton/support_test - no fmfa. * Fix issue with rounding mode in accelerate amd matmul. * Fixed issues with usage of mfma in support_test. (cherry picked from commit 44f7d87) Restore gpu_triton_custom_call_test (#262) (cherry picked from commit 32eafa4) Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. (cherry picked from commit 56ec7ec) (cherry picked from commit b1f3e9f) fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179) (cherry picked from commit 8517a3a) (cherry picked from commit c62e47d) fixed bazel build issue --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Alex <[email protected]>
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test --------- Co-authored-by: Chunyu Jin <[email protected]> (cherry picked from commit 492d1ee)
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) * Ported all triton related changes from v0.6.0 to v0.7.1 (cherry picked from commit 1851bcc) Disable softmax triton fusion if triton gemm is off (#281) * Disable softmax rewriter triton if triton gemm is disabled * Add specific flag to enable triton softmax fusion * Address review comments (cherry picked from commit 51a7f4b) [ROCm][Triton] Disable transposed load in certain conditions (cherry picked from commit 50860e9) Enable unit tests that pass after fixing some Triton related issues. (#285) * Enable unit tests that pass after fixing some Triton related issues. * fusion_emitter_device_legacy_test still fails on MI200 (cherry picked from commit 97dd565) Rocm jaxlib v0.6.0 triton support ut (#279) * Fixed triton/support_test - no fmfa. * Fix issue with rounding mode in accelerate amd matmul. * Fixed issues with usage of mfma in support_test. (cherry picked from commit 44f7d87) Restore gpu_triton_custom_call_test (#262) (cherry picked from commit 32eafa4) Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. (cherry picked from commit 56ec7ec) (cherry picked from commit b1f3e9f) fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179) (cherry picked from commit 8517a3a) (cherry picked from commit c62e47d) fixed bazel build issue --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Alex <[email protected]>
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) * Ported all triton related changes from v0.6.0 to v0.7.1 (cherry picked from commit 1851bcc) Disable softmax triton fusion if triton gemm is off (#281) * Disable softmax rewriter triton if triton gemm is disabled * Add specific flag to enable triton softmax fusion * Address review comments (cherry picked from commit 51a7f4b) [ROCm][Triton] Disable transposed load in certain conditions (cherry picked from commit 50860e9) Enable unit tests that pass after fixing some Triton related issues. (#285) * Enable unit tests that pass after fixing some Triton related issues. * fusion_emitter_device_legacy_test still fails on MI200 (cherry picked from commit 97dd565) Rocm jaxlib v0.6.0 triton support ut (#279) * Fixed triton/support_test - no fmfa. * Fix issue with rounding mode in accelerate amd matmul. * Fixed issues with usage of mfma in support_test. (cherry picked from commit 44f7d87) Restore gpu_triton_custom_call_test (#262) (cherry picked from commit 32eafa4) Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. (cherry picked from commit 56ec7ec) (cherry picked from commit b1f3e9f) fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179) (cherry picked from commit 8517a3a) (cherry picked from commit c62e47d) fixed bazel build issue --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Alex <[email protected]>
* Rocm jaxlib v0.5.0 warpsize global (#177) * cherry-picked warp size passing to triton calls, and globally enabled warpsize=64 * Fix. --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) * Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests * Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error * added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments * [ROCm] Disable Cudnn fusions (#358) * Get CI green * Triton fixes porting from v0.6.0 (#389) * rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) * Ported all triton related changes from v0.6.0 to v0.7.1 (cherry picked from commit 1851bcc) Disable softmax triton fusion if triton gemm is off (#281) * Disable softmax rewriter triton if triton gemm is disabled * Add specific flag to enable triton softmax fusion * Address review comments (cherry picked from commit 51a7f4b) [ROCm][Triton] Disable transposed load in certain conditions (cherry picked from commit 50860e9) Enable unit tests that pass after fixing some Triton related issues. (#285) * Enable unit tests that pass after fixing some Triton related issues. * fusion_emitter_device_legacy_test still fails on MI200 (cherry picked from commit 97dd565) Rocm jaxlib v0.6.0 triton support ut (#279) * Fixed triton/support_test - no fmfa. * Fix issue with rounding mode in accelerate amd matmul. * Fixed issues with usage of mfma in support_test. (cherry picked from commit 44f7d87) Restore gpu_triton_custom_call_test (#262) (cherry picked from commit 32eafa4) Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. (cherry picked from commit 56ec7ec) (cherry picked from commit b1f3e9f) fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179) (cherry picked from commit 8517a3a) (cherry picked from commit c62e47d) fixed bazel build issue --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Alex <[email protected]> * Fix asan error reported when running set_dimension_size_test under rocm * PR openxla#33909: [ROCm] Make multi gpu tests exclusive if executed locally Imported from GitHub PR openxla#33909 📝 Summary of Changes Make multigpu tests being executed exclusively. 🎯 Justification Multigpu tests have to run exclusively to not clash the utilization of gpus in parallel. This PR marks multigpu tests as such by adding 'multi_gpu' tag, and in case of rocm make them run exclusively if executed locally. 🚀 Kind of Contribution Please remove what does not apply: ♻️ Cleanup 📊 Benchmark (for Performance Improvements) Not relevant, no logic change 🧪 Unit Tests: Not relevant, no logic change 🧪 Execution Tests: Not relevant, no logic change Copybara import of the project: -- 7e84a78 by Alexandros Theodoridis <[email protected]>: Mark multigpu tests as such, make them exclusive if executed locally -- 23fc24c by Alexandros Theodoridis <[email protected]>: Remove invalid multi_gpu tag from cpu tests -- a2a35de by Alexandros Theodoridis <[email protected]>: Restore build_config not exposing has_tag -- babe5c7 by Alexandros Theodoridis <[email protected]>: Unmark test which is not multigpu -- 3e30cf6 by Alexandros Theodoridis <[email protected]>: Add missing tag documentation Merging this change closes openxla#33909 COPYBARA_INTEGRATE_REVIEW=openxla#33909 from ROCm:ci_make_multi_gpu_tests_execute_sequentially_when_locally_but_not_in_rbe 3e30cf6 PiperOrigin-RevId: 836288457 * Cleanup nccl communicator shared memory files * Add missing parallel_gpu_execute target * Fix ci build command, push args to the end * Use upstream version of ci run script * Add cuda-only tag to cuda_test target * Add noasan notsan filters * Fix buildifier error --------- Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Pavel Emeliyanenko <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: spiao <[email protected]> Co-authored-by: Dragan Mladjenovic <[email protected]> Co-authored-by: Harsha HS <[email protected]> Co-authored-by: Zahid Iqbal <[email protected]> Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: Alexandros Theodoridis <[email protected]>
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test --------- Co-authored-by: Chunyu Jin <[email protected]> (cherry picked from commit 492d1ee)
* rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test --------- Co-authored-by: Chunyu Jin <[email protected]> (cherry picked from commit 492d1ee)
* Fix AMD GPU alloca address space errors (#433) * Fix AMD GPU alloca address space errors AMD GPUs require stack allocations (alloca instructions) to be in address space 5 (private/local memory), not address space 0 (generic memory) * Optimize AMDGPU allocas by keeping AS5 pointers throughout * Fix AMDGPU allocas to use address space 5 in MLIR lowering * [AIXLA-171] Jaxlib 0.8.0 requried backports (#437) * Rocm jaxlib v0.5.0 warpsize global (#177) * cherry-picked warp size passing to triton calls, and globally enabled warpsize=64 * Fix. --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) * Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests * Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error * added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments * [ROCm] Disable Cudnn fusions (#358) * Get CI green * Triton fixes porting from v0.6.0 (#389) * rocprof-sdk addition, upstream PR: openxla/pull/29769 Squash following commits.. Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302) * update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI) (cherry picked from commit 7775dd0) use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357) (cherry picked from commit 5950125) update with kernel details for rocm-7.x (#364) * update with kernel details for rocm-7.x (cherry picked from commit 5597c0d) update to remove previously hard-coded rocprofiler-sdk path (#369) * update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk` (cherry picked from commit ff74b5f) * fixed buffer comparator test * misc fixes ported from rocm-jaxlib-v0.6.0 --------- Co-authored-by: Pavel Emeliyanenko <[email protected]> (cherry picked from commit f013645) (cherry picked from commit b03cd94) Added support for waves_per_eu function attribute. (#181) (cherry picked from commit bc1d816) (cherry picked from commit d3f94e9) removed two line change (revert of half of the openxla#25959 commit (cherry picked from commit 109e138) Fixes for jax 0.6.0 (#207) * Add fixes for jax plugin 0.6.0 Drop NEEDED linking to unnecessary libs. These are loaded by amdhipruntime and not us. Fix missing NEEDED on MIOpen shared object. * Minor rocblas related changes for rocm 70 (cherry picked from commit 0de7d49) --------- Co-authored-by: Zoran Jovanovic <[email protected]> (cherry picked from commit 28f10a0) Add hipBLASLt support for gfx11. (#301) (cherry picked from commit f814bff) Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303) * Bugfix and improve device_description.h::RocmComputeCompatibility * Enable ALG_DOT_BF16* on rocm with HW support (cherry picked from commit 510ea06) [ROCm] Use bundled bitcode files (#196) Also trim bitcode file list to ockl.bc and ocml.bc only. (cherry picked from commit fc9e3c3) Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312) * Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms * Exclude failing CollectiveOpsE2E tests (cherry picked from commit fb6ddfb) Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333) At least gfx11_rx7900() is still needed for TF build. (cherry picked from commit 13c3de1) Make device_count_ atomic (#343) * Make device_count_ atomic * Use relaxed memory order * Fix build error (cherry picked from commit 8513f2d) fix hardcoded max registers (#345) (cherry picked from commit f3e170a) fix hardcoded ecc enabled (#348) (cherry picked from commit 9cfa74a) remove reserved memory (#349) (cherry picked from commit 0015d0e) Add rocm_dev config for remote caching (#353) (cherry picked from commit c815420) added rocm7 support to EnablePeerAccess (#347) * added rocm7 support to EnablePeerAccess * use wrap namespace, clang-format and add comments (cherry picked from commit 85548a7) [ROCm] Disable Cudnn fusions (#358) (cherry picked from commit edab8b2) * Ported all triton related changes from v0.6.0 to v0.7.1 (cherry picked from commit 1851bcc) Disable softmax triton fusion if triton gemm is off (#281) * Disable softmax rewriter triton if triton gemm is disabled * Add specific flag to enable triton softmax fusion * Address review comments (cherry picked from commit 51a7f4b) [ROCm][Triton] Disable transposed load in certain conditions (cherry picked from commit 50860e9) Enable unit tests that pass after fixing some Triton related issues. (#285) * Enable unit tests that pass after fixing some Triton related issues. * fusion_emitter_device_legacy_test still fails on MI200 (cherry picked from commit 97dd565) Rocm jaxlib v0.6.0 triton support ut (#279) * Fixed triton/support_test - no fmfa. * Fix issue with rounding mode in accelerate amd matmul. * Fixed issues with usage of mfma in support_test. (cherry picked from commit 44f7d87) Restore gpu_triton_custom_call_test (#262) (cherry picked from commit 32eafa4) Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM. (cherry picked from commit 56ec7ec) (cherry picked from commit b1f3e9f) fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179) (cherry picked from commit 8517a3a) (cherry picked from commit c62e47d) fixed bazel build issue --------- Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Alex <[email protected]> * Fix asan error reported when running set_dimension_size_test under rocm * PR openxla#33909: [ROCm] Make multi gpu tests exclusive if executed locally Imported from GitHub PR openxla#33909 📝 Summary of Changes Make multigpu tests being executed exclusively. 🎯 Justification Multigpu tests have to run exclusively to not clash the utilization of gpus in parallel. This PR marks multigpu tests as such by adding 'multi_gpu' tag, and in case of rocm make them run exclusively if executed locally. 🚀 Kind of Contribution Please remove what does not apply: ♻️ Cleanup 📊 Benchmark (for Performance Improvements) Not relevant, no logic change 🧪 Unit Tests: Not relevant, no logic change 🧪 Execution Tests: Not relevant, no logic change Copybara import of the project: -- 7e84a78 by Alexandros Theodoridis <[email protected]>: Mark multigpu tests as such, make them exclusive if executed locally -- 23fc24c by Alexandros Theodoridis <[email protected]>: Remove invalid multi_gpu tag from cpu tests -- a2a35de by Alexandros Theodoridis <[email protected]>: Restore build_config not exposing has_tag -- babe5c7 by Alexandros Theodoridis <[email protected]>: Unmark test which is not multigpu -- 3e30cf6 by Alexandros Theodoridis <[email protected]>: Add missing tag documentation Merging this change closes openxla#33909 COPYBARA_INTEGRATE_REVIEW=openxla#33909 from ROCm:ci_make_multi_gpu_tests_execute_sequentially_when_locally_but_not_in_rbe 3e30cf6 PiperOrigin-RevId: 836288457 * Cleanup nccl communicator shared memory files * Add missing parallel_gpu_execute target * Fix ci build command, push args to the end * Use upstream version of ci run script * Add cuda-only tag to cuda_test target * Add noasan notsan filters * Fix buildifier error --------- Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Pavel Emeliyanenko <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: spiao <[email protected]> Co-authored-by: Dragan Mladjenovic <[email protected]> Co-authored-by: Harsha HS <[email protected]> Co-authored-by: Zahid Iqbal <[email protected]> Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: Alexandros Theodoridis <[email protected]> * Backport fixes from upstream (#443) * Consider runfiles in lit tests * Address review comments * Introduce pool name for rbe * Introduce rocm rbe pools * First check for multigpu tag * Address review comments * Fix buildifier issue * Assigne specific pool only to test targets * Make rbe amdgpu pools configurable * Trigger CI/CD pipeline * Fix build error * Add excluded tests * Trigger CI/CD pipeline * Switch back to rocminfo to detect number of gpus * Backport rocm_configure * Trigger CI/CD pipeline * Add ignore tests list * Trigger CI/CD pipeline * Add disk cache * Enable rbe for this branch only * Add one more failing test * Update tsan ignore list * Remove printing out the script commands * Update local test jobs * Trigger CI/CD pipeline * Trigger CI/CD pipeline * Disable rbe * Add failing test * Enable rbe * Ignore TransferLiteralFromDevice * Ignore transfer from outfeed * Fix hermetic builds * Print bazel command * Add back communicator test * Remove require amd gpu filter * Remove filter nccl_communicator_test * Remove no_oss filter * Ignore //xla/pjrt/gpu/tfrt:tfrt_gpu_buffer_test * Exclude failing tsts * Mark multigpu tests as such, make them exclusive if executed locally * Remove invalid multi_gpu tag from cpu tests * Restore build_config not exposing has_tag * Unmark test which is not multigpu * Add missing tag documentation * Run multigpu tests locally * Revert tests to fix, use taget * Remove disk_cache * Return disk cache and ignore tsan warning * Fix tsan issues (#450) * Fix tsan issues * Fix hermetic build llvm clash * Set timeouts, ignore tsan warning * Ignore execute on stream * Port PackedTranspose performance regression fix (#441) * [XLA:GPU] Rename warp to shmem_group in PackedTranspose Also calculate their count as kNumThreadsPerBlock / kNumShmemBanks to avoid inconsistency when manually specified. This change is NFC for any GPU in upstream. However, it fixes a performance regression in downstream for AMD GPUs caused by inconsistency between shmem_group size, kNumThreadsPerBlock and kNumShmemBanks. It ended up in a situation downstream where half of the launched threads per block were not utilized at all. Update packed transpose tests to verify correct thread utilization. * [XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations. Update packed_transpose_multiple_heroes.hlo test to verify the single-loop structure with multiple iter_args. * Support split for multigpu and single gpu pipelines (#453) * Support split for multigpu and single gpu pipelines * Use placeholder config * Include all multigpu tests * Add back multigpu tests * Ignore ExecuteReplicatedImpl * Migrate settings from run_multi_gpu tests to bazelrc * Include sanitizer wrapper * Exclude failing test * Migrate more settings and enable collective tests * add missing cuda-only tag * Add back failing test * Mark nvshmem tests as cuda-only (#458) * bugfix - consider the situation where the best time is infinite (#440) * bugfix - consider the situation where the best time is infinite * added an unit test where best_tiled_run_time_data contains Infinite. * Enable rbe for testing (#463) * Enable rbe for testing * Force multigpu tests to run locally * Reduce disk_cache size (#467) * Exclude //xla/tests:collective_ops_test_amdgpu_any (#468) * Add register spilling detection AMD v0.8.0 (#464) * register spilling by disassembling object file * added time measurement to the spilling check * adapt the num_warps so that the hlo could be compiled on both amd and nvidia * pass though is_autotuning_compilation flag to the function CompileToHsaco * implementation of register spilling by reading meta data of hasco file using llvm-readobj * utilize amd code object manager library for parsing HSACO metadata * Fix infinite recursion in HloInstruction::Accept/Visit const wrappers (#470) The const wrapper methods for Accept() and Visit() were calling themselves instead of the template versions, causing infinite recursion and stack overflow. * enable mx datatype for rocm (#462) * enable mx datatype for rocm * add // TF_ROCM_VERSION >= 70000 * fix unit test build * Rocm jaxlib v0.8.0 fix fp support ut 2 (#466) * Fix expected output in fusion_emitter_int4_device_test for ROCm. * Enable triton/fusion_emitter_int4_device_test on ci. * Rocm jaxlib v0.8.0 fix triton support (#465) * Fixed triton support_test on 0.7.1. (cherry picked from commit be9da6d) * Fixed triton/support_test on ROCm 0.8.0. Updated number of IsTritonSupport functions. * Remove invalid ignore targets (#475) * revert xla_gpu_enable_triton_softmax_fusion related changes (#474) CI status is irrelevant to this PR * update rocprofiler-sdk (v3) and roctracer (v1) (#473) * update rocprofiler-sdk (v3) and roctracer (v1) for 0.8.0 * update for skip_rocprofiler_sdk * update for the fixing asan check * fix to show kernel details in trace file * Refine rocprofiler-sdk (v3) integration based on Arech8's comments along with rocprofiler_flush_buffer explicitly * update rocprofiler-sdk based on comments * Force rbe incompatible tests to be executed locally (#485) * [ROCm] Restore CudnnFusedConvRewriter (#372) (#392) (#481) * [ROCm] Restore CudnnFusedConvRewriter (#372) (#392) * [ROCm] Allways run GpuConvAlgorithmPicker * [ROCm] Restore CudnnFusedConvRewriter Introduce CudnnFusedConvDecomposer to revert back fused convs if no fused algorithm could be found with ConvAlgorithmPicker. Remove unfused fallback paths from RocmFusedConvRunner. Co-authored-by: Dragan Mladjenovic <[email protected]> * Enable u-tests * Fix * Remove gemm algorithm picker * Backport gemmalgorithm picker * Fix --------- Co-authored-by: Harsha H S <[email protected]> Co-authored-by: Dragan Mladjenovic <[email protected]> * [XLA:GPU] Consistently check which bitcasts we can fuse. (#486) github.com/openxla/pull/30864 added a check that we shouldn't fuse bitcasts that change bit width, but only in one place. We need to do that consistently, because there are multiple places in PriorityFusion that have special handling for bitcasts. PiperOrigin-RevId: 833281452 (cherry picked from commit 8c64c14) Co-authored-by: Oleg Shyshkov <[email protected]> * take silu epilogues into account in autotuning (#488) * Rocm jaxlib v0.8.0 fix fusion emitter device test 2 (#482) * Clean up disk cache using bash trap on exit (#480) * Increase the step for iota test (#487) * Make cublaslt_test platform independent (#489) * Exclude iota_test for tsan builds (#490) * Fix hipSolver FFI errors (#483) * Fix hipSolver FFI errors * Remove srcs and add linkopts instead * Add -L for different ROCm configs * Add similar fixes for rocprofiler-sdk * Revert moving rocprofiler sdk to data --------- Co-authored-by: Alexandros Theodoridis <[email protected]> * Introduce jax utest script (#498) * Make it compatible with rocm plugin * Introduce jax utest script * Sync filter for jax tests * [ROCm] Initialze collectives to nullptr to force its allocation later (#502) * Add SiLU epilogue unit tests (#491) * [ROCm] Enable embeded bitcode libs and inprocess lld (#507) Added TF_ROCM_INPROCESS_LLD and TF_ROCM_EMBEDDED_DEVICE_LIB form 0.6.0 otherwise identical to openxla#32439. Env vars only needed for 0.8.0. * Fixing self_adj_test and fixing sort_rewriter on ROCM (#493) * fixing self_adj_test and fixing sort_rewriter on ROCM * fixing sortrewriter pass * fixing gpu_compiler_test after adding 2nd SortRewriter pass * Multioutput fusion test fix (#496) * fixing warp size for multioutput fusion test * added build fix * Fix TopK algorithm for RDNA architectures by properly handling WAVEFRONT_SIZE and removing __AMDGCN_WAVEFRONT_SIZE that got deprecated in newer versions of ROCm. Fixes sort_ops_test_gpu and sort_ops_test_gpu_mlir_bridge_test UT failures. (#509) * [rocm-jaxlib-v0.8.0] Addressing CI warnings (#508) * Remove unused variable * Add fallthrough attribute in the IsSupportedDotAlgorithm * Specify all cases in the CudnnFusedDecomposer * Specify mx_mode in GemmConfig * Remove unused cleanup_on_error * Specify missing mx_mode in GemmConfig::For (#516) * PR openxla#36046: [ROCm] Fix failing unit tests on ROCm platform Imported from GitHub PR openxla#36046 📝 Summary of Changes - layout_assignment tests are marked cuda-only. - sample_file_test needs higher autotuner level for MIOpen to return conv algorithm. Earlier this was coming from GetDebugOptionsForTest. - buffer_debug_log test is made gpu agnostic by using cannonical gpu name. - cublas_gemm_rewriter_test_amdgpu_any fix unit test to remove padding for ROCm as introduced in openxla#33854 - gpu_kernel_tiling_test_amdgpu_any is updated to respect higher launch dimensions now supported by hipruntime - Mark dynamic_shared_memory_test as cuda-only - Add arch specific checks for barriers to sorting.hlo 🎯 Justification Fixes failing unit tests on ROCm platform 🚀 Kind of Contribution 🐛 Bug Fix, 🧪 Tests Copybara import of the project: -- 472cd54 by Harsha HS <[email protected]>: [ROCm] Fix failing unit tests on ROCm platform - layout_assignment tests are marked cuda-only. - sample_file_test needs higher autotuner level for MIOpen to return conv algorithm. Earlier this was coming from GetDebugOptionsForTest. - buffer_debug_log test is made gpu agnostic by using cannonical gpu name. -- 3bb9422 by Harsha HS <[email protected]>: Fix tests which started to fail due to openxla#33854 -- 850d955 by Harsha HS <[email protected]>: HIP now respects highter launch dimension similar to CUDA -- b504a7e by Harsha HS <[email protected]>: Make dynamic_shared_memory_test cuda only -- 1e4e57a by Harsha HS <[email protected]>: Add arch specific checks to sorting.hlo -- ce1241c by Harsha HS <[email protected]>: Address review comments Merging this change closes openxla#36046 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla#36046 from ROCm:ci_fix_upstream_ut_20260107 ce1241c PiperOrigin-RevId: 856068530 * PR openxla#35744: [ROCm] Fix TopK algorithm for RDNA Imported from GitHub PR openxla#35744 📝 Summary of Changes xla/stream_executor/rocm/topk_kernel_rocm_common.cu.h now implements proper handling of WAVEFRONT_SIZE based on the architecture. __AMDGCN_WAVEFRONT_SIZE is removed as it is deprecated in newer versions of ROCm. 🎯 Justification Fixes sort_ops_test_gpu and sort_ops_test_gpu_mlir_bridge_test UT failures. 🚀 Kind of Contribution 🐛 Bug Fix 🧪 Unit Tests: Fixes sort_ops_test_gpu and sort_ops_test_gpu_mlir_bridge_test UT failures. **NOTE:** The changes are verified on gfx1100, gfx1201, and gfx90a. Copybara import of the project: -- 7097b9c by Aleksa Arsic <[email protected]>: Fix TopK algorithm for RDNA architectures by properly handling WAVEFRONT_SIZE and removing __AMDGCN_WAVEFRONT_SIZE that got deprecated in newer versions of ROCm. Fixes sort_ops_test_gpu and sort_ops_test_gpu_mlir_bridge_test UT failures. -- 6e1aaab by Harsha HS <[email protected]>: [ROCm] Fix TopK kernel Serialization checks -- 9fa6021 by Harsha HS <[email protected]>: [ROCm] Get Warpsize from device descriptor -- deb4932 by Harsha H S <[email protected]>: Include link to depricated macro Merging this change closes openxla#35744 COPYBARA_INTEGRATE_REVIEW=openxla#35744 from ROCm:ci_fix_topk_algorithm_for_rdna deb4932 PiperOrigin-RevId: 855689898 * PR openxla#36283: [ROCm] Skip conditional tests on ROCm as they are not supported by HIP Graphs Imported from GitHub PR openxla#36283 📝 Summary of Changes Conditionals are not supported by HIP graphs and those tests are skipped. 🚀 Kind of Contribution 🐛 Bug Fix Copybara import of the project: -- 1c607b6 by Harsha HS <[email protected]>: [ROCm] Skip conditional tests on ROCm as they are not supported by HIP Graphs Merging this change closes openxla#36283 COPYBARA_INTEGRATE_REVIEW=openxla#36283 from ROCm:ci_skip_conditionals_on_rocm_20260112 1c607b6 PiperOrigin-RevId: 855601660 * Include more passing tests --------- Co-authored-by: Pham Binh <[email protected]> Co-authored-by: zoranjovanovic-ns <[email protected]> Co-authored-by: Pavel Emeliyanenko <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: spiao <[email protected]> Co-authored-by: Dragan Mladjenovic <[email protected]> Co-authored-by: Harsha HS <[email protected]> Co-authored-by: Zahid Iqbal <[email protected]> Co-authored-by: Chunyu Jin <[email protected]> Co-authored-by: Alexandros Theodoridis <[email protected]> Co-authored-by: Aleksei Nurmukhametov <[email protected]> Co-authored-by: Xuefei Jiang <[email protected]> Co-authored-by: Eetu Sjöblom <[email protected]> Co-authored-by: Oleg Shyshkov <[email protected]> Co-authored-by: charleshofer <[email protected]> Co-authored-by: pemeliya <[email protected]> Co-authored-by: Aleksa Arsic <[email protected]> Co-authored-by: mmakevic-amd <[email protected]>
This PR combines rocprofiler-sdk (v3) and roctracer (v1) for rocm-jaxlib-v0.6.0
--bazel_options="--define=xla_rocm_profiler=v1"Previously, there was one for v3 only #251.