Skip to content

Conversation

@cj401-amd
Copy link

@cj401-amd cj401-amd commented Aug 1, 2025

This PR combines rocprofiler-sdk (v3) and roctracer (v1) for rocm-jaxlib-v0.6.0

  • Update rocprofiler-sdk integration for improved profiling with rocprofiler_force_configure() and annotations, support both time-based and step-based profiling,
  • Keep roctracer(v1) that can be built by providing --bazel_options="--define=xla_rocm_profiler=v1"
  • still need to figure out how to add more stats related to kernel, e.g., kernel size, occupancy, DMA copy, etc.

Previously, there was one for v3 only #251.

Copy link
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you indicate the instructions how to compile it with roctracev1 or rocprofiler-sdk (v3)?

@cj401-amd cj401-amd force-pushed the ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0 branch from fa59b34 to 069b25f Compare August 4, 2025 10:40
Comment on lines 256 to 263
#ifdef HIP_R_8F_E5M2
return layout.type() == HIP_R_8F_E5M2_FNUZ ||
layout.type() == HIP_R_8F_E4M3_FNUZ ||
layout.type() == HIP_R_8F_E5M2 || layout.type() == HIP_R_8F_E4M3;
layout.type() == HIP_R_8F_E4M3_FNUZ ||
layout.type() == HIP_R_8F_E5M2 ||
layout.type() == HIP_R_8F_E4M3;
#else
return false;
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this part. How are these hipBLASlt changes related to your profiler work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP_R_8F_E5M2 seems not on ROCm-6.2 when I was testing rocmtracer(v1), so here I was trying to guard it or I should submit a separate PR for this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a separate PR might be a good idea. What do you think @i-chaochen ?

Copy link
Collaborator

@i-chaochen i-chaochen Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's because hipblaslt doesn't support fp8 on rocm62? Yes, please have a seperate PR just to our local guaranting this TF_ROCM_VERSION <= 60200

@cj401-amd cj401-amd requested a review from Copilot August 6, 2025 08:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates rocprofiler-sdk (v3) along with maintaining roctracer (v1) support for rocm-jaxlib-v0.6.0. The implementation uses compile-time version guards to select between the two profiling systems based on ROCm version, with v3 being used for ROCm >= 6.3.

Key changes:

  • Add rocprofiler-sdk (v3) support with improved profiling capabilities including time/step-based profiling
  • Maintain backward compatibility with roctracer (v1) for ROCm versions < 6.3
  • Implement version-specific compilation guards and API mappings

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
xla/stream_executor/rocm/roctracer_wrapper.h Adds version guards and rocprofiler-sdk headers/API mappings for v3
xla/stream_executor/rocm/hip_blas_lt.cc Adds compile guards for FP8 data type constants to prevent undefined symbols
xla/backends/profiler/gpu/rocm_tracer_test.cc New unit tests for rocm_tracer v3 functionality
xla/backends/profiler/gpu/rocm_tracer.h Adds v3 RocmTracer class definition with rocprofiler-sdk integration
xla/backends/profiler/gpu/rocm_tracer.cc Implements v3 tracer with rocprofiler-sdk callbacks and event handling
xla/backends/profiler/gpu/rocm_collector_test.cc New unit tests for rocm_collector v3 functionality
xla/backends/profiler/gpu/rocm_collector.h Adds v3 collector interfaces and data structures
xla/backends/profiler/gpu/rocm_collector.cc Implements v3 collector with event processing and export logic
xla/backends/profiler/gpu/device_tracer_rocm.cc Updates device tracer to use appropriate version-specific APIs
xla/backends/profiler/gpu/BUILD Adds rocprofiler-sdk linkage and new test targets
Comments suppressed due to low confidence (4)

xla/backends/profiler/gpu/rocm_tracer.cc:277

  • Inconsistent use of optional type - header file uses 'absl::types::optional.h' but implementation uses 'std::optional'. Should be consistent throughout.
      return "Invalid";

xla/backends/profiler/gpu/rocm_tracer.cc:163

  • The stream_id type has been changed from int64_t to uint64_t in the header but this change should be consistently applied throughout the codebase to avoid potential issues.
        oss << ", sizeBytes=" << data->args.hipMemcpyDtoD.sizeBytes;

xla/backends/profiler/gpu/rocm_collector.cc:1

  • [nitpick] Empty line at the beginning of the file should be removed for consistency with code style.
/* Copyright 2024 The OpenXLA Authors. All Rights Reserved.

Comment on lines 2200 to 2202
void __attribute__((constructor)) init_rocm_lib() {
rocprofiler_force_configure(xla::profiler::rocprofiler_configure);
}
Copy link

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using compiler-specific attributes like 'attribute((constructor))' reduces portability. Consider using a more portable initialization mechanism.

Suggested change
void __attribute__((constructor)) init_rocm_lib() {
rocprofiler_force_configure(xla::profiler::rocprofiler_configure);
}
namespace {
struct RocmLibInitializer {
RocmLibInitializer() {
rocprofiler_force_configure(xla::profiler::rocprofiler_configure);
}
};
static RocmLibInitializer rocm_lib_initializer;
} // namespace

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need rocprofiler_force_configure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need rocprofiler_force_configure to initialize rocprofiler-sdk's hooks into hip runtime before hipInit. it is called automatically when xla_rocm_plugin.so is loaded. I believe rocprofiler-sdk team is trying to solve this, then we can update our code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit further. For the look at the source rocprofiler_force_configure cannot work if hipInit was called. So at the point rocprofiler_force_configure is called hipInit was not called yet, so I see no reason for rocprofiler_force_configure to be called.

@i-chaochen
Copy link
Collaborator

@cj401-amd I'm confused this PR, what's differences between this and #251 ? why you create separate one?

Copy link

@mrodden mrodden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't able to finish my review fully yet but here's what I have so far.

Comment on lines 267 to 338
linkopts = select({
"//conditions:default": [
"-L/opt/rocm/lib", # search path for all ROCm shared objects
"-lrocprofiler-sdk", # the library that owns the missing symbols
],
}),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need this anymore if you are using the macros in roctracer_wrapper.h which will do the dlopen at runtime to load the lib.

If you remove this and it doesn't work then we probably have another issue that needs to be fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you should not link against rocprofiler-sdk if you don't use it. Say on 6.2. If we end up with compile-time switch between the tracers I suggest this be modeled as an proper library and added to dependencies.

Comment on lines 340 to 343
deps = if_rocm([
":rocm_tracer",
":rocm_collector",
]) + [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda weird that this has to be under an if. I would think it wouldn't matter since it shouldn't be invoked already due to the rocm-only tag...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just to align upstream syntax

options.device_type() != ProfileOptions::UNSPECIFIED)
return nullptr;

#if TF_ROCM_VERSION < 60300
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we need to do this switch at runtime, not compile time. It should be possible since you have both symbol sets from roctracer and rocprofiler available from the roctracer_wrapper.h, so you just have to switch on an environment variable or something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if you can have another PR to make it feasible on those tested workloads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrodden But we only have a single rocm version where both rocprofiler-sdk and roctracer coexist?

#include "absl/strings/str_cat.h"
#include "absl/strings/str_format.h"
#include "absl/strings/str_join.h"
#include "xla/stream_executor/rocm/roctracer_wrapper.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need this?

Copy link

@pemeliya pemeliya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see rocm_tracer/collector for rocprofiler v1 and for v3 are two orthogonal implementations. They have very few things in common. If we really insist on keeping rocprofiler v1, perhaps can we split rocm_tracer.h/.cc into rocm_tracer_v1.h/cc and rocm_tracer_v3.h/cc and maybe the same for rocm_collector? This way we can improve code maintainability, and can get rid of major #if / #else' clauses by letting bazel to do the conditional compilation:

e.g. if ROCM_VERSION < 60300 -> compile rocm_tracer/collector_v1
else -> compile rocm_tracer/collector_v3

LOG(INFO) << "agent id = " << agent.id.handle
<< ", dev = " << agent.device_id
<< ", name = " << (agent.name ? agent.name : "null");
agents_[agent.id.handle] = agent;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<< ", dev = " << agent.device_id
<< ", name = " << (agent.name ? agent.name : "null");
agents_[agent.id.handle] = agent;
if (agent.type == ROCPROFILER_AGENT_TYPE_GPU) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we include both CPU and GPU agents in the map?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocprofiler-sdk retrieves all agents, including both CPUs and GPUs on the system. then gpus are filtered.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the gpus filtered?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cj401-amd
Copy link
Author

As I see rocm_tracer/collector for rocprofiler v1 and for v3 are two orthogonal implementations. They have very few things in common. If we really insist on keeping rocprofiler v1, perhaps can we split rocm_tracer.h/.cc into rocm_tracer_v1.h/cc and rocm_tracer_v3.h/cc and maybe the same for rocm_collector? This way we can improve code maintainability, and can get rid of major #if / #else' clauses by letting bazel to do the conditional compilation:

e.g. if ROCM_VERSION < 60300 -> compile rocm_tracer/collector_v1 else -> compile rocm_tracer/collector_v3

initially, the idea is to put v1 along v3 for a transition period (v1 is being phased out.) and then remove v1 from the code totally.

@mrodden
Copy link

mrodden commented Aug 15, 2025

@cj401-amd @i-chaochen Here is a modification of the current changes to switch between them at runtime ae1c5d8

I am not sure my plan of trying to extend and override the one function member of RocmTraceCollector is going to work, but it could be changed to a composition or templated out instead.

@i-chaochen
Copy link
Collaborator

i-chaochen commented Aug 20, 2025

@cj401-amd I think you need to rebase your branch to the latest 0.6 to get a green CI pass

@cj401-amd cj401-amd force-pushed the ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0 branch from 70e5252 to a585235 Compare August 20, 2025 17:00
@cj401-amd
Copy link
Author

As I see rocm_tracer/collector for rocprofiler v1 and for v3 are two orthogonal implementations. They have very few things in common. If we really insist on keeping rocprofiler v1, perhaps can we split rocm_tracer.h/.cc into rocm_tracer_v1.h/cc and rocm_tracer_v3.h/cc and maybe the same for rocm_collector? This way we can improve code maintainability, and can get rid of major #if / #else' clauses by letting bazel to do the conditional compilation:

e.g. if ROCM_VERSION < 60300 -> compile rocm_tracer/collector_v1 else -> compile rocm_tracer/collector_v3

now the backend is split into rocm_tracer_v1 and rocm_profiler_sdk with default to rocm_profiler_sdk (v3), which can be changed to --bazel_options="--define=xla_rocm_profiler=v1" for ROCm < 6.3.

it seems v1 could not be built on ROCm-6.4, and rocprofiler-sdk could not be built on ROCm-6.2.

@cj401-amd
Copy link
Author

it seems v1 could not be built on ROCm-6.4, and rocprofiler-sdk could not be built on ROCm-6.2.

@cj401-amd cj401-amd closed this Aug 20, 2025
@cj401-amd
Copy link
Author

@cj401-amd @i-chaochen Here is a modification of the current changes to switch between them at runtime ae1c5d8

I am not sure my plan of trying to extend and override the one function member of RocmTraceCollector is going to work, but it could be changed to a composition or templated out instead.

based on latest exps, v1 seems not be built on ROCm-6.4, and rocprofiler-sdk could not be built on ROCm-6.2.

@cj401-amd cj401-amd reopened this Aug 20, 2025
Copy link
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok to me @pemeliya @ScXfjiang please have a check

Copy link

@draganmladjenovic draganmladjenovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rocprofiler_force_configure should be removed.

@cj401-amd
Copy link
Author

I think rocprofiler_force_configure should be removed.

we are going to remove it when rocprofiler-sdk fixes the initialization of its hooks. If we remove it now, rocprofiler-sdk can not trace any GPU events for some workloads due to hipInit is called before rocprofiler_configure, e.g., maxtext LLAMA2-7B workload.

Copy link
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM please squash your commit as one as well, and if @pemeliya or @draganmladjenovic approved we can merge it. Thanks!

Copy link

@pemeliya pemeliya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cj401-amd , can you also take a look at my two last comments ? I mean we do not really need to access that 'agents_' array to get the device ID. Furthermore, we can experiment with the profiler buffer size to see if increasing it, solves the problem with missing events (as we saw previously)

Comment on lines +220 to +221
const auto &src_gpu = agents_[static_cast<uint32_t>(rec.src_agent_id.handle)],
&dst_gpu = agents_[static_cast<uint32_t>(rec.dst_agent_id.handle)];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this - because src_gpu.id.handle seems to be the same as rec.src_agent_id.handle
and same for dst_gpu. So we do not really need to access 'agents_' map here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cj401-amd
Copy link
Author

I think rocprofiler_force_configure should be removed.

Hi @draganmladjenovic, is it possible to help remove the lock now?

@cj401-amd cj401-amd force-pushed the ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0 branch from 7c7eaba to cd5615f Compare September 8, 2025 17:27
@i-chaochen
Copy link
Collaborator

I can see there are still 2 failed UTs from CI, is it expecable?

[2025-09-08T21:01:33.346Z] //xla/backends/profiler/gpu:rocm_collector_test_cpu                      FAILED in 3 out of 3 in 3.4s
[2025-09-08T21:01:33.346Z] //xla/backends/profiler/gpu:rocm_collector_test_gpu_amd_any              FAILED in 3 out of 3 in 12.7s

@draganmladjenovic draganmladjenovic self-requested a review September 9, 2025 06:44
Copy link

@draganmladjenovic draganmladjenovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go.

@cj401-amd
Copy link
Author

I can see there are still 2 failed UTs from CI, is it expecable?

[2025-09-08T21:01:33.346Z] //xla/backends/profiler/gpu:rocm_collector_test_cpu                      FAILED in 3 out of 3 in 3.4s
[2025-09-08T21:01:33.346Z] //xla/backends/profiler/gpu:rocm_collector_test_gpu_amd_any              FAILED in 3 out of 3 in 12.7s

fixed them already.

@cj401-amd cj401-amd force-pushed the ci_cj-rocprofv3-v1-rocm-jaxlib-v0.6.0 branch from e978e5c to 6f2106d Compare September 10, 2025 21:02
@cj401-amd cj401-amd merged commit 7775dd0 into rocm-jaxlib-v0.6.0 Sep 10, 2025
8 checks passed
zahiqbal pushed a commit that referenced this pull request Oct 1, 2025
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)
zahiqbal pushed a commit that referenced this pull request Oct 1, 2025
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)
zahiqbal pushed a commit that referenced this pull request Oct 2, 2025
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)
zahiqbal pushed a commit that referenced this pull request Oct 3, 2025
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)
hsharsha pushed a commit that referenced this pull request Oct 6, 2025
* rocprof-sdk addition,
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)

* fixed buffer comparator test

---------

Co-authored-by: Chunyu Jin <[email protected]>
hsharsha pushed a commit that referenced this pull request Oct 6, 2025
* rocprof-sdk addition,
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)

* fixed buffer comparator test

* misc fixes ported from rocm-jaxlib-v0.6.0

---------

Co-authored-by: Pavel Emeliyanenko <[email protected]>
(cherry picked from commit f013645)
(cherry picked from commit b03cd94)

Added support for waves_per_eu function attribute. (#181)

(cherry picked from commit bc1d816)
(cherry picked from commit d3f94e9)

removed two line change (revert of half of the openxla#25959 commit

(cherry picked from commit 109e138)

Fixes for jax 0.6.0 (#207)

* Add fixes for jax plugin 0.6.0

Drop NEEDED linking to unnecessary libs.
These are loaded by amdhipruntime and not us.

Fix missing NEEDED on MIOpen shared object.

* Minor rocblas related changes for rocm 70

(cherry picked from commit 0de7d49)

---------

Co-authored-by: Zoran Jovanovic <[email protected]>
(cherry picked from commit 28f10a0)

Add hipBLASLt support for gfx11. (#301)

(cherry picked from commit f814bff)

Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303)

* Bugfix and improve device_description.h::RocmComputeCompatibility

* Enable ALG_DOT_BF16* on rocm with HW support

(cherry picked from commit 510ea06)

[ROCm] Use bundled bitcode files (#196)

Also trim bitcode file list to ockl.bc and ocml.bc only.

(cherry picked from commit fc9e3c3)

Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312)

* Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms

* Exclude failing CollectiveOpsE2E tests

(cherry picked from commit fb6ddfb)

Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333)

At least gfx11_rx7900() is still needed for TF build.

(cherry picked from commit 13c3de1)

Make device_count_ atomic (#343)

* Make device_count_ atomic

* Use relaxed memory order

* Fix build error

(cherry picked from commit 8513f2d)

fix hardcoded max registers (#345)

(cherry picked from commit f3e170a)

fix hardcoded ecc enabled (#348)

(cherry picked from commit 9cfa74a)

remove reserved memory (#349)

(cherry picked from commit 0015d0e)

Add rocm_dev config for remote caching (#353)

(cherry picked from commit c815420)

added rocm7 support to EnablePeerAccess (#347)

* added rocm7 support to EnablePeerAccess

* use wrap namespace, clang-format and add comments

(cherry picked from commit 85548a7)

[ROCm] Disable Cudnn fusions (#358)

(cherry picked from commit edab8b2)

---------

Co-authored-by: Chunyu Jin <[email protected]>
Co-authored-by: zoranjovanovic-ns <[email protected]>
hsharsha pushed a commit that referenced this pull request Oct 6, 2025
* rocprof-sdk addition,
upstream PR: openxla/pull/29769

Squash following commits..
Update rocprofiler-sdk (v3) along with roctracer (v1) for rocm-jaxlib-v0.6.0 (#302)

* update for integration of rocprofiler-sdk (along with roctracer as a backup based on bazel_options from CLI)

(cherry picked from commit 7775dd0)

use VLOG(2) to replace LOG(INFO), so PGLE has no verbose info (#357)

(cherry picked from commit 5950125)

update with kernel details for rocm-7.x (#364)

* update with kernel details for rocm-7.x

(cherry picked from commit 5597c0d)

update to remove previously hard-coded rocprofiler-sdk path (#369)

* update to remove previously hard-coded rocprofiler-sdk path and add skip_rocprofiler_sdk to avoid loading `rocprofiler-sdk`

(cherry picked from commit ff74b5f)

* fixed buffer comparator test

* misc fixes ported from rocm-jaxlib-v0.6.0

---------

Co-authored-by: Pavel Emeliyanenko <[email protected]>
(cherry picked from commit f013645)
(cherry picked from commit b03cd94)

Added support for waves_per_eu function attribute. (#181)

(cherry picked from commit bc1d816)
(cherry picked from commit d3f94e9)

removed two line change (revert of half of the openxla#25959 commit

(cherry picked from commit 109e138)

Fixes for jax 0.6.0 (#207)

* Add fixes for jax plugin 0.6.0

Drop NEEDED linking to unnecessary libs.
These are loaded by amdhipruntime and not us.

Fix missing NEEDED on MIOpen shared object.

* Minor rocblas related changes for rocm 70

(cherry picked from commit 0de7d49)

---------

Co-authored-by: Zoran Jovanovic <[email protected]>
(cherry picked from commit 28f10a0)

Add hipBLASLt support for gfx11. (#301)

(cherry picked from commit f814bff)

Add bf16 starting from gfx11, bugfix & optimize RocmComputeCapability (#303)

* Bugfix and improve device_description.h::RocmComputeCompatibility

* Enable ALG_DOT_BF16* on rocm with HW support

(cherry picked from commit 510ea06)

[ROCm] Use bundled bitcode files (#196)

Also trim bitcode file list to ockl.bc and ocml.bc only.

(cherry picked from commit fc9e3c3)

Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms (#312)

* Add MIOPEN_FIND_ENFORCE For ROCm 7 for convolution gemms

* Exclude failing CollectiveOpsE2E tests

(cherry picked from commit fb6ddfb)

Restore RocmComputeCapability:: gfx11_rx7900() and gfx12_rx8900() methods (#333)

At least gfx11_rx7900() is still needed for TF build.

(cherry picked from commit 13c3de1)

Make device_count_ atomic (#343)

* Make device_count_ atomic

* Use relaxed memory order

* Fix build error

(cherry picked from commit 8513f2d)

fix hardcoded max registers (#345)

(cherry picked from commit f3e170a)

fix hardcoded ecc enabled (#348)

(cherry picked from commit 9cfa74a)

remove reserved memory (#349)

(cherry picked from commit 0015d0e)

Add rocm_dev config for remote caching (#353)

(cherry picked from commit c815420)

added rocm7 support to EnablePeerAccess (#347)

* added rocm7 support to EnablePeerAccess

* use wrap namespace, clang-format and add comments

(cherry picked from commit 85548a7)

[ROCm] Disable Cudnn fusions (#358)

(cherry picked from commit edab8b2)

* Ported all triton related changes from v0.6.0 to v0.7.1

(cherry picked from commit 1851bcc)

Disable softmax triton fusion if triton gemm is off (#281)

* Disable softmax rewriter triton if triton gemm is disabled

* Add specific flag to enable triton softmax fusion

* Address review comments

(cherry picked from commit 51a7f4b)

[ROCm][Triton] Disable transposed load in certain conditions

(cherry picked from commit 50860e9)

Enable unit tests that pass after fixing some Triton related issues. (#285)

* Enable unit tests that pass after fixing some Triton related issues.

* fusion_emitter_device_legacy_test still fails on MI200

(cherry picked from commit 97dd565)

Rocm jaxlib v0.6.0 triton support ut (#279)

* Fixed triton/support_test - no fmfa.

* Fix issue with rounding mode in accelerate amd matmul.

* Fixed issues with usage of mfma in support_test.

(cherry picked from commit 44f7d87)

Restore gpu_triton_custom_call_test (#262)

(cherry picked from commit 32eafa4)

Skipped CanNotEmitTritonCustomCallOnPreAmpereGpu test for ROCM.

(cherry picked from commit 56ec7ec)
(cherry picked from commit b1f3e9f)

fixed createTritonAMDGPULowerInstructionSchedHintsPass (#179)

(cherry picked from commit 8517a3a)
(cherry picked from commit c62e47d)

fixed bazel build issue

---------

Co-authored-by: Chunyu Jin <[email protected]>
Co-authored-by: zoranjovanovic-ns <[email protected]>
Co-authored-by: Alex <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants