update rocprofiler-sdk with stuff in 0.8.0#513
Conversation
| build:ci_multi_gpu --experimental_guard_against_concurrent_changes | ||
| build:ci_multi_gpu --test_env=HIP_VISIBLE_DEVICES=0,1,2,3 | ||
| build:ci_multi_gpu --strategy=TestRunner=local | ||
|
|
There was a problem hiding this comment.
why your rocprof-sdk PR has this change? I don't think this is relevant and don't put any others in this profiling backport PR
There was a problem hiding this comment.
why your rocprof-sdk PR has this change? I don't think this is relevant and don't put any others in this profiling backport PR
those were mainly for CI test. Otherwise, it failed straightaway as ci_single/mult_gpu requires those definition. it seems we still got failures that are not related to the backported stuff.
There was a problem hiding this comment.
If this is only for CI test due to #453 and f6785ef , please create another PR
cc @alekstheod
There was a problem hiding this comment.
- six failed ci_multi_gpu tests on a local MI300
bazel --bazelrc=/work/xla/build_tools/rocm/rocm_xla.bazelrc test \
--config=rocm_ci \
--config=ci_multi_gpu \
--test_output=errors \
--spawn_strategy=local \
--strategy=TestRunner=local \
--repo_env=TF_ROCM_AMDGPU_TARGETS=gfx942,gfx90a \
//xla/tests:array_elementwise_ops_test_amdgpu_any
//xla/tests:array_elementwise_ops_test_amdgpu_any \
//xla/tests:convert_test_amdgpu_any \
//xla/tests:dot_operation_test_autotune_disabled_amdgpu_any \
//xla/tests:iota_test_amdgpu_any \
//xla/service/gpu/transforms:command_buffer_scheduling_test_amdgpu_any \
//xla/tests:local_client_execute_test_amdgpu_any
//xla/tests:array_elementwise_ops_test_amdgpu_any
INFO: Found 1 test target...
Target //xla/tests:array_elementwise_ops_test_amdgpu_any up-to-date:
bazel-bin/xla/tests/array_elementwise_ops_test_amdgpu_any
INFO: Elapsed time: 240.038s, Critical Path: 237.85s
INFO: 12 processes: 7 internal, 5 local.
INFO: Build completed successfully, 12 total actions
//xla/tests:array_elementwise_ops_test_amdgpu_any PASSED in 212.8s
Executed 1 out of 1 test: 1 test passes.
//xla/tests:convert_test_amdgpu_any
-- Test timed out at 2026-01-12 18:13:37 UTC --
================================================================================
INFO: Found 1 test target...
Target //xla/tests:convert_test_amdgpu_any up-to-date:
bazel-bin/xla/tests/convert_test_amdgpu_any
INFO: Elapsed time: 928.377s, Critical Path: 681.32s
INFO: 8616 processes: 75 internal, 8541 local.
INFO: Build completed successfully, 8616 total actions
//xla/tests:convert_test_amdgpu_any FLAKY, failed in 1 out of 2 in 300.7s
Stats over 2 runs: max = 300.7s, min = 222.3s, avg = 261.5s, dev = 39.2s
/root/.cache/bazel/_bazel_root/ea1efa0977f8828bf242d5b6a382af7f/execroot/xla/bazel-out/k8-opt/testlogs/xla/tests/convert_test_amdgpu_any/test_attempts/attempt_1.log
Executed 1 out of 1 test: 1 test passes.
//xla/tests:dot_operation_test_autotune_disabled_amdgpu_any
INFO: Found 1 test target...
Target //xla/tests:dot_operation_test_autotune_disabled_amdgpu_any up-to-date:
bazel-bin/xla/tests/dot_operation_test_autotune_disabled_amdgpu_any
INFO: Elapsed time: 88.008s, Critical Path: 86.59s
INFO: 16 processes: 9 internal, 7 local.
INFO: Build completed successfully, 16 total actions
Executed 1 out of 1 test: 1 test passes.disabled_amdgpu_any PASSED in 63.4s
//xla/tests:iota_test_amdgpu_any
[ RUN ] IotaR2TestInstantiation/IotaR2Test.DoIt/1171
I0000 00:00:1768218999.961985 3016214 se_gpu_pjrt_client.cc:1381] Using BFC allocator.
I0000 00:00:1768218999.962074 3016214 gpu_helpers.cc:136] XLA backend allocating 16491332239 bytes on device 0 for BFCAllocator.
I0000 00:00:1768218999.962097 3016214 gpu_helpers.cc:136] XLA backend allocating 16491332239 bytes on device 1 for BFCAllocator.
I0000 00:00:1768218999.962110 3016214 gpu_helpers.cc:136] XLA backend allocating 16491332239 bytes on device 2 for BFCAllocator.
I0000 00:00:1768218999.962119 3016214 gpu_helpers.cc:136] XLA backend allocating 16491332239 bytes on device 3 for BFCAllocator.
I0000 00:00:1768218999.962128 3016214 gpu_helpers.cc:177] XLA backend will use up to 189650320752 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1768218999.962136 3016214 gpu_helpers.cc:177] XLA backend will use up to 189650320752 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1768218999.962144 3016214 gpu_helpers.cc:177] XLA backend will use up to 189650320752 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1768218999.962151 3016214 gpu_helpers.cc:177] XLA backend will use up to 189650320752 bytes on device 3 for CollectiveBFCAllocator.
-- Test timed out at 2026-01-12 11:56:40 UTC --
WARNING: Build options --action_env, --run_under, and --test_env have changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed target //xla/tests:iota_test_amdgpu_any (376 packages loaded, 53666 targets configured).
INFO: Found 1 test target...
Target //xla/tests:iota_test_amdgpu_any up-to-date:
bazel-bin/xla/tests/iota_test_amdgpu_any
INFO: Elapsed time: 6654.189s, Critical Path: 301.90s
INFO: 8663 processes: 77 internal, 8586 local.
INFO: Build completed successfully, 8663 total actions
//xla/tests:iota_test_amdgpu_any PASSED in 126.7s
Stats over 50 runs: max = 126.7s, min = 123.1s, avg = 124.5s, dev = 0.7s
Executed 1 out of 1 test: 1 test passes.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
//xla/service/gpu/transforms:command_buffer_scheduling_test_amdgpu_any
[ RUN ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand
2026-01-12 12:34:57.785081: W ./xla/service/compiler.h:234] Ignoring the buffer assignment proto provided.
xla/service/gpu/transforms/command_buffer_scheduling_test.cc:1477: Failure
Value of: RunAndCompareTwoModulesReplicated(std::move(m_ref), std::move(m), true, true, std::nullopt)
Actual: false (UNIMPLEMENTED: Empty nodes are not supported on ROCM.)
Expected: true
[ FAILED ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand (338 ms)
[ RUN ] CommandBufferSchedulingTest.AllGatherStartFollowedByDone
[ OK ] CommandBufferSchedulingTest.AllGatherStartFollowedByDone (3 ms)
[ RUN ] CommandBufferSchedulingTest.MoveGTEs
[ OK ] CommandBufferSchedulingTest.MoveGTEs (3 ms)
[ RUN ] CommandBufferSchedulingTest.SingleCommandBuffer
[ OK ] CommandBufferSchedulingTest.SingleCommandBuffer (1 ms)
[----------] 29 tests from CommandBufferSchedulingTest (3838 ms total)
[----------] Global test environment tear-down
[==========] 30 tests from 2 test suites ran. (4605 ms total)
[ PASSED ] 27 tests.
[ SKIPPED ] 2 tests, listed below:
[ SKIPPED ] CommandBufferSchedulingTest.Conditional
[ SKIPPED ] CommandBufferSchedulingTest.While
[ FAILED ] 1 test, listed below:
[ FAILED ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand
1 FAILED TEST
//xla/tests:local_client_execute_test_amdgpu_any
[ RUN ] LocalClientExecuteTest.CompilePartitionedExecutable
2026-01-12 12:38:10.563499: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
xla/tests/local_client_execute_test.cc:767: Failure
Expected equality of these values:
2
executables.size()
Which is: 1
[ FAILED ] LocalClientExecuteTest.CompilePartitionedExecutable (34 ms)
[ RUN ] LocalClientExecuteTest.AddArraysWithDifferentInputLayouts
2026-01-12 12:38:10.597713: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.AddArraysWithDifferentInputLayouts (57 ms)
[ RUN ] LocalClientExecuteTest.Constant
2026-01-12 12:38:10.655456: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.Constant (8 ms)
[ RUN ] LocalClientExecuteTest.SizeOfGeneratedCodeInBytes
2026-01-12 12:38:10.664086: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.SizeOfGeneratedCodeInBytes (34 ms)
[ RUN ] LocalClientExecuteTest.InfeedOutfeedTest
2026-01-12 12:38:10.698750: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.InfeedOutfeedTest (29 ms)
[ RUN ] LocalClientExecuteTest.ValidateMemoryFittingLevel
2026-01-12 12:38:10.728554: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.ValidateMemoryFittingLevel (30 ms)
[ RUN ] LocalClientExecuteTest.ShapeBufferToLiteralConversion
2026-01-12 12:38:10.759389: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.ShapeBufferToLiteralConversion (2 ms)
[ RUN ] LocalClientExecuteTest.AddScalars
2026-01-12 12:38:10.762242: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.AddScalars (29 ms)
[ RUN ] LocalClientExecuteTest.ValidateOptimizationLevel
2026-01-12 12:38:10.792232: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.ValidateOptimizationLevel (23 ms)
[ RUN ] LocalClientExecuteTest.TupleArguments
2026-01-12 12:38:10.815470: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.TupleArguments (31 ms)
[ RUN ] LocalClientExecuteTest.LargeNestedTuple
2026-01-12 12:38:10.846715: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.LargeNestedTuple (5281 ms)
[ RUN ] LocalClientExecuteTest.ValidateExecTimeOptimizationEffort
2026-01-12 12:38:16.128208: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.ValidateExecTimeOptimizationEffort (28 ms)
[ RUN ] LocalClientExecuteTest.RunOnStreamForWrongPlatform
2026-01-12 12:38:16.156942: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.RunOnStreamForWrongPlatform (5 ms)
[ RUN ] LocalClientExecuteTest.DeepTuple
2026-01-12 12:38:16.162605: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.DeepTuple (131 ms)
[ RUN ] LocalClientExecuteTest.ValidateDeviceMemorySize
2026-01-12 12:38:16.294098: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.ValidateDeviceMemorySize (24 ms)
[ RUN ] LocalClientExecuteTest.ValidateFDOProfile
2026-01-12 12:38:16.318581: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
2026-01-12 12:38:16.324076: I xla/service/gpu/gpu_hlo_schedule.cc:342] Attempting to parse as a binary proto.
2026-01-12 12:38:16.324101: I xla/service/gpu/gpu_hlo_schedule.cc:347] Not a binary proto, attempt to parse it as a text proto.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1768221496.324192 3356912 text_format.cc:378] Error parsing text-format tensorflow.profiler.ProfiledInstructionsProto: 1:8: Message type "tensorflow.profiler.ProfiledInstructionsProto" has no field named "Testing".
2026-01-12 12:38:16.324219: E xla/service/gpu/gpu_hlo_schedule.cc:356] Unable to parse fdo_profile: not a valid text or binary ProfiledInstructionsProto
[ OK ] LocalClientExecuteTest.ValidateFDOProfile (26 ms)
[ RUN ] LocalClientExecuteTest.AddArraysWithDifferentOutputLayouts
2026-01-12 12:38:16.344869: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.AddArraysWithDifferentOutputLayouts (58 ms)
[ RUN ] LocalClientExecuteTest.RunOnAllDeviceOrdinals
2026-01-12 12:38:16.403325: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
[ OK ] LocalClientExecuteTest.RunOnAllDeviceOrdinals (34 ms)
[----------] 37 tests from LocalClientExecuteTest (11475 ms total)
[----------] Global test environment tear-down
[==========] 37 tests from 1 test suite ran. (11475 ms total)
[ PASSED ] 36 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] LocalClientExecuteTest.CompilePartitionedExecutable
1 FAILED TEST
There was a problem hiding this comment.
- cherry-pick from 926af0e to 0.7.1
The following two need further investigation
//xla/service/gpu/transforms:command_buffer_scheduling_test_amdgpu_any \
//xla/tests:local_client_execute_test_amdgpu_any
//xla/service/gpu/transforms:command_buffer_scheduling_test_amdgpu_any
[ SKIPPED ] CommandBufferSchedulingTest.Conditional (0 ms)
[ RUN ] CommandBufferSchedulingTest.CollectCommandBufferSequence
[ OK ] CommandBufferSchedulingTest.CollectCommandBufferSequence (0 ms)
[ RUN ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand
2026-01-12 17:55:42.615535: W ./xla/service/compiler.h:234] Ignoring the buffer assignment proto provided.
xla/service/gpu/transforms/command_buffer_scheduling_test.cc:1477: Failure
Value of: RunAndCompareTwoModulesReplicated(std::move(m_ref), std::move(m), true, true, std::nullopt)
Actual: false (UNIMPLEMENTED: Empty nodes are not supported on ROCM.)
Expected: true
[ FAILED ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand (386 ms)
[ RUN ] CommandBufferSchedulingTest.While
xla/service/gpu/transforms/command_buffer_scheduling_test.cc:962: Skipped
Not supported for ROCm!
[ SKIPPED ] CommandBufferSchedulingTest.While (1 ms)
[ RUN ] CommandBufferSchedulingTest.CollectivePermuteStartFollowedByAnotherStart
[ OK ] CommandBufferSchedulingTest.CollectivePermuteStartFollowedByAnotherStart (3 ms)
[ RUN ] CommandBufferSchedulingTest.ReduceScatterStartFollowedByDone
[ OK ] CommandBufferSchedulingTest.ReduceScatterStartFollowedByDone (1 ms)
[----------] 29 tests from CommandBufferSchedulingTest (5484 ms total)
[----------] Global test environment tear-down
[==========] 30 tests from 2 test suites ran. (7164 ms total)
[ PASSED ] 27 tests.
[ SKIPPED ] 2 tests, listed below:
[ SKIPPED ] CommandBufferSchedulingTest.Conditional
[ SKIPPED ] CommandBufferSchedulingTest.While
[ FAILED ] 1 test, listed below:
[ FAILED ] CommandBufferSchedulingTest.DynamicSliceFusionWithDynamicAddressesNotACommand
1 FAILED TEST
//xla/tests:local_client_execute_test_amdgpu_any
[ RUN ] LocalClientExecuteTest.CompilePartitionedExecutable
2026-01-12 17:59:44.042897: I xla/service/platform_util.cc:84] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- was support for that platform linked in?
2026-01-12 17:59:45.376245: I xla/service/service.cc:163] XLA service 0x55bfd2d516d0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
2026-01-12 17:59:45.376623: I xla/service/service.cc:171] StreamExecutor device (0): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376629: I xla/service/service.cc:171] StreamExecutor device (1): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376634: I xla/service/service.cc:171] StreamExecutor device (2): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376638: I xla/service/service.cc:171] StreamExecutor device (3): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376642: I xla/service/service.cc:171] StreamExecutor device (4): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376645: I xla/service/service.cc:171] StreamExecutor device (5): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376649: I xla/service/service.cc:171] StreamExecutor device (6): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
2026-01-12 17:59:45.376653: I xla/service/service.cc:171] StreamExecutor device (7): gfx942:sramecc+:xnack-, AMDGPU ISA version: gfx942:sramecc+:xnack-
xla/tests/local_client_execute_test.cc:767: Failure
Expected equality of these values:
2
executables.size()
Which is: 1
[ FAILED ] LocalClientExecuteTest.CompilePartitionedExecutable (1427 ms)
[----------] 1 test from LocalClientExecuteTest (1427 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1427 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] LocalClientExecuteTest.CompilePartitionedExecutable
1 FAILED TEST
Motivation
backporting rocprofiler-sdk from 0.8.0 update rocprofiler-sdk (v3) and roctracer (v1) #473
still no kernel details in the trace file when building from rocm-jax.
running
python3 profiler_test.pyfrom jaxci_cj_profiler_test_rocm-jaxlib-v0.8.0, which requiresxprof(it may be needed for CI later)