Skip to content

Conversation

@psalz
Copy link
Contributor

@psalz psalz commented Dec 9, 2021

This bumps the version of DPC++ to 6699a5d (December 8) and hipSYCL to b836149 (December 9). I've also updated the CI filter for DPC++, which is now able to compile 12 additional test categories, failing for 2 new ones, for a total of 5 categories that do not compile yet.

I've also adjusted the CI workflow to encode an implementation's version into the Docker image tag (i.e. there is now a dpcpp-6699a5d tag, as opposed to just dpcpp) so that submitting this PR won't affect current CI runs (until it is merged).

Disable DPC++ range rounding feature to speed up compilation.

psalz added 3 commits December 9, 2021 11:31
In order to be able to bump a container's version as part of a PR but
without disrupting CI on other branches, we have to include the version
in the container image tag.

A downside that comes with this is that we now need to manually keep the
versions in sync between the two workflows.
DPC++ to 6699a5d (December 8)
hipSYCL to b836149 (December 9)
@psalz
Copy link
Contributor Author

psalz commented Dec 9, 2021

Shoot, of course we now need to adjust the naming scheme for the ComputeCpp image as well. @ProGTX who should I talk to regarding this going forward? You, Scott Straughan, Rod Burns..? Also, can we bump the image to 2.8?

@bader
Copy link
Contributor

bader commented Dec 9, 2021

Compilation with dpcpp takes 148 minutes (~2.5 hours). Is this acceptable for pre-commit?

@psalz
Copy link
Contributor Author

psalz commented Dec 10, 2021

Compilation with dpcpp takes 148 minutes (~2.5 hours). Is this acceptable for pre-commit?

No, not really. These are the top 5 culprits:

1971.5 accessor_api_buffer_core.cpp.o (../tests/accessor/accessor_api_buffer_core.cpp)
1709.1 handler_copy.cpp.o (../tests/handler/handler_copy.cpp)
532.1 math_builtin_integer.cpp.o (tests/math_builtin_api/math_builtin_integer.cpp)
469.2 nd_item_async_work_group_copy.cpp.o (../tests/nd_item/nd_item_async_work_group_copy.cpp)
448.3 group_async_work_group_copy.cpp.o (../tests/group/group_async_work_group_copy.cpp)

I will take a look at the top two, I remember discussing long build times in the context of the handler_copy tests a while back. We'll probably have to seriously cull the set of tested combinations in non-full-conformance mode.

Edit: Still, we need to pick up discussions about improving the CI infrastructure early next year.

@alexbatashev
Copy link
Contributor

Compilation with dpcpp takes 148 minutes (~2.5 hours). Is this acceptable for pre-commit?

No, not really. These are the top 5 culprits:

1971.5 accessor_api_buffer_core.cpp.o (../tests/accessor/accessor_api_buffer_core.cpp)
1709.1 handler_copy.cpp.o (../tests/handler/handler_copy.cpp)
532.1 math_builtin_integer.cpp.o (tests/math_builtin_api/math_builtin_integer.cpp)
469.2 nd_item_async_work_group_copy.cpp.o (../tests/nd_item/nd_item_async_work_group_copy.cpp)
448.3 group_async_work_group_copy.cpp.o (../tests/group/group_async_work_group_copy.cpp)

This is mostly spent in the kernel function instantiation. One hack, that you can employ here is passing -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__ which halves compile time for at least the first test out of these five. I guess, the other tests will see a similar speedup.

Disable range rounding feature to reduce # of SYCL kernels.
@bader
Copy link
Contributor

bader commented Dec 12, 2021

Compilation with dpcpp takes 148 minutes (~2.5 hours). Is this acceptable for pre-commit?

No, not really. These are the top 5 culprits:

1971.5 accessor_api_buffer_core.cpp.o (../tests/accessor/accessor_api_buffer_core.cpp)
1709.1 handler_copy.cpp.o (../tests/handler/handler_copy.cpp)
532.1 math_builtin_integer.cpp.o (tests/math_builtin_api/math_builtin_integer.cpp)
469.2 nd_item_async_work_group_copy.cpp.o (../tests/nd_item/nd_item_async_work_group_copy.cpp)
448.3 group_async_work_group_copy.cpp.o (../tests/group/group_async_work_group_copy.cpp)

This is mostly spent in the kernel function instantiation. One hack, that you can employ here is passing -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__ which halves compile time for at least the first test out of these five. I guess, the other tests will see a similar speedup.

Applied in d752a19.
I think DPC++ should improve default compilation time.

@alexbatashev
Copy link
Contributor

@psalz we now build docker images right in the intel/llvm repository, you can re-use those if you want. See https://github.com/intel/llvm/pkgs/container/llvm%2Fsycl_ubuntu2004_nightly

latest tag comes with all Intel drivers pre-installed, and no-drivers tag bundles compiler toolchain only. Also, there're tags for each commit hash, that was used to build the image.

@bader
Copy link
Contributor

bader commented Dec 12, 2021

This is mostly spent in the kernel function instantiation. One hack, that you can employ here is passing -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__ which halves compile time for at least the first test out of these five. I guess, the other tests will see a similar speedup.

It looks like the right way to disable this feature is to use -fsycl-disable-range-rounding.

@bader
Copy link
Contributor

bader commented Dec 12, 2021

@alexbatashev, -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__ saved ~10 minutes overall.

@alexbatashev
Copy link
Contributor

alexbatashev commented Dec 12, 2021

@bader handler is not the only template-heavy thing we have. Here's before disabling range rounding:

1971.5 accessor_api_buffer_core.cpp.o (../tests/accessor/accessor_api_buffer_core.cpp)
1709.1 handler_copy.cpp.o (../tests/handler/handler_copy.cpp)
532.1 math_builtin_integer.cpp.o (tests/math_builtin_api/math_builtin_integer.cpp)
469.2 nd_item_async_work_group_copy.cpp.o (../tests/nd_item/nd_item_async_work_group_copy.cpp)
448.3 group_async_work_group_copy.cpp.o (../tests/group/group_async_work_group_copy.cpp)
371.3 vector_swizzles_sycl__half.cpp.o (tests/vector_swizzles/vector_swizzles_sycl__half.cpp)
367.7 vector_swizzles_sycl__byte.cpp.o (tests/vector_swizzles/vector_swizzles_sycl__byte.cpp)
367.3 vector_swizzles_int8_t.cpp.o (tests/vector_swizzles/vector_swizzles_int8_t.cpp)
364.0 vector_swizzles_int32_t.cpp.o (tests/vector_swizzles/vector_swizzles_int32_t.cpp)
361.1 vector_swizzles_char.cpp.o (tests/vector_swizzles/vector_swizzles_char.cpp)
360.2 vector_swizzles_double.cpp.o (tests/vector_swizzles/vector_swizzles_double.cpp)
360.1 vector_swizzles_int.cpp.o (tests/vector_swizzles/vector_swizzles_int.cpp)
344.6 vector_swizzles_float.cpp.o (tests/vector_swizzles/vector_swizzles_float.cpp)
311.3 accessor_api_buffer_fp16.cpp.o (../tests/accessor/accessor_api_buffer_fp16.cpp)
310.7 accessor_api_buffer_fp64.cpp.o (../tests/accessor/accessor_api_buffer_fp64.cpp)
291.1 accessor_constructors_buffer.cpp.o (../tests/accessor/accessor_constructors_buffer.cpp)

And here's the after:

1456.0 accessor_api_buffer_core.cpp.o (../tests/accessor/accessor_api_buffer_core.cpp)
1445.6 handler_copy.cpp.o (../tests/handler/handler_copy.cpp)
507.2 math_builtin_integer.cpp.o (tests/math_builtin_api/math_builtin_integer.cpp)
468.7 nd_item_async_work_group_copy.cpp.o (../tests/nd_item/nd_item_async_work_group_copy.cpp)
445.2 group_async_work_group_copy.cpp.o (../tests/group/group_async_work_group_copy.cpp)
362.5 vector_swizzles_int32_t.cpp.o (tests/vector_swizzles/vector_swizzles_int32_t.cpp)
362.0 vector_swizzles_sycl__half.cpp.o (tests/vector_swizzles/vector_swizzles_sycl__half.cpp)
359.3 vector_swizzles_double.cpp.o (tests/vector_swizzles/vector_swizzles_double.cpp)
358.5 vector_swizzles_sycl__byte.cpp.o (tests/vector_swizzles/vector_swizzles_sycl__byte.cpp)
355.3 vector_swizzles_int.cpp.o (tests/vector_swizzles/vector_swizzles_int.cpp)
354.9 vector_swizzles_int8_t.cpp.o (tests/vector_swizzles/vector_swizzles_int8_t.cpp)
353.6 vector_swizzles_float.cpp.o (tests/vector_swizzles/vector_swizzles_float.cpp)
342.1 vector_swizzles_char.cpp.o (tests/vector_swizzles/vector_swizzles_char.cpp)
280.5 accessor_constructors_buffer.cpp.o (../tests/accessor/accessor_constructors_buffer.cpp)
276.9 accessor_api_buffer_fp16.cpp.o (../tests/accessor/accessor_api_buffer_fp16.cpp)
272.3 accessor_api_buffer_fp64.cpp.o (../tests/accessor/accessor_api_buffer_fp64.cpp)

intel/llvm#5127 should also give a small improvement for the first two tests. The trick with disabling this-free-function queries on host device I showed you earlier in the morning also gives nice speedup to those two tests (I'm yet to test all three tricks applied to see the final difference). The next big thing is sycl::vec class, which probably also has some template abuse. There's clearly a lot of room for improvement in terms of compile times.

Edit: applying all three hacks drives compile times for accessor_api_buffer_core.cpp from ~9 minutes to ~130 seconds. Considering, that GitHub hosted runners compile this file for over 30 minutes, 4.5 time sounds like a big potential improvement here.

@psalz
Copy link
Contributor Author

psalz commented Dec 16, 2021

Thanks guys for looking into this.

applying all three hacks drives compile times for accessor_api_buffer_core.cpp from ~9 minutes to ~130 seconds. Considering, that GitHub hosted runners compile this file for over 30 minutes, 4.5 time sounds like a big potential improvement here.

So we already have the -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__ setting, and I can bump the DPC++ version to include intel/llvm#5127. What would be the third hack then? Is this something we can apply here?

@psalz we now build docker images right in the intel/llvm repository, you can re-use those if you want. See https://github.com/intel/llvm/pkgs/container/llvm%2Fsycl_ubuntu2004_nightly

Good to know, thanks. Since we don't really need any special setup within the containers we can probably switch to using this. I'll make sure to look into it when I have some time.

@alexbatashev
Copy link
Contributor

What would be the third hack then? Is this something we can apply here?

This requires some work on the clang frontend side. I'll ping you once I have more to share.

@psalz psalz closed this Dec 20, 2021
@psalz psalz deleted the bump-ci-container-versions branch December 20, 2021 18:04
@psalz
Copy link
Contributor Author

psalz commented Dec 20, 2021

Closed in favor of #241, let's move discussions on compilation time to a dedicated PR.

steffenlarsen pushed a commit to steffenlarsen/SYCL-CTS that referenced this pull request Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants