Skip to content

Correctly align trees in experimental FIL#6397

Merged
rapids-bot[bot] merged 37 commits intorapidsai:branch-25.04from
wphicks:fea/fil_shallow
Mar 13, 2025
Merged

Correctly align trees in experimental FIL#6397
rapids-bot[bot] merged 37 commits intorapidsai:branch-25.04from
wphicks:fea/fil_shallow

Conversation

@wphicks
Copy link
Copy Markdown
Contributor

@wphicks wphicks commented Mar 5, 2025

Due to a bug in the import code, experimental FIL was previously not making use of the align_bytes argument correctly. The effect was not just a failure to take advantage of cache line boundaries but a severe pessimization in which padding nodes were inserted in the forest structure at highly non-optimal places.

This PR corrects this, resulting in a substantial performance improvement. It also introduces the layered layout type, in which nodes of the same depth are stored together. This allows for a moderate performance improvement in some models. It also allows CPU FIL to intelligently set the number of threads rather than accepting the highly non-optimal default. This provides a significant performance improvement for small batch size.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Mar 5, 2025
@wphicks wphicks added bug Something isn't working non-breaking Non-breaking change labels Mar 10, 2025
@wphicks wphicks marked this pull request as ready for review March 11, 2025 22:04
@wphicks wphicks requested review from a team as code owners March 11, 2025 22:04
@wphicks
Copy link
Copy Markdown
Contributor Author

wphicks commented Mar 11, 2025

Perfomance Benchmarks

Overview

Prior to these changes, we held back on promoting this version of FIL out of experimental due to its mild regression in performance on shallow trees relative to legacy FIL. Our previous benchmarking of FIL was also somewhat spotty, covering a somewhat haphazard combination of depths, features, tree counts, and batch sizes. In order to help assess the impact of the current changes, I have conducted a much more comprehensive study of the performance of both new and old FIL.

Benchmarking FIL is challenging not only because performance varies so widely over so many model parameters but also because FIL itself offers a number of performance hyperparameters. In all of the results presented below, we have obtained optimal hyperparameters through exhaustive exploration of available options.

An additional challenge comes from an unusual behavior observed with legacy FIL. Occasionally, legacy FIL's performance shows a dramatic (~10x) slowdown for a particular model. Rerunning the same benchmark many times in the same process does not yield improved results, but rerunning multiple times in different processes eventually yields ~10x better results. In order to ensure that we consider legacy FIL's best possible performance, legacy FIL's benchmark runs were repeated in their entirety multiple times, and the minimum runtime was recorded.

All GPU results were obtained on a single H100 (80 GB HBM3). All CPU results were obtained with 224x Intel Xeon Platinum 8480CL CPUs. Similar trends were observed on other hardware, but no comprehensive study was performed on any other hardware. All results were obtained through Python, using cupy array inputs for GPU execution and numpy array inputs for CPU execution.

High-level Results

The following model and batch size values were explored:

  • Feature count: 8, 32, 128, 512
  • Maximum tree depth: 2, 4, 8, 16, 32
  • Number of trees: 16, 128, 1024, 2048
  • Batch Size: 1, 16, 128, 1024, 1048576, 16777216

The largest batch size was omitted for runs with 512 features due to memory constraints.

Across all of these scenarios, new FIL outperformed legacy FIL in 75% of cases. In the worst case, new FIL underperformed by 27% (an absolute difference of 56 microseconds). In the best case, new FIL outperformed legacy FIL by 4.1x (an absolute difference of 5 seconds). The median speedup was 1.16x, the average was 1.31x, and the quartiles for speedup factor are presented below:

Percentile Speedup Factor (New vs. Legacy FIL)
0 0.73
25 1.00
50 1.16
75 1.51
100 4.10

The worst absolute regression was 60 milliseconds, and the best absolute improvement was 5 seconds.

We can get a sense of when new FIL does and does not outperform legacy FIL using the following heatmap, which covers the entire range of scenarios mentioned above. Blue represents cases in which new FIL outperformed legacy FIL. Red represents a regression. Labels are speedup factors.

overall_speedup

While this exhaustive picture is illuminating for applications that require a specific batch size, it is often the case that we care only about maximum throughput or about batch size 1 performance. The first is important for applications that care only about minimizing processing time for large amounts of data. The second is important for applications where batching is impossible or where minimizing latency for individual requests is critical. We will consider collated results for each of these separately.

Maximum Throughput Results

The maximum throughput is often (though not always) obtained at the largest batch size. Future work should include automated division of batches into sub-batches for cases where there is a significant peak in performance at batch sizes below the size dictated by device memory constraints.

Results for maximum throughput unsurprisingly follow the general trends already described. In 76% of explored models, new FIL obtained a higher maximum throughput than legacy FIL. In the worst case, new FIL underperformed by 18%. In the best case, new FIL outperformed legacy FIL by 3.8x. The median speedup was 1.37x, the average was 1.46x, and the quartiles for speedup factor are presented below:

Percentile Speedup Factor (New vs. Legacy FIL)
0 0.82
25 1.04
50 1.37
75 1.71
100 3.77

The heatmap again gives a good general sense of where new FIL outperforms legacy FIL:
Large_Batch_Speedup_experimental_vs_legacy_

Batch Size 1 Results

For batch size 1, new FIL outperforms legacy FIL in 81% of explored models. In the worst case, new FIL underperformed by 22%. In the best case, new FIL outperformed legacy FIL by 3.02x. The median speedup was 1.55x, the average was 1.55x, and the quartiles for speedup factor are presented below:

Percentile Speedup Factor (New vs. Legacy FIL)
0 0.78
25 1.19
50 1.55
75 1.65
100 3.02

The heatmap for batch size 1 is presented below:
Batch_Size_1_Speedup_experimental_vs_legacy_

Note that for batch size 1, new FIL benefits significantly from the fact that it allows CPU execution. While the general algorithmic design of both new and legacy FIL parallelize extremely well over trees and rows, neither implementation includes parallelization over individual tree nodes. As a result, the lower overhead of the CPU implementation gives it an edge for small numbers of shallow trees.

If we exclude CPU execution from consideration, we see the following batch size 1 performance comparison:
Batch_Size_1_Speedup_experimental_vs_legacy_GPU_only_

Neither new nor legacy FIL have greatly prioritized the batch size 1 case on GPU, but ongoing work is exploring methods to improve this. In the meantime, new FIL's CPU execution offers a way for FIL users to obtain desired batch size 1 performance for small forest models. For those users who make use of FIL through Triton, Triton's dynamic batching feature allows individual requests to be combined into larger batches, allowing effective use of GPU FIL.

Additional findings

CumlArray Overhead

During this benchmarking, it was noted that CumlArray construction accounted for 62% of inference time in some cases. If it is possible to reduce this overhead or at least to provide a mechanism for advanced users to avoid it, performance (especially on batch size 1) may improve significantly

align_bytes

After correcting the implementation of align_bytes, it was discovered that most though not all the time, GPU execution did not benefit from alignment to cache line boundaries and most though not all the time CPU execution did benefit from alignment to cache line boundaries. The defaults for this value were updated accordingly. Given this finding, it would also be useful to ensure that the optimize method tests performance with and without alignment in order to obtain the best possible performance from new FIL

Optimal layout

Across all scenarios explored, the depth_first layout was optimal in 45% of cases. The percentage of scenarios in which each layout was optimal is shown in the following table:

Layout GPU CPU Overall
depth_first 47% 43% 45%
layered 24% 36% 30%
breadth_first 28% 22% 25%

The fact that no layout dominated a majority of these scenarios emphasizes the importance of exploring available layouts for optimal runtime performance.

Summary

In a comprehensive exploration of realistic model parameters, this change allows new FIL to outperform legacy FIL in a significant majority of cases. Where regressions still exist, they are typically on the order of microseconds to milliseconds, while performance improvements can be on the order of seconds. If we more narrowly focus on the two scenarios that are most important for most users (maximum throughput and batch size 1), these changes still offer significant advantages.

Additionally, new FIL does not appear to suffer from the intermittent slowdown that occasionally appears when benchmarking legacy FIL.

Recommended follow-ups

  1. Promote new FIL to stable
  2. Add align_bytes as an optimization target of the optimize method
  3. Explore CumlArray overhead
  4. Explore improved small-batch performance on GPU through parallelization over nodes

Copy link
Copy Markdown
Contributor

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I really like how you defined a good set of abstractions and primitives for tree traversal.

Comment thread cpp/include/cuml/experimental/forest/integrations/treelite.hpp
Comment thread cpp/include/cuml/experimental/forest/traversal/traversal_forest.hpp
@wphicks wphicks requested a review from a team as a code owner March 13, 2025 16:06
@github-actions github-actions Bot added the conda conda issue label Mar 13, 2025
@wphicks
Copy link
Copy Markdown
Contributor Author

wphicks commented Mar 13, 2025

cudf-pandas failures are unrelated. One is a recently-introduced issue that is being worked on separately. The other is a flaky issue that has been around for sometime, is related to original FIL, and will probably be eliminated by promoting new FIL to stable.

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will! 🙏

Had a question on the OpenMP addition below

Comment thread dependencies.yaml
# clang 15 required by libcudacxx.
- clang==15.0.7
- clang-tools==15.0.7
- llvm-openmp==15.0.7
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more context on how OpenMP is used in the clang-tidy step?

Is OpenMP needed at build and runtime as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my apologies! Should have addressed that in the PR description, but I forgot to update it. cuML should be compilable and runnable with or without OpenMP. The problem is that when we clang-tidy, we fairly naively forward our build flags on to clang-tidy itself, including -fopenmp, which sets the _OPENMP compilation definition.

For actual compilation, this is fine, since gcc compilation makes omp.h available by default. That's no longer true with llvm, so we need to either explicitly make that header available to clang-tidy or remove the -fopenmp flag from the clang-tidy invocation. Following offline discussion with @robertmaynard, I opted for providing the header. That will ensure that we're tidying OpenMP-only code paths.

This is the first PR that exposes this problem because we do not clang-tidy our .cu files, and this is the first time a .cpp file required OpenMP.

@wphicks
Copy link
Copy Markdown
Contributor Author

wphicks commented Mar 13, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 3a8ea8c into rapidsai:branch-25.04 Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CMake conda conda issue CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants