Correctly align trees in experimental FIL by wphicks · Pull Request #6397 · rapidsai/cuml

wphicks · 2025-03-05T16:10:05Z

Due to a bug in the import code, experimental FIL was previously not making use of the align_bytes argument correctly. The effect was not just a failure to take advantage of cache line boundaries but a severe pessimization in which padding nodes were inserted in the forest structure at highly non-optimal places.

This PR corrects this, resulting in a substantial performance improvement. It also introduces the layered layout type, in which nodes of the same depth are stored together. This allows for a moderate performance improvement in some models. It also allows CPU FIL to intelligently set the number of threads rather than accepting the highly non-optimal default. This provides a significant performance improvement for small batch size.

copy-pr-bot · 2025-03-05T16:10:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wphicks · 2025-03-11T23:45:33Z

Perfomance Benchmarks

Overview

Prior to these changes, we held back on promoting this version of FIL out of experimental due to its mild regression in performance on shallow trees relative to legacy FIL. Our previous benchmarking of FIL was also somewhat spotty, covering a somewhat haphazard combination of depths, features, tree counts, and batch sizes. In order to help assess the impact of the current changes, I have conducted a much more comprehensive study of the performance of both new and old FIL.

Benchmarking FIL is challenging not only because performance varies so widely over so many model parameters but also because FIL itself offers a number of performance hyperparameters. In all of the results presented below, we have obtained optimal hyperparameters through exhaustive exploration of available options.

An additional challenge comes from an unusual behavior observed with legacy FIL. Occasionally, legacy FIL's performance shows a dramatic (~10x) slowdown for a particular model. Rerunning the same benchmark many times in the same process does not yield improved results, but rerunning multiple times in different processes eventually yields ~10x better results. In order to ensure that we consider legacy FIL's best possible performance, legacy FIL's benchmark runs were repeated in their entirety multiple times, and the minimum runtime was recorded.

All GPU results were obtained on a single H100 (80 GB HBM3). All CPU results were obtained with 224x Intel Xeon Platinum 8480CL CPUs. Similar trends were observed on other hardware, but no comprehensive study was performed on any other hardware. All results were obtained through Python, using cupy array inputs for GPU execution and numpy array inputs for CPU execution.

High-level Results

The following model and batch size values were explored:

Feature count: 8, 32, 128, 512
Maximum tree depth: 2, 4, 8, 16, 32
Number of trees: 16, 128, 1024, 2048
Batch Size: 1, 16, 128, 1024, 1048576, 16777216

The largest batch size was omitted for runs with 512 features due to memory constraints.

Across all of these scenarios, new FIL outperformed legacy FIL in 75% of cases. In the worst case, new FIL underperformed by 27% (an absolute difference of 56 microseconds). In the best case, new FIL outperformed legacy FIL by 4.1x (an absolute difference of 5 seconds). The median speedup was 1.16x, the average was 1.31x, and the quartiles for speedup factor are presented below:

Percentile	Speedup Factor (New vs. Legacy FIL)
0	0.73
25	1.00
50	1.16
75	1.51
100	4.10

The worst absolute regression was 60 milliseconds, and the best absolute improvement was 5 seconds.

We can get a sense of when new FIL does and does not outperform legacy FIL using the following heatmap, which covers the entire range of scenarios mentioned above. Blue represents cases in which new FIL outperformed legacy FIL. Red represents a regression. Labels are speedup factors.

While this exhaustive picture is illuminating for applications that require a specific batch size, it is often the case that we care only about maximum throughput or about batch size 1 performance. The first is important for applications that care only about minimizing processing time for large amounts of data. The second is important for applications where batching is impossible or where minimizing latency for individual requests is critical. We will consider collated results for each of these separately.

Maximum Throughput Results

The maximum throughput is often (though not always) obtained at the largest batch size. Future work should include automated division of batches into sub-batches for cases where there is a significant peak in performance at batch sizes below the size dictated by device memory constraints.

Results for maximum throughput unsurprisingly follow the general trends already described. In 76% of explored models, new FIL obtained a higher maximum throughput than legacy FIL. In the worst case, new FIL underperformed by 18%. In the best case, new FIL outperformed legacy FIL by 3.8x. The median speedup was 1.37x, the average was 1.46x, and the quartiles for speedup factor are presented below:

Percentile	Speedup Factor (New vs. Legacy FIL)
0	0.82
25	1.04
50	1.37
75	1.71
100	3.77

The heatmap again gives a good general sense of where new FIL outperforms legacy FIL:

Batch Size 1 Results

For batch size 1, new FIL outperforms legacy FIL in 81% of explored models. In the worst case, new FIL underperformed by 22%. In the best case, new FIL outperformed legacy FIL by 3.02x. The median speedup was 1.55x, the average was 1.55x, and the quartiles for speedup factor are presented below:

Percentile	Speedup Factor (New vs. Legacy FIL)
0	0.78
25	1.19
50	1.55
75	1.65
100	3.02

The heatmap for batch size 1 is presented below:

Note that for batch size 1, new FIL benefits significantly from the fact that it allows CPU execution. While the general algorithmic design of both new and legacy FIL parallelize extremely well over trees and rows, neither implementation includes parallelization over individual tree nodes. As a result, the lower overhead of the CPU implementation gives it an edge for small numbers of shallow trees.

If we exclude CPU execution from consideration, we see the following batch size 1 performance comparison:

Neither new nor legacy FIL have greatly prioritized the batch size 1 case on GPU, but ongoing work is exploring methods to improve this. In the meantime, new FIL's CPU execution offers a way for FIL users to obtain desired batch size 1 performance for small forest models. For those users who make use of FIL through Triton, Triton's dynamic batching feature allows individual requests to be combined into larger batches, allowing effective use of GPU FIL.

Additional findings

CumlArray Overhead

During this benchmarking, it was noted that CumlArray construction accounted for 62% of inference time in some cases. If it is possible to reduce this overhead or at least to provide a mechanism for advanced users to avoid it, performance (especially on batch size 1) may improve significantly

align_bytes

After correcting the implementation of align_bytes, it was discovered that most though not all the time, GPU execution did not benefit from alignment to cache line boundaries and most though not all the time CPU execution did benefit from alignment to cache line boundaries. The defaults for this value were updated accordingly. Given this finding, it would also be useful to ensure that the optimize method tests performance with and without alignment in order to obtain the best possible performance from new FIL

Optimal layout

Across all scenarios explored, the depth_first layout was optimal in 45% of cases. The percentage of scenarios in which each layout was optimal is shown in the following table:

Layout	GPU	CPU	Overall
depth_first	47%	43%	45%
layered	24%	36%	30%
breadth_first	28%	22%	25%

The fact that no layout dominated a majority of these scenarios emphasizes the importance of exploring available layouts for optimal runtime performance.

Summary

In a comprehensive exploration of realistic model parameters, this change allows new FIL to outperform legacy FIL in a significant majority of cases. Where regressions still exist, they are typically on the order of microseconds to milliseconds, while performance improvements can be on the order of seconds. If we more narrowly focus on the two scenarios that are most important for most users (maximum throughput and batch size 1), these changes still offer significant advantages.

Additionally, new FIL does not appear to suffer from the intermittent slowdown that occasionally appears when benchmarking legacy FIL.

Recommended follow-ups

Promote new FIL to stable
Add align_bytes as an optimization target of the optimize method
Explore CumlArray overhead
Explore improved small-batch performance on GPU through parallelization over nodes

hcho3

LGTM. I really like how you defined a good set of abstractions and primitives for tree traversal.

wphicks · 2025-03-13T18:13:42Z

cudf-pandas failures are unrelated. One is a recently-introduced issue that is being worked on separately. The other is a flaky issue that has been around for sometime, is related to original FIL, and will probably be eliminated by promoting new FIL to stable.

jakirkham

Thanks Will! 🙏

Had a question on the OpenMP addition below

jakirkham · 2025-03-13T18:45:50Z

          # clang 15 required by libcudacxx.
          - clang==15.0.7
          - clang-tools==15.0.7
+          - llvm-openmp==15.0.7


Is there more context on how OpenMP is used in the clang-tidy step?

Is OpenMP needed at build and runtime as well?

Ah, my apologies! Should have addressed that in the PR description, but I forgot to update it. cuML should be compilable and runnable with or without OpenMP. The problem is that when we clang-tidy, we fairly naively forward our build flags on to clang-tidy itself, including -fopenmp, which sets the _OPENMP compilation definition.

For actual compilation, this is fine, since gcc compilation makes omp.h available by default. That's no longer true with llvm, so we need to either explicitly make that header available to clang-tidy or remove the -fopenmp flag from the clang-tidy invocation. Following offline discussion with @robertmaynard, I opted for providing the header. That will ensure that we're tidying OpenMP-only code paths.

This is the first PR that exposes this problem because we do not clang-tidy our .cu files, and this is the first time a .cpp file required OpenMP.

wphicks · 2025-03-13T20:14:32Z

/merge

wphicks added 23 commits October 7, 2024 13:42

Begin move to separate traversal logic

228bf9b

Restructure import to accommodate more generic layouts

4983192

Add debugging for decision forest construction

ebbb379

Begin adding traversal tests

fe16b55

Add traversal order tests

832279a

Test depth and parent tracking in traversal

46529c1

Remove copied src file

f298257

Begin adding Treelite traversal tests

2cc7612

Update node_transform test

9c2a71e

Merge branch 'branch-24.12' into fea/fil_shallow

9b857a4

Begin adding node_accumulate test

b3de75e

Merge branch 'branch-25.02' into fea/fil_shallow

f39e601

Fix typo in forest docs

a129b88

Add tests for all Treelite traversal orders

5c507de

Add test for full treelite import to FIL

4a92a3e

Merge branch 'branch-25.04' into fea/fil_shallow

1d8a493

Hook up layered_children_together

5d1a294

Allow benchmarking of layered layout

12c2988

Remove debug logging

334e336

Update benchmarks for XGBoost ubjson default

372dfbf

Merge branch 'branch-25.04' into fea/fil_shallow

24cf628

Provide correct align_bytes default

5b08adf

Ensure optimize never explores chunks larger than batch

764e519

github-actions Bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Mar 5, 2025

Merge branch 'branch-25.04' into fea/fil_shallow

c836f19

wphicks mentioned this pull request Mar 5, 2025

[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214

Closed

Improve thread count selection for CPU FIL

b8e3dc1

wphicks and others added 2 commits March 7, 2025 16:01

Ensure pad nodes are added in correct order

0aa729d

Merge branch 'branch-25.04' into fea/fil_shallow

216ba79

wphicks added bug Something isn't working non-breaking Non-breaking change labels Mar 10, 2025

wphicks added 3 commits March 11, 2025 12:20

Use optimized default hyperparameters

e4ea72d

Update layout description

966f65e

Merge branch 'branch-25.04' into fea/fil_shallow

1a2a805

wphicks marked this pull request as ready for review March 11, 2025 22:04

wphicks requested review from a team as code owners March 11, 2025 22:04

wphicks requested review from KyleFromNVIDIA, dantegd and teju85 March 11, 2025 22:04

wphicks added 3 commits March 11, 2025 16:52

Update style

32bf670

Remove unused variable in test

046a4a0

Add support for non-openmp builds

4d1ffa9

hcho3 approved these changes Mar 12, 2025

View reviewed changes

Comment thread cpp/include/cuml/experimental/forest/integrations/treelite.hpp

Comment thread cpp/include/cuml/experimental/forest/traversal/traversal_forest.hpp

Ensure clang-tidy has access to omp.h when necessary

4a6f9a0

wphicks requested a review from a team as a code owner March 13, 2025 16:06

github-actions Bot added the conda conda issue label Mar 13, 2025

wphicks added 2 commits March 13, 2025 09:07

Merge branch 'branch-25.04' into fea/fil_shallow

895b5e0

Remove unused variable

e17fbaa

dantegd approved these changes Mar 13, 2025

View reviewed changes

jakirkham reviewed Mar 13, 2025

View reviewed changes

jakirkham approved these changes Mar 13, 2025

View reviewed changes

rapids-bot Bot merged commit 3a8ea8c into rapidsai:branch-25.04 Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly align trees in experimental FIL#6397

Correctly align trees in experimental FIL#6397
rapids-bot[bot] merged 37 commits intorapidsai:branch-25.04from
wphicks:fea/fil_shallow

wphicks commented Mar 5, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Mar 5, 2025

Uh oh!

wphicks commented Mar 11, 2025

Uh oh!

hcho3 left a comment

Uh oh!

Uh oh!

Uh oh!

wphicks commented Mar 13, 2025

Uh oh!

jakirkham left a comment

Uh oh!

jakirkham Mar 13, 2025

Uh oh!

wphicks Mar 13, 2025

Uh oh!

wphicks commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wphicks commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Mar 5, 2025

Uh oh!

wphicks commented Mar 11, 2025

Perfomance Benchmarks

Overview

High-level Results

Maximum Throughput Results

Batch Size 1 Results

Additional findings

CumlArray Overhead

align_bytes

Optimal layout

Summary

Recommended follow-ups

Uh oh!

hcho3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wphicks commented Mar 13, 2025

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

wphicks Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

wphicks commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wphicks commented Mar 5, 2025 •

edited

Loading