Skip to content

PCA preprocessor#1808

Merged
rapids-bot[bot] merged 26 commits intorapidsai:release/26.04from
aamijar:pca-preprocessor
Mar 20, 2026
Merged

PCA preprocessor#1808
rapids-bot[bot] merged 26 commits intorapidsai:release/26.04from
aamijar:pca-preprocessor

Conversation

@aamijar
Copy link
Copy Markdown
Member

@aamijar aamijar commented Feb 16, 2026

Resolves #1207. Depends on rapidsai/raft#2952
This PR introduces the cuvs::preprocessing::pca with float support. The following APIs are supported:
fit, transform, fit_transform, inverse_transform.

@aamijar aamijar self-assigned this Feb 16, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Feb 16, 2026
@aamijar aamijar added feature request New feature or request milvus AlloyDB non-breaking Introduces a non-breaking change labels Feb 16, 2026
Comment thread cpp/src/preprocessing/pca/detail/pca.cuh Outdated
raft::device_vector_view<float, int64_t> mu,
raft::device_scalar_view<float, int64_t> noise_vars,
bool flip_signs_based_on_U = false);

Copy link
Copy Markdown
Member Author

@aamijar aamijar Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note here that I don't think the existing cuml implementation has the ability to tune the percentage of explained variance.
For example, in sklearn we can set 0 < n_components < 1 where the user can select a percentage of the explained variance to recover and the n_components is automatically determined by the algorithm in order to satisfy that.

We will have to build that piece out since it doesn't exist in the current implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuning is not what's being asked for. Exposing the explained variance is what's being requested (it is used for tuning / selecting the number of components but that's something the user does, not something we need to do).

Comment thread cpp/tests/preprocessing/pca.cu Outdated
Comment thread cpp/tests/preprocessing/pca.cu Outdated
Comment thread cpp/tests/preprocessing/pca.cu Outdated
Comment thread cpp/tests/preprocessing/pca.cu Outdated
Comment on lines +29 to +31
prms.copy = config.copy;
prms.whiten = config.whiten;
return prms;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prms.copy = config.copy;
prms.whiten = config.whiten;
return prms;
prms.copy = config.copy;
prms.whiten = config.whiten;
prms.verbose = config.verbose;
return prms;

We are missing verbose here, no?

Copy link
Copy Markdown
Member Author

@aamijar aamijar Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose was a unused parameter, removed it from pca.hpp in 99f32fc

Comment thread cpp/tests/preprocessing/pca.cu Outdated
Comment on lines +93 to +94
// prms.n_cols = params.n_col;
// prms.n_rows = params.n_row;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 074fd96

Comment thread cpp/include/cuvs/preprocessing/pca.hpp Outdated
* @param[in] flip_signs_based_on_U whether to determine signs by U (true) or V.T (false)
*/
void fit(raft::resources const& handle,
params config,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing but important for API consistency: params config is passed by value here and in all other functions in the PR. The cuVS convention afaik is const params& for example in kmeans::fit:

void fit(raft::resources const& handle,
         const cuvs::cluster::kmeans::params& params,
         raft::device_matrix_view<const float, int> X,
         std::optional<raft::device_vector_view<const float, int>> sample_weight,
         raft::device_matrix_view<float, int> centroids,
         raft::host_scalar_view<float> inertia,
         raft::host_scalar_view<int> n_iter);

This won't affect performance or correctness at all, but for consistency I would suggest changing to const params& config throughout. Applies to all 8 overloads in this header plus the detail layer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c7c52a7

Comment thread cpp/tests/preprocessing/pca.cu Outdated
}

protected:
void basicTest()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both basicTest and advancedTest set n_components = n_col, so the inverse transform should perfectly reconstruct the input.

Consider adding a test with n_components < n_col that verifies the reconstruction error is bounded but non-zero, this would confirm the dimensionality reduction is actually working, not just passing data through unchanged and be a useful test case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 948882c

Copy link
Copy Markdown
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were double instantiations needed? Where is the code intended to be used?

Comment thread cpp/tests/preprocessing/pca.cu Outdated
@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Mar 4, 2026

Why were double instantiations needed? Where is the code intended to be used?

The cuml pca python interface supports double inputs. However, cuml will use the raft api, so therefore cuvs does not need double instantiations if we think its not valuable.

@divyegala
Copy link
Copy Markdown
Member

The cuml pca python interface supports double inputs. However, cuml will use the raft api, so therefore cuvs does not need double instantiations if we think its not valuable.

In that case, please remove double.

@aamijar aamijar changed the base branch from main to release/26.04 March 14, 2026 00:01
@cjnolet cjnolet dismissed dantegd’s stale review March 20, 2026 19:06

Comments addressed. Dismissing while Dante on PTO.

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Mar 20, 2026

/merge

@rapids-bot rapids-bot Bot merged commit cede915 into rapidsai:release/26.04 Mar 20, 2026
102 of 118 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Mar 20, 2026
gforsyth pushed a commit to gforsyth/cuvs that referenced this pull request Mar 20, 2026
Resolves rapidsai#1207. Depends on rapidsai/raft#2952
This PR introduces the `cuvs::preprocessing::pca` with `float` support. The following APIs are supported:
`fit`, `transform`, `fit_transform`, `inverse_transform`.

Authors:
  - Anupam (https://github.com/aamijar)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1808
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
Resolves rapidsai#1207. Depends on rapidsai/raft#2952
This PR introduces the `cuvs::preprocessing::pca` with `float` support. The following APIs are supported:
`fit`, `transform`, `fit_transform`, `inverse_transform`.

Authors:
  - Anupam (https://github.com/aamijar)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1808
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
Resolves rapidsai#1207. Depends on rapidsai/raft#2952
This PR introduces the `cuvs::preprocessing::pca` with `float` support. The following APIs are supported:
`fit`, `transform`, `fit_transform`, `inverse_transform`.

Authors:
  - Anupam (https://github.com/aamijar)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AlloyDB feature request New feature or request milvus non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEA] Implement cuvs::preprocessing::pca

4 participants