Skip to content

Conversation

@joshdunnlime
Copy link
Contributor

Reference Issues/PRs

Background on why we want this: #601 and #605
Supersedes this PR: #608

What does this implement/fix? Explain your changes.

This implementation adds a DifferentiableTransformer (DT). This acts as a wrapped around an sklearn transformer and gives the user the option to use either (in order):

  1. an explicit derivative via inverse_func_diff.
  2. an explicit derivative for a scaler transformer via scaler_.
  3. Numerical differentiation of the inverse_transform.

The derivative available for both the (forward) transform and inverse_transform.

The DT has a coerce classmethod that takes a sklearn transformer or a function and coerces it to a DT.

In addition to the above, it also makes changes to #605 to apply the Jacobian (derivative of the transform/inverse_tranform) to the pdf and log_pdf. In cases 1) and 2) above this returns the exact pdf and log_pdf.

This preserves the current functionality of being able to pass a function to TTR (TransformedTargetRegressor) and allows a user to extend this by passing an sklearn transformer or their own DT. The final point allows the user to configure their DT with either the explicit derivative, or to pass kwargs that can be used for numerical differentiation. In theory, it also allows a user to pass these kwargs as hyperparameters via a gridsearch/optimisation.

Does your contribution introduce a new dependency? If yes, which one?

No.

What should a reviewer concentrate their feedback on?

  1. Implementation of the DT is the most important part:

    • Are we happy with the transformer inheriting from sklearn?
    • Do we need this much abstraction at this stage?
    • Do we wish to add/implement tags for this?
  2. Changes to TD (TranformedDistribution) - the main change here is internally handling a transformer instead of a function.

  3. Changes to the TTR are minor.

On points 2) and 3), note that if we wish to explicitly pass transform only as a method/function, it is very straightforward to change this back. We would simply need to add inverse_transform and inverse_func_diff as kwargs to the TD. This does mean adding more kwargs if we wish to support numerical derivative kwargs.

Secondary feedback

Did you add any tests for the change?

Yes, added a param3 to TD, passing a FunctionTransformer instead of just a function.

Any other comments?

I had the bulk of this implementation down prior to the discussion on what we pass to TTR/TD here. Again, it is trivial to move the creation of the DT into TTR and pass the transform, inverse_transform and inverse_func_diff, but I feel the current approach is much cleaner as it keep the majority of the new logic in the DT and within the new _transformer.py module.

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

…ion.

This allows us to get more information from the tranformer class. E.g. MinMaxScaler scale_ parameter.

Though not part of the typical forward-facing interface, TransformedDistribution.transform is now no longer callable. Instead TransformedDistribution.transform.inverse_transform would be needed.
This is useful when testing XGBoostLLS and hyperparameter optimisation is not important. XGBoostLLS can be computationally expensive when searching over the default 30 trials.
Does so by adding the log_pdf with change-of-variables (jacobian of inverse transform) and direct method for linear transforms, and numerical method for non-linear transforms.

TransformedDistribution now take transformer instead of transform - that is, the transformer class instead of just the inverse transform function/method.
This is not actually intended as an example. It is a simple way to share findings. This can (should) be removed before any final merges into main.
This reverts commit 4146a6f, reversing
changes made to 8bc5897.
This is needed so that when TransformerDistribtution.distribution is called, it has the same indices as the wrapper distribution.
@joshdunnlime
Copy link
Contributor Author

Usage examples:

Create some data:

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from skpro.metrics import LogLoss, CRPS
from skpro.regression.xgboostlss import XGBoostLSS
from skpro.regression.compose import TransformedTargetRegressor
from skpro.compose import DifferentiableTransformer

size = 1000

X = np.random.normal(1, 1, size)
y = (2 * X + np.random.normal(0, 0.5, size)) # / 15 + 0.5 # for Beta

Xy = pd.DataFrame(
    {"target": y, "feature": X},
    index=range(size)
)

Now we can create pass an sklearn transform to TTR:

xgb = XGBoostLSS(**params)
mms = MinMaxScaler((0.1, 0.9))
pipe = TransformedTargetRegressor(regressor=xgb, transformer=mms)

or a DT:

xgb = XGBoostLSS(**params)
mms = MinMaxScaler((0.1, 0.9))
mms = DifferentiableTransformer(transformer=MinMaxScaler())
pipe = TransformedTargetRegressor(regressor=xgb, transformer=mms)

Both are handled the same:

pipe.fit(X=Xy[["feature"]], y=Xy["target"])
pp = pipe.predict_proba(Xy[["feature"]])
p = pipe.predict(Xy[["feature"]])
CRPS()(y_true=Xy["target"], y_pred=pp)
LogLoss()(y_true=Xy["target"], y_pred=pp)

0.41678

And in either case, if we inspect the TD (pp) returned by predict_proba we get a DT as the transformer.

pp.transformer_

DifferentiableTransformer(transformer=MinMaxScaler(feature_range=(0.1, 0.9)))

@fkiraly
Copy link
Collaborator

fkiraly commented Oct 7, 2025

I tried to resolve the conflicts, have a look

Here we consider the numerical differentiation to classed as approx. For wellbehaved function (which change-of-variables requires), this is likely to be overkill.
@joshdunnlime
Copy link
Contributor Author

@fkiraly

This should give working implementation.

There are a couple design choice that it might be worth discussing:

  1. numdifftools was added to handle fast and accurate numerical differentiation.

    • Added as an optional extra. We could add an "install numdifftools/extras" warning when the fallback is used so users know there is a better option.
    • scipy derivative function has some breaking changes between versions. We could handle this by checking versions however, the scipy implementation is a simplified version of numdifftools so numdifftools is likely to give more robust results.
  2. Do we need a DifferentiableTransformer (DT)? And does it need to inherit from BaseTransformer, BaseEstimator etc?

    • We could have a DifferentiableFunction class you as mentioned previously. My main concern with it being a Transformer is that it makes design choices about the transformer base class which might narrow the design choices for that in the future.
    • Does it need to inherit from sklearn's BaseEstimator or should it have a pure skpro/skbase implementation?
  3. This implementation can pass a function, an sklearn transformer (skT) or DT to TTR (all via the transform kwarg). It doesn't support passing inverse_transform or inverse_tranform_diff but it have kept that in mind while designing this. It should be trivial to extend this (allow passing an SKT, DT or inverse_transform + inverse_transform_diff), or to constrain this to only allow inverse_transform + inverse_transform_diff.

    • I am in favour of supporting skT or DT as it feels more sklearn-like. E.g. wrap an sklearn-like object in another sklearn-like object and then just fit-predict.
    • Supporting all of the above options doesn't introduce any breaking changes.
  4. Should we consider numerical differentiation as approx? The current implementation does, however, the "change-of-variables" only works with well-behaved functions. These are incredibly easy for numerical differentiation to handle, with errors typically being in the order of the 7th decimal place when comparing the exact derivative of a function to the numerical derivative for CRPS, LogLoss and MAE of the log_pdf.

  5. Currently, multivariate targets are handled completely independently, that is to say, the two targets are transformed independently. Thus, the Jacobian is treated as a diagonal matrix with no partial derivatives. This is mathematically correct for all sklearn transformers where the transform (F_i(y)) is only dependent on it's respective target (y_i). Other custom transformers could be created, where some F_i(y) is dependent on any/all y, but I would consider this extremely unlikely.

In addition to the above, the documentation needs some tidying, and comments, tags and licenses need updating.

Further improvements:

  1. Tidy up repeated handling of index and columns.
  2. Tests - do we need specific tests added?
  3. Tests don't catch false-negative exact for pdf_log_pdf. E.g. if the tag logic is changed and something that is exact get's tagged as "approx" this won't be caught.
  4. The fallback derivative could also be improved.

@fkiraly
Copy link
Collaborator

fkiraly commented Oct 14, 2025

This should give working implementation.

Oh, very nice! I will review but first leave some comments on the design choices.

  1. numdifftools was added to handle fast and accurate numerical differentiation.

I think we should not take a dependency on numdifftools - the package looks abandoned. Last release in 2022, and sole maintainer. It implies python 3.10 or python 3.11, so will lapse into outdated python within 2 years.

I will contact the author to see what is going on, but for now I think we should avoid it.

  1. Do we need a DifferentiableTransformer (DT)?

That is unclear to me and we should really think carefully about pros and cons.

We could have a DifferentiableFunction class you as mentioned previously.

But then we would need to turn the fitted transformer into a DifferentiableFunction, right? Feels less extensible to me. Or how would everything fit together? Would appreciate your thoughts.

Does it need to inherit from sklearn's BaseEstimator or should it have a pure skpro/skbase implementation?

If transformer, I am not in favour strongly of either option. If a function, I would favour an skbase implementation strongly, because it is too far from the sklearn transformer.

  1. This implementation can pass a function

I like this, though I still need time to think about it too.

  1. Should we consider numerical differentiation as approx?

Where exactly? I have already implemented numerical differentiation (with sixth-order approximation) in the pdf and log_pdf defaults.

  1. Currently, multivariate targets are handled completely independently,

That is correct, currentls skpro cannot support multivariate distributions. That is a bit of a project, and the API is not fleshed out.

I think that API question needs to be answered before we move to the differentiation question (but it might make sense to look at both in close sequence ot validate the design).

E.g., do we need to have a separate method for multivariate pdf? Note that the current pdf produces marginals of an independent distribution across variables. But for joint distributions the marginals do not carry the full information, so there needs to be a way to return joint pdf.

@fkiraly
Copy link
Collaborator

fkiraly commented Oct 14, 2025

FYI, I opened an issue to collect API discussion for multivariate distributions here:
#622

@joshdunnlime
Copy link
Contributor Author

joshdunnlime commented Oct 19, 2025

I think we should not take a dependency on numdifftools - the package looks abandoned. Last release in 2022, and sole maintainer. It implies python 3.10 or python 3.11, so will lapse into outdated python within 2 years.

I'll revert to scipy with version handling for misc.derivative and differentiation.derivative function or our own.

But then we would need to turn the fitted transformer into a DifferentiableFunction, right? Feels less extensible to me. Or how would everything fit together? Would appreciate your thoughts.

I've had much more of a think about having a DifferentiableFunction vs DifferentiableTransformer. I am confident the DiffT is the right way to go:

  • It follows existing sklearn-like patterns. DiffF introduces a new pattern.
  • Adding numerical differentiation is unrestrictive to future development of BaseTransformer. The BaseDiffT only requires inverse_diff and transform_diff as public methods and _fit_with_fitted_transformer as a private method and currently has _numerical_diff but that is not a hard requirement.

My impression is that if we were implementing TTR and TD from scratch, with LogLoss, pdf and log_pdf support, I would probably go with something like:

# what this PR has
xgb = XGBoostLSS(**params)
mms = MinMaxScaler((0.1, 0.9))
dt = DifferentiableTransformer(transformer=mms)
pipe = TransformedTargetRegressor(regressor=xgb, transformer=dt)

or

# we could easily add "dft" with the current PR
xgb = XGBoostLSS(**params)
dft = DifferentiableFuncTransformer(func, inverse_func, func_diff, inverse_diff)
pipe = TransformedTargetRegressor(regressor=xgb, transformer=dft)

I think the main issue here is extending/changing the TTR input options, e.g. point 3 above. For example, the _fit_with_fitted_ is only used to accommodate patterns unlike the above examples.

If transformer, I am not in favour strongly of either option. If a function, I would favour an skbase implementation strongly, because it is too far from the sklearn transformer.

I would keep it inheriting from sklearn for now then. My understanding is it should be straightforward to replace this with skbase if/when needed.

  1. This implementation can pass a function

I like this, though I still need time to think about it too.

It would be good to get some more thoughts on this. As mentioned above, I think this is the main constraint in how this gets implemented.

  1. Should we consider numerical differentiation as approx?

Where?

In the pdf and log_pdf, when applying the Jacobian. The plain pdf and log_pdf are always exact for the transformed distribution. For the original distribution, the pdf and log_pdf are always exact when using scale_ or inverse_transform_diff for the Jacobian. When using _numerical_diff, the pdf and log_pdf are still exact, but the Jacobian is technically an approximation. Therefore, in this case, I have set the tag as "approx". However, for all well-behaved functions (which is a theoretical prerequisite), numerical differentiation would effectively give an exact Jacobian (to some some relatively small rounding error). Do we wish to keep this as "approx" or change it to "exact"?

  1. Currently, multivariate targets are handled completely independently,

I assumed that this was fully supported. Until the these other parts are implemented it will be difficult to design around it. I would consider this a edge-case - sklearn transformers don't even handle combined column transformations (afaik).

@fkiraly
Copy link
Collaborator

fkiraly commented Oct 19, 2025

ok, I also think that transformer is the better way to go after some thinking.

How do we proceed practically - do you want to give it a stab and then we maybe refine?

@joshdunnlime
Copy link
Contributor Author

@fkiraly - I have removed any additional packages and the differentiation is now done by scipy with both the old and new derivative APIs implemented.

The transformer implementation we spoke about is implemented and now just needs refining. I have moved some methods down to the DiffT class so as to keep the BaseT as generic as possible.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 24, 2025

Nice! Will review in the coming days - this is much appreciated, but I think I need more time to digest the API design.
(please ping if I need more than a few days)

@joshdunnlime
Copy link
Contributor Author

Nice! Will review in the coming days - this is much appreciated, but I think I need more time to digest the API design. (please ping if I need more than a few days)

@fkiraly - just bringing this one your attention again. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants