Skip to content

Conversation

@fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Oct 5, 2025

This PR adds cdf support to TransformedTargetRegressor, via changes in TransformedDistribution:

  • TransformedDistribution now accepts inverse_transform, which can be used for an exact cdf
  • TransformedTargetRegressor passes the fitted self.transformer_transform to the inverse_transform of TransformedDistribution

Together with #610, it means that TransformedTargetRegressor can now produce distributions with reasonably reliable cdf and pdf.

Goes partially towards #601.

@fkiraly fkiraly added enhancement module:probability&simulation probability distributions and simulators module:regression probabilistic regression module labels Oct 5, 2025
@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 5, 2025

FYI @joshdunnlime, this is what I meant for a quick fix

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 5, 2025

@joshdunnlime, any idea why the test is failing? cdf and ppf do not seem to be inverse to each other.

@joshdunnlime
Copy link
Contributor

joshdunnlime commented Oct 6, 2025

@fkiraly is cdf meant to be approximate? The class tags are approx for cdf but the get_tags show cdf is exact. I can see this get set if there is an inverse passed. However, I get:

<class 'skpro.distributions.trafo._transformed.TransformedDistribution'> does not have a numerically exact implementation of the 'cdf' method, it is filled in by approximating the expected value by the indicator function on 1000 samples.

So the _cdf seems to be falling back to the approx (I get that warning when running in a script). I had similar issue with the _pdf and _log_pdf and it was an indices issue (set index and column names on the transformed). See here and here.

@joshdunnlime
Copy link
Contributor

@fkiraly - What are the methods loc/iloc doing? It's being called by the _pdf via boilerplate but the class is being initialed again but without passing the inverse_transform.

@joshdunnlime
Copy link
Contributor

@fkiraly merge https://github.com/joshdunnlime/skpro/tree/ttr-cdf-pdf-add-inv or just add

        inverse_transform=self.inverse_transform,

to the cls call in _iloc.

@joshdunnlime
Copy link
Contributor

This still raises a X does not have valid feature names, but MinMaxScaler was fitted with feature names so I think there is still the indices issue mentioned above, but the tests pass.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 6, 2025

@fkiraly - What are the methods loc/iloc doing?

These are methods to subset or reorder indices of the array distribution. This is called in loc and iloc which works like for pandas.DataFrame, the inner _loc / _iloc must define equivalent operations for distributions etc.

If the distribution is parametric, this is taken care of by the default, but in general, e.g., if the distribution is composite, it needs to be done manually currently.

There is space to define a broader default including distribution objects as components, but that is currently an open issue:
#559

It's being called by the _pdf via boilerplate but the class is being initialed again but without passing the inverse_transform.

I see, that must be it!

@fkiraly fkiraly changed the title [ENH] TransformedDistribution and TransformedTargetRegressor cdf support &joshdunnlime [ENH] TransformedDistribution and TransformedTargetRegressor cdf support Oct 6, 2025
@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 6, 2025

This still raises a X does not have valid feature names, but MinMaxScaler was fitted with feature names so I think there is still the indices issue mentioned above, but the tests pass.

can you outline your understanding of why this occurs? Which two objects exactly where, when passed to MinMaxScaler have missing or inconsistent feature names?

assume_monotonic=self.assume_monotonic,
index=new_index,
columns=new_columns,
**params_dict,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safer and more extensible. This might already be an almost solution for #559 (only needs to be combined with checks for type?)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 6, 2025

This still raises a X does not have valid feature names, but MinMaxScaler was fitted with feature names so I think there is still the indices issue mentioned above, but the tests pass.

can you outline your understanding of why this occurs? Which two objects exactly where, when passed to MinMaxScaler have missing or inconsistent feature names?

If you have a fix, @joshdunnlime, could you open a PR with only the fix? So we can merge it quickly while the more complicated design questions remain open?

@joshdunnlime
Copy link
Contributor

Done. It is literally that one line to allow the exact cdf.


inv_trafo = self.inverse_transform

inv_x = inv_trafo(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where MinMaxScaler in my local script raised the warning.

I used:

warnings.filterwarnings('error')

to catch and debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I think this is resolved by ensuring x is a pd.DataFrame when we pass it.

@joshdunnlime
Copy link
Contributor

joshdunnlime commented Oct 6, 2025

This still raises a X does not have valid feature names, but MinMaxScaler was fitted with feature names so I think there is still the indices issue mentioned above, but the tests pass.

can you outline your understanding of why this occurs? Which two objects exactly where, when passed to MinMaxScaler have missing or inconsistent feature names?

I have added a comment on where the warning is raised. The inverse_tranform function returns a numpy array. This was easy to find when raising warning as errors (see comment).

I think the solution I have implemented in #612, where every function or transformer is wrapped in a DifferentiableTransformer would be the nicest way to fix this. It is then very easy to keep all of the transformation logic in there and out of the TransformedDistriubtion. We can guarantee that tranform, inverse_transform and their _diff functions always return a dataframe. This means much less code like:

if not isinstance(x_t, pd.DataFrame):
    x_t = pd.DataFrame(x_t, index=x.index, columns=x.columns)
else:
    x_t.columns = x.columns
    x_t.index = x.index

This is the case for _pdf, _log_pdf, _cdf and probably all the other functions that need transformer outputs.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 6, 2025

We can guarantee that tranform, inverse_transform and their _diff functions always return a dataframe.

Can you explain why this would be the case? The data arrives inside _cdf and _pdf as numpy already.

This may have been an unfortunate design choice (not sure about this yet), but that also means whatever we do with x or p, it starts with numpy. So, the DifferentiableTransformer may not get the pd.DataFrame it expects?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 6, 2025

I have opened a new design issue here: #615
I think there are some interesting questions about nested/composite distributions where I would appreciate your opinion! Given that you have dug into the problems above.

I will now merge this PR to expedite the various improvements.

@fkiraly fkiraly merged commit 1c54087 into main Oct 6, 2025
34 checks passed
fkiraly pushed a commit that referenced this pull request Oct 6, 2025
…rm` (#614)

#### Reference Issues/PRs
#611

#### What does this implement/fix? Explain your changes.
Fixes the call to exact instead of approx for the `_cdf` method.
@joshdunnlime
Copy link
Contributor

We can guarantee that tranform, inverse_transform and their _diff functions always return a dataframe.

Can you explain why this would be the case? The data arrives inside _cdf and _pdf as numpy already.

This may have been an unfortunate design choice (not sure about this yet), but that also means whatever we do with x or p, it starts with numpy. So, the DifferentiableTransformer may not get the pd.DataFrame it expects?

I'm talking more about the output of the cdf, ppf and pdf. We are nearly always applying some transform in the TD so having the outputs (e.g. Jacobian) as dataframe would tidy up the TD code. It does deviate from default sklearn behaviour but sklearn does have a transformer to dataframe setting.

Skpro TD method outputs are dataframes so this keeps consistency with that and makes debugging easier IMO.

I have opened a new design issue here: #615 I think there are some interesting questions about nested/composite distributions where I would appreciate your opinion! Given that you have dug into the problems above.

I will now merge this PR to expedite the various improvements.

I'll take a look. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement module:probability&simulation probability distributions and simulators module:regression probabilistic regression module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants