You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features
### What changes were proposed in this pull request?
Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:
- Aggregated label is the weighted average of all labels
- Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
- Aggregated weight is the sum of all weights
### Why are the changes needed?
As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`.
### Does this PR introduce _any_ user-facing change?
There are no changes to the API, documentation or error messages. However, the user should expect results to change.
### How was this patch tested?
Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.
Here is a python snippet that can be used to verify the results:
```python
from sklearn.isotonic import IsotonicRegression
def test(x, y, x_test, isotonic=True):
ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
y_test = ir.predict(x_test)
def print_array(label, a):
print(f"{label}: [{', '.join([str(i) for i in a])}]")
print_array("boundaries", ir.X_thresholds_)
print_array("predictions", ir.y_thresholds_)
print_array("y_test", y_test)
test(
x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
)
```
srowen zapletal-martin
Closes#38966 from ahmed-mahran/ml-isotonic-reg-dups.
Authored-by: Ahmed Mahran <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
0 commit comments