We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.
Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.
The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.
Install scikit-lego via pip with
python -m pip install scikit-legoVia conda with
conda install -c conda-forge scikit-legoAlternatively, to edit and contribute you can fork/clone and run:
python -m pip install -e ".[dev]"
python setup.py developThe documentation can be found here.
We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.
# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier
...
mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])
...Here's a list of features that this library currently offers:
- sklego.datasets.load_abaloneloads in the abalone dataset
- sklego.datasets.load_arrestsloads in a dataset with fairness concerns
- sklego.datasets.load_chickenloads in the joyful chickweight dataset
- sklego.datasets.load_heroesloads a heroes of the storm dataset
- sklego.datasets.load_heartsloads a dataset about hearts
- sklego.datasets.load_penguinsloads a lovely dataset about penguins
- sklego.datasets.fetch_creditcardfetch a fraud dataset from openml
- sklego.datasets.make_simpleseriesmake a simulated timeseries
- sklego.pandas_utils.add_lagsadds lag values in a pandas dataframe
- sklego.pandas_utils.log_stepa useful decorator to log your pipeline steps
- sklego.dummy.RandomRegressordummy benchmark that predicts random values
- sklego.linear_model.DeadZoneRegressorexperimental feature that has a deadzone in the cost function
- sklego.linear_model.DemographicParityClassifierlogistic classifier constrained on demographic parity
- sklego.linear_model.EqualOpportunityClassifierlogistic classifier constrained on equal opportunity
- sklego.linear_model.ProbWeightRegressionlinear model that treats coefficients as probabilistic weights
- sklego.linear_model.LowessRegressionlocally weighted linear regression
- sklego.linear_model.LADRegressionleast absolute deviation regression
- sklego.linear_model.QuantileRegressionlinear quantile regression, generalizes LADRegression
- sklego.linear_model.ImbalancedLinearRegressionpunish over/under-estimation of a model directly
- sklego.naive_bayes.GaussianMixtureNBclassifies by training a 1D GMM per column per class
- sklego.naive_bayes.BayesianGaussianMixtureNBclassifies by training a bayesian 1D GMM per class
- sklego.mixture.BayesianGMMClassifierclassifies by training a bayesian GMM per class
- sklego.mixture.BayesianGMMOutlierDetectordetects outliers based on a trained bayesian GMM
- sklego.mixture.GMMClassifierclassifies by training a GMM per class
- sklego.mixture.GMMOutlierDetectordetects outliers based on a trained GMM
- sklego.meta.ConfusionBalancerexperimental feature that allows you to balance the confusion matrix
- sklego.meta.DecayEstimatoradds decay to the sample_weight that the model accepts
- sklego.meta.EstimatorTransformeradds a model output as a feature
- sklego.meta.OutlierClassifierturns outlier models into classifiers for gridsearch
- sklego.meta.GroupedPredictorcan split the data into runs and run a model on each
- sklego.meta.GroupedTransformercan split the data into runs and run a transformer on each
- sklego.meta.SubjectiveClassifierexperimental feature to add a prior to your classifier
- sklego.meta.Thresholdermeta model that allows you to gridsearch over the threshold
- sklego.meta.RegressionOutlierDetectormeta model that finds outliers by adding a threshold to regression
- sklego.meta.ZeroInflatedRegressorpredicts zero or applies a regression based on a classifier
- sklego.preprocessing.ColumnCapperlimits extreme values of the model features
- sklego.preprocessing.ColumnDropperdrops a column from pandas
- sklego.preprocessing.ColumnSelectorselects columns based on column name
- sklego.preprocessing.InformationFiltertransformer that can de-correlate features
- sklego.preprocessing.IdentityTransformerreturns the same data, allows for concatenating pipelines
- sklego.preprocessing.LinearEmbedderreweight features using coefficients from a fitted linear model
- sklego.preprocessing.OrthogonalTransformermakes all features linearly independent
- sklego.preprocessing.TypeSelectorselects columns based on type
- sklego.preprocessing.RandomAdderadds randomness in training
- sklego.preprocessing.RepeatingBasisFunctionrepeating feature engineering, useful for timeseries
- sklego.preprocessing.DictMapperassign numeric values on categorical columns
- sklego.preprocessing.OutlierRemoverexperimental method to remove outliers during training
- sklego.preprocessing.MonotonicSplineTransformerre-uses- SplineTransformerin an attempt to make monotonic features
- sklego.model_selection.GroupTimeSeriesSplittimeseries Kfold for groups with different amount of observations per group
- sklego.model_selection.KlusterFoldValidationexperimental feature that does K folds based on clustering
- sklego.model_selection.TimeGapSplittimeseries Kfold with a gap between train/test
- sklego.pipeline.DebugPipelineadds debug information to make debugging easier
- sklego.pipeline.make_debug_pipelineshorthand function to create a debugable pipeline
- sklego.metrics.correlation_scorecalculates correlation between model output and feature
- sklego.metrics.equal_opportunity_scorecalculates equal opportunity metric
- sklego.metrics.p_percent_scoreproxy for model fairness with regards to sensitive attribute
- sklego.metrics.subset_scorecalculate a score on a subset of your data (meant for fairness tracking)
We want to be rather open here in what we accept but we do demand three things before they become added to the project:
- any new feature contributes towards a demonstrable real-world usecase
- any new feature passes standard unit tests (we use the ones from scikit-learn)
- the feature has been discussed in the issue list beforehand
We automate all of our testing and use pre-commit hooks to keep the code working.
