-
Notifications
You must be signed in to change notification settings - Fork 121
Feature mrmr #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Feature mrmr #622
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
0e2e45f
Wip 1
14ff603
new ideas
4d8e4ae
new ideas
e324389
commit
fabioscantamburlo 2a78dc1
commit
fabioscantamburlo 55fcd63
exclude venv
fabioscantamburlo 82d9719
Wip2
fabioscantamburlo bfe8ba3
Wip3
fabioscantamburlo 5397d34
Wip 3.5
fabioscantamburlo 7ae7d10
Pushing some optim
fabioscantamburlo 365a78d
Doc string and examples
fabioscantamburlo acf356a
Docstring WIP
fabioscantamburlo b5160af
Docstring WIP2
fabioscantamburlo 5d12ada
Adding something
fabioscantamburlo 899f315
Bugfix
fabioscantamburlo a0f0470
Mkdocs and small fixes
fabioscantamburlo 7591f85
Added tests
fabioscantamburlo 0053735
Added scripts
fabioscantamburlo 8fecf4a
Wip4
fabioscantamburlo b08b04b
removing tests
fabioscantamburlo 9d8c20b
Added tests and some bugifx
fabioscantamburlo 077b0d1
revert pandastransformer
fabioscantamburlo 5d55a2e
Update sklego/feature_selection/mrmr.py
fabioscantamburlo 2e107ef
Resolving comments on PR
fabioscantamburlo 1d4340c
features
fabioscantamburlo 8f4481e
venv
fabioscantamburlo a6713d6
Add missing file
fabioscantamburlo 25f5613
Wip userguide
fabioscantamburlo 774f170
Merge branch 'FEATURE-MRMR-UserGuide' into FEATURE-MRMR
fabioscantamburlo 4454266
Merge branch 'main' into FEATURE-MRMR
fabioscantamburlo 3f21b0a
typing
fabioscantamburlo 249d17f
Update sklego/feature_selection/mrmr.py
fabioscantamburlo 6c772d8
Update sklego/feature_selection/mrmr.py
fabioscantamburlo b9b5bfc
Update docs/user-guide/feature-selection.md
fabioscantamburlo 86e0dc6
Update sklego/feature_selection/mrmr.py
fabioscantamburlo f788026
resolve comments
fabioscantamburlo 87da50c
clean
fabioscantamburlo 287977e
suggestions + general rephrase
fabioscantamburlo 8b4c40a
Typo
fabioscantamburlo a1df5fe
Update docs/user-guide/feature-selection.md
fabioscantamburlo 3b2bc7a
Merge branch 'main' into FEATURE-MRMR
fabioscantamburlo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| from pathlib import Path | ||
|
|
||
| _file = Path(__file__) | ||
| print(f"Executing {_file}") | ||
|
|
||
|
|
||
| _static_path = Path("_static") / _file.stem | ||
| _static_path.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # --8<-- [start:mrmr-commonimports] | ||
| from sklearn.datasets import fetch_openml | ||
| from sklearn.ensemble import HistGradientBoostingClassifier | ||
| from sklearn.feature_selection import f_classif, mutual_info_classif | ||
| from sklearn.metrics import f1_score | ||
| from sklearn.model_selection import train_test_split | ||
| from sklego.feature_selection import MaximumRelevanceMinimumRedundancy | ||
| import matplotlib.pyplot as plt | ||
| import numpy as np | ||
| import seaborn as sns | ||
| # --8<-- [end:mrmr-commonimports] | ||
|
|
||
| # --8<-- [start:mrmr-intro] | ||
|
|
||
| # Download MNIST dataset using scikit-learn | ||
| mnist = fetch_openml("mnist_784", cache=True) | ||
|
|
||
| # Assign features and labels | ||
| X_pd, y_pd = mnist["data"], mnist["target"].astype(int) | ||
|
|
||
| X, y = X_pd.to_numpy(), y_pd.to_numpy() | ||
| t_t_s_params = {'test_size': 10000, 'random_state': 42} | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y, **t_t_s_params) | ||
| X_train = X_train.reshape(60000, 28 * 28) | ||
| X_test = X_test.reshape(10000, 28 * 28) | ||
| # --8<-- [end:mrmr-intro] | ||
|
|
||
| # --8<-- [start:mrmr-smile] | ||
| def smile_relevance(X, y): | ||
| rows = 28 | ||
| cols = 28 | ||
| smiling_face = np.zeros((rows, cols), dtype=int) | ||
|
|
||
| # Set the values for the eyes, nose, | ||
| # and mouth with adjusted positions and sizes | ||
| # Left eye | ||
| smiling_face[10:13, 8:10] = 1 | ||
| # Right eye | ||
| smiling_face[10:13, 18:20] = 1 | ||
| # Upper part of the mouth | ||
| smiling_face[18:20, 10:18] = 1 | ||
| # Left edge of the open mouth | ||
| smiling_face[16:18, 8:10] = 1 | ||
| # Right edge of the open mouth | ||
| smiling_face[16:18, 18:20] = 1 | ||
|
|
||
| # Add the nose as four pixels one pixel higher | ||
| smiling_face[14, 13:15] = 1 | ||
| smiling_face[27, :] = 1 | ||
| return smiling_face.reshape(rows * cols,) | ||
|
|
||
|
|
||
| def smile_redundancy(X, selected, left): | ||
| return np.ones(len(left)) | ||
| # --8<-- [end:mrmr-smile] | ||
|
|
||
| # --8<-- [start:mrmr-core] | ||
| K = 38 | ||
| mrmr = MaximumRelevanceMinimumRedundancy(k=K, | ||
| kind="auto", | ||
| redundancy_func="p", | ||
| relevance_func="f") | ||
| mrmr_s = MaximumRelevanceMinimumRedundancy(k=K, | ||
| redundancy_func=smile_redundancy, | ||
| relevance_func=smile_relevance) | ||
|
|
||
| f = f_classif(X_train ,y_train.reshape(60000,))[0] | ||
| f_features = np.argsort(np.nan_to_num(f, nan=np.finfo(float).eps))[-K:] | ||
| mi = mutual_info_classif(X_train, y_train.reshape(60000,)) | ||
| mi_features = np.argsort(np.nan_to_num(mi, nan=np.finfo(float).eps))[-K:] | ||
| mrmr_features = mrmr.fit(X_train, y_train).selected_features_ | ||
| mrmr_smile_features = mrmr_s.fit(X_train, y_train).selected_features_ | ||
|
|
||
| # --8<-- [end:mrmr-core] | ||
| # --8<-- [start:mrmr-selected-features] | ||
| # Define features dictionary | ||
| features = { | ||
| "f_classif": f_features, | ||
| "mutual_info": mi_features, | ||
| "mrmr": mrmr_features, | ||
| "mrmr_smile": mrmr_smile_features, | ||
| } | ||
| for name, s_f in features.items(): | ||
| model = HistGradientBoostingClassifier(random_state=42) | ||
| model.fit(X_train[:, s_f], y_train.squeeze()) | ||
| y_pred = model.predict(X_test[:, s_f]) | ||
| print(f"Feature selection method: {name}") | ||
| print(f"F1 score: {round(f1_score(y_test, y_pred, average="weighted"), 3)}") | ||
|
|
||
| # --8<-- [end:mrmr-selected-features] | ||
|
|
||
| # --8<-- [start:mrmr-plots] | ||
| # Create figure and axes for the plots | ||
| fig, axes = plt.subplots(2, 2, figsize=(12, 8)) | ||
|
|
||
| # Iterate through the features dictionary and plot the images | ||
| for idx, (name, s_f) in enumerate(features.items()): | ||
| row = idx // 2 | ||
| col = idx % 2 | ||
|
|
||
| a = np.zeros(28 * 28) | ||
| a[s_f] = 1 | ||
| ax = axes[row, col] | ||
| plot_= sns.heatmap(a.reshape(28, 28), cmap="binary", ax=ax, cbar=False) | ||
| ax.set_title(name) | ||
|
|
||
|
|
||
|
|
||
|
|
||
| # --8<-- [end:mrmr-plots] | ||
| plt.tight_layout() | ||
| plt.savefig(_static_path / "mrmr-feature-selection-mnist.png") | ||
| plt.clf() | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # Features Selection | ||
|
|
||
| :::sklego.feature_selection.mrmr.MaximumRelevanceMinimumRedundancy | ||
| options: | ||
| show_root_full_path: true | ||
| show_root_heading: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # Features Selection | ||
|
|
||
| :::sklego.feature_selection.mrmr.MaximumRelevanceMinimumRedundancy | ||
| options: | ||
| show_root_full_path: true | ||
| show_root_heading: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # Feature Selection | ||
|
|
||
| ## Maximum Relevance Minimum Redundancy | ||
|
|
||
| The [`Maximum Relevance Minimum Redundancy`][MaximumRelevanceMinimumRedundancy-api] (MRMR) is an iterative feature selection method commonly used in data science to select a subset of features from a larger feature set. The goal of MRMR is to choose features that have high *relevance* to the target variable while minimizing *redundancy* among the already selected features. | ||
|
|
||
| MRMR is heavily dependent on the two functions used to determine relevace and redundancy. However, the paper [Maximum Relevanceand Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://arxiv.org/pdf/1908.05376.pdf) shows that using [f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) or [f_regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html) as relevance function and Pearson correlation as redundancy function is the best choice for a variety of different problems and in general is a good choice. | ||
|
|
||
| Inspired by the Medium article [Feature Selection: How To Throw Away 95% of Your Data and Get 95% Accuracy](https://towardsdatascience.com/feature-selection-how-to-throw-away-95-of-your-data-and-get-95-accuracy-ad41ca016877) we showcase a practical application using the well known mnist dataset. | ||
|
|
||
| Note that although the default scikit-lego MRMR implementation uses redundancy and relevance as defined in [Maximum Relevanceand Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform](https://arxiv.org/pdf/1908.05376.pdf), our implementation offers the possibility of defining custom functions, that may be necessary in different scenarios depending on the data. | ||
|
|
||
| We will compare this list of well known filters method: | ||
|
|
||
| - F statistical test ([ANOVA F-test](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html)). | ||
| - Mutual information approximation based on sklearn implementation. | ||
|
|
||
| Against the default scikit-lego MRMR implementation and a custom MRMR implementation aimed to select features in order to draw a smiling face on the plot showing the minst letters. | ||
|
|
||
|
|
||
|
|
||
| ??? example "MRMR imports" | ||
| ```py | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-commonimports" | ||
| ``` | ||
|
|
||
| ```py title="MRMR mnist" | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-intro" | ||
| ``` | ||
|
|
||
| As custom functions, we implemented the smile redundancy and smile relevance. | ||
|
|
||
| ```py title="MRMR smile functions" | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-smile" | ||
| ``` | ||
|
|
||
| Then we execute the main code part. | ||
|
|
||
| ```py title="MRMR core" | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-core" | ||
| ``` | ||
|
|
||
| After the execution it is possible to inspect the F1-score for the selected features: | ||
|
|
||
| ```py title="MRMR mnist selected features" | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-selected-features" | ||
| ``` | ||
|
|
||
| ```console hl_lines="5-6" | ||
| Feature selection method: f_classif | ||
| F1 score: 0.854 | ||
| Feature selection method: mutual_info | ||
| F1 score: 0.879 | ||
| Feature selection method: mrmr | ||
| F1 score: 0.925 | ||
| Feature selection method: mrmr_smile | ||
| F1 score: 0.849 | ||
| ``` | ||
|
|
||
| The MRMR feature selection model provides better results compared against the other methods, although the smile technique performs rather good as well. | ||
|
|
||
| Finally, we can take a look at the selected features. | ||
|
|
||
| ??? example "MRMR generate plots" | ||
| ```py | ||
| --8<-- "docs/_scripts/feature-selection.py:mrmr-plots" | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| [MaximumRelevanceMinimumRedundancy-api]: ../../api/feature-selection#sklego.feature_selection.mrmr.MaximumRelevanceMinimumRedundancy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| __all__ = [ | ||
| "MaximumRelevanceMinimumRedundancy", | ||
| ] | ||
|
|
||
| from sklego.feature_selection.mrmr import MaximumRelevanceMinimumRedundancy |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.