Add a block version of ShuffleSplit cross-validator #251

leouieda · 2020-04-08T11:09:44Z

Cross-validation of spatial data with random splits can often lead to overestimation of accuracy (see Roberts et al., 2017 for a nice overview). To account for this, the splits can be done by spatial blocks. In this case, the data are partitioned into blocks and blocks are split randomly. That guarantees some independence of measurements between blocks assigned to test/train sets. This change adds the BlockShuffleSplit cross-validator which does the random splitting of blocks in a scikit-learn compatible class. It can be used with verde.cross_val_score like any other scikit-learn cross-validator. In fact, it can be used entirely outside of Verde for any type of machine learning on spatial data. The class takes care of balancing the random splits so that the right amount of train/test data are included in each (since blocks can have wildly different numbers of points).

TODO:

Finish the docstring for BlockShuffleSplit

Follow up PRs will:

Add support for this CV in train_test_split (either by adding a blocked=False argument or creating a train_test_split_blocks function).
Add a section in the tutorial about blocked cross-validation and use this class for splitting in other examples.

Reminders:

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

It imitates the scikit-learn classes but requires X to have easting, northing as the columns. This way, it can be used directly in vd.cross_val_score or other libraries doing modelling with scikit-learn. Still need to test this and add docstrings.

Had to change the cross-validation to use a coordinate feature matrix instead of an array of indices. But this way, the BlockShuffleSplit is a true scikit-learn cross-validator.

Leave it for a follow up PR.

leouieda · 2020-04-08T16:26:46Z

verde/model_selection.py


+class BlockShuffleSplit(BaseCrossValidator):
+    """
+    Random permutation of spatial blocks cross-validator.


I'm not very happy with this description but I wanted to stay close to the sklearn description "Random permutation cross-validator". Any suggestions?

At first read it seems like a strange description, but I think it succinctly describes the class. The other ideas I had were just different versions saying the same thing but in more words. I'll think about this for a while

leouieda · 2020-04-08T17:25:35Z

Hi @jessepisel sorry to keep bugging you with review requests. Feel free to ignore me or tell me to go bother someone else 🙂 You have been working with geostatistics and ML, right? So I thought this might be interesting to you.

jessepisel · 2020-04-08T19:43:10Z

@leouieda This looks really cool. I think that this will be useful to cross validate spatial data via blocks rather than random splits. I will take a look through this afternoon when I get a block of time and give it a review and read through the linked article to make sure I understand how it works.

jessepisel

I am digging through the BaseCrossValidator in sklearn to double check and make sure I understand how _iter_test_indices works. Tests, examples, index, and references all look good to me.

leouieda · 2020-04-09T10:20:14Z

Thanks, Jesse! That method yields the indices of the test set for each split. The BaseCrossValidator.split method uses it to yield train and test indices (basically doing np.logical_not). The rest of the methods are just there for compatibility. split is only implemented to check the inputs and so I could specify that X has coordinates in the docstring.

jessepisel

I had a chance to look through the rest of the code after the sklearn rabbit hole yesterday. I played with it and ran the tests, it looks like it should be good to go. This could make a really neat visual tutorial later on.

leouieda · 2020-04-10T08:39:56Z

Thanks again @jessepisel! I've been down that rabbit hole a bit too often in the past as well.

leouieda added 5 commits April 7, 2020 19:40

Add BlockShuffleSplit to the public API

b371131

Add doctest for BlockShuffleSplit in cross_val_score

0602244

Had to change the cross-validation to use a coordinate feature matrix instead of an array of indices. But this way, the BlockShuffleSplit is a true scikit-learn cross-validator.

Revert change to train_test_split

68c8fed

Leave it for a follow up PR.

Finish docstring and reduce the number of balancing iterations

3850984

leouieda commented Apr 8, 2020

View reviewed changes

leouieda added 2 commits April 8, 2020 17:28

Typo fix in the docstring

486484e

cross_val_score can take any CV from Verde as well

f8e5fba

leouieda changed the title ~~WIP Add a "blocked" version of ShuffleSplit cross-validator~~ Add a "blocked" version of ShuffleSplit cross-validator Apr 8, 2020

Disable pylint checks for sklearn variable names

3c492c4

leouieda changed the title ~~Add a "blocked" version of ShuffleSplit cross-validator~~ Add a block version of ShuffleSplit cross-validator Apr 8, 2020

leouieda requested a review from jessepisel April 8, 2020 17:24

jessepisel reviewed Apr 8, 2020

View reviewed changes

jessepisel approved these changes Apr 9, 2020

View reviewed changes

leouieda merged commit f768bab into master Apr 10, 2020

leouieda deleted the blockshufflesplit branch April 10, 2020 08:41

leouieda mentioned this pull request Apr 10, 2020

Add a blocked K-Fold cross-validator #254

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a block version of ShuffleSplit cross-validator #251

Add a block version of ShuffleSplit cross-validator #251

Uh oh!

leouieda commented Apr 8, 2020 •

edited

Loading

Uh oh!

leouieda Apr 8, 2020 •

edited

Loading

Uh oh!

jessepisel Apr 8, 2020

Uh oh!

leouieda commented Apr 8, 2020

Uh oh!

jessepisel commented Apr 8, 2020

Uh oh!

jessepisel left a comment

Uh oh!

leouieda commented Apr 9, 2020

Uh oh!

jessepisel left a comment

Uh oh!

leouieda commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add a block version of ShuffleSplit cross-validator #251

Add a block version of ShuffleSplit cross-validator #251

Uh oh!

Conversation

leouieda commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leouieda Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jessepisel Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

leouieda commented Apr 8, 2020

Uh oh!

jessepisel commented Apr 8, 2020

Uh oh!

jessepisel left a comment

Choose a reason for hiding this comment

Uh oh!

leouieda commented Apr 9, 2020

Uh oh!

jessepisel left a comment

Choose a reason for hiding this comment

Uh oh!

leouieda commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leouieda commented Apr 8, 2020 •

edited

Loading

leouieda Apr 8, 2020 •

edited

Loading