Skip to content

Conversation

@leouieda
Copy link
Member

@leouieda leouieda commented Apr 8, 2020

Cross-validation of spatial data with random splits can often lead to overestimation of accuracy (see Roberts et al., 2017 for a nice overview). To account for this, the splits can be done by spatial blocks. In this case, the data are partitioned into blocks and blocks are split randomly. That guarantees some independence of measurements between blocks assigned to test/train sets. This change adds the BlockShuffleSplit cross-validator which does the random splitting of blocks in a scikit-learn compatible class. It can be used with verde.cross_val_score like any other scikit-learn cross-validator. In fact, it can be used entirely outside of Verde for any type of machine learning on spatial data. The class takes care of balancing the random splits so that the right amount of train/test data are included in each (since blocks can have wildly different numbers of points).

TODO:

  • Finish the docstring for BlockShuffleSplit

Follow up PRs will:

  1. Add support for this CV in train_test_split (either by adding a blocked=False argument or creating a train_test_split_blocks function).
  2. Add a section in the tutorial about blocked cross-validation and use this class for splitting in other examples.

Reminders:

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
  • Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
  • Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

leouieda added 5 commits April 7, 2020 19:40
It imitates the scikit-learn classes but requires X to have easting,
northing as the columns. This way, it can be used directly in
vd.cross_val_score or other libraries doing modelling with scikit-learn.
Still need to test this and add docstrings.
Had to change the cross-validation to use a coordinate feature matrix
instead of an array of indices. But this way, the BlockShuffleSplit is a
true scikit-learn cross-validator.
Leave it for a follow up PR.

class BlockShuffleSplit(BaseCrossValidator):
"""
Random permutation of spatial blocks cross-validator.
Copy link
Member Author

@leouieda leouieda Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy with this description but I wanted to stay close to the sklearn description "Random permutation cross-validator". Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first read it seems like a strange description, but I think it succinctly describes the class. The other ideas I had were just different versions saying the same thing but in more words. I'll think about this for a while

@leouieda leouieda changed the title WIP Add a "blocked" version of ShuffleSplit cross-validator Add a "blocked" version of ShuffleSplit cross-validator Apr 8, 2020
@leouieda leouieda changed the title Add a "blocked" version of ShuffleSplit cross-validator Add a block version of ShuffleSplit cross-validator Apr 8, 2020
@leouieda leouieda requested a review from jessepisel April 8, 2020 17:24
@leouieda
Copy link
Member Author

leouieda commented Apr 8, 2020

Hi @jessepisel sorry to keep bugging you with review requests. Feel free to ignore me or tell me to go bother someone else 🙂 You have been working with geostatistics and ML, right? So I thought this might be interesting to you.

@jessepisel
Copy link
Contributor

@leouieda This looks really cool. I think that this will be useful to cross validate spatial data via blocks rather than random splits. I will take a look through this afternoon when I get a block of time and give it a review and read through the linked article to make sure I understand how it works.

Copy link
Contributor

@jessepisel jessepisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am digging through the BaseCrossValidator in sklearn to double check and make sure I understand how _iter_test_indices works. Tests, examples, index, and references all look good to me.

@leouieda
Copy link
Member Author

leouieda commented Apr 9, 2020

Thanks, Jesse! That method yields the indices of the test set for each split. The BaseCrossValidator.split method uses it to yield train and test indices (basically doing np.logical_not). The rest of the methods are just there for compatibility. split is only implemented to check the inputs and so I could specify that X has coordinates in the docstring.

Copy link
Contributor

@jessepisel jessepisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chance to look through the rest of the code after the sklearn rabbit hole yesterday. I played with it and ran the tests, it looks like it should be good to go. This could make a really neat visual tutorial later on.

@leouieda
Copy link
Member Author

Thanks again @jessepisel! I've been down that rabbit hole a bit too often in the past as well.

@leouieda leouieda merged commit f768bab into master Apr 10, 2020
@leouieda leouieda deleted the blockshufflesplit branch April 10, 2020 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants