-
Notifications
You must be signed in to change notification settings - Fork 73
Add a block version of ShuffleSplit cross-validator #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It imitates the scikit-learn classes but requires X to have easting, northing as the columns. This way, it can be used directly in vd.cross_val_score or other libraries doing modelling with scikit-learn. Still need to test this and add docstrings.
Had to change the cross-validation to use a coordinate feature matrix instead of an array of indices. But this way, the BlockShuffleSplit is a true scikit-learn cross-validator.
Leave it for a follow up PR.
|
|
||
| class BlockShuffleSplit(BaseCrossValidator): | ||
| """ | ||
| Random permutation of spatial blocks cross-validator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very happy with this description but I wanted to stay close to the sklearn description "Random permutation cross-validator". Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first read it seems like a strange description, but I think it succinctly describes the class. The other ideas I had were just different versions saying the same thing but in more words. I'll think about this for a while
|
Hi @jessepisel sorry to keep bugging you with review requests. Feel free to ignore me or tell me to go bother someone else 🙂 You have been working with geostatistics and ML, right? So I thought this might be interesting to you. |
|
@leouieda This looks really cool. I think that this will be useful to cross validate spatial data via blocks rather than random splits. I will take a look through this afternoon when I get a block of time and give it a review and read through the linked article to make sure I understand how it works. |
jessepisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am digging through the BaseCrossValidator in sklearn to double check and make sure I understand how _iter_test_indices works. Tests, examples, index, and references all look good to me.
|
Thanks, Jesse! That method yields the indices of the test set for each split. The |
jessepisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a chance to look through the rest of the code after the sklearn rabbit hole yesterday. I played with it and ran the tests, it looks like it should be good to go. This could make a really neat visual tutorial later on.
|
Thanks again @jessepisel! I've been down that rabbit hole a bit too often in the past as well. |
Cross-validation of spatial data with random splits can often lead to overestimation of accuracy (see Roberts et al., 2017 for a nice overview). To account for this, the splits can be done by spatial blocks. In this case, the data are partitioned into blocks and blocks are split randomly. That guarantees some independence of measurements between blocks assigned to test/train sets. This change adds the
BlockShuffleSplitcross-validator which does the random splitting of blocks in a scikit-learn compatible class. It can be used withverde.cross_val_scorelike any other scikit-learn cross-validator. In fact, it can be used entirely outside of Verde for any type of machine learning on spatial data. The class takes care of balancing the random splits so that the right amount of train/test data are included in each (since blocks can have wildly different numbers of points).TODO:
BlockShuffleSplitFollow up PRs will:
train_test_split(either by adding ablocked=Falseargument or creating atrain_test_split_blocksfunction).Reminders:
make formatandmake checkto make sure the code follows the style guide.doc/api/index.rstand the base__init__.pyfile for the package.AUTHORS.mdfile (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.