Skip to content

test: replace fetch_openml() with local data make_classification()#430

Open
shamykyzer wants to merge 6 commits intomainfrom
410-openml-ci-tests
Open

test: replace fetch_openml() with local data make_classification()#430
shamykyzer wants to merge 6 commits intomainfrom
410-openml-ci-tests

Conversation

@shamykyzer
Copy link
Copy Markdown
Contributor

@shamykyzer shamykyzer self-assigned this Mar 26, 2026
@shamykyzer shamykyzer marked this pull request as ready for review March 26, 2026 23:08
@shamykyzer
Copy link
Copy Markdown
Contributor Author

Hello @rpreen, could you please have a look at this?

  1. test_factory.py - The TPR/FPR/score assertions are hardcoded to the real nursery dataset values. With synthetic data the metrics shift (TPR 0.958 vs expected 0.91).

Can I widen the tolerances or use range checks since this test verifies the factory pipeline, not model performance?

  1. test_adaboost_nondisclosive - min_samples_leaf=200 with ~40 training samples means the tree can never split. But AdaBoost rejects it as worse than random.

Also I believe the adaboost test is a pre-existing bug: min_samples_leaf=200 on ~40 samples means the tree can never split, and the missing random_state makes it pass or fail depending on test ordering.

Can you confirm that I am not misunderstanding?

Thanks

@shamykyzer shamykyzer requested a review from rpreen March 27, 2026 03:26
@rpreen
Copy link
Copy Markdown
Contributor

rpreen commented Mar 30, 2026

If there's a random seed set, does the test number need to be widened or just slightly adjusted? If a slight tweak is needed because it's using synth data that seems fine to me.

The AdaBoost test should also have the seed set (they all should really). It looks like originally the disclosive and non-disclosive tests were in a single function which did set the seed to 42 (in fact I explicitly added the setting of the seed in #318 because it was causing problems) but then in #367 it was split into two functions and the random seed was removed and there were some other parameter changes which may have changed the behaviour. Perhaps you can use the original as a guide since - @jim-smith will know more than me as I didn't work on that test.

@rpreen
Copy link
Copy Markdown
Contributor

rpreen commented Mar 30, 2026

There must be some way of avoiding the code duplication with the data creation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Tests] Replace network calls to fetch data from OpenML with local data

2 participants