test: replace fetch_openml() with local data make_classification() by shamykyzer · Pull Request #430 · AI-SDC/SACRO-ML

shamykyzer · 2026-03-26T22:39:43Z

Replaces fetch_openml() calls in test fixtures with locally generated data using sklearn.make_classification() to avoid CI failures from
network issues.

Closes [Tests] Replace network calls to fetch data from OpenML with local data #410

shamykyzer · 2026-03-27T03:26:10Z

Hello @rpreen, could you please have a look at this?

test_factory.py - The TPR/FPR/score assertions are hardcoded to the real nursery dataset values. With synthetic data the metrics shift (TPR 0.958 vs expected 0.91).

Can I widen the tolerances or use range checks since this test verifies the factory pipeline, not model performance?

test_adaboost_nondisclosive - min_samples_leaf=200 with ~40 training samples means the tree can never split. But AdaBoost rejects it as worse than random.

Also I believe the adaboost test is a pre-existing bug: min_samples_leaf=200 on ~40 samples means the tree can never split, and the missing random_state makes it pass or fail depending on test ordering.

Can you confirm that I am not misunderstanding?

Thanks

rpreen · 2026-03-30T09:47:32Z

If there's a random seed set, does the test number need to be widened or just slightly adjusted? If a slight tweak is needed because it's using synth data that seems fine to me.

The AdaBoost test should also have the seed set (they all should really). It looks like originally the disclosive and non-disclosive tests were in a single function which did set the seed to 42 (in fact I explicitly added the setting of the seed in #318 because it was causing problems) but then in #367 it was split into two functions and the random seed was removed and there were some other parameter changes which may have changed the behaviour. Perhaps you can use the original as a guide since - @jim-smith will know more than me as I didn't work on that test.

rpreen · 2026-03-30T09:48:16Z

There must be some way of avoiding the code duplication with the data creation?

test: replace fetch_openml() with local data make_classification()

901ac1c

shamykyzer self-assigned this Mar 26, 2026

pre-commit-ci bot and others added 3 commits March 26, 2026 22:39

style: pre-commit fixes

2b64b21

ci: fixed failed ruff check

cdaa644

test: fix flaky test_factory and test_adaboost assertions

eb6a732

shamykyzer marked this pull request as ready for review March 26, 2026 23:08

shamykyzer requested a review from rpreen March 27, 2026 03:26

test: revert test_factory score tolerance back to original

8296b8e

Merge branch 'main' into 410-openml-ci-tests

2eb9d01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: replace fetch_openml() with local data make_classification()#430

test: replace fetch_openml() with local data make_classification()#430
shamykyzer wants to merge 6 commits intomainfrom
410-openml-ci-tests

shamykyzer commented Mar 26, 2026

Uh oh!

shamykyzer commented Mar 27, 2026

Uh oh!

rpreen commented Mar 30, 2026 •

edited

Loading

Uh oh!

rpreen commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shamykyzer commented Mar 26, 2026

Uh oh!

shamykyzer commented Mar 27, 2026

Uh oh!

rpreen commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rpreen commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rpreen commented Mar 30, 2026 •

edited

Loading