Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Sep 28, 2022

This PR improves PackagedDatasetTest CI performance speed. For Ubuntu (latest):

  • Duration (without parallelism) before: 334.78s (5.58m)
  • Duration (without parallelism) afterwards: 0.48s

The approach is passing a dummy data_files argument to load the builder, so that it avoids the slow inferring of it over the entire root directory of the repo.

Total duration of PackagedDatasetTest

Before Afterwards Improvement
Linux 334.78s 0.48s x700
Windows 513.02s 1.09s x500

Durations by each individual sub-test

More accurate durations, running them on GitHub, for Linux (latest).

Before this PR, the total test time (without parallelism) for tests/test_dataset_common.py::PackagedDatasetTest is 334.78s (5.58m)

39.07s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_imagefolder
38.94s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_audiofolder
34.18s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_parquet
34.12s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_csv
34.00s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_pandas
34.00s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_text
33.86s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_json
10.39s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_audiofolder
6.50s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_audiofolder
6.46s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_imagefolder
6.40s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_imagefolder
5.77s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_csv
5.77s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_text
5.74s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_parquet
5.69s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_json
5.68s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_pandas
5.67s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_parquet
5.67s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_pandas
5.66s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_json
5.66s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_csv
5.55s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_configs_text

(42 durations < 0.005s hidden.)

With this PR: 0.48s

0.09s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_audiofolder
0.08s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_csv
0.08s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_imagefolder
0.06s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_json
0.05s call     tests/test_dataset_common.py::PackagedDatasetTest::test_builder_class_audiofolder
0.05s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_parquet
0.04s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_pandas
0.03s call     tests/test_dataset_common.py::PackagedDatasetTest::test_load_dataset_offline_text

(55 durations < 0.005s hidden.)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 28, 2022

The documentation is not available anymore as the PR was closed or merged.

@albertvillanova albertvillanova changed the title Improve PackagedDatasetTest CI performance speed by x6k Improve CI performance speed of PackagedDatasetTest Sep 29, 2022
@albertvillanova albertvillanova marked this pull request as ready for review September 30, 2022 13:06
@albertvillanova
Copy link
Member Author

There was a CI error which seemed unrelated: https://github.com/huggingface/datasets/actions/runs/3143581330/jobs/5111807056

FAILED tests/test_load.py::test_load_dataset_private_zipped_images[True] - FileNotFoundError: https://hub-ci.huggingface.co/datasets/__DUMMY_TRANSFORMERS_USER__/repo_zipped_img_data-16643808721979/resolve/75c3fc424a3b898a828b2b3fd84d96da4703228a/data.zip

It disappeared after merging the main branch.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speed up !

@albertvillanova albertvillanova merged commit a4a571a into huggingface:main Sep 30, 2022
@albertvillanova albertvillanova deleted the ci-packaged-faster branch September 30, 2022 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants