Fix `Overview.ipynb` & detach Jupyter Notebooks from `datasets` repository #5902

alvarobartt · 2023-05-26T10:25:01Z

What's in this PR?

This PR solves #5887 since there was a mismatch between the tokenizer and the model used, since the tokenizer was bert-base-cased while the model was distilbert-base-case both for the PyTorch and TensorFlow alternatives. Since DistilBERT doesn't use/need the token_type_ids, the **batch was failing, as the batch contained input_ids, attention_mask, token_type_ids, start_positions and end_positions, and token_type_ids was not required.

Besides that, at the end seqeval was being used to evaluate the model predictions, and just evaluate was being installed, so I've also included the seqeval installation.

Finally, I've re-run everything in Google Colab, and every cell was successfully executed!

What was done on top of the original PR?

Based on the comments from @mariosasko and @stevhliu, I've updated the contents of this PR to also review the quickstart.mdx and update what was needed, besides that, we may eventually move the Overview.ipynb dataset to huggingface/notebooks following @stevhliu suggestions.

alvarobartt · 2023-05-26T10:26:57Z

Random fact: previous run was showing that the Hub was hosting 13336 datasets, while the most recent run shows 36662 👀🎉

HuggingFaceDocBuilderDev · 2023-05-26T10:29:34Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko · 2023-06-09T16:57:05Z

Thanks!

However, I think we should stop linking this notebook and use the notebook version of the Quickstart doc page instead of it for easier maintenance (we would have the "Open in Colab" button in the Quickstart doc as Transformers does).

@stevhliu should be able to help with this. If I'm not mistaken, this can be done by adding the [[open in colab]] marker to the doc page.

Also, if some useful info from the Overview notebook is not in the docs, feel free to add it so we don't lose it 🙂.

alvarobartt · 2023-06-10T09:58:04Z

Cool, makes sense @mariosasko, then I'll check both notebooks and see whether there's something in Overview.ipynb worth including in the docs/source/quickstart.mdx and remove Overview.ipynb and update references in favour of docs/source/quickstart.mdx

Are you OK if I do that @stevhliu @mariosasko? Thanks 🤗

* Re-ordered subsections so that `Text` goes first * Add machine learning frameworks missing install instructions * Add [[open-in-colab]] button * Add missing license

alvarobartt · 2023-06-10T10:25:55Z

For the moment I've just updated the quickstart.mdx to be more similar to quicktour.mdx, but regarding the Overview.ipynb notebook I was planning to create a PR in https://github.com/huggingface/notebooks to add it there, does that make sense @stevhliu? And then to create a README.md in this repository in notebooks/ as transformers does to point to the related notebooks hosted in https://github.com/huggingface/notebooks, WDYT? 🤗

My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets

stevhliu

Thanks for the fix! I left a few nits but looks great overall 🤗

I think at the end of the Quickstart, it may be nice to add in the What's Next section, links to how to create an image or audio dataset.

regarding the Overview.ipynb notebook I was planning to create a PR in https://github.com/huggingface/notebooks to add it there

The notebooks in https://github.com/huggingface/notebooks are automatically generated by the doc-builder, so we can definitely enable that for Datasets by:

Creating a _config.py file in docs/source like this one here.
Adding this line to the build documentation workflow.

I think it'd be nice to drop the Overview.ipynb notebook entirely for easier maintenance, as @mariosasko mentioned, and I think it'd also be better for users to go through the docs instead of the GitHub repository (it helps to keep all the information in one place). In the Datasets GitHub notebook folder, we can add a README.md with a link to the notebooks generated by the doc-builder. What do you think @mariosasko?

docs/source/quickstart.mdx

Co-authored-by: Steven Liu <[email protected]>

alvarobartt · 2023-06-13T09:27:36Z

Hi @stevhliu thanks for the feedback! Already applied your suggestions, I'll also add the pointers to both audio and image datasets in the "What's next" section.

Besides that, let me know if I can help with the notebook being hosted in huggingface/notebooks instead, and I'll happily do so!

mariosasko

Nice! Some comments:

docs/source/quickstart.mdx

notebooks/Overview.ipynb

docs/source/quickstart.mdx

alvarobartt · 2023-06-14T05:53:48Z

Thanks a lot for the detailed feedback @mariosasko, I'll apply the changes today!

stevhliu · 2023-06-14T15:39:31Z

Besides that, let me know if I can help with the notebook being hosted in huggingface/notebooks instead, and I'll happily do so!

Awesome! If you're up for it, I think you can go ahead and open a PR with the changes I've outlined here to add the notebook building workflow.

As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation`

Co-authored-by: Mario Šaško <[email protected]>

mariosasko

I think we can soon merge :). Some nits:

docs/source/_config.py

docs/source/quickstart.mdx

notebooks/README.md

docs/source/quickstart.mdx

stevhliu

Added a few more nits, thanks for iterating on this and adding the notebook workflow! 🤗

docs/source/quickstart.mdx

notebooks/README.md

alvarobartt · 2023-06-29T07:07:46Z

Hi @stevhliu @mariosasko, sorry for the delay I had a busy week, I'll tackle this either today or tomorrow to ideally close it before the weekend, thanks again for the help and guidance 😄

Co-authored-by: Mario Šaško <[email protected]>

Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>

In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb Co-authored-by: Mario Šaško <[email protected]>

For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly Co-authored-by: Mario Šaško <[email protected]>

Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>

alvarobartt · 2023-07-24T16:36:47Z

Hi guys @stevhliu @mariosasko sorry for the delay! I've resolved all the comments and applied your reviews 👍🏻 Let me know if this works and we can finally close this PR, thanks for the help in the meantime!

stevhliu

One more nit to resolve the link to the prepare_tf_dataset method.

Thanks for iterating on this and wrapping it up! 🤗

docs/source/quickstart.mdx

Co-authored-by: Steven Liu <[email protected]>

alvarobartt · 2023-07-25T07:45:38Z

Thanks for iterating on this and wrapping it up! 🤗

No need to! Always a pleasure to collaborate with you guys 🤗

mariosasko

Thanks, great work indeed!

github-actions · 2023-07-25T13:49:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009814 / 0.011353 (-0.001539)	0.004632 / 0.011008 (-0.006376)	0.103059 / 0.038508 (0.064551)	0.090277 / 0.023109 (0.067167)	0.389344 / 0.275898 (0.113446)	0.464536 / 0.323480 (0.141056)	0.008196 / 0.007986 (0.000210)	0.003872 / 0.004328 (-0.000457)	0.081912 / 0.004250 (0.077662)	0.073197 / 0.037052 (0.036145)	0.407545 / 0.258489 (0.149056)	0.458035 / 0.293841 (0.164194)	0.037485 / 0.128546 (-0.091061)	0.010141 / 0.075646 (-0.065505)	0.365998 / 0.419271 (-0.053273)	0.065218 / 0.043533 (0.021685)	0.414091 / 0.255139 (0.158952)	0.435617 / 0.283200 (0.152417)	0.028850 / 0.141683 (-0.112833)	1.883510 / 1.452155 (0.431355)	1.979986 / 1.492716 (0.487269)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.236623 / 0.018006 (0.218616)	0.467128 / 0.000490 (0.466638)	0.008273 / 0.000200 (0.008074)	0.000699 / 0.000054 (0.000645)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033061 / 0.037411 (-0.004350)	0.101381 / 0.014526 (0.086856)	0.110862 / 0.176557 (-0.065695)	0.180982 / 0.737135 (-0.556154)	0.113791 / 0.296338 (-0.182548)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.450805 / 0.215209 (0.235596)	4.478374 / 2.077655 (2.400719)	2.190814 / 1.504120 (0.686694)	1.976726 / 1.541195 (0.435532)	2.078527 / 1.468490 (0.610037)	0.569150 / 4.584777 (-4.015627)	4.557790 / 3.745712 (0.812078)	3.794964 / 5.269862 (-1.474898)	2.555689 / 4.565676 (-2.009987)	0.067380 / 0.424275 (-0.356896)	0.008741 / 0.007607 (0.001134)	0.536913 / 0.226044 (0.310868)	5.364588 / 2.268929 (3.095659)	2.725602 / 55.444624 (-52.719022)	2.332012 / 6.876477 (-4.544465)	2.560550 / 2.142072 (0.418477)	0.672490 / 4.805227 (-4.132738)	0.153629 / 6.500664 (-6.347035)	0.070583 / 0.075469 (-0.004886)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.620083 / 1.841788 (-0.221704)	23.094248 / 8.074308 (15.019939)	17.797625 / 10.191392 (7.606233)	0.167993 / 0.680424 (-0.512430)	0.021151 / 0.534201 (-0.513050)	0.470216 / 0.579283 (-0.109067)	0.515492 / 0.434364 (0.081128)	0.666359 / 0.540337 (0.126021)	0.772928 / 1.386936 (-0.614008)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007853 / 0.011353 (-0.003500)	0.004627 / 0.011008 (-0.006381)	0.079803 / 0.038508 (0.041295)	0.091562 / 0.023109 (0.068453)	0.488537 / 0.275898 (0.212639)	0.579207 / 0.323480 (0.255728)	0.006579 / 0.007986 (-0.001406)	0.003946 / 0.004328 (-0.000382)	0.080224 / 0.004250 (0.075973)	0.074499 / 0.037052 (0.037446)	0.488292 / 0.258489 (0.229803)	0.569246 / 0.293841 (0.275405)	0.039994 / 0.128546 (-0.088553)	0.012867 / 0.075646 (-0.062780)	0.092563 / 0.419271 (-0.326709)	0.061656 / 0.043533 (0.018124)	0.488271 / 0.255139 (0.233132)	0.550651 / 0.283200 (0.267451)	0.032078 / 0.141683 (-0.109605)	1.874440 / 1.452155 (0.422286)	1.973480 / 1.492716 (0.480763)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238789 / 0.018006 (0.220782)	0.460237 / 0.000490 (0.459748)	0.000500 / 0.000200 (0.000300)	0.000067 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034961 / 0.037411 (-0.002450)	0.102696 / 0.014526 (0.088170)	0.117772 / 0.176557 (-0.058784)	0.183865 / 0.737135 (-0.553270)	0.119216 / 0.296338 (-0.177122)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.528894 / 0.215209 (0.313685)	5.303954 / 2.077655 (3.226300)	2.897505 / 1.504120 (1.393385)	2.475898 / 1.541195 (0.934703)	2.553479 / 1.468490 (1.084988)	0.625847 / 4.584777 (-3.958930)	4.656595 / 3.745712 (0.910882)	3.745170 / 5.269862 (-1.524691)	2.470922 / 4.565676 (-2.094755)	0.066908 / 0.424275 (-0.357367)	0.009172 / 0.007607 (0.001565)	0.572695 / 0.226044 (0.346650)	5.753428 / 2.268929 (3.484499)	3.033226 / 55.444624 (-52.411398)	2.677280 / 6.876477 (-4.199197)	2.908857 / 2.142072 (0.766785)	0.681595 / 4.805227 (-4.123632)	0.154602 / 6.500664 (-6.346062)	0.072608 / 0.075469 (-0.002861)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.738550 / 1.841788 (-0.103237)	25.090637 / 8.074308 (17.016329)	18.371478 / 10.191392 (8.180086)	0.207357 / 0.680424 (-0.473067)	0.023396 / 0.534201 (-0.510805)	0.505663 / 0.579283 (-0.073620)	0.503137 / 0.434364 (0.068773)	0.598015 / 0.540337 (0.057678)	0.714122 / 1.386936 (-0.672814)

alvarobartt · 2023-07-25T13:49:57Z

Just as a heads up @mariosasko, the quickstart.ipynb Jupyter Notebook has been built at https://github.com/huggingface/notebooks/blob/main/datasets_doc/en/quickstart.ipynb, while the URLs in here point to https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb instead, should we update that?

Fix and re-run Overview.ipynb

a1fbcf7

alvarobartt mentioned this pull request May 26, 2023

HuggingsFace dataset example give error #5887

Closed

alvarobartt mentioned this pull request May 26, 2023

Relax ci.yml trigger for pull_request based on modified paths #5903

Open

Update quickstart.mdx

89b1476

* Re-ordered subsections so that `Text` goes first * Add machine learning frameworks missing install instructions * Add [[open-in-colab]] button * Add missing license

alvarobartt added 2 commits June 10, 2023 10:32

Fix references to new sub-sections

b0b1513

Remove not required exclamation marks

066dcf5

My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets

stevhliu reviewed Jun 12, 2023

View reviewed changes

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

Apply suggestions from code review

97702ce

Co-authored-by: Steven Liu <[email protected]>

alvarobartt requested a review from stevhliu June 13, 2023 18:27

mariosasko reviewed Jun 13, 2023

View reviewed changes

alvarobartt added 2 commits June 15, 2023 17:12

Add datasets_doc to host notebooks in hugginface/notebooks

f22988b

Add notebooks/README.md

43c7a9f

As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation`

alvarobartt changed the title ~~Align bert-base-cased usage, install missing seqeval, and re-run Overview.ipynb~~ Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository Jun 15, 2023

Apply suggestions from code review

cf3c837

Co-authored-by: Mario Šaško <[email protected]>

alvarobartt requested a review from mariosasko June 15, 2023 15:44

mariosasko reviewed Jun 15, 2023

View reviewed changes

stevhliu approved these changes Jun 16, 2023

View reviewed changes

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

notebooks/README.md Show resolved Hide resolved

alvarobartt and others added 4 commits July 24, 2023 18:05

Apply suggestions from code review

1f5e852

Co-authored-by: Mario Šaško <[email protected]>

Revert Image and Text renames

bb355e9

Co-authored-by: Mario Šaško <[email protected]>

Remove reference to to_tf_dataset

aff2c32

Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>

Add deprecation message in Overview.ipynb

4597136

In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb Co-authored-by: Mario Šaško <[email protected]>

alvarobartt and others added 2 commits July 24, 2023 18:35

Add transformers, torch, and tensorflow in docs extra

c6c911d

For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly Co-authored-by: Mario Šaško <[email protected]>

Add albumentations to extend data preparation

09b1433

Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>

stevhliu approved these changes Jul 24, 2023

View reviewed changes

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

docs/source/quickstart.mdx Outdated Show resolved Hide resolved

Apply suggestions from code review

fff88ba

Co-authored-by: Steven Liu <[email protected]>

mariosasko added 2 commits July 25, 2023 15:03

Minor improvements

3992864

Merge branch 'main' into fix/colab-example

1435bda

mariosasko approved these changes Jul 25, 2023

View reviewed changes

mariosasko merged commit 971e33e into huggingface:main Jul 25, 2023

mariosasko mentioned this pull request Jul 25, 2023

Fix Quickstart notebook link #6070

Merged

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository #5902

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository #5902

Uh oh!

Conversation

alvarobartt commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR?

What was done on top of the original PR?

Uh oh!

alvarobartt commented May 26, 2023

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko commented Jun 9, 2023

Uh oh!

alvarobartt commented Jun 10, 2023

Uh oh!

alvarobartt commented Jun 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alvarobartt commented Jun 13, 2023

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alvarobartt commented Jun 14, 2023

Uh oh!

stevhliu commented Jun 14, 2023

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alvarobartt commented Jun 29, 2023

Uh oh!

alvarobartt commented Jul 24, 2023

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alvarobartt commented Jul 25, 2023

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Fix `Overview.ipynb` & detach Jupyter Notebooks from `datasets` repository #5902

Fix `Overview.ipynb` & detach Jupyter Notebooks from `datasets` repository #5902

alvarobartt commented May 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 26, 2023 •

edited

Loading

alvarobartt commented Jun 10, 2023 •

edited

Loading

alvarobartt commented Jul 25, 2023 •

edited

Loading