Skip to content

Conversation

@alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented May 26, 2023

What's in this PR?

This PR solves #5887 since there was a mismatch between the tokenizer and the model used, since the tokenizer was bert-base-cased while the model was distilbert-base-case both for the PyTorch and TensorFlow alternatives. Since DistilBERT doesn't use/need the token_type_ids, the **batch was failing, as the batch contained input_ids, attention_mask, token_type_ids, start_positions and end_positions, and token_type_ids was not required.

Besides that, at the end seqeval was being used to evaluate the model predictions, and just evaluate was being installed, so I've also included the seqeval installation.

Finally, I've re-run everything in Google Colab, and every cell was successfully executed!

What was done on top of the original PR?

Based on the comments from @mariosasko and @stevhliu, I've updated the contents of this PR to also review the quickstart.mdx and update what was needed, besides that, we may eventually move the Overview.ipynb dataset to huggingface/notebooks following @stevhliu suggestions.

@alvarobartt
Copy link
Member Author

Random fact: previous run was showing that the Hub was hosting 13336 datasets, while the most recent run shows 36662 👀🎉

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 26, 2023

The documentation is not available anymore as the PR was closed or merged.

@mariosasko
Copy link
Collaborator

Thanks!

However, I think we should stop linking this notebook and use the notebook version of the Quickstart doc page instead of it for easier maintenance (we would have the "Open in Colab" button in the Quickstart doc as Transformers does).

@stevhliu should be able to help with this. If I'm not mistaken, this can be done by adding the [[open in colab]] marker to the doc page.

Also, if some useful info from the Overview notebook is not in the docs, feel free to add it so we don't lose it 🙂.

@alvarobartt
Copy link
Member Author

Cool, makes sense @mariosasko, then I'll check both notebooks and see whether there's something in Overview.ipynb worth including in the docs/source/quickstart.mdx and remove Overview.ipynb and update references in favour of docs/source/quickstart.mdx

Are you OK if I do that @stevhliu @mariosasko? Thanks 🤗

* Re-ordered subsections so that `Text` goes first
* Add machine learning frameworks missing install instructions
* Add [[open-in-colab]] button
* Add missing license
@alvarobartt
Copy link
Member Author

alvarobartt commented Jun 10, 2023

For the moment I've just updated the quickstart.mdx to be more similar to quicktour.mdx, but regarding the Overview.ipynb notebook I was planning to create a PR in https://github.com/huggingface/notebooks to add it there, does that make sense @stevhliu? And then to create a README.md in this repository in notebooks/ as transformers does to point to the related notebooks hosted in https://github.com/huggingface/notebooks, WDYT? 🤗

My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I left a few nits but looks great overall 🤗

I think at the end of the Quickstart, it may be nice to add in the What's Next section, links to how to create an image or audio dataset.

regarding the Overview.ipynb notebook I was planning to create a PR in https://github.com/huggingface/notebooks to add it there

The notebooks in https://github.com/huggingface/notebooks are automatically generated by the doc-builder, so we can definitely enable that for Datasets by:

  1. Creating a _config.py file in docs/source like this one here.
  2. Adding this line to the build documentation workflow.

I think it'd be nice to drop the Overview.ipynb notebook entirely for easier maintenance, as @mariosasko mentioned, and I think it'd also be better for users to go through the docs instead of the GitHub repository (it helps to keep all the information in one place). In the Datasets GitHub notebook folder, we can add a README.md with a link to the notebooks generated by the doc-builder. What do you think @mariosasko?

@alvarobartt
Copy link
Member Author

Hi @stevhliu thanks for the feedback! Already applied your suggestions, I'll also add the pointers to both audio and image datasets in the "What's next" section.

Besides that, let me know if I can help with the notebook being hosted in huggingface/notebooks instead, and I'll happily do so!

@alvarobartt alvarobartt requested a review from stevhliu June 13, 2023 18:27
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some comments:

@alvarobartt
Copy link
Member Author

Thanks a lot for the detailed feedback @mariosasko, I'll apply the changes today!

@stevhliu
Copy link
Member

Besides that, let me know if I can help with the notebook being hosted in huggingface/notebooks instead, and I'll happily do so!

Awesome! If you're up for it, I think you can go ahead and open a PR with the changes I've outlined here to add the notebook building workflow.

As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation`
@alvarobartt alvarobartt changed the title Align bert-base-cased usage, install missing seqeval, and re-run Overview.ipynb Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository Jun 15, 2023
@alvarobartt alvarobartt requested a review from mariosasko June 15, 2023 15:44
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can soon merge :). Some nits:

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more nits, thanks for iterating on this and adding the notebook workflow! 🤗

@alvarobartt
Copy link
Member Author

Hi @stevhliu @mariosasko, sorry for the delay I had a busy week, I'll tackle this either today or tomorrow to ideally close it before the weekend, thanks again for the help and guidance 😄

alvarobartt and others added 2 commits July 24, 2023 18:35
For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly

Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
@alvarobartt
Copy link
Member Author

Hi guys @stevhliu @mariosasko sorry for the delay! I've resolved all the comments and applied your reviews 👍🏻 Let me know if this works and we can finally close this PR, thanks for the help in the meantime!

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit to resolve the link to the prepare_tf_dataset method.

Thanks for iterating on this and wrapping it up! 🤗

@alvarobartt
Copy link
Member Author

Thanks for iterating on this and wrapping it up! 🤗

No need to! Always a pleasure to collaborate with you guys 🤗

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great work indeed!

@mariosasko mariosasko merged commit 971e33e into huggingface:main Jul 25, 2023
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009814 / 0.011353 (-0.001539) 0.004632 / 0.011008 (-0.006376) 0.103059 / 0.038508 (0.064551) 0.090277 / 0.023109 (0.067167) 0.389344 / 0.275898 (0.113446) 0.464536 / 0.323480 (0.141056) 0.008196 / 0.007986 (0.000210) 0.003872 / 0.004328 (-0.000457) 0.081912 / 0.004250 (0.077662) 0.073197 / 0.037052 (0.036145) 0.407545 / 0.258489 (0.149056) 0.458035 / 0.293841 (0.164194) 0.037485 / 0.128546 (-0.091061) 0.010141 / 0.075646 (-0.065505) 0.365998 / 0.419271 (-0.053273) 0.065218 / 0.043533 (0.021685) 0.414091 / 0.255139 (0.158952) 0.435617 / 0.283200 (0.152417) 0.028850 / 0.141683 (-0.112833) 1.883510 / 1.452155 (0.431355) 1.979986 / 1.492716 (0.487269)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.236623 / 0.018006 (0.218616) 0.467128 / 0.000490 (0.466638) 0.008273 / 0.000200 (0.008074) 0.000699 / 0.000054 (0.000645)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033061 / 0.037411 (-0.004350) 0.101381 / 0.014526 (0.086856) 0.110862 / 0.176557 (-0.065695) 0.180982 / 0.737135 (-0.556154) 0.113791 / 0.296338 (-0.182548)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.450805 / 0.215209 (0.235596) 4.478374 / 2.077655 (2.400719) 2.190814 / 1.504120 (0.686694) 1.976726 / 1.541195 (0.435532) 2.078527 / 1.468490 (0.610037) 0.569150 / 4.584777 (-4.015627) 4.557790 / 3.745712 (0.812078) 3.794964 / 5.269862 (-1.474898) 2.555689 / 4.565676 (-2.009987) 0.067380 / 0.424275 (-0.356896) 0.008741 / 0.007607 (0.001134) 0.536913 / 0.226044 (0.310868) 5.364588 / 2.268929 (3.095659) 2.725602 / 55.444624 (-52.719022) 2.332012 / 6.876477 (-4.544465) 2.560550 / 2.142072 (0.418477) 0.672490 / 4.805227 (-4.132738) 0.153629 / 6.500664 (-6.347035) 0.070583 / 0.075469 (-0.004886)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.620083 / 1.841788 (-0.221704) 23.094248 / 8.074308 (15.019939) 17.797625 / 10.191392 (7.606233) 0.167993 / 0.680424 (-0.512430) 0.021151 / 0.534201 (-0.513050) 0.470216 / 0.579283 (-0.109067) 0.515492 / 0.434364 (0.081128) 0.666359 / 0.540337 (0.126021) 0.772928 / 1.386936 (-0.614008)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007853 / 0.011353 (-0.003500) 0.004627 / 0.011008 (-0.006381) 0.079803 / 0.038508 (0.041295) 0.091562 / 0.023109 (0.068453) 0.488537 / 0.275898 (0.212639) 0.579207 / 0.323480 (0.255728) 0.006579 / 0.007986 (-0.001406) 0.003946 / 0.004328 (-0.000382) 0.080224 / 0.004250 (0.075973) 0.074499 / 0.037052 (0.037446) 0.488292 / 0.258489 (0.229803) 0.569246 / 0.293841 (0.275405) 0.039994 / 0.128546 (-0.088553) 0.012867 / 0.075646 (-0.062780) 0.092563 / 0.419271 (-0.326709) 0.061656 / 0.043533 (0.018124) 0.488271 / 0.255139 (0.233132) 0.550651 / 0.283200 (0.267451) 0.032078 / 0.141683 (-0.109605) 1.874440 / 1.452155 (0.422286) 1.973480 / 1.492716 (0.480763)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.238789 / 0.018006 (0.220782) 0.460237 / 0.000490 (0.459748) 0.000500 / 0.000200 (0.000300) 0.000067 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034961 / 0.037411 (-0.002450) 0.102696 / 0.014526 (0.088170) 0.117772 / 0.176557 (-0.058784) 0.183865 / 0.737135 (-0.553270) 0.119216 / 0.296338 (-0.177122)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.528894 / 0.215209 (0.313685) 5.303954 / 2.077655 (3.226300) 2.897505 / 1.504120 (1.393385) 2.475898 / 1.541195 (0.934703) 2.553479 / 1.468490 (1.084988) 0.625847 / 4.584777 (-3.958930) 4.656595 / 3.745712 (0.910882) 3.745170 / 5.269862 (-1.524691) 2.470922 / 4.565676 (-2.094755) 0.066908 / 0.424275 (-0.357367) 0.009172 / 0.007607 (0.001565) 0.572695 / 0.226044 (0.346650) 5.753428 / 2.268929 (3.484499) 3.033226 / 55.444624 (-52.411398) 2.677280 / 6.876477 (-4.199197) 2.908857 / 2.142072 (0.766785) 0.681595 / 4.805227 (-4.123632) 0.154602 / 6.500664 (-6.346062) 0.072608 / 0.075469 (-0.002861)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.738550 / 1.841788 (-0.103237) 25.090637 / 8.074308 (17.016329) 18.371478 / 10.191392 (8.180086) 0.207357 / 0.680424 (-0.473067) 0.023396 / 0.534201 (-0.510805) 0.505663 / 0.579283 (-0.073620) 0.503137 / 0.434364 (0.068773) 0.598015 / 0.540337 (0.057678) 0.714122 / 1.386936 (-0.672814)

@alvarobartt
Copy link
Member Author

alvarobartt commented Jul 25, 2023

Just as a heads up @mariosasko, the quickstart.ipynb Jupyter Notebook has been built at https://github.com/huggingface/notebooks/blob/main/datasets_doc/en/quickstart.ipynb, while the URLs in here point to https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb instead, should we update that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants