-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository
#5902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Random fact: previous run was showing that the Hub was hosting 13336 datasets, while the most recent run shows 36662 👀🎉 |
|
The documentation is not available anymore as the PR was closed or merged. |
|
Thanks! However, I think we should stop linking this notebook and use the notebook version of the Quickstart doc page instead of it for easier maintenance (we would have the "Open in Colab" button in the Quickstart doc as Transformers does). @stevhliu should be able to help with this. If I'm not mistaken, this can be done by adding the Also, if some useful info from the Overview notebook is not in the docs, feel free to add it so we don't lose it 🙂. |
|
Cool, makes sense @mariosasko, then I'll check both notebooks and see whether there's something in Are you OK if I do that @stevhliu @mariosasko? Thanks 🤗 |
* Re-ordered subsections so that `Text` goes first * Add machine learning frameworks missing install instructions * Add [[open-in-colab]] button * Add missing license
|
For the moment I've just updated the |
My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I left a few nits but looks great overall 🤗
I think at the end of the Quickstart, it may be nice to add in the What's Next section, links to how to create an image or audio dataset.
regarding the Overview.ipynb notebook I was planning to create a PR in https://github.com/huggingface/notebooks to add it there
The notebooks in https://github.com/huggingface/notebooks are automatically generated by the doc-builder, so we can definitely enable that for Datasets by:
- Creating a
_config.pyfile indocs/sourcelike this one here. - Adding this line to the build documentation workflow.
I think it'd be nice to drop the Overview.ipynb notebook entirely for easier maintenance, as @mariosasko mentioned, and I think it'd also be better for users to go through the docs instead of the GitHub repository (it helps to keep all the information in one place). In the Datasets GitHub notebook folder, we can add a README.md with a link to the notebooks generated by the doc-builder. What do you think @mariosasko?
Co-authored-by: Steven Liu <[email protected]>
|
Hi @stevhliu thanks for the feedback! Already applied your suggestions, I'll also add the pointers to both audio and image datasets in the "What's next" section. Besides that, let me know if I can help with the notebook being hosted in |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Some comments:
|
Thanks a lot for the detailed feedback @mariosasko, I'll apply the changes today! |
Awesome! If you're up for it, I think you can go ahead and open a PR with the changes I've outlined here to add the notebook building workflow. |
As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation`
bert-base-cased usage, install missing seqeval, and re-run Overview.ipynbOverview.ipynb & detach Jupyter Notebooks from datasets repository
Co-authored-by: Mario Šaško <[email protected]>
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can soon merge :). Some nits:
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more nits, thanks for iterating on this and adding the notebook workflow! 🤗
|
Hi @stevhliu @mariosasko, sorry for the delay I had a busy week, I'll tackle this either today or tomorrow to ideally close it before the weekend, thanks again for the help and guidance 😄 |
Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>
In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb Co-authored-by: Mario Šaško <[email protected]>
For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Steven Liu <[email protected]>
|
Hi guys @stevhliu @mariosasko sorry for the delay! I've resolved all the comments and applied your reviews 👍🏻 Let me know if this works and we can finally close this PR, thanks for the help in the meantime! |
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit to resolve the link to the prepare_tf_dataset method.
Thanks for iterating on this and wrapping it up! 🤗
Co-authored-by: Steven Liu <[email protected]>
No need to! Always a pleasure to collaborate with you guys 🤗 |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great work indeed!
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
|
Just as a heads up @mariosasko, the |

What's in this PR?
This PR solves #5887 since there was a mismatch between the tokenizer and the model used, since the tokenizer was
bert-base-casedwhile the model wasdistilbert-base-caseboth for the PyTorch and TensorFlow alternatives. Since DistilBERT doesn't use/need thetoken_type_ids, the**batchwas failing, as the batch containedinput_ids,attention_mask,token_type_ids,start_positionsandend_positions, andtoken_type_idswas not required.Besides that, at the end
seqevalwas being used to evaluate the model predictions, and justevaluatewas being installed, so I've also included theseqevalinstallation.Finally, I've re-run everything in Google Colab, and every cell was successfully executed!
What was done on top of the original PR?
Based on the comments from @mariosasko and @stevhliu, I've updated the contents of this PR to also review the
quickstart.mdxand update what was needed, besides that, we may eventually move theOverview.ipynbdataset tohuggingface/notebooksfollowing @stevhliu suggestions.