-
Notifications
You must be signed in to change notification settings - Fork 3.1k
add books3 #2801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
add books3 #2801
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| --- | ||
| annotations_creators: | ||
| - no-annotation | ||
| language_creators: | ||
| - found | ||
| languages: | ||
| - en | ||
| licenses: | ||
| - mit | ||
| multilinguality: | ||
| - monolingual | ||
| pretty_name: Books3 | ||
| size_categories: | ||
| - unknown | ||
| source_datasets: | ||
| - original | ||
| task_categories: | ||
| - sequence-modeling | ||
| task_ids: | ||
| - language-modeling | ||
| --- | ||
|
|
||
| # Dataset Card for books3 | ||
|
|
||
| ## Table of Contents | ||
| - [Dataset Card for books3](#dataset-card-for-books3) | ||
| - [Table of Contents](#table-of-contents) | ||
| - [Dataset Description](#dataset-description) | ||
| - [Dataset Summary](#dataset-summary) | ||
| - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) | ||
| - [Languages](#languages) | ||
| - [Dataset Structure](#dataset-structure) | ||
| - [Data Instances](#data-instances) | ||
| - [Data Fields](#data-fields) | ||
| - [Data Splits](#data-splits) | ||
| - [|split|num examples|](#splitnum-examples) | ||
| - [Dataset Creation](#dataset-creation) | ||
| - [Curation Rationale](#curation-rationale) | ||
| - [Source Data](#source-data) | ||
| - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) | ||
| - [Who are the source language producers?](#who-are-the-source-language-producers) | ||
| - [Annotations](#annotations) | ||
| - [Annotation process](#annotation-process) | ||
| - [Who are the annotators?](#who-are-the-annotators) | ||
| - [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
| - [Considerations for Using the Data](#considerations-for-using-the-data) | ||
| - [Social Impact of Dataset](#social-impact-of-dataset) | ||
| - [Discussion of Biases](#discussion-of-biases) | ||
| - [Other Known Limitations](#other-known-limitations) | ||
| - [Additional Information](#additional-information) | ||
| - [Dataset Curators](#dataset-curators) | ||
| - [Licensing Information](#licensing-information) | ||
| - [Citation Information](#citation-information) | ||
| - [Contributions](#contributions) | ||
|
|
||
| ## Dataset Description | ||
|
|
||
| - **Homepage:** https://github.com/soskek/bookcorpus/issues/27#issuecomment-716104208 | ||
| - **Repository:** [Needs More Information] | ||
| - **Paper:** https://arxiv.org/abs/2101.00027 | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Leaderboard:** [Needs More Information] | ||
| - **Point of Contact:** [Needs More Information] | ||
|
|
||
| ### Dataset Summary | ||
|
|
||
| This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset. | ||
|
|
||
| This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. | ||
|
|
||
| |download_size|36.8 Gib| | ||
| |dataset_size|100.9 Gib| | ||
|
|
||
| ### Supported Tasks and Leaderboards | ||
|
|
||
| - `lm` | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Languages | ||
|
|
||
| - `en` | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Dataset Structure | ||
|
|
||
| ### Data Instances | ||
|
|
||
| ``` | ||
| {'title': '07 LEGO Ninjago - The Search For Zane (Scholastic) - Kate Howard (retail)' | ||
| 'text': '\n\nTITLE PAGE\n\nFROM THE JOURNAL OF SENSEI GARMADON\n\nCHAPTER 1\n\nCHAPTER 2\n\nCHAPTER 3\n\nCHAPTER 4\n\nCHAPTER 5\n\nCHAPTER 6\n\nCHAPTER 7\n\nCHAPTER 8\n\nCHAPTER 9\n\nCOPYRIGHT\n\nThroughout Ninjago", five ninja are well-known for their speed, strength, and of course the elemental powers that help them protect our world from evil. But there are others who possess some of the same powers as the ninja. Others who may not always use their powers for good.\n\nBefore now, the ninja believed they were special. They di.......'} | ||
| ``` | ||
|
|
||
| ### Data Fields | ||
|
|
||
| - `title` | ||
| - `text` | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Data Splits | ||
|
|
||
| |split|num examples| | ||
| -------------------------------- | ||
| |train|196640| | ||
|
|
||
| ## Dataset Creation | ||
|
|
||
| ### Curation Rationale | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Source Data | ||
|
|
||
| #### Initial Data Collection and Normalization | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| #### Who are the source language producers? | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Annotations | ||
|
|
||
| #### Annotation process | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| #### Who are the annotators? | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Personal and Sensitive Information | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ## Considerations for Using the Data | ||
|
|
||
| ### Social Impact of Dataset | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Discussion of Biases | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Other Known Limitations | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ## Additional Information | ||
|
|
||
| ### Dataset Curators | ||
|
|
||
| [Needs More Information] | ||
|
|
||
| ### Licensing Information | ||
|
|
||
| MIT | ||
|
|
||
| ### Citation Information | ||
|
|
||
| ``` | ||
| @article{pile, | ||
| title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, | ||
| author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, | ||
| journal={arXiv preprint arXiv:2101.00027}, | ||
| year={2020} | ||
| } | ||
| ``` | ||
|
|
||
| ### Contributions | ||
|
|
||
| [@shawwn](https://github.com/shawwn) creates this dataset. | ||
| [@richarddwang](https://github.com/richarddwang) adds this dataset. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # coding=utf-8 | ||
| # Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Lint as: python3 | ||
| """The BookCorpus dataset based on Shawn Presser's work https://github.com/soskek/bookcorpus/issues/27 """ | ||
|
|
||
|
|
||
| from pathlib import Path | ||
|
|
||
| import datasets | ||
|
|
||
|
|
||
| _DESCRIPTION = """\ | ||
| This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset. \ | ||
| This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly \ | ||
| the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious \ | ||
| "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know \ | ||
| very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. | ||
| """ | ||
|
|
||
| _CITATION = """\ | ||
| @article{pile, | ||
| title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, | ||
| author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, | ||
| journal={arXiv preprint arXiv:2101.00027}, | ||
| year={2020} | ||
| } | ||
| """ | ||
| _PROJECT_URL = "https://github.com/soskek/bookcorpus/issues/27#issuecomment-716104208" | ||
| _DOWNLOAD_URL = "https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz" | ||
|
|
||
|
|
||
| class Books3Config(datasets.BuilderConfig): | ||
| """BuilderConfig for Books3.""" | ||
|
|
||
| def __init__(self, **kwargs): | ||
| """BuilderConfig for Books3. | ||
| Args: | ||
| **kwargs: keyword arguments forwarded to super. | ||
| """ | ||
| super(Books3Config, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs) | ||
|
|
||
|
|
||
| class Books3(datasets.GeneratorBasedBuilder): | ||
| """Books3 dataset.""" | ||
|
|
||
| BUILDER_CONFIGS = [ | ||
| Books3Config( | ||
| name="plain_text", | ||
| description="Plain text", | ||
| ) | ||
| ] | ||
| # Every example is a whole book thus big, adjust writer_batch_size to avoid OOM at the cost of writing speed | ||
| DEFAULT_WRITER_BATCH_SIZE = 500 | ||
|
|
||
| def _info(self): | ||
| return datasets.DatasetInfo( | ||
| description=_DESCRIPTION, | ||
| features=datasets.Features( | ||
| { | ||
| "title": datasets.Value("string"), | ||
| "text": datasets.Value("string"), | ||
| } | ||
| ), | ||
| supervised_keys=None, | ||
| homepage=_PROJECT_URL, | ||
| citation=_CITATION, | ||
| ) | ||
|
|
||
| def _split_generators(self, dl_manager): | ||
| extracted_dir = dl_manager.download_and_extract(_DOWNLOAD_URL) | ||
| leaf_dirs = Path(extracted_dir).glob("**/Bibliotik/*") | ||
| leaf_dirs = sorted(leaf_dirs) | ||
|
|
||
| return [ | ||
| datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"leaf_dirs": leaf_dirs}), | ||
| ] | ||
|
|
||
| def _generate_examples(self, leaf_dirs): | ||
| _id = 0 | ||
| for leaf_dir in leaf_dirs: | ||
| for path in sorted(leaf_dir.glob("**/*.epub.txt")): | ||
| with path.open(mode="r", encoding="utf-8") as f: | ||
| entry = {"title": path.name, "text": f.read()} | ||
| yield _id, entry | ||
| _id += 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {"plain_text": {"description": "Shawn Presser's work. All of bibliotik in plain .txt form, aka 197,000 books processed in exactly \nthe same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious \n\"books2\" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know\nvery little about any differences. People suspect it's \"all of libgen\", but it's purely conjecture.\n", "citation": "@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n", "homepage": "https://github.com/soskek/bookcorpus/issues/27#issuecomment-716104208", "license": "", "features": {"title": {"dtype": "string", "id": null, "_type": "Value"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "books3", "config_name": "plain_text", "version": {"version_str": "1.0.0", "description": "", "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 108395639965, "num_examples": 196640, "dataset_name": "books3"}}, "download_checksums": {"https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz": {"num_bytes": 39516981435, "checksum": "016b90fa6b8507328b6a90d13b0f68c2b87dfd281b35e449a1d466fd9eebc14a"}}, "download_size": 39516981435, "post_processing_size": null, "dataset_size": 108395639965, "size_in_bytes": 147912621400}} |
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.