-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
When I used load_dataset to download this data set, the following error occurred. The main problem was that the target data did not exist.
Steps to reproduce the bug
1.I tried downloading directly.
wiki_dataset = load_dataset("wikipedia", "20220301.en")An exception occurred
MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in `load_dataset` or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called `DirectRunner` (you may run out of memory).
Example of usage:
`load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')`
2.I modified the code as prompted.
wiki_dataset = load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')An exception occurred:
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json
Expected behavior
I searched in the parent directory of the corresponding URL, but there was no corresponding "20220301" directory.
I really need this data set and hope to provide a download method.
Environment info
python 3.8
datasets 2.16.0
apache-beam 2.52.0
dill 0.3.7
Metadata
Metadata
Assignees
Labels
No labels