Skip to content

Datasets : wikipedia 20220301.en error  #6542

@ppx666

Description

@ppx666

Describe the bug

When I used load_dataset to download this data set, the following error occurred. The main problem was that the target data did not exist.

Steps to reproduce the bug

1.I tried downloading directly.

wiki_dataset = load_dataset("wikipedia", "20220301.en")

An exception occurred

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in `load_dataset` or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called `DirectRunner` (you may run out of memory). 
Example of usage: 
	`load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')`

2.I modified the code as prompted.

wiki_dataset = load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

An exception occurred:

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Expected behavior

I searched in the parent directory of the corresponding URL, but there was no corresponding "20220301" directory.
I really need this data set and hope to provide a download method.

Environment info

python 3.8
datasets 2.16.0
apache-beam 2.52.0
dill 0.3.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions