Skip to content

dataset-compatible-libraries gives an UnexpectedError for some datasets #2607

@severo

Description

@severo

On https://huggingface.co/datasets/HackerNoon/tech-company-news-data-dump, the step dataset-compatible-libraries gives:

{
  "error": "Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. ",
  "cause_exception": "EmptyDatasetError",
  "cause_message": "Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. ",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py\", line 622, in create_builder_configs_from_metadata_configs\n else get_data_patterns(config_base_path)\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py\", line 485, in get_data_patterns\n raise EmptyDatasetError(f\"The directory at {base_path} doesn't contain any data files\") from None\n",
    "datasets.data_files.EmptyDatasetError: The directory at hf://datasets/HackerNoon/tech-company-news-data-dump doesn't contain any data files\n",
    "\nThe above exception was the direct cause of the following exception:\n\n",
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_manager.py\", line 125, in process\n job_result = self.job_runner.compute()\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 632, in compute\n response_content = compute_compatible_libraries_response(\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 619, in compute_compatible_libraries_response\n compatible_library = get_compatible_library_for_builder[builder_name](dataset, hf_token)\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 416, in get_compatible_libraries_for_csv\n builder_configs = get_builder_configs_with_simplified_data_files(dataset, module_name=\"csv\", hf_token=hf_token)\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 107, in get_builder_configs_with_simplified_data_files\n builder_configs, _ = create_builder_configs_from_metadata_configs(\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py\", line 629, in create_builder_configs_from_metadata_configs\n raise EmptyDatasetError(\n",
    "datasets.data_files.EmptyDatasetError: Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. \n"
  ]
}

Some ideas to explore: it's a gated dataset, and also it's a partial parquet export.

Metadata

Metadata

Assignees

Labels

P1Not as needed as P0, but still important/wantedblocked-by-upstreamThe issue must be fixed in a dependencybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions