Skip to content

tokenize_and_cache cooks up wacky paths #1281

@eritain

Description

@eritain

Describe the bug
Supplying a relative path to the data downloader lays a trap for tokenize_and_cache.py.

To Reproduce
Call jiant/scripts/download_data/runscript.py to download some task data. Use a relative --output_path such as experiment/tasks.

Download a model, including its tokenizer.

Call jiant/proj/main/tokenize_and_cache.py to preprocess the task data for the model. Use a relative --task_config_path such as experiment/tasks/configs/taskname_config.json. It will die:

FileNotFoundError: [Errno 2] No such file or directory: 'experiment/tasks/configs/experiment/tasks/data/taskname/train.jsonl'

Expected behavior
A clear and concise description of what you expected to happen.

tokenize_and_cache formulates the correct path experiment/tasks/data/taskname/train.jsonl.

Additional context
Giving an absolute path to the downloader allows tokenize_and_cache to formulate the correct path and produce correct outputs. Hand-patching absolute paths into experiment/tasks/configs/taskname_config.json after the downloader creates it, but before tokenize_and_cache uses it, appears to work too.

At a minimum, or while working on a better solution, stick a warning on all examples of using the downloader, including README.md and guides/tutorials/quick_start_main.md. For extra credit, stick it in the source of both download_data/runscript.py and tokenize_and_cache.py as a comment. But the ideal thing would be to patch tokenize_and_cache to handle relative paths correctly. Forcing the downloader to build absolute paths before writing the task config would be OK too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    assignedIs being looked into/followed-up on by a dev

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions