Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 25, 2021

The TIMIT ASR dataset had two issues that was preventing it from being streamable:

  1. it was missing a call to open before pd.read_csv
  2. it was using os.path.dirname which is not supported for streaming

I made the dataset streamable by using open to load the CSV, and by adding the support for os.path.dirname in dataset scripts to stream data

You can now do

from datasets import load_dataset

timit_asr = load_dataset("timit_asr", streaming=True)
print(next(iter(timit_asr["train"])))

prints:

{"file": "zip://data/TRAIN/DR4/MMDM0/SI681.WAV::https://data.deepai.org/timit.zip",
"phonetic_detail": {"start": [0, 1960, 2466, 3480, 4000, 5960, 7480, 7880, 9400, 9960, 10680, 13480, 15680, 15880, 16920, 18297, 18882, 19480, 21723, 22516, 24040, 25190, 27080, 28160, 28560, 30120, 31832, 33240, 34640, 35968, 37720],
"utterance": ["h#", "w", "ix", "dcl", "s", "ah", "tcl", "ch", "ix", "n", "ae", "kcl", "t", "ix", "v", "r", "ix", "f", "y", "ux", "zh", "el", "bcl", "b", "iy", "y", "ux", "s", "f", "el", "h#"],
"stop": [1960, 2466, 3480, 4000, 5960, 7480, 7880, 9400, 9960, 10680, 13480, 15680, 15880, 16920, 18297, 18882, 19480, 21723, 22516, 24040, 25190, 27080, 28160, 28560, 30120, 31832, 33240, 34640, 35968, 37720, 39920]},
"sentence_type": "SI", "id": "SI681",
"speaker_id": "MMDM0",
"dialect_region": "DR4",
"text": "Would such an act of refusal be useful?",
"word_detail": {
    "start": [1960, 4000, 9400, 10680, 15880, 18297, 27080, 30120],
    "utterance": ["would", "such", "an", "act", "of", "refusal", "be", "useful"],
    "stop": [4000, 9400, 10680, 15880, 18297, 27080, 30120, 37720]
}}

cc @patrickvonplaten @vrindaprabhu

@lhoestq lhoestq merged commit 9a2dff6 into master Sep 7, 2021
@lhoestq lhoestq deleted the timit_asr-streaming branch September 7, 2021 13:15
JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request Sep 8, 2021
Update: timit_asr - make the dataset streamable (huggingface#2835)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants