huggingface · albertvillanova · Aug 12, 2021 · Aug 10, 2021 · Aug 11, 2021 · Aug 11, 2021
diff --git a/datasets/vivos/README.md b/datasets/vivos/README.md
@@ -0,0 +1,179 @@
+---
+pretty_name: vivos
+annotations_creators:
+- expert-generated
+language_creators:
+- crowdsourced
+- expert-generated
+languages:
+- vi
+licenses:
+- cc-by-sa-4.0
+multilinguality:
+- monolingual
+size_categories:
+- 10K<n<100K
+source_datasets:
+- original
+task_categories:
+- speech-processing
+task_ids:
+- automatic-speech-recognition
+---
+
+# Dataset Card for VIVOS
+
+## Table of Contents
+- [Dataset Card for VIVOS](#dataset-card-for-vivos)
+  - [Table of Contents](#table-of-contents)
+  - [Dataset Description](#dataset-description)
+    - [Dataset Summary](#dataset-summary)
+    - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+    - [Languages](#languages)
+  - [Dataset Structure](#dataset-structure)
+    - [Data Instances](#data-instances)
+    - [Data Fields](#data-fields)
+    - [Data Splits](#data-splits)
+  - [Dataset Creation](#dataset-creation)
+    - [Curation Rationale](#curation-rationale)
+    - [Source Data](#source-data)
+      - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
+      - [Who are the source language producers?](#who-are-the-source-language-producers)
+    - [Annotations](#annotations)
+      - [Annotation process](#annotation-process)
+      - [Who are the annotators?](#who-are-the-annotators)
+    - [Personal and Sensitive Information](#personal-and-sensitive-information)
+  - [Considerations for Using the Data](#considerations-for-using-the-data)
+    - [Social Impact of Dataset](#social-impact-of-dataset)
+    - [Discussion of Biases](#discussion-of-biases)
+    - [Other Known Limitations](#other-known-limitations)
+  - [Additional Information](#additional-information)
+    - [Dataset Curators](#dataset-curators)
+    - [Licensing Information](#licensing-information)
+    - [Citation Information](#citation-information)
+    - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://ailab.hcmus.edu.vn/vivos
+- **Repository:** [Needs More Information]
+- **Paper:** [Needs More Information]
+- **Leaderboard:** [Needs More Information]
+- **Point of Contact:** [email protected]
+
+### Dataset Summary
+
+VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for
+Vietnamese Automatic Speech Recognition task.
+The corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.
+We publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.
+
+### Supported Tasks and Leaderboards
+
+[Needs More Information]
+
+### Languages
+
+Vietnamese
+
+## Dataset Structure
+
+### Data Instances
+
+A typical data point comprises the path to the audio file, called `path` and its transcription, called `sentence`. Some additional information about the speaker and the passage which contains the transcription is provided.
+
+```
+{'speaker_id': VIVOSSPK01,
+ 'path': '/home/admin/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/vivos/train/waves/VIVOSSPK01/VIVOSSPK01_R001.wav',
+ 'sentence': 'KHÁCH SẠN'}
+```
+
+### Data Fields
+
+speaker_id: An id for which speaker (voice) made the recording
+
+path: The path to the audio file
+
+sentence: The sentence the user was prompted to speak
+
+### Data Splits
+
+The speech material has been subdivided into portions for train and test.
+
+Speech was recorded in a quiet environment with high quality microphone, speakers were asked to read one sentence at a time.
+
+|                  | Train | Test  |
+| ---------------- | ----- | ----- |
+| Speakers         | 46    | 19    | 
+| Utterances       | 11660 | 760   |
+| Duration         | 14:55 | 00:45 |
+| Unique Syllables | 4617  | 1692  |
+
+## Dataset Creation
+
+### Curation Rationale
+
+[Needs More Information]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[Needs More Information]
+
+#### Who are the source language producers?
+
+[Needs More Information]
+
+### Annotations
+
+#### Annotation process
+
+[Needs More Information]
+
+#### Who are the annotators?
+
+[Needs More Information]
+
+### Personal and Sensitive Information
+
+The dataset consists of people who have donated their voice online.  You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed] 
+
+### Other Known Limitations
+
+[More Information Needed] 
+
+## Additional Information
+
+### Dataset Curators
+
+The dataset was initially prepared by AILAB, a computer science lab of VNUHCM - University of Science.
+
+### Licensing Information
+
+Creative Commons Attribution NonCommercial ShareAlike v4.0 (CC BY-NC-SA 4.0)
+
+### Citation Information
+
+```
+@InProceedings{vivos:2016,
+Address = {Ho Chi Minh, Vietnam}
+title = {VIVOS: 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition},
+author={Prof. Vu Hai Quan},
+year={2016}
+}
+```
+
+### Contributions
+
+Thanks to [@binh234](https://github.com/binh234) for adding this dataset.
diff --git a/datasets/vivos/dataset_infos.json b/datasets/vivos/dataset_infos.json
@@ -0,0 +1 @@
+{"default": {"description": "VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for\nVietnamese Automatic Speech Recognition task.\nThe corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.\nWe publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.\n", "citation": "@InProceedings{vivos:2016,\nAddress = {Ho Chi Minh, Vietnam}\ntitle = {VIVOS: 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition},\nauthor={Prof. Vu Hai Quan},\nyear={2016}\n}\n", "homepage": "https://ailab.hcmus.edu.vn/vivos", "license": "cc-by-sa-4.0", "features": {"speaker_id": {"dtype": "string", "id": null, "_type": "Value"}, "path": {"dtype": "string", "id": null, "_type": "Value"}, "sentence": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "vivos_dataset", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3186233, "num_examples": 11660, "dataset_name": "vivos_dataset"}, "test": {"name": "test", "num_bytes": 193258, "num_examples": 760, "dataset_name": "vivos_dataset"}}, "download_checksums": {"https://ailab.hcmus.edu.vn/assets/vivos.tar.gz": {"num_bytes": 1474408300, "checksum": "147477f7a7702cbafc2ee3808d1c142989d0dbc8d9fce8e07d5f329d5119e4ca"}}, "download_size": 1474408300, "post_processing_size": null, "dataset_size": 3379491, "size_in_bytes": 1477787791}}
diff --git a/datasets/vivos/dummy/1.1.0/dummy_data.zip b/datasets/vivos/dummy/1.1.0/dummy_data.zip
diff --git a/datasets/vivos/vivos.py b/datasets/vivos/vivos.py
@@ -0,0 +1,124 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import datasets
+
+
+# Find for instance the citation on arxiv or on the dataset repo/website
+_CITATION = """\
+@InProceedings{vivos:2016,
+Address = {Ho Chi Minh, Vietnam}
+title = {VIVOS: 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition},
+author={Prof. Vu Hai Quan},
+year={2016}
+}
+"""
+
+_DESCRIPTION = """\
+VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for
+Vietnamese Automatic Speech Recognition task.
+The corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.
+We publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.
+"""
+
+_HOMEPAGE = "https://ailab.hcmus.edu.vn/vivos"
+
+_LICENSE = "cc-by-sa-4.0"
+
+_DATA_URL = "https://ailab.hcmus.edu.vn/assets/vivos.tar.gz"
+
+
+class VivosDataset(datasets.GeneratorBasedBuilder):
+    """VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for
+    Vietnamese Automatic Speech Recognition task."""
+
+    VERSION = datasets.Version("1.1.0")
+
+    # This is an example of a dataset with multiple configurations.
+    # If you don't want/need to define several sub-sets in your dataset,
+    # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.
+
+    # If you need to make complex sub-parts in the datasets with configurable options
+    # You can create your own builder configuration class to store attribute, inheriting from datasets.BuilderConfig
+    # BUILDER_CONFIG_CLASS = MyBuilderConfig
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "speaker_id": datasets.Value("string"),
+                    "path": datasets.Value("string"),
+                    "sentence": datasets.Value("string"),
+                }
+            ),
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name
+
+        # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLs
+        # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files.
+        # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive
+        dl_path = dl_manager.download_and_extract(_DATA_URL)
+        data_dir = os.path.join(dl_path, "vivos")
+        train_dir = os.path.join(data_dir, "train")
+        test_dir = os.path.join(data_dir, "test")
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(train_dir, "prompts.txt"),
+                    "path_to_clips": os.path.join(train_dir, "waves"),
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(test_dir, "prompts.txt"),
+                    "path_to_clips": os.path.join(test_dir, "waves"),
+                },
+            ),
+        ]
+
+    def _generate_examples(
+        self,
+        filepath,
+        path_to_clips,  # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    ):
+        """Yields examples as (key, example) tuples."""
+        # This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
+        # The `key` is here for legacy reason (tfds) and is not important in itself.
+
+        with open(filepath, encoding="utf-8") as f:
+            lines = f.readlines()
+            for id_, row in enumerate(lines):
+                data = row.strip().split(" ", 1)
+                speaker_id = data[0].split("_")[0]
+                yield id_, {
+                    "speaker_id": speaker_id,
+                    "path": os.path.join(path_to_clips, speaker_id, data[0] + ".wav"),
+                    "sentence": data[1],
+                }
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"default": {"description": "VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for\nVietnamese Automatic Speech Recognition task.\nThe corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.\nWe publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.\n", "citation": "@InProceedings{vivos:2016,\nAddress = {Ho Chi Minh, Vietnam}\ntitle = {VIVOS: 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition},\nauthor={Prof. Vu Hai Quan},\nyear={2016}\n}\n", "homepage": "https://ailab.hcmus.edu.vn/vivos", "license": "cc-by-sa-4.0", "features": {"speaker_id": {"dtype": "string", "id": null, "_type": "Value"}, "path": {"dtype": "string", "id": null, "_type": "Value"}, "sentence": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "vivos_dataset", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3186233, "num_examples": 11660, "dataset_name": "vivos_dataset"}, "test": {"name": "test", "num_bytes": 193258, "num_examples": 760, "dataset_name": "vivos_dataset"}}, "download_checksums": {"https://ailab.hcmus.edu.vn/assets/vivos.tar.gz": {"num_bytes": 1474408300, "checksum": "147477f7a7702cbafc2ee3808d1c142989d0dbc8d9fce8e07d5f329d5119e4ca"}}, "download_size": 1474408300, "post_processing_size": null, "dataset_size": 3379491, "size_in_bytes": 1477787791}}