You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -41,18 +43,6 @@ See "Synthesize from a checkpoint" section in the README for how to generate spe
41
43
42
44
- Default hyper parameters, used during preprocessing/training/synthesis stages, are turned for English TTS using LJSpeech dataset. You will have to change some of parameters if you want to try other datasets. See `hparams.py` for details.
43
45
-`builder` specifies which model you want to use. `deepvoice3`, `deepvoice3_multispeaker`[1] and `nyanko`[2] are surpprted.
44
-
-`presets` represents hyper parameters known to work well for particular dataset/model from my experiments. Before you try to find your best parameters, I would recommend you to try those presets by setting `preset=${name}`. e.g., for LJSpeech, you can try either
- Hyper parameters described in DeepVoice3 paper for single speaker didn't work for LJSpeech dataset, so I changed a few things. Add dilated convolution, more channels, more layers and add guided attention loss, etc. See code for details. The changes are also applied for multi-speaker model.
57
47
- Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
58
48
- With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
git clone https://github.com/r9y9/deepvoice3_pytorch && cd deepvoice3_pytorch
79
68
pip install -e ".[train]"
80
69
```
81
70
82
-
If you want Japanese text processing frontend, install additional dependencies by:
71
+
## Getting started
72
+
73
+
### Preset parameters
74
+
75
+
There are many hyper parameters to be turned depends on what model and data you are working on. For typical datasets and models, parameters that known to work good (**preset**) are provided in the repository. See `presets` directory for details. Notice that
76
+
77
+
1.`preprocess.py`
78
+
2.`train.py`
79
+
3.`synthesis.py`
80
+
81
+
accepts `--preset=<json>` optional parameter, which specifies where to load preset parameters. If you are going to use preset parameters, then you must use same `--preset=<json>` throughout preprocessing, training and evaluation. e.g.,
Suppose you will want to preprocess LJSpeech dataset and have it in `~/data/LJSpeech-1.0`, then you can preprocess data by:
119
+
Assuming you use preset parameters known to work good for LJSpeech dataset / DeepVoice3 and have data in `~/data/LJSpeech-1.0`, then you can preprocess data by:
`frontend=jp` tell the training script to use Japanese text processing frontend. Default is `en` and uses English text processing frontend.
144
-
145
-
Note that there are many hyper parameters and design choices. Some are configurable by `hparams.py` and some are hardcoded in the source (e.g., dilation factor for each convolution layer). If you find better hyper parameters, please let me know!
141
+
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
146
142
147
143
#### NIKL
144
+
148
145
Pleae check [this](https://github.com/homink/deepvoice3_pytorch/blob/master/nikl_preprocess/README.md) in advance and follow the commands below.
@@ -178,17 +174,11 @@ Once upon a time there was a dear little girl who was loved by every one who loo
178
174
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
179
175
```
180
176
181
-
Note that you have to use the same hyper parameters used for training. For example, if you are using hyper parameters `preset=deepvoice3_ljspeech,builder=deepvoice3"` for training, then synthesis command should be:
VCTK and NIKL are supported dataset for building a multi-speaker model.
181
+
VCTK and NIKL are supported dataset for building a multi-speaker model.
192
182
193
183
#### VCTK
194
184
Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
@@ -203,22 +193,23 @@ Now that you have data prepared, then you can train a multi-speaker version of D
You will be able to obtain cleaned-up audio samples in ../nikl_preprocoess. Details are found in [here](https://github.com/homink/speech.ko).
223
214
224
215
@@ -232,7 +223,7 @@ Now that you have data prepared, then you can train a multi-speaker version of D
0 commit comments