Skip to content

Commit 747f2e0

Browse files
authored
Merge branch 'master' into master
2 parents a155fb9 + 52d1026 commit 747f2e0

12 files changed

+440
-202
lines changed

README.md

Lines changed: 43 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ Audio samples are available at https://r9y9.github.io/deepvoice3_pytorch/.
2929

3030
## Pretrained models
3131

32+
**NOTE**: pretrained models are not compatible to master. To be updated soon.
33+
3234
| URL | Model | Data | Hyper paramters | Git commit | Steps |
3335
|-----|------------|----------|--------------------------------------------------|----------------------|--------|
3436
| [link](https://www.dropbox.com/s/cs6d070ommy2lmh/20171213_deepvoice3_checkpoint_step000210000.pth?dl=0) | DeepVoice3 | LJSpeech | `builder=deepvoice3,preset=deepvoice3_ljspeech` | [4357976](https://github.com/r9y9/deepvoice3_pytorch/tree/43579764f35de6b8bac2b18b52a06e4e11b705b2)| 210k ~ |
@@ -41,18 +43,6 @@ See "Synthesize from a checkpoint" section in the README for how to generate spe
4143

4244
- Default hyper parameters, used during preprocessing/training/synthesis stages, are turned for English TTS using LJSpeech dataset. You will have to change some of parameters if you want to try other datasets. See `hparams.py` for details.
4345
- `builder` specifies which model you want to use. `deepvoice3`, `deepvoice3_multispeaker` [1] and `nyanko` [2] are surpprted.
44-
- `presets` represents hyper parameters known to work well for particular dataset/model from my experiments. Before you try to find your best parameters, I would recommend you to try those presets by setting `preset=${name}`. e.g., for LJSpeech, you can try either
45-
```
46-
python train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_deepvoice3 \
47-
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" \
48-
--log-event-path=log/deepvoice3_preset
49-
```
50-
or
51-
```
52-
python train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_nyanko \
53-
--hparams="builder=nyanko,preset=nyanko_ljspeech" \
54-
--log-event-path=log/nyanko_preset
55-
```
5646
- Hyper parameters described in DeepVoice3 paper for single speaker didn't work for LJSpeech dataset, so I changed a few things. Add dilated convolution, more channels, more layers and add guided attention loss, etc. See code for details. The changes are also applied for multi-speaker model.
5747
- Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
5848
- With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
@@ -74,18 +64,34 @@ python train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_nyanko
7464
Please install packages listed above first, and then
7565

7666
```
77-
git clone https://github.com/r9y9/deepvoice3_pytorch
78-
cd deepvoice3_pytorch
67+
git clone https://github.com/r9y9/deepvoice3_pytorch && cd deepvoice3_pytorch
7968
pip install -e ".[train]"
8069
```
8170

82-
If you want Japanese text processing frontend, install additional dependencies by:
71+
## Getting started
72+
73+
### Preset parameters
74+
75+
There are many hyper parameters to be turned depends on what model and data you are working on. For typical datasets and models, parameters that known to work good (**preset**) are provided in the repository. See `presets` directory for details. Notice that
76+
77+
1. `preprocess.py`
78+
2. `train.py`
79+
3. `synthesis.py`
80+
81+
accepts `--preset=<json>` optional parameter, which specifies where to load preset parameters. If you are going to use preset parameters, then you must use same `--preset=<json>` throughout preprocessing, training and evaluation. e.g.,
8382

8483
```
85-
pip install -e ".[jp]"
84+
python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/LJSpeech-1.0
85+
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech
8686
```
8787

88-
## Getting started
88+
instead of
89+
90+
```
91+
python preprocess.py ljspeech ~/data/LJSpeech-1.0
92+
# warning! this may use different hyper parameters used at preprocessing stage
93+
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech
94+
```
8995

9096
### 0. Download dataset
9197

@@ -96,62 +102,52 @@ pip install -e ".[jp]"
96102

97103
### 1. Preprocessing
98104

99-
Preprocessing can be done by `preprocess.py`. Usage is:
105+
Usage:
100106

101107
```
102-
python preprocess.py ${dataset_name} ${dataset_path} ${out_dir}
108+
python preprocess.py ${dataset_name} ${dataset_path} ${out_dir} --preset=<json>
103109
```
104110

105-
Supported `${dataset_name}`s for now are
111+
Supported `${dataset_name}`s are:
106112

107113
- `ljspeech` (en, single speaker)
108114
- `vctk` (en, multi-speaker)
109115
- `jsut` (jp, single speaker)
110116
- `nikl_m` (ko, multi-speaker)
111117
- `nikl_s` (ko, single speaker)
112118

113-
Suppose you will want to preprocess LJSpeech dataset and have it in `~/data/LJSpeech-1.0`, then you can preprocess data by:
119+
Assuming you use preset parameters known to work good for LJSpeech dataset / DeepVoice3 and have data in `~/data/LJSpeech-1.0`, then you can preprocess data by:
114120

115121
```
116-
python preprocess.py ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech
122+
python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech
117123
```
118124

119125
When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`.
120126

121127
### 2. Training
122128

123-
Basic usage of `train.py` is:
129+
Usage:
124130

125131
```
126-
python train.py --data-root=${data-root} --hparams="parameters you want to override"
132+
python train.py --data-root=${data-root} --preset=<json> --hparams="parameters you may want to override"
127133
```
128134

129-
Suppose you will want to build a DeepVoice3-style model using LJSpeech dataset with default hyper parameters, then you can train your model by:
135+
Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:
130136

131137
```
132-
python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"
138+
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
133139
```
134140

135-
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 5000 steps by default.
136-
137-
If you are building a Japaneses TTS model, then for example,
138-
139-
```
140-
python train.py --data-root=./data/jsut --hparams="frontend=jp" --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"
141-
```
142-
143-
`frontend=jp` tell the training script to use Japanese text processing frontend. Default is `en` and uses English text processing frontend.
144-
145-
Note that there are many hyper parameters and design choices. Some are configurable by `hparams.py` and some are hardcoded in the source (e.g., dilation factor for each convolution layer). If you find better hyper parameters, please let me know!
141+
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
146142

147143
#### NIKL
144+
148145
Pleae check [this](https://github.com/homink/deepvoice3_pytorch/blob/master/nikl_preprocess/README.md) in advance and follow the commands below.
149146

150147
```
151-
python preprocess.py nikl_s ${your_nikl_root_path} data/nikl_s
148+
python preprocess.py nikl_s ${your_nikl_root_path} data/nikl_s --preset=presets/deepvoice3_nikls.json
152149
153-
python train.py --data-root=./data/nikl_s --checkpoint-dir checkpoint_nikl_s \
154-
--hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_nikls"
150+
python train.py --data-root=./data/nikl_s --checkpoint-dir checkpoint_nikl_s --preset=presets/deepvoice3_nikls.json
155151
```
156152

157153
### 4. Monitor with Tensorboard
@@ -167,7 +163,7 @@ tensorboard --logdir=log
167163
Given a list of text, `synthesis.py` synthesize audio signals from trained model. Usage is:
168164

169165
```
170-
python synthesis.py ${checkpoint_path} ${text_list.txt} ${output_dir}
166+
python synthesis.py ${checkpoint_path} ${text_list.txt} ${output_dir} --preset=<json>
171167
```
172168

173169
Example test_list.txt:
@@ -178,17 +174,11 @@ Once upon a time there was a dear little girl who was loved by every one who loo
178174
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
179175
```
180176

181-
Note that you have to use the same hyper parameters used for training. For example, if you are using hyper parameters `preset=deepvoice3_ljspeech,builder=deepvoice3"` for training, then synthesis command should be:
182-
183-
```
184-
python synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" ${checkpoint_path} ${text_list.txt} ${output_dir}
185-
```
186-
187177
## Advanced usage
188178

189179
### Multi-speaker model
190180

191-
VCTK and NIKL are supported dataset for building a multi-speaker model.
181+
VCTK and NIKL are supported dataset for building a multi-speaker model.
192182

193183
#### VCTK
194184
Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
@@ -203,22 +193,23 @@ Now that you have data prepared, then you can train a multi-speaker version of D
203193

204194
```
205195
python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
206-
--hparams="preset=deepvoice3_vctk,builder=deepvoice3_multispeaker" \
196+
--preset=presets/deepvoice3_vctk.json \
207197
--log-event-path=log/deepvoice3_multispeaker_vctk_preset
208198
```
209199

210200
If you want to reuse learned embedding from other dataset, then you can do this instead by:
211201

212202
```
213203
python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
214-
--hparams="preset=deepvoice3_vctk,builder=deepvoice3_multispeaker" \
204+
--preset=presets/deepvoice3_vctk.json \
215205
--log-event-path=log/deepvoice3_multispeaker_vctk_preset \
216206
--load-embedding=20171213_deepvoice3_checkpoint_step000210000.pth
217207
```
218208

219209
This may improve training speed a bit.
220210

221211
#### NIKL
212+
222213
You will be able to obtain cleaned-up audio samples in ../nikl_preprocoess. Details are found in [here](https://github.com/homink/speech.ko).
223214

224215

@@ -232,7 +223,7 @@ Now that you have data prepared, then you can train a multi-speaker version of D
232223

233224
```
234225
python train.py --data-root=./data/nikl_m --checkpoint-dir checkpoint_nikl_m \
235-
--hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker"
226+
--preset=presets/deepvoice3_niklm.json
236227
```
237228

238229
### Speaker adaptation
@@ -241,7 +232,7 @@ If you have very limited data, then you can consider to try fine-turn pre-traine
241232

242233
```
243234
python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk_adaptation \
244-
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" \
235+
--preset=presets/deepvoice3_ljspeech.json \
245236
--log-event-path=log/deepvoice3_vctk_adaptation \
246237
--restore-parts="20171213_deepvoice3_checkpoint_step000210000.pth"
247238
--speaker-id=0

compute_timestamp_ratio.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
55
options:
66
--hparams=<parmas> Hyper parameters [default: ].
7+
--preset=<json> Path of preset parameters (json).
78
-h, --help Show this help message and exit
89
"""
910
from docopt import docopt
@@ -19,7 +20,12 @@
1920
if __name__ == "__main__":
2021
args = docopt(__doc__)
2122
data_root = args["<data_root>"]
23+
preset = args["--preset"]
2224

25+
# Load preset if specified
26+
if preset is not None:
27+
with open(preset) as f:
28+
hparams.parse_json(f.read())
2329
# Override hyper parameters
2430
hparams.parse(args["--hparams"])
2531
assert hparams.name == "deepvoice3"

dump_hparams_to_json.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# coding: utf-8
2+
"""
3+
Dump hyper parameters to json file.
4+
5+
usage: dump_hparams_to_json.py [options] <output_json_path>
6+
7+
options:
8+
-h, --help Show help message.
9+
"""
10+
from docopt import docopt
11+
12+
import sys
13+
import os
14+
from os.path import dirname, join, basename, splitext
15+
16+
import audio
17+
18+
# The deepvoice3 model
19+
from deepvoice3_pytorch import frontend
20+
from hparams import hparams
21+
import json
22+
23+
if __name__ == "__main__":
24+
args = docopt(__doc__)
25+
output_json_path = args["<output_json_path>"]
26+
27+
j = hparams.values()
28+
29+
# for compat legacy
30+
for k in ["preset", "presets"]:
31+
if k in j:
32+
del j[k]
33+
34+
with open(output_json_path, "w") as f:
35+
json.dump(j, f, indent=2)
36+
sys.exit(0)

0 commit comments

Comments
 (0)