pytorch BERT Trainer using HuggingFace transformers
- python 3.6
- pytorch 1.12
- cuda 10.0
- tensorflow 1.14 (for tensorboard)
- pytorch_transformers
- gluonnlp >= 0.6.0
- apex (for mixed precision training)
- flask (for using api)
Pretrained Korean Bert Model (ETRI or SKT)
Make directory pretrained_model and make sub directory like below
pretrained_model
├── etri
│ ├── bert_config.json
│ ├── pytorch_model.bin
│ ├── tokenization.py
│ └── vocab.korean.rawtext.list
└── skt
├── bert_config.json
├── pytorch_model.bin
├── tokenizer.model
└── vocab.json
- 한국어 단발성 대화 데이터셋(공포, 놀람, 분노, 슬픔, 중립, 행복, 혐오)
- Any Dataset containing binary label(긍정, 부정)
Datasets should be in csv format which has two columns named 'Sentence' and 'Emotion'.
Or you can modify a few codes below in datasets.py to fit your own datasets
...
# line 50 - 58
def get_data(self, file_path):
data = pd.read_csv(file_path)
corpus = data['Sentence']
label = None
try:
label = [self.label2idx[l] for l in data['Emotion']]
except:
pass
return corpus, label
...For maksed language model pretrain
$ python train_mlm.py\
--pretrained_type="etri"
For text classification
$ python train_classification.py\
--pretrained_type="etri"
Classification after further MLM pretrain
$ python train_classification.py\
--pretrained_model_path=".../best_model.bin"
Use fp16 argument for mixed precision training
$ python train_classification.py\
--fp16\
--fp16_opt_level="O1"
Inference
$ python test.py\
--pretrained_model_path="./data/korean_single_test.csv"
After inference, result file saved to /result folder.
/result/test_result.csv: predicted label for test data/result/test_result.png: confusion matrix for test data
Overall
| Test Set(3,859) | |
|---|---|
| Accuracy | 57.69% |
| Macro F1 | 56.84% |
F1 score for each Emotion
| Emotion | F1 |
|---|---|
| 공포 | 60.00% |
| 놀람 | 57.49% |
| 분노 | 54.60% |
| 슬픔 | 62.64% |
| 중립 | 44.21% |
| 행복 | 81.88% |
| 혐오 | 37.04% |
$ python app.py
| Sad case | Happy case |
|---|---|
![]() |
![]() |


