|
| 1 | +--- |
| 2 | +language: |
| 3 | +- "zh" |
| 4 | +thumbnail: "https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png" |
| 5 | +tags: |
| 6 | +- "chinese" |
| 7 | +- "classical chinese" |
| 8 | +- "literary chinese" |
| 9 | +- "ancient chinese" |
| 10 | +- "bert" |
| 11 | +- "pytorch" |
| 12 | +license: "apache-2.0" |
| 13 | +pipeline_tag: "fill-mask" |
| 14 | +widget: |
| 15 | +- text: "[MASK]太元中,武陵人捕鱼为业。" |
| 16 | +- text: "山不在[MASK],有仙则名。" |
| 17 | +- text: "浔阳江头夜送客,枫叶[MASK]花秋瑟瑟。" |
| 18 | +--- |
| 19 | + |
| 20 | +# GuwenBERT |
| 21 | + |
| 22 | +## Model description |
| 23 | + |
| 24 | + |
| 25 | +This is a RoBERTa model pre-trained on Classical Chinese. You can fine-tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on. |
| 26 | + |
| 27 | +For more information about RoBERTa, take a look at the RoBERTa's offical repo. |
| 28 | + |
| 29 | +## How to use |
| 30 | + |
| 31 | +```python |
| 32 | +from transformers import AutoTokenizer, AutoModel |
| 33 | + |
| 34 | +tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large") |
| 35 | + |
| 36 | +model = AutoModel.from_pretrained("ethanyt/guwenbert-large") |
| 37 | +``` |
| 38 | + |
| 39 | +## Training data |
| 40 | + |
| 41 | +The training data is daizhige dataset (殆知阁古代文献) which is contains of 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang. |
| 42 | +76% of them are punctuated. |
| 43 | +The total number of characters is 1.7B (1,743,337,673). |
| 44 | +All traditional Characters are converted to simplified characters. |
| 45 | +The vocabulary is constructed from this data set and the size is 23,292. |
| 46 | + |
| 47 | +## Training procedure |
| 48 | + |
| 49 | +The models are initialized with `hfl/chinese-roberta-wwm-ext-large` and then pre-trained with a 2-step strategy. |
| 50 | +In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training. |
| 51 | + |
| 52 | +The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 1e-4, adam-betas of (0.9,0.98), adam-eps of 1e-6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after. |
| 53 | + |
| 54 | +## Eval results |
| 55 | + |
| 56 | +### "Gulian Cup" Ancient Books Named Entity Recognition Evaluation |
| 57 | + |
| 58 | +Second place in the competition. Detailed test results: |
| 59 | + |
| 60 | +| NE Type | Precision | Recall | F1 | |
| 61 | +|:----------:|:-----------:|:------:|:-----:| |
| 62 | +| Book Name | 77.50 | 73.73 | 75.57 | |
| 63 | +| Other Name | 85.85 | 89.32 | 87.55 | |
| 64 | +| Micro Avg. | 83.88 | 85.39 | 84.63 | |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +## About Us |
| 70 | + |
| 71 | +We are from [Datahammer](https://datahammer.net), Beijing Institute of Technology. |
| 72 | +For more cooperation, please contact email: ethanyt [at] qq.com |
| 73 | + |
| 74 | +> Created with ❤️ by Tan Yan [](https://github.com/Ethan-yt) and Zewen Chi [](https://github.com/CZWin32768) |
0 commit comments