Skip to content
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c158781
glm_config
Aug 31, 2023
9eb7eb9
Merge branch 'main' into GLM-PR
GGuanl Aug 31, 2023
3f6afe7
fix_#1
GGuanl Sep 1, 2023
c48bd1b
glm-config_updated
GGuanl Sep 6, 2023
acbf4b6
glm-config-updated#2
GGuanl Sep 6, 2023
53b0999
glm_config-updated#2
GGuanl Sep 6, 2023
dd3c478
glm_config-#2
GGuanl Sep 6, 2023
7659bea
Update README.md
GGuanl Sep 6, 2023
f953f4e
Update README.md
GGuanl Sep 6, 2023
04d5bd9
Update README.md
GGuanl Sep 6, 2023
c038ab7
Update README.md
GGuanl Sep 6, 2023
046fdf7
Update README.md
GGuanl Sep 6, 2023
6b6fd85
Update README.md
GGuanl Sep 6, 2023
4f998e3
Update README.md
GGuanl Sep 6, 2023
629b37a
Update README.md
GGuanl Sep 6, 2023
be5eb37
Update pytorch_install.sh
GGuanl Sep 6, 2023
22eeefc
Create config_common
GGuanl Sep 6, 2023
0dee798
Update README.md
GGuanl Sep 6, 2023
c2f993f
Rename config_common to config_common.py
GGuanl Sep 7, 2023
ca34bb6
Update config_R300x2x8.py
GGuanl Sep 7, 2023
a9d02b8
Update config_R300x1x1.py
GGuanl Sep 7, 2023
972ed9f
Update config_R300x1x8.py
GGuanl Sep 7, 2023
f1714c5
Update README.md
GGuanl Sep 7, 2023
ace9fea
Update README.md
GGuanl Sep 7, 2023
c474b63
Update README.md
GGuanl Sep 7, 2023
9e30809
Update README.md
GGuanl Sep 7, 2023
015a751
Update README.md
GGuanl Sep 7, 2023
60dee8a
Update requirements.txt
GGuanl Sep 7, 2023
32eb6f1
Update README.md
GGuanl Sep 7, 2023
9d81ff1
Update config_R300x1x1.py
GGuanl Sep 7, 2023
7952993
Update config_R300x1x8.py
GGuanl Sep 7, 2023
27b48b5
Update config_R300x2x8.py
GGuanl Sep 7, 2023
c0e3ab4
Update config_R300x1x1.py
GGuanl Sep 7, 2023
7394352
Update config_R300x1x8.py
GGuanl Sep 7, 2023
186ae5d
Update config_R300x2x8.py
GGuanl Sep 7, 2023
2f204f5
Update config_common.py
GGuanl Sep 7, 2023
26acd75
Update config_R300x1x1.py
GGuanl Sep 7, 2023
13cc7be
Update config_R300x2x8.py
GGuanl Sep 7, 2023
d207087
Update README.md
GGuanl Sep 7, 2023
27033ba
Update README.md
GGuanl Sep 7, 2023
a44db3f
Update README.md
GGuanl Sep 7, 2023
bce5f61
Update config_R300x1x1.py
GGuanl Sep 7, 2023
42f761d
Update config_R300x1x8.py
GGuanl Sep 7, 2023
cda8284
Update config_R300x2x8.py
GGuanl Sep 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion training/kunlunxin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@ R480-X8基于多芯片间高速互联技术,单机可提供高达1 Peta Ops @F
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:4.0.25
- Docker镜像和版本:pytorch1.12.1-cpu-ubuntu18.04:v0.04
- Docker镜像和版本:pytorch1.12.1-cpu-ubuntu20.04:v0.01
- 训练框架版本: xmlir+111e7d45 【[xmlir下载](https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/111e7d45/xacc-0.1.0-cp38-cp38-linux_x86_64.whl)】
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2个下载地址是可用的吗?
image

- 训练编译器版本: xacc+111e7d45 【[xacc下载](https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/111e7d45/xmlir-0.0.1-cp38-cp38-linux_x86_64.whl)】
- 依赖软件版本:pytorch-1.12.1+cpu

## 容器镜像信息
- 容器构建信息
Expand Down
70 changes: 30 additions & 40 deletions training/kunlunxin/glm-pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,45 +14,35 @@
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:4.0.25
- Docker镜像和版本:pytorch1.12.1-cpu-ubuntu18.04:v0.04
- 训练框架版本:xmlir+e70db8f6
- Docker镜像和版本:pytorch1.12.1-cpu-ubuntu20.04:v0.01
- 训练框架版本:xmlir+111e7d45 【[xmlir下载](https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/archives/111e7d45/xmlir-0.0.1-cp38-cp38-linux_x86_64.whl)】
- 训练编译器版本:xacc+111e7d45 【[xacc下载](https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/archives/111e7d45/xacc-0.1.0-cp38-cp38-linux_x86_64.whl)】
- 依赖软件版本:pytorch-1.12.1+cpu

### 测试运行方法

修改`FlagPerf/training/run_benchmarks/config/test_conf.py`文件里的配置项:

```python
VENDOR = "kunlunxin"

ACCE_CONTAINER_OPT = " --device=/dev/xpu0 --device=/dev/xpu1 --device=/dev/xpu2" + \
" --device=/dev/xpu3 --device=/dev/xpu4 --device=/dev/xpu5" + \
" --device=/dev/xpu6 --device=/dev/xpu7 --device=/dev/xpuctrl"

ACCE_VISIBLE_DEVICE_ENV_NAME = "XPU_VISIBLE_DEVICES"

CASES = [
"GLM_TORCH_DEMO_R300_1X1",
"GLM_TORCH_DEMO_R300_1X2",
"GLM_TORCH_DEMO_R300_1X4",
"GLM_TORCH_DEMO_R300_1X8",
"GLM_TORCH_DEMO_R300_2X8"
]
```

剩余步骤按照项目根目录文档下的[“快速启动”](../../../README.md#快速启动)章节进行。


### 运行情况参考

| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
|---------| --------------- | ----------- | -------- | -------- | ------- | ---------------- |
| 单机1卡 | config_R300x1x1 | 121371.25| 0.8 | 0.8021 | 14400(fp32)| 0.50 |
| 单机2卡 | config_R300x1x2 | 106709.60| 0.8 | 0.8085 | 12000(fp32)| 0.92 |
| 单机4卡 | config_R300x1x4 | 44162.12 | 0.8 | 0.8027 | 4800(fp32) | 1.79 |
| 单机8卡 | config_R300x1x8 | 22902.82 | 0.8 | 0.8003 | 2400(fp32) | 3.47 |
| 两机8卡 | config_R300x2x8 | 16217.80 | 0.8 | 0.8012 | 1500(fp32) | 6.08 |

### 许可证

Apache 2.0 license。
#### 运行情况

* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | ------------------------------ | ------------------------------------------- |
| 任务类别 | 通用语言模型 | |
| 模型 | glm | |
| 数据集 | ReCoRD | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | R300 | |
| 硬件存储使用 | mem(actual/total),见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练样本数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | acc,见“性能指标” | 分类准确率(mlm_accuracy) |
| 额外修改项 | 无 | |

* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | acc | mem |
| ------------------- | --------- | ---------------- | -------- | ------- | ------- | ------ | ----- | --------- |
| R300单机单卡(1x1) | fp32 | bs=5,lr=1e-05 | | | | | | |
| R300单机8卡(1x8) | fp32 | bs=5,lr=1e-05 | 30764 | 3.21 | 3.64 | 3.646 | 80.52%| 31.8/32.0 |
| R300两机8卡(2x8) | fp32 | bs=5,lr=1e-05 | | | | | | |
2 changes: 1 addition & 1 deletion training/kunlunxin/glm-pytorch/config/config_R300x1x1.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
fp16 = False
Copy link
Contributor

@yuzhou03 yuzhou03 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

冗余代码,fp16已经出现在common中了


train_batch_size = 4
eval_batch_size = 6
eval_batch_size = 4

dist_backend = "xccl"

Expand Down
4 changes: 2 additions & 2 deletions training/kunlunxin/glm-pytorch/config/config_R300x1x8.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
vendor = 'kunlunxin'
fp16 = False

train_batch_size = 4
eval_batch_size = 6
train_batch_size = 5
eval_batch_size = 5

dist_backend = "xccl"

Expand Down
2 changes: 1 addition & 1 deletion training/kunlunxin/glm-pytorch/config/config_R300x2x8.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
fp16 = False

train_batch_size = 4
eval_batch_size = 6
eval_batch_size = 4

dist_backend = "xccl"

Expand Down
18 changes: 18 additions & 0 deletions training/kunlunxin/glm-pytorch/config/config_common
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名没有.py后缀?

Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
vendor = 'kunlunxin'
fp16 = False

dist_backend = "xccl"

lr = 1e-5
weight_decay = 0.1
adam_beta1 = 0.9
adam_beta2 = 0.999
adam_eps = 1e-08
gradient_accumulation_steps = 1
warmup = 0.1
lr_decay_ratio = 0.1
lr_decay_iters = 4338
log_freq = 1
seed = 4096
max_samples_termination = 5553080
training_event = None
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ export BKCL_TIMEOUT=1800
# when using tree allreduce, the number of nodes must be a multiple of 2
export BKCL_SOCKET_FORCE_TREE=1

export XMLIR_D_XPU_L3_SIZE=32505856

export BKCL_CCIX_RING=1
export BKCL_FORCE_SYNC=1

export ALLREDUCE_ASYNC=false
export ALLREDUCE_FUSION=0

Expand Down
3 changes: 3 additions & 0 deletions training/kunlunxin/glm-pytorch/config/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
h5sparse
boto3
h5py
numpy>=1.15.4
sentencepiece>=0.1.8
jieba