Skip to content

add parallel build script to ci …#16901

Merged
wopeizl merged 9 commits intoPaddlePaddle:developfrom
wopeizl:add_parallel_test_to_ci
Apr 22, 2019
Merged

add parallel build script to ci …#16901
wopeizl merged 9 commits intoPaddlePaddle:developfrom
wopeizl:add_parallel_test_to_ci

Conversation

@wopeizl
Copy link
Copy Markdown
Contributor

@wopeizl wopeizl commented Apr 16, 2019

… test=develop

  1. 将多卡隔离为多个单卡/双卡环境,来并行跑多个单测
  2. 单测分类,将单测分为可单卡运行/双卡运行/独占运行三种类型。

4.1日CI单测时间(617个单测): 4卡机器 56 分钟,8卡机器 53 分钟
4.19日CI单测时间(617个单测):4卡机器 33 分钟,8卡机器 27 分钟

部分独占运行case列表如下,
1/24 Test #146: test_inference_label_semantic_roles .......... Passed 11.74 sec
2/24 Test #142: test_api_impl ................................ Passed 15.35 sec
3/24 Test #144: test_inference_image_classification_vgg ...... Passed 17.29 sec
4/24 Test #145: test_inference_image_classification_resnet ... Passed 17.39 sec
5/24 Test #147: test_inference_recognize_digits_mlp .......... Passed 8.66 sec
6/24 Test #151: test_inference_nlp ........................... Passed 2.11 sec
7/24 Test #149: test_inference_recommender_system ............ Passed 8.33 sec
8/24 Test #174: test_train_recognize_digits_mlp .............. Passed 2.88 sec
9/24 Test #150: test_inference_word2vec ...................... Passed 8.58 sec
10/24 Test #148: test_inference_recognize_digits_conv ......... Passed 14.64 sec
11/24 Test #307: test_hsigmoid_remote_table_op ................ Passed 3.21 sec
12/24 Test #252: test_parallel_executor_test_while_train ...... Passed 7.83 sec
13/24 Test #365: test_nce_remote_table_op ..................... Passed 3.63 sec
14/24 Test #175: test_train_recognize_digits_conv ............. Passed 16.14 sec
15/24 Test #384: test_listen_and_serv_op ...................... Passed 4.85 sec
16/24 Test #438: test_alloc_continuous_space_op ............... Passed 5.71 sec
17/24 Test #337: test_parallel_executor_mnist ................. Passed 21.12 sec
18/24 Test #445: test_conv_shift_op ........................... Passed 5.49 sec
19/24 Test #541: test_adam_op_multi_thread .................... Passed 5.99 sec
20/24 Test #205: test_weight_decay ............................ Passed 62.28 sec
21/24 Test #427: test_recordio_reader ......................... Passed 56.72 sec
22/24 Test #452: test_parallel_executor_seresnext ............. Passed 428.13 sec
23/24 Test #544: test_nearest_interp_op ....................... Passed 58.07 sec
24/24 Test #552: test_parallel_executor_crf ................... Passed 81.98 sec

部分双卡case,
Test #255: test_dist_base ................... Passed 2.23 sec
Test #192: test_distribute_fpn_proposals_op ... Passed 6.51 sec
Test #267: test_dist_save_load .............. Passed 24.87 sec
Test #410: test_dist_allreduce_op ............. Passed 25.29 sec
Test #232: test_dist_mnist_pg ............... Passed 25.48 sec
Test #278: test_dist_mnist_batch_merge ...... Passed 19.90 sec
Test #614: test_distillation_strategy ....... Passed 62.71 sec
Test #275: test_dist_simnet_bow ............... Passed 47.75 sec
Test #319: test_dist_ctr .................... Passed 26.57 sec
Test #492: test_dist_word2vec ............... Passed 66.63 sec
Test #531: test_dist_text_classification .... Passed 31.28 sec
Test #358: test_dist_mnist .................. Passed 125.71 sec
Test #548: test_dist_train .................. Passed 2.59 sec
Test #551: test_dist_se_resnext_nccl ........ Passed 60.43 sec
Test #550: test_dist_se_resnext ............... Passed 233.77 sec

@wopeizl wopeizl changed the title add parallel build script to ci and remove the fluid process from py2… add parallel build script to ci … Apr 18, 2019
Copy link
Copy Markdown
Contributor

@junjun315 junjun315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chengduoZH chengduoZH requested a review from luotao1 April 22, 2019 07:49
Copy link
Copy Markdown
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

wait; # wait for all subshells to finish
}

function aggresive_test() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggresive没有这个单词

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

test_cases=$(ctest -N -V)
exclusive_tests=''
single_card_tests=''
multiple_card_tests=''
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请解释下每个变量的含义

exclusive_tests='' 这个没理解是什么测试?独占的?
single_card_tests='' 单卡测试
multiple_card_tests='' 多卡测试

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

if [[ "$matchstr" == "" ]]; then
# Any test case with LABELS property would be parse here
# RUN_TYPE=EXCLUSIVE mean the case would run exclusively
# RUN_TYPE=DIST mean the case would take two graph cards during runtime
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

706行的解释是说,分布式的单测都必须要2张卡,4张卡都不行?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般情况下两卡应该是够

# RUN_TYPE=EXCLUSIVE mean the case would run exclusively
# RUN_TYPE=DIST mean the case would take two graph cards during runtime
read is_exclusive <<< $(echo "$line"|grep -oEi "RUN_TYPE=EXCLUSIVE")
read is_multicard <<< $(echo "$line"|grep -oEi "RUN_TYPE=DIST")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_TYPE=EXCLUSIVESERIAL重复了。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在含义不太一样了,SERIAL在现在处理方式下应该不起作用了

if [ ${WITH_TESTING:-ON} == "ON" ] ; then
cat <<EOF
========================================
Running unit tests ...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running unit tests ... 没体现并行

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

wait
}

function parallel_test() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个parallel_test没有看到用的地方。可以删掉?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

if [[ $cardnumber == $CUDA_DEVICE_COUNT ]]; then
ctest -I $i,,$NUM_PROC -R "($testcases)" --output-on-failure &
else
# echo "env CUDA_VISIBLE_DEVICES=$cuda_list ctest -I $i,,$NUM_PROC -R \"($testcases)\" --output-on-failure &"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

671行可以去掉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

cuda_list="$cuda_list,$[i*cardnumber+j]"
fi
done
# echo $cuda_list
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

660行可以去掉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, trigger a new pr: #17020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants