- [ ] AllReduce selectedrows - [ ] without csc - [ ] with csc - [ ] Optimizing Network Performance for Distributed DNN Training on GPU Clusters - [x] Get the system arch and performance. - [x] Analysis the operator time and communication time. - [ ] Mixed precision. - [ ] On Bert. - [ ] On Resnet 50 on imagenet dataset. - [x] Dynamic(static) LA(lazy allreduce) overlap - [x] FUse allreduce tensor and analysis the performance. - [x] Implement the Hierarchical All-reduce. - [ ] CSC communication - [ ] resnet - [ ] bert - [ ] Pserver sync from step to var