-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Mxnet executor survey #4298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mxnet executor survey #4298
Conversation
helinwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赞
|
|
||
| - Executor里面会完成Graph的构建,包括插入backward operator/copy operator等;同时完成InferShape/InferType等,并分配内存(这里需要注意的是,当输入数据的大小发生变化时,需要重新bind得到一个新的Executor) | ||
|
|
||
| - Executor有一个RunOps方法,在这里依次把operator的操作push到Engine中 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executor和Engine是同一个东西吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是的。
Mxnet暴露给用户端的接口是Symbol,Executor的作用是对Symbol进行解析,得到Graph,并且会对Graph做若干pass的优化。同时Executor也拥有一个RunOps的接口,用来执行Graph中的每一个Op。
而Engine是比较通用的分析数据依赖的引擎,主要作用是对Op的data dependency进行分析,可以做一定的并行优化。Mxnet的Engine设计的比较通用而独立,可以应用在别的需要分析data dependency的场景下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
明白了!
感觉Executor(或者重命名为converter)做Graph优化,Engine运行Graph的设计很合理。
我认为Engine直接运行优化后的Graph就好。Executor在这里更像是做优化的,觉得Mxnet让Executor有RunOps的接口不是很合理,让Executor和Engine耦合在一起了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executor的RunOps的接口只是顺序的把Graph中的Op push到Engine中,Op在Engine中是异步执行的,因此这里实际上耦合也并不大。Engine并没有Run的接口,只负责对push进来的操作进行data dependency分析,对满足依赖的Op发起执行。
|
|
||
| Mxnet对输入数据,参数的操作接口暴露是命令式的,非常清晰,容易理解 | ||
|
|
||
| - 输入数据加载,参数初始化/加载/存储,本质上是对变量的set/load/save操作,直接操作比使用Operator更加简便 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实更简便。但是如果不把训练核心用OP表示,不利于分布式训练的调度。比如如果参数初始化不是一个OP,是不是每个node都要对同一个参数做一次初始化呢?如果是OP的话,就可以把这个OP分配给某一个node去做,只有一个node对参数做初始化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mxnet对于参数是统一管理的,有一个kvstore来存储,也提供分布式的版本。这里不是很理解某一个node做参数初始化这句话。kvstore对参数的操作还是非常友好的,可以直接使用set/load/save命令式编程的指令。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多机训练的时候需要指定一台机器做初始化,如果初始化用OP来表示,则可以有一个调度系统来指定某一个node运行初始化的OP。如果初始化不用OP来表示,则程序需要一些"hack"的方法来找一个node来做初始化,比如说trainerID=="0"的node来做初始化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实是的。当然我们的调度系统也可以支持,默认trainerID=0的节点来负责初始化,这种naive的调度策略。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune 有调度系统之后就不需要找trainerID=0了,随便指定一个trainer即可。在分布式系统中如果不特殊处理的话,无法确定哪个节点是trainerID=0的节点(每个节点同等看待)。
| Mxnet对输入数据,参数的操作接口暴露是命令式的,非常清晰,容易理解 | ||
|
|
||
| - 输入数据加载,参数初始化/加载/存储,本质上是对变量的set/load/save操作,直接操作比使用Operator更加简便 | ||
| - 参数的更新,本质上是对变量读取之后,进行计算,然后再assign的操作,使用的是同一个内存;如果作为Operator加入到Graph中,会带来环,不利于优化 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
即使参数更新是OP,也不一定会有环,要看graph是如何定义的。我们的graph确实有环,但是TF的就没有:TF的参数OP只有输出,没有输入,所以没有环。输出的是参数的handle(比如内存指针),而不是参数内容。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,Graph有环还是会带来问题吧,普遍都会避免产生环。其实参数更新的逻辑跟神经网络的forward/backward还是有一些不同的。
不管怎样设计,这里需要着重注意的一个问题是,如何保证计算与通信相互掩盖。参数更新其实主要是通信开销,神经网络的backward与参数的update完全是可以逐层并行来做的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“计算与通信相互掩盖”由scheduler来做即可。
|
|
||
| self._params_dirty = True | ||
| if self._update_on_kvstore: | ||
| _update_params_on_kvstore(self._exec_group.param_arrays, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好奇以下几点:
- 单机多卡时kvstore存储参数的地方是GPU内存还是CPU内存。
- kvstore本身进行参数运算,还是有一个负责运算的设备把参数从kvstore读出来,计算更新后的参数,存回去,然后其他设备再读。如果是后者,是一个设备计算所有的参数更新,还是分配到了不同的设备上。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 单机多卡提供多种方式,可以选择在CPU内存,也可以是GPU显存。
- kvstore存储的是模型的全局参数,每个设备也有自己的本地参数。如果是在CPU内存上,每个设备上的数据产生的梯度先拷贝CPU上,然后在CPU上做梯度的求平均,然后再分发给每个设备。在GPU显存上的情况有待进一步阅读源码考证。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是在CPU内存上,每个设备上的数据产生的梯度先拷贝CPU上,然后在CPU上做梯度的求平均,然后再分发给每个设备
如果多机所有的梯度都拷贝到同一个机器上做优化,会带来网络吞吐瓶颈。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单机多卡这样做梯度聚合还是可以接受的。目前还没有看多机怎么实现的。
| ``` | ||
|
|
||
|
|
||
| 4. C++端的executor构造一个图,在NNVM里面专门有一个place_device的pass,来对Graph中的每个节点进行遍历,设置device信息;如果发现跨设备,则插入copy operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有意思,好奇数据读取是怎分配到特定机器上的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没太理解这句话。
需要说明的是,mxnet目前只支持单机上的跨设备,不支持多机跨设备。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的意思是多机模型并行。此时并不是所有的机器都需要读数据的,好奇Mxnet如果支持的话,是怎么做到指定哪些机器读数据,哪些机器不读数据的。
|
|
||
| - Symbol书写完毕后,会bind到一个Executor上 | ||
|
|
||
| - Executor里面会完成Graph的构建,包括插入backward operator/copy operator等;同时完成InferShape/InferType等,并分配内存(这里需要注意的是,当输入数据的大小发生变化时,需要重新bind得到一个新的Executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当输入数据的大小发生变化时,需要重新bind得到一个新的Executor
我们当前的设计和symbol不太合适,在mxnet中symbol等同于expression,每个node是包含有全部的graph信息的。
* Symbol acts as an interface for building graphs from different components
* like Variable, Functor and Group. Symbol is also exported to python front-end
* (while Graph is not) to enable quick test and deployment. Conceptually,
* symbol is the final operation of a graph and thus including all the information
* required (the graph) to evaluate its output value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,所以要尽快搞清楚,我们的Graph是怎么定义的,高层的python api接口是怎么配置网络的。
|
|
||
| 4. update | ||
|
|
||
| Module中提供了update方法,来负责优化参数,更新存储在kvstore上的参数: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module只是为了实现symbolic这套接口和以往接口兼容的替代品。
这里存在两点问题,1、kvstore的抽象对sparse不友好。2、将更新参数的方法做成updater,不利于用户定制优化算法。
下面是mxnet对应的更新参数的两种方法,分别对应了在ps上和本地更新参数。
def _update_params_on_kvstore(param_arrays, grad_arrays, kvstore, param_names):
"""Perform update of param_arrays from grad_arrays on kvstore."""
for index, pair in enumerate(zip(param_arrays, grad_arrays)):
arg_list, grad_list = pair
if grad_list[0] is None:
continue
name = param_names[index]
# push gradient, priority is negative index
kvstore.push(name, grad_list, priority=-index)
# pull back the weights
kvstore.pull(name, arg_list, priority=-index)
def _update_params(param_arrays, grad_arrays, updater, num_device,
kvstore=None, param_names=None):
"""Perform update of param_arrays from grad_arrays not on kvstore."""
for i, pair in enumerate(zip(param_arrays, grad_arrays)):
arg_list, grad_list = pair
if grad_list[0] is None:
continue
index = i
if kvstore:
name = param_names[index]
# push gradient, priority is negative index
kvstore.push(name, grad_list, priority=-index)
# pull back the sum gradients, to the same locations.
kvstore.pull(name, grad_list, priority=-index)
for k, p in enumerate(zip(arg_list, grad_list)):
# faked an index here, to make optimizer create diff
# state for the same index but on diff devs, TODO(mli)
# use a better solution later
w, g = p
updater(index*num_device+k, g, w)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以深入思考一下,感觉参数更新用Op来描述的话,sparse更新和用户自定义更新算法是不是也会不方便呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune 如果用OP来描述,用户可以随意定制,converter自动分配相关OP到负责的pserver的worker上。
Here is better for review.
We have a lot of discussions in #4031. The design of TensorFlow and Mxnet are both helpful. I make a Mxnet executor survey for reference and further discussion.