-
Notifications
You must be signed in to change notification settings - Fork 5.9k
add block execution doc #4125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add block execution doc #4125
Changes from 3 commits
40b32cd
350a572
3989b25
1462821
23449bf
9e70fad
9447858
b03462f
521ea42
8117183
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| ## Overview | ||
| PaddlePaddle represents a neural network as a `Block`. The word block and graph are interchangable in the desgin of PaddlePaddle. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators. Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA. | ||
|
|
||
| There are some challenges here to achieve high performance and scalability when executing a `Block`: | ||
|
|
||
| 1. Execution Plan. It is often that a kind of device cannot accelerate all operators in a block. This requires an execution plan to use multiple devices. For example, some computational operators run on the GPU and cross-node data communication operators like SendOp and RecvOp on the CPU. | ||
|
|
||
| 2. Parallel Execution. | ||
| It is often that a computer has more than one acceleration devices. For example, most servers and portable computers like PX2 have more than one CUDA GPUs. We want to make full use of them. Intuitively there are two approaches: (1) running operators in a block on multiple devices, and (2) duplicating the block and running them on multiple devices. The former is also known as model parallelism and the latter data parallelism. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add an example of CPU with multiple cores. This happens almost everywhere.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly, we can implement both data parallelism and model parallelism in CPU with multiple cores first. |
||
|
|
||
| 3. Variable Places. Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block. | ||
|
|
||
|
|
||
| Here are the features we want to support: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are the below feature sorted by priority? It looks like so because of the "1. 2. 3. ...). But the second point in "3. Model Parallelism" seems very important.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's sorted by priority, data parallelism have higher priority than model parallelism. |
||
|
|
||
| 1. Baseline | ||
| - a single thread runs a block on a single device, e.g., a CPU or a GPU. | ||
|
|
||
| 2. Data Parallelism | ||
| - multi-threads in a stand-alone machine: Only one copy of parameters needs to be storaged in CPU memory. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. storaged -> stored |
||
| - multi-GPUs in a stand-alone machine: Each GPU card will have a full copy of parameters in its own GPU memory. | ||
| - multi-threads/multi-GPUs in a cluster: Each node in the cluster supports multi-threads/multi-GPUs. | ||
|
|
||
| 3. Model Parallelism | ||
| - Operators in a block can locate in different devices(CPU/GPU/FPGA) in a stand-alone machine. Users have to set device id for every operator manually. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think we need to change this line to: "Users can optionally set device id for operators.". Because forcing the user to set every OP manually
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we should allow users to set specific device id for operators. |
||
| - We should notice another situation where operators in a block can be executed in different CPU threads or CUDA streams to achieve high performace, but they are actually in the same device. It's scheduled by Paddle automatically, and users don't need to set specific CPU thread id or CUDA stream id. | ||
|
|
||
| 4. Long-term | ||
| - Paddle schedules CPU/GPU/FPGA in a stand-alone machine automatically, makes full use of them to get best performance when executing a neural network. | ||
| - Paddle schedules CPU/GPU/FPGA in a cluster automatically, and overlaps communication and computation well to get best performance. | ||
|
|
||
|
|
||
|
|
||
| ## Analysis | ||
|
|
||
|
|
||
| Since a block is actually a graph, the data member of a graph should be Nodes and Edges. Every Node must have a device id for descirbing the place information. Please refer to the design of [TensorFlow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/node_def.proto#L48). In our block design now, the data members are VarDesc and OpDesc. So, VarDesc and OpDesc must have a field to descirbe concrete device information. | ||
|
||
|
|
||
| Let's consider about this question, What configuration should Paddle get from uses to execute a block in various hardward environment. | ||
|
|
||
|
|
||
| - Data Parallelism | ||
| - stand-alone machine: Users have to set several device ids, and each device will execute the same block. | ||
| - cluster: Users have to set node ids and device ids, and each device will execute the same block. | ||
|
||
|
|
||
| - Model Parallelism | ||
| - Users have to set a concrete device id for each operator in a topology. Paddle will first execute the operator strictly following users' configuration, and will do some another parallel optimization automatically. | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## Solution | ||
|
|
||
| Actually, we can have both data parallelism and model parallelism. In order to support the mixing paralelism, we have to propose these two strategies: | ||
|
|
||
|
|
||
| - A simple device id setting rules | ||
|
||
| - Firsly, users have to set one or several default device id for a topology. These devices are the same and usually finish the most computation of a topology. We mainly consider data parallelism in this step. | ||
| - Secondly, users need to set device id for each operators in a topology. If users set nothing, the operator will just use default device id. When devices have to be switched, users have to add a copy operator manually. | ||
|
||
| - We travel the topology and set concrete device id for each operator. | ||
|
|
||
|
|
||
| Here is a pre-sudo code about the device setting rules. | ||
|
|
||
|
|
||
| ``` | ||
| import paddle as pd | ||
|
|
||
| def fpga_infer(fpga_id): | ||
| with pd.device(fpga_id): | ||
| x = pd.v2.variable(data_type='float32', dims=[32, 784], device='cpu') | ||
| w = pd.v2.variable(data_type='float32', dims=[784, 100], device='cpu') | ||
| x_fpga = pd.v2.copy_op(x) | ||
| w_fpga = pd.v2.copy_op(w) | ||
| h = pd.v2.mul_op(x_fpga, w_fgpa) | ||
| h_gpu = pd.v2.copy(h, device='gpu:0') | ||
| out = pd.v2.relu_op(h_gpu, device='gpu:0') | ||
| out_cpu = pd.v2.copy_op(out, device='cpu') | ||
| return out_cpu | ||
|
|
||
| pd.run(fpga_infer, devices=['fpga:0', 'fpga:1']) | ||
|
||
|
|
||
| ``` | ||
|
|
||
| - A data dependency analysis engine | ||
| - We need to analysis data dependency of each operator in the topology. When the dependent variables of a operator is ready, the operator can be executed in corresponding device. | ||
|
|
||
|
|
||
|
|
||
| The concept `Block` contains full information for executing a neural network topology. And we will have another concept `Executor` which have a runtime data depenency analysis engine to execute operators in `Block` in certain hardware with high performance. | ||
|
|
||
|
|
||
| ## Conclusion | ||
|
|
||
| We will split our implementation of `Block` into following steps: | ||
|
||
|
|
||
| 1. Baseline feature | ||
|
|
||
| 2. Divide `Block` into two concepts, `Block` and `Executor` | ||
|
||
|
|
||
| 3. Implement a simple device id setting rules | ||
|
|
||
| 4. Data parallelism feature(run sequentially in each device) | ||
|
|
||
| 5. Implement a data dependency analysis engine | ||
|
|
||
| 6. Model parallelism feature | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune @helinwang
we discussed face to face and reached some agreements.
1、 To unify our discussion results, we define some common concepts to represent these modules.
When the operator run call happens, it needs to be assigned to a specific device. We define the Placement Policy as operator assigned policy. Currently, we only need a simple priority rule to implement the simplest version.
In the multi-node or multi-device environment, we have more than one block, need a module to manage the op running process. We define the Scheduler as op runner manager.
The user-defined block is different from the one which really is executed. There is a gap between the original block and optimized block. We define the Converter as the optimizer module.
2、
Blockshould not contain aRuninterface.Currently, the Block design contains an interface of
RUN, which is inherited fromOpBase. This interface is overlapped with Scheduler module.When the
blockreplacesNetOp, there are two concerns.First, the backward module use
NetOpto represent a nested Network and a group of operators. The block contains much more than a group of operators, which is needed in the backward process. Second, some Op likeFC, uses NetOp to contain more than one operator, also need to be replaced.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't catch up with your conclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executoris a better name?BlockOptimizeris more specific.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This picture shows how to define variable places and add copy ops automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero So, where will the operator run? on gpu:0 or gpu:1. Where do We insert copy operator?before calculation or after calculation. I think we have to specify a concrete device for a operator. Please refer to the design of TensorFlow. We should have a GraphView of Block, which the data members are Nodes and Edges instead vector of VarDesc and OpDesc. The device info will located in Node in GraphView of Block.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero @QiJune
We still need scheduling: schedule the execution of the OP only when it's ready to run (i.e. when all dependency are done). One example would we when a
recv OPon node 1 is waiting for thesend OPon node 0,recv OPis not ready untilsend OPis done. If the scheduler schedules therecv OPbefore it's ready, it will be waiting for thesend OPand block the running thread in the tread pool unnecessarily.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune I see your point, I agree with defining device place for each "Node"(or operator). The reason is people define the graph by ops not by every variable, like the output of two ops is not defined by user.
When we know the variable's device place, we can easily figure out where the operator runs, this is not the reason I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helinwang Is this scheduler only for scheduling on GPU? Because when use CPU, linux kernel will put the thread to "Sleep" in this situation and do not consume CPU resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero Sorry, my example is misleading. I mean we need a scheduler to put OP to "run queue" only when the dependencies of the OP is done. Because by definition the OP should only run when all of its dependencies is done.