Add fluid distribute transpiler parameter split strategy doc by velconia · Pull Request #11283 · PaddlePaddle/Paddle

velconia · 2018-06-07T10:31:01Z

This pr close #11250

replace with the new version of paddle build.sh in write_doc.md
add fluid paramter split strategy doc

typhoonzero · 2018-06-07T11:59:12Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+## 模型切分策略设计
+### 参数切分原因
+
+在模型设计时, 我们通常不会限制模型隔层使用的参数大小, 但当我们设计了如下的网络时:


typhoonzero · 2018-06-07T11:59:34Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+在模型设计时, 我们通常不会限制模型隔层使用的参数大小, 但当我们设计了如下的网络时:
+
+![fluid_3_layer_network](src/fluid_3_layers_network.png)


typhoonzero · 2018-06-07T12:04:19Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

@@ -0,0 +1,57 @@
+# Fluid 分布式训练模型参数切分策略详解
+本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 模型参数的分配方案设计, 并且举了一个如何使用这种切分方案的栗子:) ;


文档尽量用朴素点的语言：比如“栗子:)”

typhoonzero · 2018-06-08T01:36:39Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+![fluid_3_layer_network](src/fluid_3_layers_network.png)
+
+fluid.input 层可能非常宽, 导致 w1, b1 参数纬度可能非常的大, 而 fluid.fc 层可能非常窄, 导致 w2, b2 参数纬度特别小, 如果只是简单的将模型分配到参数服务器上可能会导致每个参数服务器拿到的参数大小并不均匀, 负载较轻的参数服务器会等待负载较重的参数服务器, 所以针对参数大小不均匀的情况, 我们提供了参数切分功能;


纬度 => 维度

我们提供了参数切分功能; => Fluid Distribute Transpiler会默认将较大的参数和梯度拆分到不同的pserver上进行更新。

typhoonzero · 2018-06-08T01:46:05Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+fluid.input 层可能非常宽, 导致 w1, b1 参数纬度可能非常的大, 而 fluid.fc 层可能非常窄, 导致 w2, b2 参数纬度特别小, 如果只是简单的将模型分配到参数服务器上可能会导致每个参数服务器拿到的参数大小并不均匀, 负载较轻的参数服务器会等待负载较重的参数服务器, 所以针对参数大小不均匀的情况, 我们提供了参数切分功能;
+
+### 参数切分方式


直接写切分方式吧，不只是参数需要切

typhoonzero · 2018-06-08T01:47:01Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+### 参数切分方式
+
+参数会在切分后变为参数块, 而在切分参数时, 如果参数切分的过细会导致参数服务器的计算效率不高, 但如果参数切分的不够均匀又无法达到我们上述的效果, 所以我们会先定一个最小的参数块大小: 8192, 并且按照如下方式计算需要切分数量:


这句话太长而且逻辑比较绕。

typhoonzero · 2018-06-08T01:48:15Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+参数会在切分后变为参数块, 而在切分参数时, 如果参数切分的过细会导致参数服务器的计算效率不高, 但如果参数切分的不够均匀又无法达到我们上述的效果, 所以我们会先定一个最小的参数块大小: 8192, 并且按照如下方式计算需要切分数量:
+
+```
+# parameter_size: 参数大小


文档前面并没有出现这些参数，这里直接说明会比较费解。

done, 去掉了这些code block

typhoonzero · 2018-06-08T01:48:54Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+# parameter_server_count: 参数服务器总数
+math.min(parameter_size / MIN_PARAMETER_BLOCK_SIZE, parameter_server_count)
+```
+在将参数切分为多个参数块后, 我们还需要对参数块进行打散, 均匀的分配到参数服务器上


打散这个词，读者不容易理解要做什么。

恩, 我默认了读者能理解打散的概念, 我修改一下

typhoonzero · 2018-06-08T01:49:15Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+```
+在将参数切分为多个参数块后, 我们还需要对参数块进行打散, 均匀的分配到参数服务器上
+
+### 参数分配方式


同上，不仅是参数，还包括梯度

typhoonzero · 2018-06-08T01:50:06Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+在将参数切分为多个参数块后, 我们还需要对参数块进行打散, 均匀的分配到参数服务器上
+
+### 参数分配方式
+我们现在支持两种简单而有效的[Partition](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py)方式: [Round Robin](https://en.wikipedia.org/wiki/Round-robin_scheduling) 和 [Hash](https://en.wikipedia.org/wiki/Hash_function);


Partition对应的链接里并没有出现partition

typhoonzero · 2018-06-13T01:51:39Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

@@ -0,0 +1,67 @@
+# Fluid 分布式训练模型参数切分策略详解


=> 分布式训练参数切分设计

这样确实简洁一些, thanks, 我修改一下

typhoonzero · 2018-06-13T01:55:04Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 模型参数的切分方案设计, 并且举了一个如何应用这种切分方案的简单例子;
+
+## 模型参数切分策略设计
+### 切分原因


切分原因不应该在“xxxx设计下”，原因是背景，是为什么要做，而不是怎么做。增加一个二级标题 “背景“ 说明需要切分的原因

好的, 我修改一下

typhoonzero · 2018-06-13T01:55:41Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

@@ -0,0 +1,67 @@
+# Fluid 分布式训练模型参数切分策略详解
+本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 模型参数的切分方案设计, 并且举了一个如何应用这种切分方案的简单例子;


这句话其实没什么作用，只要标题足够简短明确就行。

我觉得这句话能告诉读者他能从这篇文章里获取什么, 比如他可以直接复制例子的代码, 或者他可以理解背后的设计决策怎么做的, 这样也节省读者的时间, 仅有一个标题还没法做到这个效果

typhoonzero · 2018-06-13T02:20:31Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+在模型设计时, 我们通常不会限制模型各层使用的参数大小, 假设我们现在有3台参数服务器, 并且要训练如下的网络:
+
+![fluid_3_layer_network](src/fluid_3_layers_network.png)


这个图有很多问题：

没有fluid.input， fluid.output这个函数， fluid.fc应该是fluid.layers.fc

w, b 是在fc上的

前面说的“假设有3台服务器“ 并没有体现，也没有后续说明。

thanks, 我修改一下图片

wb我的理解是因为 w * fluid.layers.data + b 才得到的 fluid.layers.fc 的输入, 我觉得应该在连接线上? 参考这张图 http://www.paddlepaddle.org/docs/develop/book/02.recognize_digits/image/mlp.png

后面其实有提及服务器的数量和切分数量的关系, 所以这里阐述清楚其实是很必要的

2. change the 3_layers_network image

shanyi15 · 2018-07-03T07:29:51Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

@@ -0,0 +1,79 @@
+# Fluid 分布式训练参数切分设计
+本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 切分模型参数的原因, 以及模型参数的切分方案设计细节, 并且举了一个如何应用这种切分方案的简单例子;


这里最好用句号

shanyi15 · 2018-07-03T07:30:14Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+## 背景
+现在, 假设我们有3台机器作为 Parameter Server , 2台机器作为 Trainer, 整体架构如下:
+
+![fluid_3_ps_design.png](src/fluid_3_ps_design.png)


这个图片显示不出

shanyi15 · 2018-07-03T07:30:46Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+并且我们要训练如下的网络:
+
+![fluid_3_layer_network](src/fluid_3_layers_network.png)


图片最好可以都居中，方法：

shanyi15 · 2018-07-03T07:31:27Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+### 参数切分原因
+
+可以看到, 在上述的网络中, 输入层 fluid.layers.data 非常宽, 导致参数 w1 和 参数 b1 的维度非常的大, 达到 10 * 1000, 而隐层 fluid.layers.fc 层非常窄, 导致参数 w2 和参数 b2 的维度特别小, 只有 1 * 10. 


中文文档里的标点最好都用中文格式的，例如这里的句号。

好的, 感谢shanyi老师review, 我会在下个commit改掉

Yancey0623 · 2018-07-03T07:50:11Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+### 参数切分原因
+
+可以看到, 在上述的网络中, 输入层 fluid.layers.data 非常宽, 导致参数 w1 和 参数 b1 的维度非常的大, 达到 10 * 1000, 而隐层 fluid.layers.fc 层非常窄, 导致参数 w2 和参数 b2 的维度特别小, 只有 1 * 10。 


fluid.layerrs.fc 用``引起来吧。

Yancey0623 · 2018-07-03T07:57:30Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+可以看到, 在上述的网络中, 输入层 fluid.layers.data 非常宽, 导致参数 w1 和 参数 b1 的维度非常的大, 达到 10 * 1000, 而隐层 fluid.layers.fc 层非常窄, 导致参数 w2 和参数 b2 的维度特别小, 只有 1 * 10。 
+
+如果我们只是简单的将这些参数随机分配到参数服务器上, 会导致每个参数服务器拿到的参数大小非常不均匀, 负载高低完全不同, 这会带来两个问题:


如果我们只是简单的将这些参数随机分配到参数服务器上, 会导致每个参数服务器拿到的参数大小非常不均匀, 负载高低完全不同

简单 -> 如何理解“简单”呢？尽量少用模糊的词，可以改成: 如果只是将这些参数采用随机或者one-by-one的方式分布到多个参数服务器上。

Yancey0623 · 2018-07-03T07:58:04Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+
+如果我们只是简单的将这些参数随机分配到参数服务器上, 会导致每个参数服务器拿到的参数大小非常不均匀, 负载高低完全不同, 这会带来两个问题:
+
+1. 负载较轻的参数服务器等待负载较重的参数服务器, 让负载高的参数服务器成为系统瓶颈;


统一下负载轻轻，负载高低。

Yancey0623 · 2018-07-03T07:59:23Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+1. 负载较轻的参数服务器等待负载较重的参数服务器, 让负载高的参数服务器成为系统瓶颈;
+2. 负载较高的参数服务器容易受到网络带宽的限制;
+
+针对这个问题, 在Distribute Transpiler中, 我们会对模型的参数和对应的梯度进行切分, 按照大小和参数服务器数量, 这些参数和梯度会被切分成更细粒度的一个或多个大小均匀参数块, 我们再对依照一个固定的算法, 将这些参数块均匀的分配到参数服务器上, 下面我们将详细描述一下具体的切分过程。


我们再对依照

Yancey0623 · 2018-07-03T08:00:34Z

doc/fluid/design/dist_train/fluid_parameter_split_strategy_cn.md

+1. 负载较轻的参数服务器等待负载较重的参数服务器, 让负载高的参数服务器成为系统瓶颈;
+2. 负载较高的参数服务器容易受到网络带宽的限制;
+
+针对这个问题, 在Distribute Transpiler中, 我们会对模型的参数和对应的梯度进行切分, 按照大小和参数服务器数量, 这些参数和梯度会被切分成更细粒度的一个或多个大小均匀参数块, 我们再对依照一个固定的算法, 将这些参数块均匀的分配到参数服务器上, 下面我们将详细描述一下具体的切分过程。


下面我们将详细描述一下具体的切分过程。

下面的标题是参数切分详细设计，统一下吧。

luotao1 · 2019-02-01T05:54:47Z

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo，因此关闭您的PR，欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

Add fluid distribute transpiler parameter slipt strategy doc

0664717

velconia changed the title ~~Add fluid distribute transpiler parameter slipt strategy doc~~ Add fluid distribute transpiler parameter split strategy doc Jun 7, 2018

typhoonzero reviewed Jun 8, 2018

View reviewed changes

Change the doc content following wuyi's cool advise

c6ff912

velconia closed this Jun 8, 2018

velconia reopened this Jun 8, 2018

typhoonzero reviewed Jun 13, 2018

View reviewed changes

1. Add Background paragraph level

d6d4d96

2. change the 3_layers_network image

typhoonzero requested a review from Yancey0623 June 15, 2018 07:36

Add a ps-trainer design graph and update the content

314f0d9

shanyi15 reviewed Jul 3, 2018

View reviewed changes

velconia added 2 commits July 3, 2018 15:44

Remove ';' in doc and align the image to center

d587221

Align the last image to center

232e1e2

Yancey0623 reviewed Jul 3, 2018

View reviewed changes

luotao1 closed this Feb 1, 2019


		在模型设计时, 我们通常不会限制模型隔层使用的参数大小, 但当我们设计了如下的网络时:

		![fluid_3_layer_network](src/fluid_3_layers_network.png)

		@@ -0,0 +1,57 @@
		# Fluid 分布式训练模型参数切分策略详解
		本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 模型参数的分配方案设计, 并且举了一个如何使用这种切分方案的栗子:) ;


		![fluid_3_layer_network](src/fluid_3_layers_network.png)

		fluid.input 层可能非常宽, 导致 w1, b1 参数纬度可能非常的大, 而 fluid.fc 层可能非常窄, 导致 w2, b2 参数纬度特别小, 如果只是简单的将模型分配到参数服务器上可能会导致每个参数服务器拿到的参数大小并不均匀, 负载较轻的参数服务器会等待负载较重的参数服务器, 所以针对参数大小不均匀的情况, 我们提供了参数切分功能;


		fluid.input 层可能非常宽, 导致 w1, b1 参数纬度可能非常的大, 而 fluid.fc 层可能非常窄, 导致 w2, b2 参数纬度特别小, 如果只是简单的将模型分配到参数服务器上可能会导致每个参数服务器拿到的参数大小并不均匀, 负载较轻的参数服务器会等待负载较重的参数服务器, 所以针对参数大小不均匀的情况, 我们提供了参数切分功能;

		### 参数切分方式


		### 参数切分方式

		参数会在切分后变为参数块, 而在切分参数时, 如果参数切分的过细会导致参数服务器的计算效率不高, 但如果参数切分的不够均匀又无法达到我们上述的效果, 所以我们会先定一个最小的参数块大小: 8192, 并且按照如下方式计算需要切分数量:

		@@ -0,0 +1,67 @@
		# Fluid 分布式训练模型参数切分策略详解


		在模型设计时, 我们通常不会限制模型各层使用的参数大小, 假设我们现在有3台参数服务器, 并且要训练如下的网络:

		![fluid_3_layer_network](src/fluid_3_layers_network.png)

		@@ -0,0 +1,79 @@
		# Fluid 分布式训练参数切分设计
		本篇文章将说明, 在使用 PaddlePaddle Fluid 进行基于 Parameter Server 的分布式训练时, 切分模型参数的原因, 以及模型参数的切分方案设计细节, 并且举了一个如何应用这种切分方案的简单例子;


		并且我们要训练如下的网络:

		![fluid_3_layer_network](src/fluid_3_layers_network.png)


		### 参数切分原因

		可以看到, 在上述的网络中, 输入层 fluid.layers.data 非常宽, 导致参数 w1 和参数 b1 的维度非常的大, 达到 10 * 1000, 而隐层 fluid.layers.fc 层非常窄, 导致参数 w2 和参数 b2 的维度特别小, 只有 1 * 10.


		如果我们只是简单的将这些参数随机分配到参数服务器上, 会导致每个参数服务器拿到的参数大小非常不均匀, 负载高低完全不同, 这会带来两个问题:

		1. 负载较轻的参数服务器等待负载较重的参数服务器, 让负载高的参数服务器成为系统瓶颈;

Comments

Conversation

velconia commented Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

velconia Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

velconia commented Jun 7, 2018 •

edited

Loading

velconia Jun 13, 2018 •

edited

Loading