Skip to content
Closed
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions paddle/cuda/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
## Runtime Check SIMD for x86 architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个design doc的标题可以修改一下,Runtime Check是其中一个工作;整个工作(或design)的目标应该是,只发布一个Paddle支持各种CPU环境。
我的理解这里至少包括三块工作,
a. Runtime Check
b. 代码修改和目录结构调整 (比如下面讲到的把naive/sse/avx的一些实现放到不同的目录里;另外,在issue #1116 里面也有相关的内容)
c. 最后是编译相关的工作;比如,如何编译一个即包含naive实现,也包含sse、avx实现的paddle(这里涉及编译以外,也涉及代码修改相关工作)。

虽然,后面的实现上也都写到了一些相关的技术细节,但建议design doc还是从整个工作包含哪些方面去讲,这样基于这个design doc可以创建一个个相关issue,比如cuda下面的那些代码怎么调整,把细节的讨论放到issue里面去(一些细节问题可以在issue里面讨论完之后再merge回这个design doc)。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Background

Currently, PaddlePaddle supports AVX and SSE3 intrinsics (extensions to the x86 instruction set architecture). When using CMake to compile PaddlePaddle source code, it will check and detect the host which SIMD instruction is supported, then automatically set the legal one. Developer or user also could manually set CMake option `WITH_AVX=ON/OFF` before PaddlePaddle compilation. That's good for local usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我想强调一下hosttarget。目前我们常用的local usage方式,host就是target,所以没有什么区别。如果hosttarget不同,check and detect the host which SIMD instruction is supported其实没有意义。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

正是如此。



### Problem Involved

Nonetheless, from the perspective of the deployment, there are some drawbacks:

1. The online runtime environment is very complex, if an older node does not support AVX or others,
PaddlePaddle will crash and throw out `illegal instruction is used`. This problem will appear
frequently on cluster environment, like Kubernetes. **It must be addressed before PaddlePaddle on Cloud**

2. Once new version is ready to deliver, we have to release more products to users, for example, `no-avx-cpu`, `avx-cpu`, `no-avx-gpu`, `avx=gpu`. Users do not need to care about details. It sucks!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx=gpu ==> avx-gpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



### How to Address it?

1. We can utilize CPU ID information to check SIMD info at runtime. This functionality already merged into
current develop branch. For full details, please check out [CpuId.cpp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/utils/CpuId.cpp) and [CpuId.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/utils/CpuId.h).

You can use `HAS_SIMD(__flags)` to runtime check SIMD. For instance,

```c++
if (HAS_SIMD(SIMD_AVX2 | SIMD_FMA4)) {
avx2_fm4_stub();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xxx_xx_stub不太合适。

stub似乎指测试时候模拟某种行为的函数,好像是用在单元测试里的术语。

https://zh.wikipedia.org/wiki/%E6%A1%A9_(%E8%AE%A1%E7%AE%97%E6%9C%BA)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。

} else if (HAS_SIMD(SIMD_SSE3)) {
sse3_stub();
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里接口上可以简单一些吧,比如可以定义一个SIMD的宏(或者其他方式),然后调用出简化为SIMD(stub())

Copy link
Contributor Author

@gangliao gangliao Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实CpuID.h里面也是有宏的, 可以直接用, 函数调用和宏都支持

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * @brief   Check SIMD flags at runtime.
 *
 * 1. Check all SIMD flags at runtime:
 *
 * @code{.cpp}
 * if (HAS_AVX && HAS_AVX2) {
 *      avx2_stub();
 * }
 * @endcod
 *
 * 2. Check one SIMD flag at runtime:
 *
 * @code{.cpp}
 * if (HAS_SSE41 || HAS_SSE42) {
 *      sse4_stub();
 * }
 * @endcode
 */


`avx2_fm4_stub` and `sse3_stub` could be located in different directory:

```text
------x84---naive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive的实现应该是和x86, arm并列的一种实现。因为ARM和X86实现某一个操作的时候,应该会调用naive的代码进行尾数处理。

譬如,目录结构可以是

---simd
        `---naive
        `---X86
                  `--- AVX
        `---ARM

另外,谨慎怀疑我们应该直接使用一些封装好的,成熟的SIMD库做这个事情。直接把Paddle内部的SIMD优化删掉,依赖其他开源项目也许更方便。

例如 Vc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充回复下, Vc 不太能用。因为那个库里面没有做运行时的实现切换。

所以还是把目录结构改一下。

| |
| |---avx2 -- avx2_fm4_stub()
| |
| |---sse -- sse_stub()
| |
| |---sse3 -- sse3_stub()
|
arm--- ...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive指的是纯c++的实现吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩, 顺手写了,当然可以直接放在x86目录下,不需要naive目录


Here, each directory uses the different compile options (`-mavx` or `-msse`) to generate the corresponding binaries. Then, at
runtime, `if(HAS_SIMD(__flags)` can select the supported branch (intrinsics) to execute.

The method could fix the releases and deployment problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并不是所有代码都需要写naive和avx两份,有的只需要写naive一份,编译的时候通过是否加-mavx生成两份,对应到不同的CPU环境下执行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,我的理解也是这样。



### How to implement it?

Since the current `cuda` directory includes heterogeneous source code, we want to refactor `cuda` directory as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解这里目录结构是一个example?有几个问题:

  1. kernels是替换cuda目录?还是新增一个目录?
  2. 这里面好些文件名与当前paddle里面的文件对应不上,比如activation.cc是指的哪一个?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解应该是替换,之后再issue里面讨论目录问题


```
kernels--- cpu --- inc -- x86 -- avx ----- avx_mathfun.h activation.h gru.h ...
| | |
| | |- naive --- activation.h gru.h ...
| |
| |- src -- x86 -- avx ----- activation.cc
| | |- gru.cc
| | |- ...
| |
| |- naive --- activation.cc
| | |- gru.cc
| |- ...
|- gpu -- ...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我有两点不确定:

  • gpu的实现以后是不是都会移到function里面去
  • 这个目录还需要保持includesrc两个目录吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. gpu这一块可以先放着不动,先增加一个cpu目录开展工作,并不会冲突。
  2. 觉得还是需要include目录来存放接口,清晰规范一些,方便调用。

Copy link
Contributor Author

@gangliao gangliao Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive目录直接去掉,相应代码直接放x86目录很合理。


For simplicity, different arches or intrinsics will be inside the different directories. we need to
modified CMake files to support this solution.


### Reference

AVX Cheat Sheet, TUM, https://db.in.tum.de/~finis/x86%20intrinsics%20cheat%20sheet%20v1.0.pdf