Skip to content
This repository was archived by the owner on Aug 11, 2020. It is now read-only.

Improve batch gemm performance using MKL#342

Merged
piiswrong merged 6 commits intodmlc:masterfrom
xinyu-intel:master
Jun 23, 2018
Merged

Improve batch gemm performance using MKL#342
piiswrong merged 6 commits intodmlc:masterfrom
xinyu-intel:master

Conversation

@xinyu-intel
Copy link
Member

This pr is to improve the performance of small size matrix batch gemm around 5-10x by using MKL. This optimization will be useful for attention layer in sockeye.

Performance comparison:

1000 loops

size mshadow MKL
[1120, 10, 256] * [1120, 256, 10] 1.4739921093 0.180208921432
[1120, 40, 512] * [1120, 512, 1] 3.45011711121 0.670109033585

@pengzhao-intel

@pengzhao-intel
Copy link

FYI, @fhieber @tdomhan @mjpost

@pengzhao-intel
Copy link

@piiswrong @sxjscience please help take a review :) thanks in advance.

const float *A, int lda, const float *B, int ldb,
float beta, float *C, int ldc, int batch_count,
float **workspace) {
#if MSHADOW_USE_MKL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cblas_sgemm_batch and cblas_dgemm_batch generally supported in MKL? Do we need to check the version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this page, Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM".

@piiswrong piiswrong merged commit 757a91c into dmlc:master Jun 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants