I have noticed that innerProduct
|
for (int k = 0; k < B.numCols; k++) { |
|
double sum = 0; |
|
for (int i = 0; i < B.numRows; i++) { |
|
sum += a[offsetA + i]*B.data[k + i*cols]; |
|
} |
|
output += sum*c[offsetC + k]; |
|
} |
performs a lot worse (2x minimum, but varies with size, 1000s) than my original naive implementation.
I think I figured out the reason.
The access of the matrix data likely trashes the CPU cache, because it keeps jumping column: B.data[k + i*cols], where i is incremented in the inner loop.
If I swap the loops, I get a back the lost speed.
Before I provide a PR, is there any reason this is does this way?
I have noticed that
innerProductejml/main/ejml-ddense/src/org/ejml/dense/row/mult/MatrixVectorMult_DDRM.java
Lines 338 to 344 in 2c9d1dc
performs a lot worse (2x minimum, but varies with size, 1000s) than my original naive implementation.
I think I figured out the reason.
The access of the matrix data likely trashes the CPU cache, because it keeps jumping column:
B.data[k + i*cols], whereiis incremented in the inner loop.If I swap the loops, I get a back the lost speed.
Before I provide a PR, is there any reason this is does this way?