Compare C/Java/Python Performance in Matrix Multiplication
- Implement matrix multiplication in C, Java and Python
- For C, two versions
- Array + for loop
- Pointer arithmetic + for loop
- For Java, three versions
- Standard for loop
- Stream (Parallel)
- Without JIT
- For Python, two versions
- Standard for loop
- Numpy
- For C, two versions
c-std: Implementation in C using for loop and arrays (this should have been named c-for for coherence but too late)
c-pointer: Implementation in C using for loop and pointer arithmetic
java-for: Implementation in Java using for loop and arrays
java-no-jit: Same as above but disable JIT during runtime
java-stream: Implementation in Java using Parallel Streams (JDK8+) and arrays
python-for: Implementation in Python using for loop and arrays
python-np: Implementation in Python using numpy library
- We'll be working with square matrices ranging from
16x16to1024x1024. We'll focus more on matrices above256x256. - We'll only consider
inttypes for simplicity
- Only benchmark the matrix multiplication operation
- Wont consider time taken for other actions such as I/O
- Use same matrices for all variations
- Generate matrices on first run (Java) and save them to files
- For all variations, read from file and execute
- Iterations
- For each size, we will multiply
20times and sum the durations to get larger duration results
- For each size, we will multiply
For example, python-np: 1024, 122.44 means, it took 122.44 seconds to multiply 20 1024x1024 matrices with 20 other corresponding 1024x1024 matrices in the python implementation using numpy.
- CPU: Ryzen 3950x (16/32 Cores)
- RAM: 64GB
- GPU: N/A (Maybe in future, we can do CUDA)
- C: MinGW, C23, CMake
- JDK: 18 (Corretto)
- Python: 3.9 (Conda)
- OS: Windows 10
TBA if requested
c-std: C, being the low level system language with lowest overhead, will be fastest (well, 2nd most; see below).c-pointer: Using pointer arithmetic, we should be able to squeeze out a little more performance and should be fastest (faster thanc-std).java-for: Java, being byte-code interpreted language during runtime, will be slower than C variations.java-no-jit: WithoutJIToptimization, Java will be extremely slow.java-stream: This should be really fast as Java will utilize multiple CPU cores but can it be faster than C?python-for: This is similar tojava-forbut should be even slower as Python's interpreter is much slower than Java.python-np: This should get us similar (but slightly slower) results toc-stdasnumpyis an extension module written inC.
The chart contains data 7x data points for each variations and there are 7 variations as well. So we have 49 data points in total.
NOTE: We have ignored three data points in the above chart; namely java-no-jit @ 1024, python-for @ 512 and python-for @ 1024 as each of these took over ~500 seconds and made it harder to visualize other points of interest. We will talk about these values later.
- C is really fast for smaller dataset but is not performing as expected for larger datasets
- Java turned out to be much faster than expected
- Stock Python is many orders slow compared to everything else
We'll be explaining each of these shorty.
- Variations of C show absolute dominance.
java-streamappears to be very slow.python-foris already slow compared to others (exceptjava-stream)python-npclosely followsc-stdas predicted.
- Variations of
care closely holding up. javavariations are slow (as expected) andjava-no-jitis look really bad; almost as bad aspython-for.python-for, is already showing the extremely lack of performance of python.python-for 256, 86svspython-np 256, 0.319s. Just by replacingfor-loopswithnumpy's functions which execute usingc, we are seeing insane amount of performance difference.python-npclosely followingc-stdas before.
Things get really interesting here. Compared to other matrix sizes, these two are huge. The amount of mathematical operations is way higher and requires much more memory.
For example, mathematical operations for a matrix is O(n^3).
-
256x256:16M -
512x512:134M(8 times compared to256x256) -
1024x1024:1B(62 times compared to256x256) -
As expected of
cvariations, their times shot up due to huge number of operations increase. -
While
java-forandjava-streamwere slower than variations ofcbefore, they have turned it upside down and now are faster thancvariations and in one particular case, amazingly faster. We'll explain this below. -
java-no-jitis showing how useless Java is without JIT optimization and is only trailed bypython-for. -
python-foris taking over 5000 seconds for 1024x1024 compared to C's ~100 seconds. -
As expected,
python-npis trailing C's performance closely but slightly slowly.
Assuming c-std as baseline performance,
- Why
c-pointeris slower thanc-std? - Why does
java-forsuddenly do better as matrix size increases? - How did
java-streamoutperform everything? - Why
java-no-jitis so slow? - Why
python-foris even slower and excessively so? - Why
python-npis so much faster thanpython-forand almost same asc-std?
- While theoretically,
c-pointershould be faster, modern C compilers do amazing optimizations during compilation resulting in extremely fast machine codes. By hand-rolling array system using pointers, we have prevented the compiler for making every optimization it could and thus,c-pointeris slightly slower thanc-std. java-forimplementation runs the java program with default settings whereJIT Optimizationis enabled. This causes java to utilize various loop optimizations such asloop unrollingandloop tiling. This is possible to do inCtoo but we will require a huge amount of code/time for this. Libraries likeBLASimplements these optimizations and will be even faster than any Java implementation (to be tested though).java-streamimplementation is similar tojava-forbut it is fully parallel [ Link to Code ]. It can and will saturate every ounce of CPU power available and come up with the result more faster as you can give it more CPU cores. Again, if we implement similar algorithm in C (very difficult), C variation will outperform the java variation.java-no-jitis 100% pure interpretation and without any single optimization. This is theoretically as slow java as possible.python-for, same asjava-no-jitexcept their interpreter is even slower.- As mentioned before,
numpydelegates it's operations to modules written in pure C (well, 95% C, 5% C++) as such it can harness the performance of C.
In general, C is faster but for large datasets, we need to implement non-trivial amount of optimizations ourselves (or use libraries) otherwise, it will fall behind.
Java, on the other hand, is a breeze to write and automatically optimizes code path using JIT. It is truly an excellent balance between performance and ease of use. Finally, python is the slowest of them all but python can delegate computationally expensive tasks to extension modules written in C and harness that power. Therefore, as long as python has great extension modules, it is an excellent choice for most operations.
Loop unrolling and Loop Tiling are excellent techniques and can provide a huge amount of performance boost if the programmer knows what they are doing. Cache Oblivious Algorithm is another concept to explore to further utilize loop tiling to maximize cache memory usage and increase performance by many folds.
- END
