Lancern's Treasure Chest
10:38 · Jul 4, 2024 · Thu
Beating NumPy matrix multiplication in 150 lines of C
https://salykova.github.io/matmul-cpu
salykova
Advanced GEMM Optimization on Modern x86-64 Multi-Core Processors
This blog post explains how to optimize multi-threaded FP32 matrix multiplication for modern processors using FMA3 and AVX2 vector instructions. The optimized custom implementation resembles the BLIS design and outperforms existing BLAS libraries (including…
Home
Powered by
BroadcastChannel
&
Sepia