High Performance GPU Kernels

Published at : 23 Dec 2025

Disclaimer: This video is generated with Google's NotebookLM.

https://www.aleksagordic.com/blog/matmul

This technical blog post provides a comprehensive deep dive into the architecture and programming of high-performance NVIDIA GPU matrix multiplication kernels. The author explains the hardware evolution from Ampere to Hopper, highlighting how specialized components like Tensor Cores and the Tensor Memory Accelerator (TMA) drastically improve computational throughput. By examining PTX and SASS assembly languages, the text illustrates how low-level optimizations—such as loop unrolling and memory swizzling—maximize efficiency and prevent bank conflicts. The narrative moves from naive implementations to state-of-the-art techniques, including warp-tiling, persistent kernels, and Hilbert-curve scheduling. Ultimately, the source serves as a guide for developers to squeezing maximum performance out of modern AI accelerators by aligning software design with physical hardware constraints.

#ai #nvidia #gpu #computer