Lesson 35: cuBLAS, Thrust & When Not to Write a Kernel

Through the course we wrote quite a few kernels by hand — vector add, reduction, matrix multiply. But for standard operations, NVIDIA and the CUDA community already wrote implementations aggressively optimized over years: cuBLAS for linear algebra (matrix multiply sgemm, dot product), and Thrust — a

Before you build an electric saw from scratch to cut a board, check if there is already a great saw in the store. cuBLAS and Thrust are the ready-made professional tools: for common operations they are almost always better than anything you could build alone in one evening.

cuBLAS: NVIDIA's GPU linear-algebra library. Includes sgemm (matrix multiply) and more, tuned for peak performance.
Thrust: An STL-style C++ library for CUDA: thrust::reduce, thrust::sort, thrust::transform and more, running on the GPU.
thrust::reduce: Performs a reduction (e.g. a sum) over a range, on the GPU, in one line — replacing a hand-written reduction kernel.
thrust::device_vector: A Thrust container that lives in device memory and manages allocation/free automatically, like std::vector.