Lesson 35: cuBLAS, Thrust & When Not to Write a Kernel
Through the course we wrote quite a few kernels by hand — vector add, reduction, matrix multiply. But for standard operations, NVIDIA and the CUDA community already wrote implementations aggressively optimized over years: cuBLAS for linear algebra (matrix multiply sgemm, dot product), and Thrust — a
Before you build an electric saw from scratch to cut a board, check if there is already a great saw in the store. cuBLAS and Thrust are the ready-made professional tools: for common operations they are almost always better than anything you could build alone in one evening.
- cuBLAS
- NVIDIA's GPU linear-algebra library. Includes sgemm (matrix multiply) and more, tuned for peak performance.
- Thrust
- An STL-style C++ library for CUDA: thrust::reduce, thrust::sort, thrust::transform and more, running on the GPU.
- thrust::reduce
- Performs a reduction (e.g. a sum) over a range, on the GPU, in one line — replacing a hand-written reduction kernel.
- thrust::device_vector
- A Thrust container that lives in device memory and manages allocation/free automatically, like std::vector.