Lesson 24: strace and ltrace — Tracing System Calls
A program that is stuck, slow, or behaving oddly — where do you start investigating? strace is the microscope of Linux engineers: it intercepts every system call a program makes and displays it in real time. ltrace does the same for library calls. At NVIDIA, when a GPU program hangs, strace reveals
Imagine following a child and recording every action they take: opened door, picked up cup, drank water, closed door. That is strace — it writes down every 'action' the program asks the kernel to perform. If the child stands for a full minute in front of a closed door without moving — that is exactly what strace shows: 'the program has been blocked on open() for a minute'.
- strace
- A Linux tool that intercepts and displays every system call a program makes in real time. Uses the kernel's ptrace() syscall. Essential for debugging stuck programs, crashes, and permission issues.
- ltrace
- Similar to strace but focuses on library calls (libc, libm, etc.) rather than direct system calls. Lets you see calls like malloc(), printf(), fopen() at the library level.
- system call
- The interface between user-space programs and the kernel. Every regulated operation (reading a file, creating a socket, allocating OS memory) goes through a syscall. strace shows them all.
- ioctl
- A general-purpose device control syscall. The CUDA driver exposes all GPU operations (VRAM allocation, kernel launch, NVLink) through ioctl(). strace on a CUDA program shows dozens of ioctl calls.
- ptrace
- The syscall that strace is built on. Allows one process to monitor, stop, and inspect another process. Debuggers like gdb also use ptrace.