Qiuchu's Wiki

Reading Notes on Programming Massively Parallel Processors

This post contains my reading notes on the book “Programming Massively Parallel Processors: A Hands-on Approach” by David B. Kirk and Wen-mei W. Hwu.

Table of Contents

Chap4: Computing architecture and scheduling

This chapter introduces the architecture of the CUDA cores and the scheduling of the threads.

Architecture of a modern GPU

From a hardware’s perspective, a GPU is partitioned into a 2-level hierarchy of parallel processing units. The top level is the Streaming Multiprocessor (SM), which is a collection of CUDA cores. For example, the Ampere A100 GPU has 108 SMs, each with 64 CUDA cores. The bottom level is the CUDA core, which is the smallest unit of computation in a GPU.

Just like the computation units, the memory units are also grouped as a hierarchy. The global memory is the largest and slowest memory, which are shared by all the SMs. The SMs also have their own shared memory and L1 cache, which is faster than the global memory. Between the global memory and the shared memory/L1 cache, there is the L2 cache, which is shared by all the SMs. Register are the fastest memory, which is private to each thread at runtime.

gpu arch

Here is a table of the memory information of different memory types on NVIDIA A100 40GB GPU, you can find more details in the NVIDIA A100 datasheet:

Memory TypeBandwidthSize
Global Memory1555GB/s40 GB
L2 Cachenot opened40 MB
L1 Cachenot opened192KB
Shared Memorynot opened164 KB
Register File per SMnot opened256KB
Register File per GPUnot opened27648KB

Besides the bandwidth of the global memory, the bandwidth of other memory types are not opened to the public in the whitepaper, maybe you can test these numbers following here.

Block, warp, thread, and their scheduling

When a cuda kernel is called, the CUDA runtime system lanches threads that execute the kernel code. There is a 2-level abstraction of the threads: the block and the thread.

A block is the abstraction used to group threads when theses threads are lanched onto the GPU(using the <<<grid, block>>> syntax). There can be multiple blocks being assigned to the same SM at the same time, but the number of blocks that can be assigned to an SM is also limited by the hardware. The execution between blocks is independent, which means that the threads in different blocks cannot communicate with each other and can be fully parallelized. Within the block, programmers can synchronize threads and share data between threads, like using __syncthreads() to make sure all threads in the block have finished the previous computation before moving on to the next computation or using the shared memory to share data between threads in the same block.

A warp is the abstraction used to group threads when the threads are getting scheduled on the SM. The warp size on most NVIDIA GPUs is 32. Threads in a block are partitioned into warps according to the thread index order. When lanching a three-dimensional grid of blocks, the threads within a block are partitioned into warps in the following order:

threadIdx.z -> threadIdx.y -> threadIdx.x

For example, for a three-dimensional 2x8x4(z, y, x) block grid, the threads are partitioned into two warps, (0, 0, 0) through (0, 7, 3) for the first warp and (1, 0, 0) through (1, 7, 3) for the second warp. Threads within a warp are executed using SIMD (Single Instruction Multiple Data) model, which means that the threads in a warp execute the same instruction at the same time. When meeting control flow divergence, the threads in the same warp between Volta architecture will be serialized, which will lead to performance degradation. From the Volta architecture on wards, the threads in different divergent paths may be executed concurrently using independent thread scheduling.

With many warps assigned to SMs and Zero-overhead scheduling, GPU can hide the latency of memory access by switching to another warp when the current warp is waiting for long-latency operations.

Resouce partitioning and occupancy

The ratio of the number of active warps to the maximum number of warps that can be active on an SM is called the occupancy. In this book, authors use many examples to show how to calculate the occupancy of an SM and how to optimize the occupancy to achieve better performance. Generally, we should take into account the following factors when improving the occupancy for a particular hardware:

Querying device properties

CUDA C provides a set of functions to query the device properties, the ones that are mentioned in this chapter are:

Chap5: Memory architecture and data locality

Estimate peak performance of a kernel

At the beginning of this chapter, the authors introduce a methodology to estimate the peak performance of a kernel. Before visiting that part, we first need to know arithmetical intensity, arithmetic/computational intensity(FLOP/B) is the ratio of floating-point operations to bytes from global memory It can be a good indicator about whether the kernel is memory-bound or compute-bound. Take the matrix multiplication as an example, we read two numbers from global memory to compute 1 addtion and 1 multiplication, so the arithmetical intensity is 2/8 = 0.25 FLOP/B. Then we can obtain the peak throughput of FLOPs of this kernel by multiplying the arithmetical intensity by the peak memory bandwidth of the device, which is 1555GB/s * 0.25 = 388.75 GFLOP/s for NVIDIA A100 40GB GPU. But according to the whitepaper, the single-precision peak throughput of A100 is 19500 GFLOP/s or 156000 GFLOPS(with tensor core). This indicate that the matrix multiplication kernel is not compute-bound, which means that the kernel is not fully utilizing the peak throughput of the device.

The Roofline Model

The Roofline Model is a performance model that can help us to understand the performance of a kernel and to identify the bottleneck of the kernel.

roofline model

The x-axis of the Roofline Model is the arithmetical intensity, and the y-axis is the performance of the kernel. With the increasing of the arithmetical intensity, the performance of the kernel will increase until it reaches the peak throughput of the device. This divides the panel below the peak throughput into two parts: the compute-bound part and the memory-bound part. On the left of the vertical line, the kernel is memory-bound because the throughput can be increased by optimizing computational intensity, and on the right of the vertical line, the kernel is compute-bound.

From this model, we learn that we need 19500 GFLOP/s / 1555 GB/s = 12.5 FLOP/B to reach the peak throughput of the A100 GPU.

CUDA memory types

memory type

When discussing the cuda memory types, we describe different memory types from theses aspectives:

memory lifetime

#CUDA Programming