Introduction

CUDA is a software layer that allows you to run multiple instances of the same code on a GPU in parallel. This is useful for AI and anything else that requires massive parallelisation. ML Frameworks such as Tensorflow and Pytorch support CUDA (noting that this article is about writing the low level CUDA code yourself rather than using frameworks). CUDA is compatible with most NVIDIA GPUs, and can be coded in C, C++, Python and Fortran. CUDA makes it relatively easy to parallelise new or existing code.

CUDA code can be compiled like a C/C++ program as follows:

nvcc cuda_program.cu -o cuda_program

Quick Example

The following code is often provided as an easy example of how to parallelise CPU-only code using CUDA, so let's start with this:

#include <stdio.h>
#define SIZE	1024

void VectorAdd(int *a, int *b, int *c, int n) {
    int i;
    for (i=0; i < n; ++i)
        c[i] = a[i] + b[i];
}

int main() {
    int *a, *b, *c;

    a = (int *)malloc(SIZE * sizeof(int));
    b = (int *)malloc(SIZE * sizeof(int));
    c = (int *)malloc(SIZE * sizeof(int));

    for (int i = 0; i < SIZE; ++i) {
        a[i] = i;
        b[i] = i;
        c[i] = 0;
    }

    VectorAdd(a, b, c, SIZE);

    for (int i = 0; i < 10; ++i)
        printf("c[%d] = %d\n", i, c[i]);

    free(a);
    free(b);
    free(c);

    return 0;
}

As you can see, the above code adds vectors together in a CPU based serial loop. The following code is the parallelized CUDA equivalent:

#include <stdio.h>

#define SIZE	1024

__global__
void VectorAdd(int *a, int *b, int *c, int n) {
    int i = threadIdx.x;

    if (i < n)
        c[i] = a[i] + b[i];
}

int main() {
    int *a, *b, *c;

    cudaMallocManaged(&a, SIZE * sizeof(int));
    cudaMallocManaged(&b, SIZE * sizeof(int));
    cudaMallocManaged(&c, SIZE * sizeof(int));

    for (int i = 0; i < SIZE; ++i) {
        a[i] = i;
        b[i] = i;
        c[i] = 0;
    }

    VectorAdd <<<1, SIZE>>> (a, b, c, SIZE);
    cudaDeviceSynchronize();

    for (int i = 0; i < 10; ++i)
        printf("c[%d] = %d\n", i, c[i]);

    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    return 0;
}

The above code produces the same result as the CPU-only version, but with parallelized GPU code. The coding differences are as follows:

global

The __global__ keyword indicates that the function will run on the GPU, and that it can be invoked globally. This code is known as a kernel.

cudaMallocManaged

cudaMallocManaged is used rather than malloc. This makes the data available to both the GPU and the CPU in a space known as unified memory. This saves manually copying data back and forth between the host and the GPU device. CUDA programmers are always quick to point out that cudaMallocManaged can (in some cases) lead to suboptimal performance because of the hard limitations at the PCIE bus. This is less of a problem with the latest NVIDIA GPUs because they include hardware support for Unified Memory page faulting and migration. In some cases, however, you’ll still want to handle this yourself, rather than use cudaMallocManaged (more on this later in the post).

cudaFree

cudaFree is used rather than the usual free. It's needed when memory is allocated with cudaMallocManaged.

<<<1, SIZE>>>

The tripple angle brackets <<<1, SIZE>>> specify the launch configuration of the kernel. The first parameter is the number of thread blocks. The second parameter is the number of threads in each thread block.

You should use a multiple of 32 for the number of threads for best performance.

Each instance of the kernel (i.e. the VectorAdd function) runs in a thread. Threads run in parallel on the GPU. Threads are grouped together in blocks, each of which has the same number of threads. The collection of blocks is a grid. (The latest H100 GPUs support another layer known as a cluster, but I'll write about this in a separate post).

You can determine the maximum number of threads per block programmatically as follows:

int deviceId;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&maxThreadsPerBlock, cudaDevAttrMaxThreadsPerBlock, deviceId);
printf("Max threads per block: %d \n", maxThreadsPerBlock);

cudaDeviceSynchronize

cudaDeviceSynchronize prevents the CPU from continuing until the threads have completed.

A Few More Things

The above example is enough to get started, but let's take things a tiny step further, with a few more techniques.

Compile and run

You can compile and run at the same time by adding -run. This is handy when developing: nvcc -o something something.cu -run

threadIdx.x and blockIdx.x

threadIdx.x and blockIdx.x are special variables available inside the kernel that indicate the thread and block indices. You'll need to know the index in many cases, in the same way the code inside a for loop often needs to know the index of iteration.

Multiple blocks

The code above used a single block, but you'll often need more than one. In this case you probably don't want the index of the thread in the current block, but the index across all blocks. You can calculate this using the following idiomatic expression:

int index = threadIdx.x + blockIdx.x * blockDim.x;

Block/thread mismatch

Sometimes the number of threads will not be a multiple of 32, and won't be divisible by the number of blocks. This can be handled as follows:

int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
    // Only do work if the index is less than N (the number of threads)
}

Strides

If you need more threads than the grid size (i.e. the number of threads multiplied by the number of blocks), you can use strides:

int indexWithinTheGrid = threadIdx.x + blockIdx.x * blockDim.x;
int gridStride = gridDim.x * blockDim.x;

for (int i = indexWithinTheGrid; i < N; i += gridStride) {
    // do work on a[i];
}

Error handling

There are two types of error to catch: those that occur while launching the kernel, and those that occur asynchronous during execution of an async kernel. You can catch them as follows:

VectorAdd<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

addVectorsErr = cudaGetLastError();
if(addVectorsErr != cudaSuccess)
    printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

asyncErr = cudaDeviceSynchronize();
if(asyncErr != cudaSuccess)
    printf("Error: %s\n", cudaGetErrorString(asyncErr));

Profiling

It's important to optimise your code, so you'll need to know what's going on under the hood. NVIDIA provides both a comand line profiling tool and a visual one. Use nsys to profile using the comand line:

!nsys profile –stats=true ./the_program_to_test

The visual tool is called Nsight. You can use it to open, and analyse, reports generated by nsys. This allows you to see things like page faulting, for example. Generate the report as follows (and then open it in Nsight):

!nsys profile –stats=true -o output_report ./the_thing_to_test

Streaming Multiprocessors

Nvidia GPUs have processing units called streaming multiprocessors (SMs). The grid size should be chosen so that the number of blocks is a multiple of the number of SMs. For portability you can get the number of SMs (and other info) like this:

int deviceId;
cudaGetDevice(&deviceId);

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId);

The number of SMs is in the multiProcessorCount property. See the documentation to decipher the contents of the other props.

Asynchronous Prefetching

As mentioned at the top of this post, there are times when blindly using cudaMallocManaged can lead to sub-optimal results. Memory is not resident on either the host of the device when UM is allocated. A page fault will occur when either the host or the device first accesses the memory, which will trigger a migration. This is OK in some cases, but not in others (e.g. when large contiguous blocks of memory are needed and when the requirements are known at compile time). One, but not the only, way to deal with this is to use asynchronous prefetching:

int deviceId;
cudaGetDevice(&deviceId);

cudaMemPrefetchAsync(pointerToTheData, size, deviceId);
cudaMemPrefetchAsync(pointerToTheData, size, cudaCpuDeviceId);

cudaMalloc

Another way to optimise your code is to manually allocate memory on the GPU with cudaMalloc. This is more work to code, but can also be more performant in some cases.

CUDA Streams

Kernel launches and memory transfers occur in CUDA streams. CUDA allows parallelisation of the kernels across threads, but the streams (which launch the kernels) run serially. It is possible, however, to create more than one stream to provide concurrency. Kernels will run in the default stream if no streams are created.

The default stream is different to defined streams in that it is blocking. i.e. The default stream will wait for other streams to complete before running, AND it will block other streams from running until it completes.

Here is how you create a stream and launch a kernel:

cudaStream_t stream;
cudaStreamCreate(&stream);

theKernel<<<number_of_blocks, threads_per_block, 0, stream>>>();
cudaStreamDestroy(stream);

The 4th argument is the stream. The third argument defines the number of bytes in shared memory to be dynamically allocated per block. Both of these are optional arguments. The 3rd argument is used here for the sole purpose of exposing the 4th (stream) argument. Note also, that you need to provide a pointer when creating the stream, and a value (not a pointer) when destroying it.

Conclusion

It's relatively easy to get started with CUDA programming. You can chip away at it bit by bit as you optimise. Read the NVIDIA documentation to understand it more fully, and to futher optimise your code.

CUDA Programming in C/C++

April 01, 2022

Introduction

Quick Example

A Few More Things

Conclusion