Comprehensive Guide to Getting Started with CUDA Programming

Introduction

Understanding the Basics of CUDA

GPU programming involves multiple components such as the CPU, GPU, memory, and video memory. It's crucial to define these concepts clearly:

Host: Refers to the CPU and system memory.
Device: Refers to the GPU and video memory.

Let’s start with a simple "Hello World" example in CUDA:

__global__ void kernel(void) {  
}  
int main(){  
    kernel<<<1, 1>>>();  
    printf("Hello world\n");  
    return 0;  
}

This example showcases a few important differences from a traditional C program:

The __global__ prefix before the function kernel informs the compiler that this function will execute on the GPU.
The <<<1, 1>>> notation indicates the launch configuration where 1,1 represents the number of GPU "cores" that may execute this function in parallel.

In this "Hello World" example, the GPU function does not accept any parameters or return values. In a more meaningful scenario, we usually expect input and output. Consider the following example that adds two integers:

__global__ void add(int a, int b, int *c) {  
    *c = a + b;  
}  
int main() {  
    int c;  
    int *dev_c;  
    cudaMalloc((void**)&dev_c, sizeof(int));  
    add<<<1, 1>>>(2, 7, dev_c);  
    cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);  
    printf("2+7=%d\n", c);  
    cudaFree(dev_c);  
    return 0;  
}

In this example, the GPU performs the addition of two integers. Here are some key points to note:

Memory Allocation: The cudaMalloc function allocates device memory for the variable. Unlike traditional malloc, the return value indicates success or failure. We pass a pointer to capture the address of the allocated memory.
Function Invocation: The line add<<<1, 1>>>(2, 7, dev_c) invokes the GPU function to compute the sum.
Data Transfer: The cudaMemcpy function is used to copy data between the host and device.
Freeing Memory: After use, the allocated device memory is freed with cudaFree.

Retrieving Device Information

To check the number of GPUs available on the current machine, use the following function:

int count;  
cudaGetDeviceCount(&count);

You can also query the properties of each GPU:

cudaDeviceProp prop;  
for (int i = 0; i < count; i++) {  
    cudaGetDeviceProperties(&prop, i);  
}

The cudaDeviceProp structure contains various attributes such as device name, total memory, and supported features.

When multiple GPU devices are present (for instance, both integrated and dedicated graphics), certain operations may require specific GPU versions. CUDA provides two APIs to quickly identify compatible devices:

cudaDeviceProp prop;  
memset(&prop, 0, sizeof(cudaDeviceProp));  
prop.major = 1;  
prop.minor = 3;  

int dev;  
cudaChooseDevice(&dev, &prop);  
printf("ID of CUDA device closest to revision 1.3: %d\n", dev);  
cudaSetDevice(dev);

Parallel Programming in CUDA C

Example: Vector Addition

In the previous section, we introduced the basic structure of a CUDA program. Now, let's explore how to implement parallel computation in CUDA.

Here’s a simple example of adding two vectors:

#define N 10  
int a[N], b[N], c[N];  
// Initialize a and b, and set c to zero  
for (int i = 0; i < N; i++) {  
    c[i] = a[i] + b[i];  
}

The GPU version of this operation can be implemented as follows:

__global__ void add(int *a, int *b, int *c) {  
    int tid = blockIdx.x;  
    if (tid < N) {  
        c[tid] = a[tid] + b[tid];  
    }  
}  
  
int main() {  
    // Suppose `int a[N], b[N], c[N]` are initialized  
    int *dev_a, *dev_b, *dev_c;  
    cudaMalloc((void**)&dev_a, N * sizeof(int));  
    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);  
    cudaMalloc((void**)&dev_b, N * sizeof(int));  
    cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);  
    cudaMalloc((void**)&dev_c, N * sizeof(int));  
    add<<<N, 1>>>(dev_a, dev_b, dev_c);  
    cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);  
    for (int i = 0; i < N; i++) {  
        printf("%d\n", c[i]);  
    }  
    cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);  
    return 0;  
}

Julia Set Example

The vector addition example is relatively straightforward. Now let's take a look at a more complex example: generating the Julia Set. The Julia Set is a type of fractal, where each pixel is determined by an iterative process based on its position.

Here’s the CPU implementation:

int main() {  
    CPUBitmap bitmap(DIM, DIM);  
    unsigned char *ptr = bitmap.get_ptr();  
    kernel(ptr);  
    bitmap.display_and_exit();  
}  
  
void kernel(unsigned char *ptr) {  
    for (int y = 0; y < DIM; y++) {  
        for(int x = 0; x < DIM; x++) {  
            int offset = x + y * DIM;  
            int juliaValue = julia(x, y);  
            ptr[offset * 4 + 0] = 255 * juliaValue;  
            ptr[offset * 4 + 1] = ptr[offset * 4 + 2] = 0;  
            ptr[offset * 4 + 3] = 255;  
        }  
    }  
}  
  
int julia(int x, int y) {  
    const float scale = 1.5;  
    float jx = scale * (float)(DIM / 2 - x) / (DIM / 2);  
    float jy = scale * (float)(DIM / 2 - y) / (DIM / 2);  
    cuComplex c(-0.8, -0.156);  
    cuComplex a(jx, jy);  
    for (int i = 0; i < 200; i++) {  
        a = a * a + c;  
        if (a.magnitude2() > 1000) {  
            return 0;  
        }  
    }  
    return 1;  
}

Now, let's look at the GPU version of this implementation:

int main() {  
    CPUBitmap bitmap(DIM, DIM);  
    unsigned char *dev_bitmap;  
    cudaMalloc((void**)&dev_bitmap, bitmap.image_size());  
    dim3 grid(DIM, DIM);  
    kernel<<<grid, 1>>>(dev_bitmap);  
    cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);  
    bitmap.display_and_exit();  
    cudaFree(dev_bitmap);  
}

Understanding Kernel Function Implementation

The kernel function implementation for the GPU is as follows:

__global__ void kernel(unsigned char *ptr) {  
    int x = blockIdx.x;  
    int y = blockIdx.y;  
    int offset = x + y * gridDim.x;  
    int juliaValue = julia(x, y);  
    ptr[offset * 4 + 0] = 255 * juliaValue;  
    ptr[offset * 4 + 1] = ptr[offset * 4 + 2] = 0;  
    ptr[offset * 4 + 3] = 255;  
}

Key points to note in this implementation:

The __global__ keyword indicates that this function runs on the GPU but can be called from the host.
The indices of blockIdx.x and blockIdx.y are used to determine which pixel to compute.
gridDim is a shared constant that holds the grid configuration.

Thread Cooperation

In the previous chapter, we introduced block-level parallelism. However, blocks are isolated and cannot directly communicate with each other. In this chapter, we will discuss thread-level cooperation, where threads within the same block can communicate.

Splitting Blocks into Threads

CUDA allows each block to contain multiple threads, specified by the second parameter in the launch configuration <<<N, M>>>. Here’s a simple example demonstrating this concept:

__global__ void add(int *a, int *b, int *c) {  
    int tid = threadIdx.x;  
    if (tid < N) {  
        c[tid] = a[tid] + b[tid];  
    }  
}

In the main function:

int main() {  
    int a[N], b[N], c[N];  
    // Initialize a[N] and b[N]  
    int *dev_a, *dev_b, *dev_c;  
    cudaMalloc((void**)&dev_a, N * sizeof(int));  
    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);  
    cudaMalloc((void**)&dev_b, N * sizeof(int));  
    cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);  
    cudaMalloc((void**)&dev_c, N * sizeof(int));  
    add<<<1, N>>>(dev_a, dev_b, dev_c);  
    cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);  
    // Display c  
    cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);  
    return 0;  
}

In this case, the first argument in the kernel invocation <<<1, N>>> indicates that there is one block containing N threads.

Synchronization and Shared Memory

The introduction of thread-level parallelism allows multiple threads within a block to share memory. This is done using the __shared__ keyword. Here’s an example of calculating the dot product using shared memory:

const int N = 33 * 1024;  
const int threadsPerBlock = 256;  
__global__ void dot(float *a, float *b, float *c) {  
    __shared__ float cache[threadsPerBlock];  
    int tid = threadIdx.x + blockIdx.x * blockDim.x;  
    int cacheIndex = threadIdx.x;  
    float temp = 0;  
    while (tid < N) {  
        temp += a[tid] * b[tid];  
        tid += blockDim.x * gridDim.x;  
    }  
    cache[cacheIndex] = temp;  
    __syncthreads();  
    // Reduction logic follows  
}

The synchronization function __syncthreads() ensures that all threads in the block have completed their computations before proceeding.

Reduction Techniques

To efficiently compute the sum across the shared cache, several techniques can be employed, such as using shared memory for intermediate sums and then reducing them within a single thread, often referred to as the "thread reduction" technique.

Using the cudaMemcpyAsync function allows for asynchronous memory transfers, enabling better utilization of available resources by overlapping computation with data transfer.

Conclusion and Next Steps

This introduction to CUDA programming provides a foundation for understanding how to leverage GPU resources for parallel computation. In future sections, we will delve deeper into advanced topics such as constant memory, texture memory, and multi-GPU setups to further enhance performance.

This guide serves as a comprehensive introduction to CUDA programming, encapsulating fundamental concepts and providing practical examples to help you get started efficiently. For further learning, exploring advanced CUDA features and optimizing existing code will be beneficial.