(4) CUDA environment installation and programming

1. Determine the installation version

        1. Check the highest CUDA version supported by the graphics card in order to download the corresponding CUDA installation package:

Install the NVIDIA graphics card driver and view detailed information about the NVIDIA graphics card in the current system: nvidia-smi,

Look in the "CUDA Version" or "Compute Capability" section to find the displayed CUDA version number supported by the graphics card.
        2. Determine the cuDNN version corresponding to the CUDA version CUDA → Search online (it is recommended to refer to the official documents or release notes provided by NVIDIA).

        3. Toolkit (nvidia): CUDA complete tool installation package, which provides Nvidia drivers, development tool kits related to developing CUDA programs, and other installation options. Including the compiler, IDE, debugger, etc. of the CUDA program, various library files corresponding to the CUDA program and their header files.

2. Installation method

Download and install from the official website, and download cuDNN.

 3. CUDA environment variable configuration

If the above nvcc-v command prompts an error, you need to put the following commands into the environment variable configuration file ~/.bashrc (assuming that the
CUDA version in the environment is 12.0),
then add the following two lines:
export PATH=$PATH: /usr/local/cuda-12.0/bin
export LD LIBRARY PATH=$LD LIBRARY PATH:/usr/local/cuda-12.0/1ib64

        NVCC (NVIDIA CUDA Compiler) is a compiler provided by NVIDIA for compiling CUDA code. It converts CUDA C/C++ code into binary code that can be executed on the GPU, thereby enabling GPU accelerated computing. NVCC is more than just a compiler, it also provides some options for managing the compilation and build process.
        NVCC supports mixed compilation of CUDA code and ordinary C/C++ code, allowing developers to write host (CPU) code and device (GPU) code in the same file at the same time. Developers can use CUDA extended C/C++ syntax to write device code, including CUDA kernel functions, thread and block control, etc.

4. NVCC compilation options

​​​​​​​

 Not many device options are used.

5. Use NVCC to compile simple CUDA programs

1. Write CUDA source code
. Write source code files containing CUDA code. Typically, CUDA code will include host code (running on the CPU) and device code (running on the GPU).
2. Compile using nvcc
3. Run the executable file: Use the generated executable file to run the CUDA program.
4. Perform performance analysis on the program 

6. Coding test-checkdeviceinfo 

#include<iostream>
#include<cuda.h>
#include<cuda_runtime.h>
int main() {
int dev = 0;
cudaDeviceProp devProp;
cudaGetDeviceProperties(&devProp, dev);
std::cout << "GPU Device Name" << dev << ": " << 
devProp.name << std::endl;
std::cout << "SM Count: " << 
devProp.multiProcessorCount << std::endl;
std::cout << "Shared Memory Size per Thread Block: " << 
devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
std::cout << "Threads per Thread Block: " << 
devProp.maxThreadsPerBlock << std::endl;
std::cout << "Threads per SM: " << 
devProp.maxThreadsPerMultiProcessor << std::endl;
std::cout << "Warps per SM: " << 
devProp.maxThreadsPerMultiProcessor / 32 << std::endl;
return 0;
}
// Kernel定义
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) 
{ 
int i = blockIdx.x * blockDim.x + threadIdx.x; 
int j = blockIdx.y * blockDim.y + threadIdx.y; 
if (i < N && j < N) 
C[i][j] = A[i][j] + B[i][j]; 
}
int main() 
{ 
...
// Kernel 线程配置
 dim3 threadsPerBlock(16, 16); 
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
// kernel调用
 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); 
...
}

Coding Experiment-Differences from Serial Code (Equipment Section)

In CUDA programming, threadIdx, blockIdx, blockDim, and gridDim are built-in variables used to determine the index and dimension of threads and thread blocks in parallel computing. Used to operate thread blocks and threads on the GPU.

 threadIdx.x, threadIdx.y, and threadIdx.z represent the index of the current thread in the x, y, and z directions within the thread block, respectively.

Threads within each thread block have their own threadIdx. 

 blockIdx.x, blockIdx.y, and blockIdx.z represent the index of the current thread block in the x, y, and z directions within the grid. Each thread block has its own blockIdx. 

 blockDim.x, blockDim.y, and blockDim.z represent the x, y, and z dimensions of the number of threads within a thread block.

This is a fixed value that will be the same for most situations.

gridDim.x, gridDim.y, and gridDim.z represent the dimensions of the entire grid in the x, y, and z directions, that is, the number of thread blocks in the grid. This is a fixed value that is the same for the entire grid

7. CUDA program performance detection tool-nvprof

        nvprof is a command line tool provided by NVIDIA that can be used to collect performance indicator data about CUDA applications, such as GPU utilization, memory bandwidth, running time and latency. Additionally, nvprofi can trace CUDA APIi calls during GPU code execution to identify functions and code paths that may cause performance bottlenecks.
        nvprofj provides a variety of analysis options, such as timeline view, function summary view, and instruction analysis view, to assist in better understanding the performance bottlenecks of CUDA applications.

Simple timeline view
 nvprof./my_cuda_app  

Specify statistical information: gpu usage, occupancy

 nvprof./my_cuda_appnvprof --metrics

gpu_utilization,achieved_occupancy ./my_cuda_app

Specify the output file
nvprof --output-profile
my_profile.nvvp ./my_cuda_app

Use the analysis view
nvprof --analysis-metrics -o
my_profile.nvvp./my_cuda_app 

8. Mapping of block and thread index (matrix related)

 Normally, the matrix is ​​stored linearly in the host memory using the row-first method. In the CUDA program, you can create a two-dimensional grid (2,3) + a two-dimensional block (4,2), using its block and thread index. Mapping matrix index.

For addition, (i, j) of the result matrix is ​​the addition of the corresponding coordinates (i, j) of two n * m matrices. A total of n * m addition operations are performed, and a total of n * m threads are required. , that is, at least the number of threads to be opened is N = n * m.

For multiplication, (i, j) of the result matrix is ​​the sum of the elements of the i-th row of the first matrix (n * k) and the j-th column of the second matrix (k * m) - Each thread needs to complete the task. The total number of threads N = n * m.

When N (n * m) is obtained, blockDim can be set (there may be a maximum limit of 1024, usually an integer multiple of 32), and the system automatically generates gridDim.

In the C language that is usually written, (i, j) represents the i-th row and j-th column.

In matrix coordinates (ix, iy) represents the iyth row and ixth column, that is, the same coordinate system as the x and y axes (x horizontal, y vertical)

`blockIdx.x` is a built-in variable in CUDA programming, used to represent the index of the current thread block in the x direction.

In CUDA, threads are organized in a grid in the form of thread blocks. Each thread block consists of several threads, and the thread blocks are organized into a three-dimensional grid. `blockIdx.x` represents the index of the current thread block in the x direction, that is, its position in the entire grid.

Through `blockIdx.x`, conditional judgment or calculation can be performed in the CUDA program to make different thread blocks perform different operations or access different data.

int ix = blockIdx.x  *  blockDim.x  +  threadIdx.x; 

int iy = blockIdx.y  *  blockDim.y  +  threadIdx.y;

This code calculates the global index of the current thread in the 2D grid. in:

- `blockIdx.x` and `blockIdx.y` represent the index of the current thread block in the x and y directions respectively.
- `blockDim.x` and `blockDim.y` represent the number of threads in each thread block (in x and y directions) respectively.
- `threadIdx.x` and `threadIdx.y` represent the index of the current thread within the thread block to which it belongs (in the x and y directions) respectively.

From these values, the global index of the current thread in the entire grid can be calculated.

Specifically, `ix` is calculated as: the x coordinate of the current thread is `blockIdx.x * blockDim.x + threadIdx.x`.

Similarly, `iy` is calculated as: the y coordinate of the current thread is `blockIdx.y * blockDim.y + threadIdx.y`.

The `ix` and `iy` calculated in this way can be used to access and process data from different threads in CUDA programs.

9. Number of threads in the thread block

The number of threads in a thread block is best configured as a multiple of 32, which is determined by the hardware characteristics of the GPU and the design of the parallel computing model.

The GPU performs parallel computing tasks in units of thread blocks, and the threads within each thread block can work together to share data and communication. When the GPU performs computing tasks, it divides thread blocks into smaller thread groups (warps), and each thread group contains a set of consecutive threads.

The hardware of the GPU is scheduled and executed in units of thread groups (warp) . Specifically, each clock cycle, the GPU will select a thread group (warp) to execute instead of scheduling and executing each thread individually. Threads in the same thread group (warp) need to execute with the same instruction stream, which is called the SIMT (Single Instruction, Multiple Threads) execution model.

This determines that the number of threads in a thread block is best configured as a multiple of 32, because there is a fixed number of threads in a thread group (warp), usually 32 (the specific number may vary depending on different GPU architectures) . If the number of threads in a thread block is not a multiple of 32, it will cause load imbalance between thread groups (warps). For example, if the number of threads in a thread block is 40, then there will be one thread group (warp) containing 40 threads, and another thread group (warp) with only 8 threads, which will waste GPU computing resources.

Therefore, in order to fully utilize the parallel computing capabilities of the GPU and maintain the load balance of the thread group, it is best to configure the number of threads in the thread block to a multiple of 32. This ensures that all thread groups (warps) are scheduled and executed in the same clock cycle, thereby improving computing performance.

However, in actual work, it is likely that the number of threads created by manually configuring parameters is not equal to the number of threads required for the parallel loop. For example: 1230 loops actually need to be executed, but 2048 threads are usually configured.

1. Set configuration parameters so that the total number of threads exceeds the number required for actual work.

2. When passing parameters to the kernel function, pass an N that represents the total size of the data set to be processed or the total number of threads required to complete the work.

3. After calculating the thread index in the grid (using threadIdx + blockIdx*blockDim), determine whether the index exceeds N, and only perform work related to the kernel function if it does not exceed N.

Note: Applicable when the total amount of work N and the number of threads in the thread block are known.

// 假设N是已知的
int N = 100000;
// 把每个block中的thread数设为256
size_t threads_per_block = 256;
// 根据N和thread数量配置Block数量
size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;
//传入参数N
some_kernel<<<number_of_blocks, threads_per_block>>>(N);

number_of_block = (N + threads_per_block - 1) / threads_per_block;

This sentence rounds up N/threads_per_block.

The advantage of rounding up is that the number of threads must be greater than or equal to N, that is, it must be able to meet the need to execute N tasks and waste at most the number of threads in one block.

10. Kernel function 

Define the kernel function:

Kernel Function refers to a function executed on the GPU in parallel computing. In the CUDA programming model, kernel functions are parallel computing task codes written by developers that are executed on the GPU.

Kernel functions are identified in CUDA by the `__global__` modifier and can accept parameters and return values. In kernel functions, you can use specific syntax and functions to specify the way of parallel computing, such as using thread indexes, thread blocks, and grids to manage parallel execution.

In CUDA programming, kernel functions are called multiple times to execute on different threads, and each thread independently performs the same computing task. When executing a kernel function, information such as thread index, thread block index, and grid index can be used to determine the role and task of each thread in the calculation.

Kernel functions are usually used to perform intensive numerical computing tasks, such as vector addition, matrix multiplication, etc., and can make full use of the parallel computing capabilities of the GPU to improve computing performance. By writing appropriate kernel functions, computing tasks can be divided into multiple parallel thread blocks and executed simultaneously on the GPU, thus accelerating the computing process.

It should be noted that the kernel function cannot directly call functions on the CPU or access data in the CPU memory, because the GPU and CPU are two independent computing devices. If you need to use data on the CPU in the kernel function, you need to copy the data from the host (CPU) memory to the device (GPU) memory and access the device memory in the kernel function. Similarly, if the calculation results need to be copied from the device memory back to the host memory, corresponding data transfer operations are also required.

In short, the kernel function is a parallel computing task code executed on the GPU. By fully utilizing the parallel computing capabilities of the GPU, intensive numerical computing tasks can be accelerated.

Calculate distribution:

Assign tasks to thread blocks and threads. Determine the number of blocks and the number of threads in each block, depending on the size of the problem and the performance of the GPU.

Data access and synchronization:

Make sure multiple threads don't interfere with each other. Within the kernel function, shared memory is used to reduce the number of global memory accesses to improve performance. You need to pay attention to synchronization issues between threads, especially when using shared memory, use __syncthreads() to synchronize threads.

11. Reduction algorithm 

Reduction Algorithm is a common parallel computing algorithm that is used to transform a larger problem into a smaller sub-problem, and obtain the solution to the original problem by merging the results of the sub-problems.

In parallel computing, reduction algorithms are usually used to perform aggregation operations on elements in a data set, such as summing, maximizing, minimizing, etc. Reduction algorithms work by dividing a data set into multiple parts, assigning them to different processing units (such as threads, thread blocks, or processor cores) for parallel computation, and merging the results of each part to get the final result.

Here is an example of a common reduction algorithm, using the sum operation as an example:

1. Divide the input data set evenly into multiple small parts and assign them to different processing units.
2. Each processing unit calculates the local sum of its assigned parts.
3. Each processing unit adds the local sums to obtain a global sum.
4. Repeat the above steps until all parts are combined into one result.

This reduction algorithm can be implemented by repeatedly halving the size of the problem, thereby achieving high parallel performance. In each iteration, the problem is divided into smaller parts and reduction operations are performed on each part. This way, the size of each iteration is halved, while the size of the parallel computation increases.

Reduction algorithms are widely used in many applications, especially in parallel computing and parallel programming models (such as CUDA, OpenMP, etc.). It can improve computing efficiency and make full use of the parallel computing capabilities of multiple processing units, thereby accelerating the problem-solving process.

It should be noted that the performance of the reduction algorithm is closely related to factors such as the way the data is divided and merged, the load balancing of parallel computing, and communication overhead. When designing and implementing reduction algorithms, these factors need to be considered to ensure the scalability and efficiency of the algorithm.

 Optimization reduction algorithm

·Avoid warp differentiation: When threads in a warp execute different branches of a judgment statement, threads that meet the branch conditions will execute the command of the branch, and threads that do not meet the branch conditions will be idle and cannot be skipped . In this way, the execution of the entire thread warp

The row efficiency will be half lower than that without branches.

·Continuous address reading: The thread's continuous reading of data in CUDA is more efficient than other methods, so we can change the thread's addressing method to continuous.

·Loop unrolling: The bottleneck of the operation may be in the addressing and the loop itself. Reduce control flow overhead by combining multiple reduction steps into a loop.

Multi-step reduction: Use multiple reduction stages to perform successive reduction operations. These stages can use different methods and parameters to gradually reduce the size of the data and thus exploit parallelism more effectively.

The reason why threads that do not meet the branch conditions are idle and cannot be skipped:

In GPU parallel computing, a thread warp is a group of continuous threads, usually containing 32 threads. These threads will execute the same instructions simultaneously but may have different data. When the threads in the thread warp execute different branches of the judgment statement, the threads that meet the branch conditions will execute the instructions of the corresponding branch, while the threads that do not meet the conditions have no need to execute.

Since the thread warp in the GPU architecture works in SIMD (Single Instruction Multiple Data) mode, that is, one instruction acts on all threads in the thread warp at the same time, therefore when executing a branch statement, threads that do not meet the conditions cannot be skipped. This is determined by the GPU hardware design. Each thread must execute according to the rhythm of the instruction pipeline, and cannot individually choose to execute or skip certain instructions.

When threads in a warp execute different branches, threads that do not meet the conditions will be idle. This means that they do not perform any substantive computing tasks, but just wait for other threads to complete the instruction execution of the corresponding branch. When this happens, the throughput of the warp decreases because some threads have no valid work to perform.

In order to avoid thread warp differentiation and improve GPU utilization, some optimization strategies can be adopted. For example, write code reasonably and try to avoid the occurrence of branch statements; or use data rearrangement, data prefetching and other technologies to increase the workload of threads in the thread warp and reduce idle time. Additionally, in GPU programming, larger warp sizes (such as 64 or 128 threads) can be used to reduce the impact of warp differentiation.

Guess you like

Origin blog.csdn.net/weixin_48060069/article/details/132311796