[Participate in CUDA online training camp]--storage unit and matrix multiplication

1. Storage unit of GPU

The storage unit of the GPU is divided into two categories:
the memory particles (on board) around the chip on the board, and the reading speed is relatively slow, such as local memory, global memory, constant memory, and texture memory in the figure below.
Inside the GPU chip (on chip), the reading speed is relatively fast, as shown in the figure below. The
arrows in the figure below indicate that both directions can be read and written, and one-way indicates that they can only be read. These memories can be further subdivided:
R/W readable and writable memory:
registers, local memory: thread private memory, each thread private access.
shared memory: All threads in a block can access it, and can share data and communicate.
global memory: each thread can read and write.
R read-only memory: constant memory, texture memory can be read by each thread.
Global memory, constant memory, texture memory and the host can communicate with each other to read and write. Usually, the video memory size written in the graphics card manual is the global memory.
insert image description here

2. Allocation and release of GPU storage units

1. Apply for GPU storage unit
When we want to apply for a GPU storage unit for a square matrix M(m * m), use the following function: cudaMalloc
((void**) &d_m, sizeof(int) * m * m), the meaning of the parameters is as follows:
1) d_m: Pointer to the address of the data stored on the Device side
2) sizeof(int) * m * m: The size of the space stored on the Device side

2. Release the storage unit function cudaFree(d_m) requested by the GPU
; d_m: pointer to the address of the data stored on the Device side, transfer from the CPU memory to the GPU storage unit
cudaMemcpy(d_m, h_m, sizeof(int) * m * m, cudaMemcpyHostToDevice), the settings of each parameter are: d_m: the
destination of the transfer, the GPU storage unit
h_m: the source address of the data, the CPU storage unit
sizeof(int) * m * m: the size of the data transfer
cudaMemcpyHostToDevice: the direction of the data transfer, CPU to GPU

insert image description here

2. Matrix multiplication

CPU processing

void cpu_matrix_mult(int *h_a, int *h_b, int *h_result, int m, int n, int k) 
{
    
    
    for (int i = 0; i < m; ++i) 
    {
    
    
        for (int j = 0; j < k; ++j) 
        {
    
    
            int tmp = 0;
            for (int h = 0; h < n; ++h) 
            {
    
    
                tmp += h_a[i * n + h] * h_b[h * k + j];
            }
            h_result[i * k + j] = tmp;
        }
    }
}

GPU Algorithm Analysis
insert image description here

Algorithm Implementation:

__global__ void gpu_matrix_mult(int *a,int *b, int *c, int m, int n, int k)
{
    
     
    int row = blockIdx.y * blockDim.y + threadIdx.y; 
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int sum = 0;
    if( col < k && row < m) 
    {
    
    
        for(int i = 0; i < n; i++) 
        {
    
    
            sum += a[row * n + i] * b[i * k + col];
        }
        c[row * k + col] = sum;
    }
}