Ascend C sqrt operator practice

Abstract: Write a sqrt operator of Ascend C, and verify it in cpu and npu mode by calling the kernel.

This article is shared from the HUAWEI CLOUD community " [2023 CANN Training Camp 1st Season] - Ascend C sqrt Operator Combat ", author: dayao.

foreword

Write a sqrt operator of Ascend C, and verify it in cpu and npu mode by calling the kernel. In the training camp sandbox environment, the cpu mode works fine and the results are correct.

I. Overview

First briefly review the process and implementation of TIK C++ operator vector programming.

The vector operator development process is as follows:

The main work contents are:

1. Operator analysis: determine the input and output, determine the mathematical expression and the underlying implementation interface, and determine the definition of the kernel function.

2. Implementation of the operator class: implement init() and process(). init() completes memory initialization, which essentially reflects multi-core operation, splitting single-core data and whether to enable double buffer optimization; Process() implements the three pipeline tasks of CopyIn, Compute, and CopyOut.

3. Operator verification: call the operator through the kernel caller of the kernel function, calculate the result, and compare it with the numpy calculation result using the same input, and the error should be within a certain range. In practical applications, it is necessary to use the operators of the original framework to compare the calculation accuracy.

2. Operator analysis

The operator is defined as follows: Assume that there are still 8 logical cores.

Querying the API of TIK C++ shows that (TIK C++ API/Vector Computing/Monocular/Sqrt, using Level 2 interface) can be used to complete the operation and get the final result.

3. Code Analysis

Modify directly on the add_tik2 operator project provided by the training camp course. Code address: https://gitee.com/zgx950813/samples/tree/master/tik2_demo/kernel_samples/kernel_add_sample

Modify the code directory structure as follows: CMakeLists.txt and data_utils.h have not been modified, and the compilation and execution script run.sh has only changed the comparison part between the calculation result and the true value.

1) Kernel function definition

Compared to the routine, the input parameter is only x.

extern "C" __global__ __aicore__ void sqrt_tik2(__gm__ uint8_t* x, __gm__ uint8_t* z)
{
 KernelSqrt op;
 op.Init(x, z);
 op.Process();
}

2), operator class

The implementation is similar to the add routine. Initialize memory in the init() function: Global Memory of x and y; pipeline task communication memory; Process() implements pipeline tasks; write CopyIn, Compute, and CopyOut according to the paradigm. The biggest difference from the add routine is that in the compute function, the API of type 2 interface of sqrt is called to realize the calculation.

class KernelSqrt {
public:
    __aicore__ inline KernelSqrt() {}
    __aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* z)
    {
 // get start index for current core, core parallel
 xGm.SetGlobalBuffer((__gm__ half*)x + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);
 zGm.SetGlobalBuffer((__gm__ half*)z + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);
 // pipe alloc memory to queue, the unit is Bytes
 pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
 pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    }
    __aicore__ inline void Process()
    {
 // loop count need to be doubled, due to double buffer
 constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;
 // tiling strategy, pipeline parallel
 for (int32_t i = 0; i < loopCount; i++) {
 CopyIn(i);
 Compute(i);
 CopyOut(i);
        }
    }
private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
 // alloc tensor from queue memory
 LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
 // copy progress_th tile from global tensor to local tensor
 DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
 // enque input tensors to VECIN queue
 inQueueX.EnQue(xLocal);
    }
    __aicore__ inline void Compute(int32_t progress)
    {
 // deque input tensors from VECIN queue
 LocalTensor<half> xLocal = inQueueX.DeQue<half>();
 LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
 // call Sqrt instr for computation
 Sqrt(zLocal, xLocal, TILE_LENGTH);
 // enque the output tensor to VECOUT queue
 outQueueZ.EnQue<half>(zLocal);
 // free input tensors for reuse
 inQueueX.FreeTensor(xLocal);
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
 // deque output tensor from VECOUT queue
 LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
 // copy progress_th tile from local tensor to global tensor
 DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
 // free output tensor for reuse
 outQueueZ.FreeTensor(zLocal);
    }
private:
 TPipe pipe;
 // create queues for input, in this case depth is equal to buffer num
 TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX;
 // create queue for output, in this case depth is equal to buffer num
 TQue<QuePosition::VECOUT, BUFFER_NUM> outQueueZ;
 GlobalTensor<half> xGm, zGm;
};

3) Kernel function call

1. In CPU mode, call through ICPU_RUN_KF

ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug

2. In NPU mode, call by <<<>>>

#ifndef __CCE_KT_TEST__
// call of kernel function
void sqrt_tik2_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* z)
{
    sqrt_tik2<<<blockDim, l2ctrl, stream>>>(x, z);
}
#endif

Because <<<>>> can only be called in NPU mode, it needs to be compiled conditionally, and it is not valid in CPU debugging mode. When calling sqrt_tik2_do, it needs to follow the requirements of ascendcl application programming.

3. Call code

The CPU and NPU modes are distinguished by the "__CCE_KT_TEST__" macro.

int32_t main(int32_t argc, char* argv[])
{
 size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);  // uint16_t represent half
 size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);  // uint16_t represent half
    uint32_t blockDim = 8;
#ifdef __CCE_KT_TEST__
    uint8_t* x = (uint8_t*)tik2::GmAlloc(inputByteSize);
    uint8_t* z = (uint8_t*)tik2::GmAlloc(outputByteSize);
 ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
 // PrintData(x, 16, printDataType::HALF);
 ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug
 // PrintData(z, 16, printDataType::HALF);
 WriteFile("./output/output_z.bin", z, outputByteSize);
    tik2::GmFree((void *)x);
    tik2::GmFree((void *)z);
#else
 aclInit(nullptr);
 aclrtContext context;
 aclError error;
    int32_t deviceId = 0;
 aclrtCreateContext(&context, deviceId);
 aclrtStream stream = nullptr;
 aclrtCreateStream(&stream);
    uint8_t *xHost, *zHost;
    uint8_t *xDevice, *zDevice;
 aclrtMallocHost((void**)(&xHost), inputByteSize);
 aclrtMallocHost((void**)(&zHost), outputByteSize);
 aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
 aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
 ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
 // PrintData(xHost, 16, printDataType::HALF);
 aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
 sqrt_tik2_do(blockDim, nullptr, stream, xDevice, zDevice); // call kernel in this function
 aclrtSynchronizeStream(stream);
 aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST);
 // PrintData(zHost, 16, printDataType::HALF);
 WriteFile("./output/output_z.bin", zHost, outputByteSize);
 aclrtFree(xDevice);
 aclrtFree(zDevice);
 aclrtFreeHost(xHost);
 aclrtFreeHost(zHost);
 aclrtDestroyStream(stream);
 aclrtResetDevice(deviceId);
 aclFinalize();
#endif
 return 0;
}

4) Benchmark data generation - sqrt_tik2.py

Generate input_x and benchmark result golden using numpy.

import numpy as np
def gen_golden_data_simple():
 input_x = np.random.uniform(0, 100, [8, 2048]).astype(np.float16)
    golden = np.sqrt(input_x).astype(np.float16)
 input_x.tofile("./input/input_x.bin")
 golden.tofile("./output/golden.bin")
if __name__ == "__main__":
 gen_golden_data_simple()

5) Comparing calculation results

Use numpy's allclose() function to compare the results of operator calculations with benchmark data. In fact, due to a compilation error in npu mode, the modified function was not actually executed for comparison. In CPU mode, the result calculated by the operator is completely consistent with the benchmark golden data, and the md5 of the two is the same.

4. Compile and run

This course provides a sandbox operating environment, find a way to get the code into it.

1) Configure environment variables

Two), CPU mode

The cpu mode compiles and runs smoothly, and the result is exactly the same as that of the comparison group.

3), NPU mode

An error is reported when compiling in npu mode, because the sandbox time is limited, and there is a chance to study it later.

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/9868981