Abstract: Write a sqrt operator of Ascend C, and verify it in cpu and npu mode by calling the kernel.
This article is shared from the HUAWEI CLOUD community " [2023 CANN Training Camp 1st Season] - Ascend C sqrt Operator Combat ", author: dayao.
foreword
Write a sqrt operator of Ascend C, and verify it in cpu and npu mode by calling the kernel. In the training camp sandbox environment, the cpu mode works fine and the results are correct.
I. Overview
First briefly review the process and implementation of TIK C++ operator vector programming.
The vector operator development process is as follows:
![](https://pic3.zhimg.com/80/v2-2ba3335ffc6721abea493bf5061e9a22_720w.webp)
The main work contents are:
1. Operator analysis: determine the input and output, determine the mathematical expression and the underlying implementation interface, and determine the definition of the kernel function.
2. Implementation of the operator class: implement init() and process(). init() completes memory initialization, which essentially reflects multi-core operation, splitting single-core data and whether to enable double buffer optimization; Process() implements the three pipeline tasks of CopyIn, Compute, and CopyOut.
3. Operator verification: call the operator through the kernel caller of the kernel function, calculate the result, and compare it with the numpy calculation result using the same input, and the error should be within a certain range. In practical applications, it is necessary to use the operators of the original framework to compare the calculation accuracy.
2. Operator analysis
The operator is defined as follows: Assume that there are still 8 logical cores.
![](https://pic2.zhimg.com/80/v2-e2e57d9558864922a27ff9df656d22dd_720w.webp)
Querying the API of TIK C++ shows that (TIK C++ API/Vector Computing/Monocular/Sqrt, using Level 2 interface) can be used to complete the operation and get the final result.
![](https://pic2.zhimg.com/80/v2-93f1e5d5645368a21c3df0c96a534b39_720w.webp)
3. Code Analysis
Modify directly on the add_tik2 operator project provided by the training camp course. Code address: https://gitee.com/zgx950813/samples/tree/master/tik2_demo/kernel_samples/kernel_add_sample
Modify the code directory structure as follows: CMakeLists.txt and data_utils.h have not been modified, and the compilation and execution script run.sh has only changed the comparison part between the calculation result and the true value.
![](https://pic3.zhimg.com/80/v2-8f24df9d0ba640fda4fd0cf404633092_720w.webp)
1) Kernel function definition
Compared to the routine, the input parameter is only x.
extern "C" __global__ __aicore__ void sqrt_tik2(__gm__ uint8_t* x, __gm__ uint8_t* z)
{
KernelSqrt op;
op.Init(x, z);
op.Process();
}
2), operator class
The implementation is similar to the add routine. Initialize memory in the init() function: Global Memory of x and y; pipeline task communication memory; Process() implements pipeline tasks; write CopyIn, Compute, and CopyOut according to the paradigm. The biggest difference from the add routine is that in the compute function, the API of type 2 interface of sqrt is called to realize the calculation.
class KernelSqrt {
public:
__aicore__ inline KernelSqrt() {}
__aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* z)
{
// get start index for current core, core parallel
xGm.SetGlobalBuffer((__gm__ half*)x + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);
zGm.SetGlobalBuffer((__gm__ half*)z + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);
// pipe alloc memory to queue, the unit is Bytes
pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
}
__aicore__ inline void Process()
{
// loop count need to be doubled, due to double buffer
constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;
// tiling strategy, pipeline parallel
for (int32_t i = 0; i < loopCount; i++) {
CopyIn(i);
Compute(i);
CopyOut(i);
}
}
private:
__aicore__ inline void CopyIn(int32_t progress)
{
// alloc tensor from queue memory
LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
// copy progress_th tile from global tensor to local tensor
DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
// enque input tensors to VECIN queue
inQueueX.EnQue(xLocal);
}
__aicore__ inline void Compute(int32_t progress)
{
// deque input tensors from VECIN queue
LocalTensor<half> xLocal = inQueueX.DeQue<half>();
LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
// call Sqrt instr for computation
Sqrt(zLocal, xLocal, TILE_LENGTH);
// enque the output tensor to VECOUT queue
outQueueZ.EnQue<half>(zLocal);
// free input tensors for reuse
inQueueX.FreeTensor(xLocal);
}
__aicore__ inline void CopyOut(int32_t progress)
{
// deque output tensor from VECOUT queue
LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
// copy progress_th tile from local tensor to global tensor
DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
// free output tensor for reuse
outQueueZ.FreeTensor(zLocal);
}
private:
TPipe pipe;
// create queues for input, in this case depth is equal to buffer num
TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX;
// create queue for output, in this case depth is equal to buffer num
TQue<QuePosition::VECOUT, BUFFER_NUM> outQueueZ;
GlobalTensor<half> xGm, zGm;
};
3) Kernel function call
1. In CPU mode, call through ICPU_RUN_KF
ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug
2. In NPU mode, call by <<<>>>
#ifndef __CCE_KT_TEST__
// call of kernel function
void sqrt_tik2_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* z)
{
sqrt_tik2<<<blockDim, l2ctrl, stream>>>(x, z);
}
#endif
Because <<<>>> can only be called in NPU mode, it needs to be compiled conditionally, and it is not valid in CPU debugging mode. When calling sqrt_tik2_do, it needs to follow the requirements of ascendcl application programming.
3. Call code
The CPU and NPU modes are distinguished by the "__CCE_KT_TEST__" macro.
int32_t main(int32_t argc, char* argv[])
{
size_t inputByteSize = 8 * 2048 * sizeof(uint16_t); // uint16_t represent half
size_t outputByteSize = 8 * 2048 * sizeof(uint16_t); // uint16_t represent half
uint32_t blockDim = 8;
#ifdef __CCE_KT_TEST__
uint8_t* x = (uint8_t*)tik2::GmAlloc(inputByteSize);
uint8_t* z = (uint8_t*)tik2::GmAlloc(outputByteSize);
ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
// PrintData(x, 16, printDataType::HALF);
ICPU_RUN_KF(sqrt_tik2, blockDim, x, z); // use this macro for cpu debug
// PrintData(z, 16, printDataType::HALF);
WriteFile("./output/output_z.bin", z, outputByteSize);
tik2::GmFree((void *)x);
tik2::GmFree((void *)z);
#else
aclInit(nullptr);
aclrtContext context;
aclError error;
int32_t deviceId = 0;
aclrtCreateContext(&context, deviceId);
aclrtStream stream = nullptr;
aclrtCreateStream(&stream);
uint8_t *xHost, *zHost;
uint8_t *xDevice, *zDevice;
aclrtMallocHost((void**)(&xHost), inputByteSize);
aclrtMallocHost((void**)(&zHost), outputByteSize);
aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
// PrintData(xHost, 16, printDataType::HALF);
aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
sqrt_tik2_do(blockDim, nullptr, stream, xDevice, zDevice); // call kernel in this function
aclrtSynchronizeStream(stream);
aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST);
// PrintData(zHost, 16, printDataType::HALF);
WriteFile("./output/output_z.bin", zHost, outputByteSize);
aclrtFree(xDevice);
aclrtFree(zDevice);
aclrtFreeHost(xHost);
aclrtFreeHost(zHost);
aclrtDestroyStream(stream);
aclrtResetDevice(deviceId);
aclFinalize();
#endif
return 0;
}
4) Benchmark data generation - sqrt_tik2.py
Generate input_x and benchmark result golden using numpy.
import numpy as np
def gen_golden_data_simple():
input_x = np.random.uniform(0, 100, [8, 2048]).astype(np.float16)
golden = np.sqrt(input_x).astype(np.float16)
input_x.tofile("./input/input_x.bin")
golden.tofile("./output/golden.bin")
if __name__ == "__main__":
gen_golden_data_simple()
5) Comparing calculation results
Use numpy's allclose() function to compare the results of operator calculations with benchmark data. In fact, due to a compilation error in npu mode, the modified function was not actually executed for comparison. In CPU mode, the result calculated by the operator is completely consistent with the benchmark golden data, and the md5 of the two is the same.
4. Compile and run
This course provides a sandbox operating environment, find a way to get the code into it.
![](https://pic4.zhimg.com/80/v2-aba692750c7704470f634e328af100c7_720w.webp)
1) Configure environment variables
![](https://pic2.zhimg.com/80/v2-153fb4acfba7fa83f06467fe63722029_720w.webp)
Two), CPU mode
The cpu mode compiles and runs smoothly, and the result is exactly the same as that of the comparison group.
![](https://pic3.zhimg.com/80/v2-6b217312da74435acf09d7bf0bb0779a_720w.webp)
3), NPU mode
An error is reported when compiling in npu mode, because the sandbox time is limited, and there is a chance to study it later.
![](https://pic2.zhimg.com/80/v2-667492615c1cd5f00463f56f431bba59_720w.webp)
Click to follow and learn about Huawei Cloud's fresh technologies for the first time~