(C/C++) Efficiency Black Technology-Duff's Device

scene

Iterating over a sequence is a very common operation.

Here is such a scenario, where two strings need to be copied.

It can be done for us in C language void* memcpy( void *dest, const void *src, size_t count );.

#include <stdio.h>
#include <string.h>

int main() {
    
    
    char src[128] = "Hello World!";
    char dest[128] = "";

    memcpy(dest, src, sizeof(src));

    printf("dest = %s\n", dest);
    printf("src = %s\n", src);

    return 0;
}

Now we have to implement such a function ourselves, we define the following interface.

#include <stdio.h>
#include <string.h>

/**
 * @brief 
 * 实现类 void* memcpy( void *dest, const void *src, size_t count );
 * 的字符串拷贝功能
 * @param dest 目标串的首地址
 * @param src  被复制串的首地址
 * @param len  复制长度
 */
void my_memcpy(char* dest, char* src, int len) ;

int main() {
    
    
    char src[128] = "Hello World!";
    char dest[128] = "";

    my_memcpy(dest, src, strlen(src));

    printf("dest = %s\n", dest);
    printf("src = %s\n", src);

    return 0;
}

accomplish

Violent copy

The most direct way is to traverse the assignment directly.

But for this loop, the operation in the loop body is very simple, but the loop judgment condition must not be less.

In this case, the judgment overhead in for is usually greater than that in the loop body.

void my_memcpy(char* dest, char* src, int len) {
    
    
    for (int i = 0; i < len; i += 1) {
    
    
        *dest++ = *src++;
    }
}

Here is a brief explanation for friends who are not familiar with C language.

*dest++ = *src++;The middle ++operation level comes *before.

Specifically, the dereference operation (value operation) is performed while the pointer is incremented, and then the value is assigned.

In this way, in the next round of operations, dest and src can go to the next position. Ensure that the relative positions of the two operations are consistent.

It can be roughly understood as equivalent to the following operations:

for (int i = 0; i < len; i += 1) {
     
     
    dest[i] = src[i];
}

Increase the number of operations and reduce the number of judgments

Try to calculate with BASE=8 operations as a round.

Operate on the margin separately. The margin here can only be the number of [0, 8] closed intervals.

Then perform the same operation 8 times for each group.

In this way, the number of operations can be directly increased in the code, and the number of loop judgments can be reduced by 8 times.

Specifically, we can write code in the following form.

code1

void my_memcpy(char* dest, char* src, int len) {
    
    
    const int BASE = 8;
	
    // 处理余量
    int remainder = len % BASE;
    while (remainder--) {
    
    
        *dest++ = *src++;
    }
    
    // 以base为一组进行计算
    int cnt = len / BASE;
    while (cnt--) {
    
    
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
    }
}

code2

void my_memcpy(char* dest, char* src, int len) {
    
    
    const int BASE = 8;

    // 处理余量
    switch (len % BASE) {
    
    
    case 7: *dest++ = *src++;
    case 6: *dest++ = *src++;
    case 5: *dest++ = *src++;
    case 4: *dest++ = *src++;
    case 3: *dest++ = *src++;
    case 2: *dest++ = *src++;
    case 1: *dest++ = *src++;
    }

    // 以base为一组进行计算
    int cnt = len / BASE;
    while (cnt--) {
    
    
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
        *dest++ = *src++;
    }
}

Duff’s Device

The above code has achieved our basic purpose. But at first glance, it is a two-part code block, which is not very beautiful.

The following is the key description of this article, Duff’s Device.

First of all, we need to have a solid understanding of the switch operation in C/C++.

In the switch statement, the most common occurrence is case /**/ : 和 default :.

The essence of the two is the jump label, just like gotothe label in the grammar, jumping directly from the judgment of the switch.

These labels themselves cannot form a domain, so various statements (blocks) can be added up and down.

In this way, we can map the label of the case to the 8 operation statements of while. For convenience, do{}while(bool);the form is used here.

While it is fixed to execute 8 at a time, but how to deal with this margin? The answer is simple, just reverse the numbering order of the cases.

At this time, len % BASE == 3the program will jump directly case 3:to where . Since there are no operations such as break and continue, case 2: case 1:the subsequent operations will also be executed. This perfectly executes 3 operations.

And a small detail that comes with it is that we need to len/BASEround up the number of groups, and the judgment in while needs to be used while(--cnt).

void my_memcpy(char* dest, char* src, int len) {
    
    
    const int BASE = 8;

    // 取上整
    int cnt = (len + BASE - 1) / BASE;
    // 余数正常计算
    switch (len % BASE) {
    
    
        do {
    
    
        case 0: *dest++ = *src++;
        case 7: *dest++ = *src++;
        case 6: *dest++ = *src++;
        case 5: *dest++ = *src++;
        case 4: *dest++ = *src++;
        case 3: *dest++ = *src++;
        case 2: *dest++ = *src++;
        case 1: *dest++ = *src++;
        } 
        // 注意cnt的运算顺序
        while (--cnt);
    }
}

Unfortunately, this syntax operation is not supported in some languages, such as java.

Performance Testing

Next, we increase the amount of data and use a timer to perform a simple performance test.

The framework is tested as follows.

#include <stdio.h>
#include <string.h>
#include <time.h>

#define STR_LENGTH 1000000
#define TEST_COUNT 5
#define RUN_COUNT 1000

void my_memcpy(char* dest, char* src, int len);

int main() {
    
    
    int test_count = TEST_COUNT;
    char src[STR_LENGTH];
    char dest[STR_LENGTH];

    while (test_count--) {
    
    
        int run_count = RUN_COUNT;

        clock_t start = clock();
        while (run_count--) {
    
    
            my_memcpy(dest, src, STR_LENGTH);
        }
        clock_t stop = clock();

        printf("run time = %ld\n", (stop - start));
    }
    
    return 0;
}

Increase the number of operations and reduce the number of judgments

The code is not tested.

normal for loop

run time = 2177
run time = 2067
run time = 1991
run time = 2009
run time = 1996

Duff’s Device

Surprisingly, the operation of Duff's Device is not as fast as the ordinary for loop.

Even slower than that. It can be seen that modern compilers have made great optimizations to the basic loop.

run time = 2266
run time = 2139
run time = 2232
run time = 2239
run time = 2421

void* memcpy( void *, const void *, size_t);

Try again, internal memcpy(). Fly directly and quickly.

Functions like ``memcpy() and memset()` are further optimized in most modern compilers.

How to operate it depends on how the specific compiler implements it. Some data show that this type of function will call some specific assembly instructions, which greatly increases the speed of operation.

#include <string.h>
void my_memcpy(char* dest, char* src, int len) {
    
    
    memcpy(dest, src, len);
}

run time = 80
run time = 69
run time = 53
run time = 56
run time = 55

END

Reference: How does Duff's Device work?

Guess you like

Origin blog.csdn.net/CUBE_lotus/article/details/131237873