Faster divisibility test than % operator?

DaBler :

I noticed a curious thing on my computer.* The handwritten divisibility test is significantly faster than the % operator. Consider the minimal example:

* AMD Ryzen Threadripper 2990WX, GCC 9.2.0

static int divisible_ui_p(unsigned int m, unsigned int a)
{
    if (m <= a) {
        if (m == a) {
            return 1;
        }

        return 0;
    }

    m += a;

    m >>= __builtin_ctz(m);

    return divisible_ui_p(m, a);
}

The example is limited by odd a and m > 0. However, it can be easily generalized to all a and m. The code just converts the division to a series of additions.

Now consider the test program compiled with -std=c99 -march=native -O3:

    for (unsigned int a = 1; a < 100000; a += 2) {
        for (unsigned int m = 1; m < 100000; m += 1) {
#if 1
            volatile int r = divisible_ui_p(m, a);
#else
            volatile int r = (m % a == 0);
#endif
        }
    }

... and the results on my computer:

| implementation     | time [secs] |
|--------------------|-------------|
| divisible_ui_p     |    8.52user |
| builtin % operator |   17.61user |

Therefore more than 2 times faster.

The question: Can you tell me how the code behaves on your machine? Is it missed optimization opportunity in GCC? Can you do this test even faster?


UPDATE: As requested, here is a minimal reproducible example:

#include <assert.h>

static int divisible_ui_p(unsigned int m, unsigned int a)
{
    if (m <= a) {
        if (m == a) {
            return 1;
        }

        return 0;
    }

    m += a;

    m >>= __builtin_ctz(m);

    return divisible_ui_p(m, a);
}

int main()
{
    for (unsigned int a = 1; a < 100000; a += 2) {
        for (unsigned int m = 1; m < 100000; m += 1) {
            assert(divisible_ui_p(m, a) == (m % a == 0));
#if 1
            volatile int r = divisible_ui_p(m, a);
#else
            volatile int r = (m % a == 0);
#endif
        }
    }

    return 0;
}

compiled with gcc -std=c99 -march=native -O3 -DNDEBUG on AMD Ryzen Threadripper 2990WX with

gcc --version
gcc (Gentoo 9.2.0-r2 p3) 9.2.0

UPDATE2: As requested, the version that can handle any a and m (if you also want to avoid integer overflow, the test has to be implemented with integer type twice as long as the input integers):

int divisible_ui_p(unsigned int m, unsigned int a)
{
#if 1
    /* handles even a */
    int alpha = __builtin_ctz(a);

    if (alpha) {
        if (__builtin_ctz(m) < alpha) {
            return 0;
        }

        a >>= alpha;
    }
#endif

    while (m > a) {
        m += a;
        m >>= __builtin_ctz(m);
    }

    if (m == a) {
        return 1;
    }

#if 1
    /* ensures that 0 is divisible by anything */
    if (m == 0) {
        return 1;
    }
#endif

    return 0;
}
DaBler :

I will answer my question myself. It seems that I became a victim of branch prediction. The mutual size of the operands does not seem to matter, only their order.

Consider the following implementation

int divisible_ui_p(unsigned int m, unsigned int a)
{
    while (m > a) {
        m += a;
        m >>= __builtin_ctz(m);
    }

    if (m == a) {
        return 1;
    }

    return 0;
}

and the arrays

unsigned int A[100000/2];
unsigned int M[100000-1];

for (unsigned int a = 1; a < 100000; a += 2) {
    A[a/2] = a;
}
for (unsigned int m = 1; m < 100000; m += 1) {
    M[m-1] = m;
}

which are / are not shuffled using the shuffle function.

Without shuffling, the results are still

| implementation     | time [secs] |
|--------------------|-------------|
| divisible_ui_p     |    8.56user |
| builtin % operator |   17.59user |

However, once I shuffle these arrays, the results are different

| implementation     | time [secs] |
|--------------------|-------------|
| divisible_ui_p     |   31.34user |
| builtin % operator |   17.53user |

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=367719&siteId=1