2012-10-05

How much do __builtin_expect(), likely(), and unlikely() improve performance?

I got curious about how much __builtin_expect() helps performance and when its use is justified, so wrote a small test program to study its effects.

As the gcc documentation says, you can use this compiler built-in function to give the optimizer a clue about the likely result of an integer (or Boolean) expression. In the context of an if statement, this enables the optimizer to reorder the code in a way that gives best performance, by ensuring that the code that is most likely to execute after the conditional immediately follows the conditional when the instruction stream is fed to the CPU pipeline.

The __builtin_expect() function takes two arguments: a value to be tested, and the expected result. Both of these are integral values. The interface is a little clumsy for most uses, since the common case is that we want to test for "true" (non-zero) or "false" (zero). Thus, the Linux kernel defines two simpler interfaces: likely() and unlikely() (in include/linux/compiler.h):

    #define likely(x)      __builtin_expect(!!(x), 1)
    #define unlikely(x)    __builtin_expect(!!(x), 0)


In other words, likely(x) means "I expect x is true", and and unlikely(x) means "I expect x is false".

Here's my test program. The comments should be enough to help you understand some of the more obscure pieces. Below, I'll just skip to looking at the test results.


/* builtin_expect_test.c */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#define BSIZE 1000000

#ifndef BINCR
#define BINCR 1
#endif

#if defined(EXPECT_RESULT) && defined(DONT_EXPECT)
#error "Specifying both EXPECT_RESULT and DONT_EXPECT makes no sense"
#endif

/* We add some seemingly unneeded complexity to the code, simply to
   make the opimizer's task tricky enough that it won't optimize away
   the effect of __builtin_expect(). In this particular program, all
   of the following are needed:

        * Calling an *non-inline* function inside the loop in main().
        * Looping over an array in main() (rather than checking a
          single variable).
        * Dynamically allocating the array with calloc(), rather than
          declaring an array and initializing with memset().
        * Acting on two different variables (m1, m2) in each branch
          of the 'if' statement in main() (if the two branches after
          the 'if' execute the same code, gcc is clever enough to
          recognize this and optimize the 'if' away).
        * Printing the resulting values of the variables modified in
          the loop (otherwise gcc may optimize away the entire loop
          inside main()).

   Compile with at least -O2 (on x86) to see a difference in
   performance due to __builtin_expect().
*/

static __attribute__ ((noinline))  int
f(int a)
{
    return a;
}

int
main(int argc, char *argv[])
{
    int *p;
    int j, k, m1, m2, nloops;

    if (argc != 2 || strcmp(argv[1], "--help") == 0) {
        fprintf(stderr, "Usage: %s num-loops\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    m1 = m2 = 0;
    nloops = atoi(argv[1]);

    /* calloc() allocates an array and zeros its contents */

    p = calloc(BSIZE, sizeof(int));
    if (p == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

#if defined(BREAK_STEP) && BREAK_STEP > 0

    /* This provides us with a way to inject some values into the
       array that differ from our expected test value, in order
       to get an idea of how how much the __builtin_expect()
       optimization is negatively affected by unexpected values. */

    for (k = 0, j = 0; j < BSIZE; j += BREAK_STEP) {
        p[j] += BINCR;
        k++;
    }

    printf("Adjusted %d items by %d\n", k, BINCR);

#endif

    for (j = 0; j < nloops; j++) {
        for (k = 0; k < BSIZE; k++) {
#ifdef DONT_EXPECT
            if (p[k]) {
#else
            if (__builtin_expect(p[k], EXPECT_RESULT)) {
#endif
                m1 = f(++m1);
             } else {
                m2 = f(++m2);
             }
        }
    }

    printf("%d, %d\n", m1, m2);

    exit(EXIT_SUCCESS);
}


(You can download the test code here.)

The program essentially repeatedly scans a one-million-element integer array whose contents are zero (in the default case). Using the program, we can time the results of the checks that are performed either with or without using __builtin_expect().

For example, here we scan the array without  __builtin_expect():
    $ cc -DDONT_EXPECT -O3 builtin_expect_test.c -o bn
    $ time -f "%E real, %U user, %S sys" ./bn 1000
    0, 1000000000
    0:02.68 real,  2.67 user, 0.00 sys
In this case, the program looped one thousand times through the array, to perform a total of one billion checks, and the real time for execution was 2.68 seconds. (The test machine is an Intel Core Duo 2.53GHz, and the gcc version is 4.6.3.)

Here's what happens if we employ  __builtin_expect(), telling the compiler that the expected result of the test is 0.
    $ cc -DEXPECT_RESULT=0 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    0, 1000000000
    0:02.28 real,  2.28 user, 0.00 sys
The execution time fell to 2.28 seconds. In other words (for this particular CPU, compiler version, and program), __builtin_expect() improved the execution time of each check by 0.4 nanoseconds (0.4 seconds for one billion checks).

Well and good. What if we tell __builtin_expect() to expect the wrong value?
    $ cc -DEXPECT_RESULT=1 -O3 builtin_expect_test.c -o b1
    $ time -f "%E real, %U user, %S sys" ./b1 1000
    0, 1000000000
    0:04.19 real,  4.18 user, 0.00 sys
In this case, unsurprisingly, we made each check run slower, by about 1.5 (i.e., 4.19 - 2.68) nanoseconds.

So, should you use __builtin_expect()?

You should only use __builtin_expect()—or the Linux kernel's likely() and unlikely()—if it's "very likely" that your code will follow the predicted branch. How much is "very likely"? If you're looking for actual numbers, the answer will depend on your compiler version, CPU, and code. But to illustrate that you should generally avoid these optimizations unless your code is very likely to follow one branch, here's some further tests using the above code.

In this test, the program first injects some nonzero values into the array before doing tests for zero using __builtin_expect(). Nonzero values are placed at every tenth element in the array:
    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000000, 900000000
    0:02.79 real,  2.76 user, 0.01 sys
Note what happened. Even though most array elements contained the expected zero value, execution speed was actually worse (2.79 seconds versus 2.69 seconds) than not using __builtin_expect() at all! In fact, even when only one in ten thousand values is nonzero, we're still at only roughly the break-even point:
    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10000 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000, 999900000
    0:02.66 real,  2.64 user, 0.00 sys
The point where using these optimizations becomes worthwhile will depend on the factors mention above, but the point is that you should really only use them when your predicted path is very likely, and if your predicted path is not very likely, then you're better off avoiding them, as you'll actually slow your code down a little.

Compiler-assisted run-time profiling

The gcc documentation contains the following advice regarding the use of __builtin_expect():
In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
That's good concise advice. To put things another way, the only time you should use __builtin_expect() is when you can't use compiler-assisted runtime optimization (perhaps because your program has no easily repeatable pattern of execution—the Linux kernel is an obvious example) and you are certain that your predicted code path is very (very) likely to be the one that will be taken.

The example program above does have a very predictable, repeatable flow of execution. Let's see what happens when we use compiler-assisted optimization. Building the programming now involves two steps: a profiling phase and an optimized compile. In the profiling phase, we build and run an instrumented version of the executable. We build as follows:
    $ cc -O3 -DDONT_EXPECT -fprofile-generate builtin_expect_test.c -o bn.prof
(The -fprofile-generate option implies -fprofile-arcs, as well as one or two other profiling options.)

We then run the executable, which generates profiling information that is stored in a file (with the extension .gcda).
    $ time -f "%E real, %U user, %S sys" ./bn.prof 1000
    0, 1000000000
    0:05.39 real,  5.37 user, 0.00 sys
Note that, because of the instrumentation code, the profiled version runs rather slower that the normally compiled code.  Running this code created a file containing the profiling results:
    $ ls *.gcda
    builtin_expect_test.gcda
We then employ the -fprofile-use compiler option,which (implicitly) uses the profiling results to create an optimized executable.
    $ cc -O3 -DDONT_EXPECT -fprofile-use builtin_expect_test.c -o bn.opt
And then we run the optimized program:
    $ time -f "%E real, %U user, %S sys" ./bn.opt 1000
    0, 1000000000
    0:01.95 real,  1.94 user, 0.00 sys
This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized.

It's left as an exercise for the reader to show that employing __builtin_expect() (to expect 0) in conjunction with compiler-assisted optimization doesn't improve things: the compiler already optimizes the if branching as well as the programmer-directed optimization. One other interesting exercise  is, of course, to compare the assembler (cc -S) code generated for each of the above cases.

4 comments:

  1. > This optimized version runs significantly faster
    > (1.95 versus 2.28 seconds) than our version that
    > used __builtin_expect().

    I think this may not compare apple to apple:
    The -fprofile-use option not only optimizes
    branch prediction (as you expect) but also
    optimizes other things. See man gcc:

    === BEGIN QUOTE ===
    -fprofile-use
    -fprofile-use=path
    Enable profile feedback directed optimizations, and optimizations generally
    profitable only with profile feedback available.

    The following options are enabled: "-fbranch-probabilities", "-fvpt",
    "-funroll-loops", "-fpeel-loops", "-ftracer"
    === END QUOTE ===

    ReplyDelete
  2. Thank you for your nice article. I tried your program and I am a bit surprised because, on my machine, b0 is not faster than bn when compiled with gcc 4.7. However, when compiled with gcc 4.6, b0 clearly outperforms bn.

    My CPU is an Intel Core i5-2410M CPU at 2.30 GHz and I am running Debian Wheezy. I am using the gcc compilers packaged by Debian.

    Here are the numbers:

    gcc 4.7.2
    bn 2.517 +/- 0.0025 seconds
    b0 2.520 +/- 0.0069 seconds
    b1 3.893 +/- 0.0037 seconds

    gcc 4.6.3
    bn 2.861 +/- 0.0062 seconds
    b0 2.179 +/- 0.0083 seconds
    b1 3.550 +/- 0.0017 seconds

    * with gcc 4.7.2, bn and b0 have the same performance.
    * with gcc 4.6.3, b0 is faster than bn (24% faster).
    * b0 is faster when compiled with gcc 4.6.3 (13% faster)

    ReplyDelete
  3. I think your test code may not be showing __builtin_expect() to best effect. In a small inner loop where the branch almost always goes the same way, dynamic branch prediction is easily capable of avoiding branch misprediction latency. __builtin_expect() is primarily useful in code which is complex enough to overflow the branch history buffer.

    ReplyDelete
  4. good article. just wanted to add a point here. yeah as illustrated if the pridiction is wrong expect poor performance.
    but some time you want to optimize a code for certain scenario even if it is likly or unlikly. example in datapath, you want to optimise the code for sucessful lookup. even if majority of the pkt are dropped(some bad config or acl deny) we still want to optimise it for good pkt path if the intension is to have good performance in the fwd traffic rate. more than the code liklyhood, the decision to optimize is more on what you want.

    ReplyDelete