2012-11-18

Japanese translation of TLPI will be published in December

The Japanese translation of TLPI (published by O'Reilly Japan) will come out next month. Japanese readers can read (the Japanese translation of) a short preface that I wrote for the translation.

Although I will never be able to read it, I have a couple of reasons to believe that the translation will be excellent. First, the translation was done by a kernel developer, Junjiro Okajima (developer of the aufs file system). Second, along the way Junjiro has been extremely attentive in reading TLPI and checking details. As a consequence, he contributed a large number of errata reports (making up more than 20% of all current errata reports for TLPI), ranging from typos to corrections and improvements to rather deep technical details presented in the English text.


2012-11-06

Nominated for NZ Open Source Award

To my pleasant surprise, I'm informed that I'm not merely nominated, but am also a finalist for for a New Zealand Open Source Award, in the Open Source Contributor category, for "contributions to Linux documentation" (sounds about right!). But, being in Barcelona for LinuxCon Europe / ELCE, I won't be able to be present for the awards event. Old friend Dan Randow has kindly agreed to proxy for me at the award event tomorrow.

(Update, 2012-11-15: Congratulations to Grant McLean, who won the award.)

2012-10-05

How much do __builtin_expect(), likely(), and unlikely() improve performance?

I got curious about how much __builtin_expect() helps performance and when its use is justified, so wrote a small test program to study its effects.

As the gcc documentation says, you can use this compiler built-in function to give the optimizer a clue about the likely result of an integer (or Boolean) expression. In the context of an if statement, this enables the optimizer to reorder the code in a way that gives best performance, by ensuring that the code that is most likely to execute after the conditional immediately follows the conditional when the instruction stream is fed to the CPU pipeline.

The __builtin_expect() function takes two arguments: a value to be tested, and the expected result. Both of these are integral values. The interface is a little clumsy for most uses, since the common case is that we want to test for "true" (non-zero) or "false" (zero). Thus, the Linux kernel defines two simpler interfaces: likely() and unlikely() (in include/linux/compiler.h):

    #define likely(x)      __builtin_expect(!!(x), 1)
    #define unlikely(x)    __builtin_expect(!!(x), 0)


In other words, likely(x) means "I expect x is true", and and unlikely(x) means "I expect x is false".

Here's my test program. The comments should be enough to help you understand some of the more obscure pieces. Below, I'll just skip to looking at the test results.


/* builtin_expect_test.c */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#define BSIZE 1000000

#ifndef BINCR
#define BINCR 1
#endif

#if defined(EXPECT_RESULT) && defined(DONT_EXPECT)
#error "Specifying both EXPECT_RESULT and DONT_EXPECT makes no sense"
#endif

/* We add some seemingly unneeded complexity to the code, simply to
   make the opimizer's task tricky enough that it won't optimize away
   the effect of __builtin_expect(). In this particular program, all
   of the following are needed:

        * Calling an *non-inline* function inside the loop in main().
        * Looping over an array in main() (rather than checking a
          single variable).
        * Dynamically allocating the array with calloc(), rather than
          declaring an array and initializing with memset().
        * Acting on two different variables (m1, m2) in each branch
          of the 'if' statement in main() (if the two branches after
          the 'if' execute the same code, gcc is clever enough to
          recognize this and optimize the 'if' away).
        * Printing the resulting values of the variables modified in
          the loop (otherwise gcc may optimize away the entire loop
          inside main()).

   Compile with at least -O2 (on x86) to see a difference in
   performance due to __builtin_expect().
*/

static __attribute__ ((noinline))  int
f(int a)
{
    return a;
}

int
main(int argc, char *argv[])
{
    int *p;
    int j, k, m1, m2, nloops;

    if (argc != 2 || strcmp(argv[1], "--help") == 0) {
        fprintf(stderr, "Usage: %s num-loops\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    m1 = m2 = 0;
    nloops = atoi(argv[1]);

    /* calloc() allocates an array and zeros its contents */

    p = calloc(BSIZE, sizeof(int));
    if (p == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

#if defined(BREAK_STEP) && BREAK_STEP > 0

    /* This provides us with a way to inject some values into the
       array that differ from our expected test value, in order
       to get an idea of how how much the __builtin_expect()
       optimization is negatively affected by unexpected values. */

    for (k = 0, j = 0; j < BSIZE; j += BREAK_STEP) {
        p[j] += BINCR;
        k++;
    }

    printf("Adjusted %d items by %d\n", k, BINCR);

#endif

    for (j = 0; j < nloops; j++) {
        for (k = 0; k < BSIZE; k++) {
#ifdef DONT_EXPECT
            if (p[k]) {
#else
            if (__builtin_expect(p[k], EXPECT_RESULT)) {
#endif
                m1 = f(++m1);
             } else {
                m2 = f(++m2);
             }
        }
    }

    printf("%d, %d\n", m1, m2);

    exit(EXIT_SUCCESS);
}


(You can download the test code here.)

The program essentially repeatedly scans a one-million-element integer array whose contents are zero (in the default case). Using the program, we can time the results of the checks that are performed either with or without using __builtin_expect().

For example, here we scan the array without  __builtin_expect():
    $ cc -DDONT_EXPECT -O3 builtin_expect_test.c -o bn
    $ time -f "%E real, %U user, %S sys" ./bn 1000
    0, 1000000000
    0:02.68 real,  2.67 user, 0.00 sys
In this case, the program looped one thousand times through the array, to perform a total of one billion checks, and the real time for execution was 2.68 seconds. (The test machine is an Intel Core Duo 2.53GHz, and the gcc version is 4.6.3.)

Here's what happens if we employ  __builtin_expect(), telling the compiler that the expected result of the test is 0.
    $ cc -DEXPECT_RESULT=0 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    0, 1000000000
    0:02.28 real,  2.28 user, 0.00 sys
The execution time fell to 2.28 seconds. In other words (for this particular CPU, compiler version, and program), __builtin_expect() improved the execution time of each check by 0.4 nanoseconds (0.4 seconds for one billion checks).

Well and good. What if we tell __builtin_expect() to expect the wrong value?
    $ cc -DEXPECT_RESULT=1 -O3 builtin_expect_test.c -o b1
    $ time -f "%E real, %U user, %S sys" ./b1 1000
    0, 1000000000
    0:04.19 real,  4.18 user, 0.00 sys
In this case, unsurprisingly, we made each check run slower, by about 1.5 (i.e., 4.19 - 2.68) nanoseconds.

So, should you use __builtin_expect()?

You should only use __builtin_expect()—or the Linux kernel's likely() and unlikely()—if it's "very likely" that your code will follow the predicted branch. How much is "very likely"? If you're looking for actual numbers, the answer will depend on your compiler version, CPU, and code. But to illustrate that you should generally avoid these optimizations unless your code is very likely to follow one branch, here's some further tests using the above code.

In this test, the program first injects some nonzero values into the array before doing tests for zero using __builtin_expect(). Nonzero values are placed at every tenth element in the array:
    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000000, 900000000
    0:02.79 real,  2.76 user, 0.01 sys
Note what happened. Even though most array elements contained the expected zero value, execution speed was actually worse (2.79 seconds versus 2.69 seconds) than not using __builtin_expect() at all! In fact, even when only one in ten thousand values is nonzero, we're still at only roughly the break-even point:
    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10000 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000, 999900000
    0:02.66 real,  2.64 user, 0.00 sys
The point where using these optimizations becomes worthwhile will depend on the factors mention above, but the point is that you should really only use them when your predicted path is very likely, and if your predicted path is not very likely, then you're better off avoiding them, as you'll actually slow your code down a little.

Compiler-assisted run-time profiling

The gcc documentation contains the following advice regarding the use of __builtin_expect():
In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
That's good concise advice. To put things another way, the only time you should use __builtin_expect() is when you can't use compiler-assisted runtime optimization (perhaps because your program has no easily repeatable pattern of execution—the Linux kernel is an obvious example) and you are certain that your predicted code path is very (very) likely to be the one that will be taken.

The example program above does have a very predictable, repeatable flow of execution. Let's see what happens when we use compiler-assisted optimization. Building the programming now involves two steps: a profiling phase and an optimized compile. In the profiling phase, we build and run an instrumented version of the executable. We build as follows:
    $ cc -O3 -DDONT_EXPECT -fprofile-generate builtin_expect_test.c -o bn.prof
(The -fprofile-generate option implies -fprofile-arcs, as well as one or two other profiling options.)

We then run the executable, which generates profiling information that is stored in a file (with the extension .gcda).
    $ time -f "%E real, %U user, %S sys" ./bn.prof 1000
    0, 1000000000
    0:05.39 real,  5.37 user, 0.00 sys
Note that, because of the instrumentation code, the profiled version runs rather slower that the normally compiled code.  Running this code created a file containing the profiling results:
    $ ls *.gcda
    builtin_expect_test.gcda
We then employ the -fprofile-use compiler option,which (implicitly) uses the profiling results to create an optimized executable.
    $ cc -O3 -DDONT_EXPECT -fprofile-use builtin_expect_test.c -o bn.opt
And then we run the optimized program:
    $ time -f "%E real, %U user, %S sys" ./bn.opt 1000
    0, 1000000000
    0:01.95 real,  1.94 user, 0.00 sys
This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized.

It's left as an exercise for the reader to show that employing __builtin_expect() (to expect 0) in conjunction with compiler-assisted optimization doesn't improve things: the compiler already optimizes the if branching as well as the programmer-directed optimization. One other interesting exercise  is, of course, to compare the assembler (cc -S) code generated for each of the above cases.

2012-07-14

The Korean translation of TLPI is out

The Korean translation of The Linux Programming Interface came out on 10 July. You can find further information about the translation my earlier blog post and on the translations page of the TLPI web site.

The publisher kindly sent me a few photographs of the finished work:



(The table on the left corresponds to page 194 in the English original, and the figure on the right corresponds to page 902.)



(You have to love the bookends...)

2012-06-25

Some licensing changes to the source code

When originally published, the code from TLPI was licensed under the GNU Affero General Public License, version 3.

One reader commented that they are using one of my library functions as part of their everyday "toolbox" of handy pieces of code. As things currently stand, that would mean that if they redistributed any code that linked against that function or ran that code as part of a larger network application, then the complete source code of their application would also need to be licensed under the Affero GPL (i.e., all source code linked to my library functions would also need to be distributed).

That scenario wasn't really my intention (I'd overlooked the case of library functions in my code). So, I've now relicensed the library functions in TLPI under the GNU Lesser General Public License, version 3. The relicensing applies to all files in the lib/ directory of the source code distribution, and the changed licenses are already present in the latest source code tarball and online versions of the programs that went up in an update of the web site a couple of weeks ago.

Aside from the licensing changes, it's worth mentioning that since publication of TLPI, I've made a number of small fixes to various example programs. The fixes are listed in the CHANGES file that is distributed as part of the source code tarball.

2012-06-22

Korean translation of TLPI available soon

The Korean publisher Acorn is completing the final pieces in the production of the Korean translation of TLPI, which should be published next month.

The translation will be published in two volumes that together run to nearly 2000 pages. As well as splitting the book into two volumes, the chapters are reorganized somewhat. Chapters 29 to 33 (POSIX threads) of the English original move to a later position in the translation, so that they form the start of the second volume. (In a two-volume version, this reordering seemed a little better both to the publisher and to me.) The two volumes are thus:
The two volumes will be sold both separately (priced respectively at 50,000 and 35,000 South Korean won), and as a two-volume package (79,000 won).

Considering that translation work began just 18 months ago, the speed of publication is impressive. One of the things that assisted was that the translation was the work of a team of translators.

It is of course an honor and a pleasure to see my book translated, for which I thank both the publisher and translators. Among the translators, a special thanks to Kiju Kim (nontechnical private blog), who kept me up to date on progress and also submitted a number of error reports on the English original while translating. And though I'll never be able to read them, I look forward to being able to hold copies of the Korean translation in some weeks.

2012-04-09

TLPI third print run now available

I'm happy to announce that the third print run of TLPI is now printed, and should be available for sale shortly. The third print run incorporates these 131 errata shown on the errata page. (If that seems like a lot of errata, take a look at this FAQ.)

2012-03-13

Kernel capability usage statistics

(Update: 2012-07-16: I wrote a related article for this blog post, "CAP_SYS_ADMIN: the new root" that was published on LWN.net on 2012-03-14.)

The idea of capabilities is to break the power of root (user ID 0) into independently assigned pieces governing specific privileged operations. Implicit in that model is that the set of privileged operations governed by each capability should be small (otherwise, why break the power of root into pieces at all?). However, that implication hasn't turned out to be true in practice.

Table 1 shows some statistics on the use of the 36 currently existing capabilities in the C files of the Linux 3.2 kernel source code. The "#uses" column is the number of uses of the capability across all source files; the "#files" column is the number of distinct source files where the capability is used.

Table 1: Breakdown of Linux capability uses in Linux 3.2
Capability#uses#files
CAP_AUDIT_CONTROL22
CAP_AUDIT_WRITE11
CAP_CHOWN42
CAP_DAC_OVERRIDE21
CAP_DAC_READ_SEARCH42
CAP_FOWNER98
CAP_FSETID86
CAP_IPC_LOCK138
CAP_IPC_OWNER11
CAP_KILL22
CAP_LEASE11
CAP_LINUX_IMMUTABLE1313
CAP_MAC_ADMIN255
CAP_MAC_OVERRIDE62
CAP_MKNOD33
CAP_NET_ADMIN395182
CAP_NET_BIND_SERVICE1310
CAP_NET_BROADCAST00
CAP_NET_RAW1811
CAP_SETFCAP32
CAP_SETGID106
CAP_SETPCAP22
CAP_SETUID84
CAP_SYS_ADMIN451229
CAP_SYS_BOOT22
CAP_SYS_CHROOT11
CAP_SYSLOG22
CAP_SYS_MODULE43
CAP_SYS_NICE148
CAP_SYS_PACCT11
CAP_SYS_PTRACE116
CAP_SYS_RAWIO6742
CAP_SYS_RESOURCE3624
CAP_SYS_TIME2213
CAP_SYS_TTY_CONFIG114
CAP_WAKE_ALARM21
Total1167610

What is of course notable is the extremely heavy use of  two capabilities: CAP_SYS_ADMIN and CAP_NET_ADMIN. Together, these two capabilities account for more than 70% of all uses of capabilities! The uses of CAP_NET_ADMIN are limited to the drivers/ (mainly drivers/net/) and net/ directories. On the other hand, the uses of CAP_SYS_ADMIN, which accounts for nearly 39% of all capability uses, are spread widely across the kernel source tree.

One might wonder whether either of these two capabilities is overrepresented because of duplications in the drivers/ or the arch/ trees. (In particular, CAP_SYS_ADMIN is used for similar administrative functions on a lot of device drivers.) However, even if we strip drivers/ and architectures other than x86 from the measurements, the overall picture doesn't change greatly, as shown in Table 2.

Table 2: Breakdown of Linux capability uses in Linux 3.2, excluding drivers and architectures other than x86
Capability#uses#files
CAP_AUDIT_CONTROL22
CAP_AUDIT_WRITE11
CAP_CHOWN42
CAP_DAC_OVERRIDE21
CAP_DAC_READ_SEARCH42
CAP_FOWNER98
CAP_FSETID86
CAP_IPC_LOCK116
CAP_IPC_OWNER11
CAP_KILL11
CAP_LEASE11
CAP_LINUX_IMMUTABLE1313
CAP_MAC_ADMIN255
CAP_MAC_OVERRIDE62
CAP_MKNOD33
CAP_NET_ADMIN16773
CAP_NET_BIND_SERVICE129
CAP_NET_BROADCAST00
CAP_NET_RAW1811
CAP_SETFCAP32
CAP_SETGID95
CAP_SETPCAP22
CAP_SETUID84
CAP_SYS_ADMIN16780
CAP_SYS_BOOT22
CAP_SYS_CHROOT11
CAP_SYSLOG22
CAP_SYS_MODULE43
CAP_SYS_NICE126
CAP_SYS_PACCT11
CAP_SYS_PTRACE105
CAP_SYS_RAWIO109
CAP_SYS_RESOURCE2618
CAP_SYS_TIME42
CAP_SYS_TTY_CONFIG11
CAP_WAKE_ALARM21
Total552291

CAP_SYS_ADMIN still accounts for 167 of 552 uses of capabilities--about 30%, and, by chance, usage of CAP_NET_ADMIN is the same.

It turns out that the overall picture hasn't changed that much since capabilities were first introduced with Linux 2.2 (Jan 1999). Then, there were 27 capabilities, broken down as shown in Table 3.

Table 3: Breakdown of Linux capability uses in Linux 2.2
Capability#uses#files
CAP_CHOWN21
CAP_DAC_OVERRIDE55
CAP_DAC_READ_SEARCH44
CAP_FOWNER75
CAP_FSETID32
CAP_IPC_LOCK42
CAP_IPC_OWNER11
CAP_KILL00
CAP_LINUX_IMMUTABLE22
CAP_NET_ADMIN7532
CAP_NET_BIND_SERVICE33
CAP_NET_BROADCAST00
CAP_NET_RAW86
CAP_SETGID72
CAP_SETPCAP22
CAP_SETUID73
CAP_SYS_ADMIN12769
CAP_SYS_BOOT11
CAP_SYS_CHROOT11
CAP_SYS_MODULE42
CAP_SYS_NICE52
CAP_SYS_PACCT11
CAP_SYS_PTRACE99
CAP_SYS_RAWIO21
CAP_SYS_RESOURCE108
CAP_SYS_TIME74
CAP_SYS_TTY_CONFIG11
Total298169

In Linux 2.2, CAP_SYS_ADMIN accounted for 42% of the uses of capabilities, and CAP_NET_ADMIN accounted for 25%.

Table 4 repeats the earlier exercise of excluding drivers/ and architectures other than i386 (as the Intel arch/ directory was then named) from the Linux 2.2 data. In this case, an interesting difference emerges.

Table 4: Breakdown of Linux capability uses in Linux 2.2, excluding drivers and architectures other than i386
Capability#uses#files
CAP_CHOWN21
CAP_DAC_OVERRIDE55
CAP_DAC_READ_SEARCH44
CAP_FOWNER75
CAP_FSETID32
CAP_IPC_LOCK42
CAP_IPC_OWNER11
CAP_KILL00
CAP_LINUX_IMMUTABLE22
CAP_NET_ADMIN4423
CAP_NET_BIND_SERVICE33
CAP_NET_BROADCAST00
CAP_NET_RAW86
CAP_SETGID72
CAP_SETPCAP22
CAP_SETUID73
CAP_SYS_ADMIN2317
CAP_SYS_BOOT11
CAP_SYS_CHROOT11
CAP_SYS_MODULE42
CAP_SYS_NICE52
CAP_SYS_PACCT11
CAP_SYS_PTRACE22
CAP_SYS_RAWIO21
CAP_SYS_RESOURCE54
CAP_SYS_TIME31
CAP_SYS_TTY_CONFIG11
Total14794

The overall picture differs in a way that I suspect is significant: just under 16% (23/147) of the uses of capabilities are CAP_SYS_ADMIN. (As we saw in Table 2, by Linux 3.2, this figure had grown to 30% (167/552).) This difference suggests to me that as a series of kernel developers was faced with the question: "What capability should I use to govern my new privileged kernel feature?", the answer was often something like "I don't know; maybe CAP_SYS_ADMIN?". (That certainly fits with a few anecdotal cases I've encountered while discussing things with kernel developers as I wrote man pages for new kernel features.)

The script (count_kernel_cap_uses.sh) used to generate the data for these statistics can be found here. The first and third tables above are based on analysis of the "p2" output files produced by the script. The second and fourth tables are based on analysis of the "p4" output files.

2012-02-22

Look hard, and you can see TLPI

This is fun... I'm fairly certain that the guess by this redditor about what you can see sitting behind the laptop in this photo (taken by John Snyder, and used in the 21 Feb 2012 Wired article, Lord of the Files: How GitHub Tamed Free Software (And More)) is correct. Probably, it's the copy I gave Linus last year. It's nice to know he held on to it, though I'm not so sure there's so much in the book that he doesn't already know.

Updated 2015-07-06: fix a broken URL, and add the title of Wired article. 
Updates 2016-07-25: fix a broken URL (again... Wired.com doesn't seem to embrace the notion of stable URLs).

2012-02-20

Traditional Chinese translation of TLPI

The Taiwanese publisher GOTOP has contracted the rights to do a Traditional Chinese publication of TLPI.

There are now four translations of TLPI in progress. It looks like the first of the translations that will appear will be the Korean translation, sometime around the middle of this year.

2012-02-04

Web site updates

I've rolled out a few updates to the man7.org web site, including a redesigned main page for TLPI, and the addition of several errata to the errata page. Special thanks for a long list of error reports to Junjiro Okajima, who's working on the Japanese translation of TLPI.

2012-01-16

Third print run (and request for bug reports)

Sales of The Linux Programming Interface have continued well enough that the publisher will soon start preparing the third print run. That print run will incorporate all of the outstanding errata.

If you've been reading TLPI and noticed any errors or typos, now would be a good time to report them, so that fixes can be included in the next printing.