man7.org: 2012

2012-11-18

Japanese translation of TLPI will be published in December

The Japanese translation of TLPI (published by O'Reilly Japan) will come out next month. Japanese readers can read (the Japanese translation of) a short preface that I wrote for the translation.

Although I will never be able to read it, I have a couple of reasons to believe that the translation will be excellent. First, the translation was done by a kernel developer, Junjiro Okajima (developer of the aufs file system). Second, along the way Junjiro has been extremely attentive in reading TLPI and checking details. As a consequence, he contributed a large number of errata reports (making up more than 20% of all current errata reports for TLPI), ranging from typos to corrections and improvements to rather deep technical details presented in the English text.

2012-11-06

Nominated for NZ Open Source Award

To my pleasant surprise, I'm informed that I'm not merely nominated, but am also a finalist for for a New Zealand Open Source Award, in the Open Source Contributor category, for "contributions to Linux documentation" (sounds about right!). But, being in Barcelona for LinuxCon Europe / ELCE, I won't be able to be present for the awards event. Old friend Dan Randow has kindly agreed to proxy for me at the award event tomorrow.

(Update, 2012-11-15: Congratulations to Grant McLean, who won the award.)

2012-10-05

How much do __builtin_expect(), likely(), and unlikely() improve performance?

I got curious about how much __builtin_expect() helps performance and when its use is justified, so wrote a small test program to study its effects.

As the gcc documentation says, you can use this compiler built-in function to give the optimizer a clue about the likely result of an integer (or Boolean) expression. In the context of an if statement, this enables the optimizer to reorder the code in a way that gives best performance, by ensuring that the code that is most likely to execute after the conditional immediately follows the conditional when the instruction stream is fed to the CPU pipeline.

The __builtin_expect() function takes two arguments: a value to be tested, and the expected result. Both of these are integral values. The interface is a little clumsy for most uses, since the common case is that we want to test for "true" (non-zero) or "false" (zero). Thus, the Linux kernel defines two simpler interfaces: likely() and unlikely() (in include/linux/compiler.h):

#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

In other words, likely(x) means "I expect x is true", and and unlikely(x) means "I expect x is false".

Here's my test program. The comments should be enough to help you understand some of the more obscure pieces. Below, I'll just skip to looking at the test results.

/* builtin_expect_test.c */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#define BSIZE 1000000

#ifndef BINCR
#define BINCR 1
#endif

#if defined(EXPECT_RESULT) && defined(DONT_EXPECT)
#error "Specifying both EXPECT_RESULT and DONT_EXPECT makes no sense"
#endif

/* We add some seemingly unneeded complexity to the code, simply to
   make the opimizer's task tricky enough that it won't optimize away
   the effect of __builtin_expect(). In this particular program, all
   of the following are needed:

        * Calling an *non-inline* function inside the loop in main().
        * Looping over an array in main() (rather than checking a
          single variable).
        * Dynamically allocating the array with calloc(), rather than
          declaring an array and initializing with memset().
        * Acting on two different variables (m1, m2) in each branch
          of the 'if' statement in main() (if the two branches after
          the 'if' execute the same code, gcc is clever enough to
          recognize this and optimize the 'if' away).
        * Printing the resulting values of the variables modified in
          the loop (otherwise gcc may optimize away the entire loop
          inside main()).

   Compile with at least -O2 (on x86) to see a difference in
   performance due to __builtin_expect().
*/

static __attribute__ ((noinline))  int
f(int a)
{
    return a;
}

int
main(int argc, char *argv[])
{
    int *p;
    int j, k, m1, m2, nloops;

    if (argc != 2 || strcmp(argv[1], "--help") == 0) {
        fprintf(stderr, "Usage: %s num-loops\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    m1 = m2 = 0;
    nloops = atoi(argv[1]);

    /* calloc() allocates an array and zeros its contents */

    p = calloc(BSIZE, sizeof(int));
    if (p == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

#if defined(BREAK_STEP) && BREAK_STEP > 0

    /* This provides us with a way to inject some values into the
       array that differ from our expected test value, in order
       to get an idea of how how much the __builtin_expect()
       optimization is negatively affected by unexpected values. */

    for (k = 0, j = 0; j < BSIZE; j += BREAK_STEP) {
        p[j] += BINCR;
        k++;
    }

    printf("Adjusted %d items by %d\n", k, BINCR);

#endif

    for (j = 0; j < nloops; j++) {
        for (k = 0; k < BSIZE; k++) {
#ifdef DONT_EXPECT
            if (p[k]) {
#else
            if (__builtin_expect(p[k], EXPECT_RESULT)) {
#endif
                m1 = f(++m1);
             } else {
                m2 = f(++m2);
             }
        }
    }

    printf("%d, %d\n", m1, m2);

    exit(EXIT_SUCCESS);

(You can download the test code here.)

The program essentially repeatedly scans a one-million-element integer array whose contents are zero (in the default case). Using the program, we can time the results of the checks that are performed either with or without using __builtin_expect().

For example, here we scan the array without __builtin_expect():

    $ cc -DDONT_EXPECT -O3 builtin_expect_test.c -o bn
    $ time -f "%E real, %U user, %S sys" ./bn 1000
    0, 1000000000
    0:02.68 real,  2.67 user, 0.00 sys

In this case, the program looped one thousand times through the array, to perform a total of one billion checks, and the real time for execution was 2.68 seconds. (The test machine is an Intel Core Duo 2.53GHz, and the gcc version is 4.6.3.)

Here's what happens if we employ __builtin_expect(), telling the compiler that the expected result of the test is 0.

    $ cc -DEXPECT_RESULT=0 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    0, 1000000000
    0:02.28 real,  2.28 user, 0.00 sys

The execution time fell to 2.28 seconds. In other words (for this particular CPU, compiler version, and program), __builtin_expect() improved the execution time of each check by 0.4 nanoseconds (0.4 seconds for one billion checks).

Well and good. What if we tell __builtin_expect() to expect the wrong value?

    $ cc -DEXPECT_RESULT=1 -O3 builtin_expect_test.c -o b1
    $ time -f "%E real, %U user, %S sys" ./b1 1000
    0, 1000000000
    0:04.19 real,  4.18 user, 0.00 sys

In this case, unsurprisingly, we made each check run slower, by about 1.5 (i.e., 4.19 - 2.68) nanoseconds.

So, should you use __builtin_expect()?

You should only use __builtin_expect()—or the Linux kernel's likely() and unlikely()—if it's "very likely" that your code will follow the predicted branch. How much is "very likely"? If you're looking for actual numbers, the answer will depend on your compiler version, CPU, and code. But to illustrate that you should generally avoid these optimizations unless your code is very likely to follow one branch, here's some further tests using the above code.

In this test, the program first injects some nonzero values into the array before doing tests for zero using __builtin_expect(). Nonzero values are placed at every tenth element in the array:

    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000000, 900000000
    0:02.79 real,  2.76 user, 0.01 sys

Note what happened. Even though most array elements contained the expected zero value, execution speed was actually worse (2.79 seconds versus 2.69 seconds) than not using __builtin_expect() at all! In fact, even when only one in ten thousand values is nonzero, we're still at only roughly the break-even point:

    $ cc -DEXPECT_RESULT=0 -DBREAK_STEP=10000 -O3 builtin_expect_test.c -o b0
    $ time -f "%E real, %U user, %S sys" ./b0 1000
    100000, 999900000
    0:02.66 real,  2.64 user, 0.00 sys

The point where using these optimizations becomes worthwhile will depend on the factors mention above, but the point is that you should really only use them when your predicted path is very likely, and if your predicted path is not very likely, then you're better off avoiding them, as you'll actually slow your code down a little.

Compiler-assisted run-time profiling

The gcc documentation contains the following advice regarding the use of __builtin_expect():

In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

That's good concise advice. To put things another way, the only time you should use __builtin_expect() is when you can't use compiler-assisted runtime optimization (perhaps because your program has no easily repeatable pattern of execution—the Linux kernel is an obvious example) and you are certain that your predicted code path is very (very) likely to be the one that will be taken.

The example program above does have a very predictable, repeatable flow of execution. Let's see what happens when we use compiler-assisted optimization. Building the programming now involves two steps: a profiling phase and an optimized compile. In the profiling phase, we build and run an instrumented version of the executable. We build as follows:

    $ cc -O3 -DDONT_EXPECT -fprofile-generate builtin_expect_test.c -o bn.prof

(The -fprofile-generate option implies -fprofile-arcs, as well as one or two other profiling options.)

We then run the executable, which generates profiling information that is stored in a file (with the extension .gcda).

    $ time -f "%E real, %U user, %S sys" ./bn.prof 1000
    0, 1000000000
    0:05.39 real,  5.37 user, 0.00 sys

Note that, because of the instrumentation code, the profiled version runs rather slower that the normally compiled code. Running this code created a file containing the profiling results:

    $ ls *.gcda
    builtin_expect_test.gcda

We then employ the -fprofile-use compiler option,which (implicitly) uses the profiling results to create an optimized executable.

    $ cc -O3 -DDONT_EXPECT -fprofile-use builtin_expect_test.c -o bn.opt

And then we run the optimized program:

    $ time -f "%E real, %U user, %S sys" ./bn.opt 1000
    0, 1000000000
    0:01.95 real,  1.94 user, 0.00 sys

This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized.

It's left as an exercise for the reader to show that employing __builtin_expect() (to expect 0) in conjunction with compiler-assisted optimization doesn't improve things: the compiler already optimizes the if branching as well as the programmer-directed optimization. One other interesting exercise is, of course, to compare the assembler (cc -S) code generated for each of the above cases.

2012-07-14

The Korean translation of TLPI is out

The Korean translation of The Linux Programming Interface came out on 10 July. You can find further information about the translation my earlier blog post and on the translations page of the TLPI web site.

The publisher kindly sent me a few photographs of the finished work:

(The table on the left corresponds to page 194 in the English original, and the figure on the right corresponds to page 902.)

(You have to love the bookends...)

2012-06-25

Some licensing changes to the source code

When originally published, the code from TLPI was licensed under the GNU Affero General Public License, version 3.

One reader commented that they are using one of my library functions as part of their everyday "toolbox" of handy pieces of code. As things currently stand, that would mean that if they redistributed any code that linked against that function or ran that code as part of a larger network application, then the complete source code of their application would also need to be licensed under the Affero GPL (i.e., all source code linked to my library functions would also need to be distributed).

That scenario wasn't really my intention (I'd overlooked the case of library functions in my code). So, I've now relicensed the library functions in TLPI under the GNU Lesser General Public License, version 3. The relicensing applies to all files in the lib/ directory of the source code distribution, and the changed licenses are already present in the latest source code tarball and online versions of the programs that went up in an update of the web site a couple of weeks ago.

Aside from the licensing changes, it's worth mentioning that since publication of TLPI, I've made a number of small fixes to various example programs. The fixes are listed in the CHANGES file that is distributed as part of the source code tarball.

2012-06-22

Korean translation of TLPI available soon

The Korean publisher Acorn is completing the final pieces in the production of the Korean translation of TLPI, which should be published next month.

The translation will be published in two volumes that together run to nearly 2000 pages. As well as splitting the book into two volumes, the chapters are reorganized somewhat. Chapters 29 to 33 (POSIX threads) of the English original move to a later position in the translation, so that they form the start of the second volume. (In a two-volume version, this reordering seemed a little better both to the publisher and to me.) The two volumes are thus:

Volume 1 (Files, memory, processes, signals, timers), which corresponds to Chapters 1 to 28 and 34 to 42 of the English original. (1112 pages)
Volume 2 (Threads, IPC, sockets, advanced I/O), which corresponds to Chapters 29 to 33 and 43 to 64 of the English original. (848 pages)

The two volumes will be sold both separately (priced respectively at 50,000 and 35,000 South Korean won), and as a two-volume package (79,000 won).

Considering that translation work began just 18 months ago, the speed of publication is impressive. One of the things that assisted was that the translation was the work of a team of translators.

It is of course an honor and a pleasure to see my book translated, for which I thank both the publisher and translators. Among the translators, a special thanks to Kiju Kim (nontechnical private blog), who kept me up to date on progress and also submitted a number of error reports on the English original while translating. And though I'll never be able to read them, I look forward to being able to hold copies of the Korean translation in some weeks.

2012-04-09

TLPI third print run now available

I'm happy to announce that the third print run of TLPI is now printed, and should be available for sale shortly. The third print run incorporates these 131 errata shown on the errata page. (If that seems like a lot of errata, take a look at this FAQ.)

2012-03-13

Kernel capability usage statistics

(Update: 2012-07-16: I wrote a related article for this blog post, "CAP_SYS_ADMIN: the new root" that was published on LWN.net on 2012-03-14.)

The idea of capabilities is to break the power of root (user ID 0) into independently assigned pieces governing specific privileged operations. Implicit in that model is that the set of privileged operations governed by each capability should be small (otherwise, why break the power of root into pieces at all?). However, that implication hasn't turned out to be true in practice.

Table 1 shows some statistics on the use of the 36 currently existing capabilities in the C files of the Linux 3.2 kernel source code. The "#uses" column is the number of uses of the capability across all source files; the "#files" column is the number of distinct source files where the capability is used.

Table 1: Breakdown of Linux capability uses in Linux 3.2

Capability	#uses	#files
CAP_AUDIT_CONTROL	2	2
CAP_AUDIT_WRITE	1	1
CAP_CHOWN	4	2
CAP_DAC_OVERRIDE	2	1
CAP_DAC_READ_SEARCH	4	2
CAP_FOWNER	9	8
CAP_FSETID	8	6
CAP_IPC_LOCK	13	8
CAP_IPC_OWNER	1	1
CAP_KILL	2	2
CAP_LEASE	1	1
CAP_LINUX_IMMUTABLE	13	13
CAP_MAC_ADMIN	25	5
CAP_MAC_OVERRIDE	6	2
CAP_MKNOD	3	3
CAP_NET_ADMIN	395	182
CAP_NET_BIND_SERVICE	13	10
CAP_NET_BROADCAST	0	0
CAP_NET_RAW	18	11
CAP_SETFCAP	3	2
CAP_SETGID	10	6
CAP_SETPCAP	2	2
CAP_SETUID	8	4
CAP_SYS_ADMIN	451	229
CAP_SYS_BOOT	2	2
CAP_SYS_CHROOT	1	1
CAP_SYSLOG	2	2
CAP_SYS_MODULE	4	3
CAP_SYS_NICE	14	8
CAP_SYS_PACCT	1	1
CAP_SYS_PTRACE	11	6
CAP_SYS_RAWIO	67	42
CAP_SYS_RESOURCE	36	24
CAP_SYS_TIME	22	13
CAP_SYS_TTY_CONFIG	11	4
CAP_WAKE_ALARM	2	1
Total	1167	610

What is of course notable is the extremely heavy use of two capabilities: CAP_SYS_ADMIN and CAP_NET_ADMIN. Together, these two capabilities account for more than 70% of all uses of capabilities! The uses of CAP_NET_ADMIN are limited to the drivers/ (mainly drivers/net/) and net/ directories. On the other hand, the uses of CAP_SYS_ADMIN, which accounts for nearly 39% of all capability uses, are spread widely across the kernel source tree.

One might wonder whether either of these two capabilities is overrepresented because of duplications in the drivers/ or the arch/ trees. (In particular, CAP_SYS_ADMIN is used for similar administrative functions on a lot of device drivers.) However, even if we strip drivers/ and architectures other than x86 from the measurements, the overall picture doesn't change greatly, as shown in Table 2.

Table 2: Breakdown of Linux capability uses in Linux 3.2, excluding drivers and architectures other than x86

Capability	#uses	#files
CAP_AUDIT_CONTROL	2	2
CAP_AUDIT_WRITE	1	1
CAP_CHOWN	4	2
CAP_DAC_OVERRIDE	2	1
CAP_DAC_READ_SEARCH	4	2
CAP_FOWNER	9	8
CAP_FSETID	8	6
CAP_IPC_LOCK	11	6
CAP_IPC_OWNER	1	1
CAP_KILL	1	1
CAP_LEASE	1	1
CAP_LINUX_IMMUTABLE	13	13
CAP_MAC_ADMIN	25	5
CAP_MAC_OVERRIDE	6	2
CAP_MKNOD	3	3
CAP_NET_ADMIN	167	73
CAP_NET_BIND_SERVICE	12	9
CAP_NET_BROADCAST	0	0
CAP_NET_RAW	18	11
CAP_SETFCAP	3	2
CAP_SETGID	9	5
CAP_SETPCAP	2	2
CAP_SETUID	8	4
CAP_SYS_ADMIN	167	80
CAP_SYS_BOOT	2	2
CAP_SYS_CHROOT	1	1
CAP_SYSLOG	2	2
CAP_SYS_MODULE	4	3
CAP_SYS_NICE	12	6
CAP_SYS_PACCT	1	1
CAP_SYS_PTRACE	10	5
CAP_SYS_RAWIO	10	9
CAP_SYS_RESOURCE	26	18
CAP_SYS_TIME	4	2
CAP_SYS_TTY_CONFIG	1	1
CAP_WAKE_ALARM	2	1
Total	552	291

CAP_SYS_ADMIN still accounts for 167 of 552 uses of capabilities--about 30%, and, by chance, usage of CAP_NET_ADMIN is the same.

It turns out that the overall picture hasn't changed that much since capabilities were first introduced with Linux 2.2 (Jan 1999). Then, there were 27 capabilities, broken down as shown in Table 3.

Table 3: Breakdown of Linux capability uses in Linux 2.2

Capability	#uses	#files
CAP_CHOWN	2	1
CAP_DAC_OVERRIDE	5	5
CAP_DAC_READ_SEARCH	4	4
CAP_FOWNER	7	5
CAP_FSETID	3	2
CAP_IPC_LOCK	4	2
CAP_IPC_OWNER	1	1
CAP_KILL	0	0
CAP_LINUX_IMMUTABLE	2	2
CAP_NET_ADMIN	75	32
CAP_NET_BIND_SERVICE	3	3
CAP_NET_BROADCAST	0	0
CAP_NET_RAW	8	6
CAP_SETGID	7	2
CAP_SETPCAP	2	2
CAP_SETUID	7	3
CAP_SYS_ADMIN	127	69
CAP_SYS_BOOT	1	1
CAP_SYS_CHROOT	1	1
CAP_SYS_MODULE	4	2
CAP_SYS_NICE	5	2
CAP_SYS_PACCT	1	1
CAP_SYS_PTRACE	9	9
CAP_SYS_RAWIO	2	1
CAP_SYS_RESOURCE	10	8
CAP_SYS_TIME	7	4
CAP_SYS_TTY_CONFIG	1	1
Total	298	169

In Linux 2.2, CAP_SYS_ADMIN accounted for 42% of the uses of capabilities, and CAP_NET_ADMIN accounted for 25%.

Table 4 repeats the earlier exercise of excluding drivers/ and architectures other than i386 (as the Intel arch/ directory was then named) from the Linux 2.2 data. In this case, an interesting difference emerges.

Table 4: Breakdown of Linux capability uses in Linux 2.2, excluding drivers and architectures other than i386

Capability	#uses	#files
CAP_CHOWN	2	1
CAP_DAC_OVERRIDE	5	5
CAP_DAC_READ_SEARCH	4	4
CAP_FOWNER	7	5
CAP_FSETID	3	2
CAP_IPC_LOCK	4	2
CAP_IPC_OWNER	1	1
CAP_KILL	0	0
CAP_LINUX_IMMUTABLE	2	2
CAP_NET_ADMIN	44	23
CAP_NET_BIND_SERVICE	3	3
CAP_NET_BROADCAST	0	0
CAP_NET_RAW	8	6
CAP_SETGID	7	2
CAP_SETPCAP	2	2
CAP_SETUID	7	3
CAP_SYS_ADMIN	23	17
CAP_SYS_BOOT	1	1
CAP_SYS_CHROOT	1	1
CAP_SYS_MODULE	4	2
CAP_SYS_NICE	5	2
CAP_SYS_PACCT	1	1
CAP_SYS_PTRACE	2	2
CAP_SYS_RAWIO	2	1
CAP_SYS_RESOURCE	5	4
CAP_SYS_TIME	3	1
CAP_SYS_TTY_CONFIG	1	1
Total	147	94

The overall picture differs in a way that I suspect is significant: just under 16% (23/147) of the uses of capabilities are CAP_SYS_ADMIN. (As we saw in Table 2, by Linux 3.2, this figure had grown to 30% (167/552).) This difference suggests to me that as a series of kernel developers was faced with the question: "What capability should I use to govern my new privileged kernel feature?", the answer was often something like "I don't know; maybe CAP_SYS_ADMIN?". (That certainly fits with a few anecdotal cases I've encountered while discussing things with kernel developers as I wrote man pages for new kernel features.)

The script (count_kernel_cap_uses.sh) used to generate the data for these statistics can be found here. The first and third tables above are based on analysis of the "p2" output files produced by the script. The second and fourth tables are based on analysis of the "p4" output files.

2012-02-22

Look hard, and you can see TLPI

This is fun... I'm fairly certain that the guess by this redditor about what you can see sitting behind the laptop in this photo (taken by John Snyder, and used in the 21 Feb 2012 Wired article, Lord of the Files: How GitHub Tamed Free Software (And More)) is correct. Probably, it's the copy I gave Linus last year. It's nice to know he held on to it, though I'm not so sure there's so much in the book that he doesn't already know.

Updated 2015-07-06: fix a broken URL, and add the title of Wired article.
Updates 2016-07-25: fix a broken URL (again... Wired.com doesn't seem to embrace the notion of stable URLs).

2012-02-20

Traditional Chinese translation of TLPI

The Taiwanese publisher GOTOP has contracted the rights to do a Traditional Chinese publication of TLPI.

There are now four translations of TLPI in progress. It looks like the first of the translations that will appear will be the Korean translation, sometime around the middle of this year.

2012-02-04

Web site updates

I've rolled out a few updates to the man7.org web site, including a redesigned main page for TLPI, and the addition of several errata to the errata page. Special thanks for a long list of error reports to Junjiro Okajima, who's working on the Japanese translation of TLPI.

2012-01-16

Third print run (and request for bug reports)

Sales of The Linux Programming Interface have continued well enough that the publisher will soon start preparing the third print run. That print run will incorporate all of the outstanding errata.

If you've been reading TLPI and noticed any errors or typos, now would be a good time to report them, so that fixes can be included in the next printing.

man7.org