2016-12-14

Introspecting namespace relationships

One of the interesting new features added in the just-released Linux 4.9 kernel is the ability to introspect namespace relationships. Two kinds of relationship can be discovered: the parent-child relationships for hierarchical namespace types (i.e. PID namespaces and user namespaces), and the ownership relationship between a non-user namespace and its associated user namespace.

There are various uses for this sort of introspection. One is to answer the question: what capabilities does process X have in namespace Y? The rules that determine the answer to that question have been documented in the user_namespaces(7) manual page for quite a while, but until now, there was no way of empirically answering that question with respect to a particular process and a particular namespace on a running system. This changes in Linux 4.9, thanks to work that Andrei Vagin did after I asked about this possibility on the Linux kernel mailing list back in July.

The solution, suggested by Eric Biederman, is rather elegant (even if implemented as ioctl() operations), and is based on returning file descriptors referring to objects in the (unmounted) namespace filesystem (NSFS). Given a file descriptor, fd, that refers to one of the /proc/PID/ns/xxxx symbolic links, two operations can be performed:

  • ioctl(fd, NS_GET_USERNS): Returns a file descriptor that refers to the owning user namespace for the namespace referred to by fd.
  • ioctl(fd, NS_GET_PARENT): Returns a file descriptor that refers to the parent namespace for the namespace referred to by by fd. This operation can be applied only to hierarchical namespaces (PID namespaces and user namespaces). This operation may fail if the parent namespace is outside the namespace scope of the caller. This might be the case if, for example, the parent of a PID namespace is an ancestor namespace of the caller's PID namespace. In addition, this error can occur when trying to find the parent of the initial PID or username space. When working our way backward through the chain of ancestors of a namespace, this fact can be used to determine whether we have reached the initial namespace.
By applying fstat() to a file descriptor returned by either of these operations, one can discover the device ID and inode number of the NSFS object referred to by the descriptor. By comparing these two values with the values for another namespace file descriptor, we can determine whether the two file descriptors refer to the same namespace.

Another possible use of this feature is to introspect across all processes on the system to discover the PID and user namespace hierarchies on a live system. (And also to discover the relationship of non-user namespaces to their owning user namespaces.)

The following Go program provides an example of such introspection. It inspects the /proc/PID/ns/user files for all processes on the system and builds up a map of the user namespace hierarchy along with the processes that reside in each namespace.

The program is fairly well commented, so without further explanation, I'll just present the code. (I should add that this is my first attempt at using Go (a nice language!), so the code may not be idiomatic, and may also have some errors, but it should serve to illustrate what's going on.) An example run is shown below. The program code can be found in the code tarball available for down on my website.

 /* userns_overview.go  
   
   Display a hierarchical view of the user namespaces on the  
   system along with the member processes for each namespace.  
   This requires features new in Linux 4.9. See the  
   namespaces(7) man page.  
   (http://man7.org/linux/man-pages/man7/namespaces.7.html)  
 */  
   
 package main  
   
 import (  
     "fmt"  
     "io/ioutil"  
     "os"  
     "sort"  
     "strconv"  
     "strings"  
     "syscall"  
     "unsafe"  
 )  
   
 // A namespace is identified by device ID and inode number  
   
 type NamespaceID struct {  
     device  uint64 // dev_t  
     inode_num uint64 // ino_t  
 }  
   
 // A namespace has associated attributes: a set of  
 // child namespaces and a set of member processes  
   
 type NamespaceAttribs struct {  
     children []NamespaceID // Child namespaces  
     pids   []int     // Member processes  
 }  
   
 // The following map records all of the namespaces that  
 // we find on the system  
   
 var NSList = make(map[NamespaceID]*NamespaceAttribs)  
   
 // Along the way, we'll discover the ancestor of all user  
 // namespaces (the root of the user namespace hierarchy).  
   
 var initialNS NamespaceID  
   
 // AddNamespace adds a PID to the list of PIDs associated with  
 // the user namespace referred to by 'namespaceFD'.  
 //  
 // The set of namespaces is recorded in the 'NSList' map.  
 // If the map does not yet contain an entry corresponding to  
 // 'namespaceFD', then an entry is created. This process is  
 // recursive: if the parent of the user namespace referred  
 // to by 'namespaceFD' does not have an entry in 'NSList'  
 // then an entry is created for the parent, and the namespace  
 // referred to by 'namespaceFD' is made a child of that namespace.  
 //  
 // When called recursively to create the ancestor namespace  
 // entries, this function is called with 'pid' as -1, meaning  
 // that no PID needs to be added for this namespace entry.  
 //  
 // The return value of the function is the ID of the namespace  
 // entry (i.e., the device ID and inode number corresponding to  
 // the user namespace file referred to by 'namespaceFD').  
   
 func AddNamespace(namespaceFD int, pid int) NamespaceID {  
     const NS_GET_PARENT = 0xb702 // ioctl() to get namespace parent  
     var sb syscall.Stat_t  
     var err error  
   
     // Obtain the device ID and inode number of the namespace  
     // file. These values together form the key for the 'NSList'  
     // map entry.  
   
     err = syscall.Fstat(namespaceFD, &sb)  
     if err != nil {  
         fmt.Println("syscall.Fstat(): ", err)  
         os.Exit(1)  
     }  
   
     ns := *new(NamespaceID)  
     ns = NamespaceID{sb.Dev, sb.Ino}  
   
     if _, fnd := NSList[ns]; fnd {  
   
         // Namespace already exists; nothing to do  
   
     } else {  
   
         // Namespace entry does not yet exist; create it  
   
         np := new(NamespaceAttribs)  
         NSList[ns] = np  
   
         // Get file descriptor for parent user namespace  
   
         r, _, e := syscall.Syscall(syscall.SYS_IOCTL,  
             uintptr(namespaceFD), uintptr(NS_GET_PARENT), 0)  
         parentFD := (int)((uintptr)(unsafe.Pointer(r)))  
   
         if parentFD == -1 {  
             switch (e) {  
             case syscall.EPERM:  
                 // This is the initial NS; remember it  
                 initialNS = ns  
             case syscall.ENOTTY:  
                 fmt.Println("This kernel doesn't support " +  
                         "namespace introspection");  
                 os.Exit(1)  
             default:  
                 // Unexpected error; bail  
                 fmt.Println("ioctl()", e)  
                 os.Exit(1)  
             }  
   
         } else {  
   
             // We have a parent user namespace; make sure it  
             // has an entry in the map. No need to add any  
             // PID for the parent entry.  
   
             par := AddNamespace(parentFD, -1)  
   
             // Make the current namespace entry ('ns') a child of  
             // the parent namespace entry  
   
             NSList[par].children = append(NSList[par].children, ns)  
   
             syscall.Close(parentFD)  
         }  
     }  
   
     // Add PID to PID list for this namespace entry  
   
     if pid > 0 {  
         NSList[ns].pids = append(NSList[ns].pids, pid)  
     }  
   
     return ns  
 }  
   
 // ProcessProcFile processes a single /proc/PID entry, creating  
 // a namespace entry for this PID's /proc/PID/ns/user file  
 // (and, as necessary, namespace entries for all ancestor namespaces  
 // going back to the initial user namespace).  
 // 'name' is the name of a PID directory under /proc.  
   
 func ProcessProcFile(name string) {  
     var namespaceFD int  
     var err error  
   
     // Obtain a file descriptor that refers to the user namespace  
     // of this process  
   
     namespaceFD, err = syscall.Open("/proc/"+name+"/ns/user",  
         syscall.O_RDONLY, 0)  
   
     if namespaceFD < 0 {  
         fmt.Println("Open: ", namespaceFD, err)  
         os.Exit(1)  
     }  
   
     pid, _ := strconv.Atoi(name)  
   
     AddNamespace(namespaceFD, pid)  
   
     syscall.Close(namespaceFD)  
 }  
   
 // DisplayNamespaceTree() recursively displays the namespace  
 // tree rooted at 'ns'. 'level' is our current level in the  
 // tree, and is used for producing suitably indented output.  
   
 func DisplayNamespaceTree(ns NamespaceID, level int) { 
     prefix := strings.Repeat(" ", level*4)  
   
     // Display the namespace ID (device ID + inode number)  
   
     fmt.Print(prefix)  
     fmt.Println(ns)  
   
     // Print a sorted list of the PIDs that are members of this  
     // namespace. We do a bit of a dance here to produce a list  
     // of PIDs that is suitably wrapped, rather than a long  
     // single-line list.  
   
     sort.Ints(NSList[ns].pids)  
     base := len(prefix) + 25  
     col := base  
     for i, p := range NSList[ns].pids {  
         if i == 0 || col >= 80 && col > base+32 {  
             col = base  
             if i > 0 {  
                 fmt.Println()  
             }  
             fmt.Print(prefix)  
             fmt.Print("      ")  
             if i == 0 {  
                 fmt.Print("PIDs: ")  
             } else {  
                 fmt.Print("   ")  
             }  
         }  
         fmt.Print(strconv.Itoa(p) + " ")  
         col += len(strconv.Itoa(p)) + 1  
     }  
     fmt.Println()  
   
     // Recursively display the children namespaces  
   
     for _, v := range NSList[ns].children {  
         DisplayNamespaceTree(v, level+1)  
     }  
 }  
   
 func main() {  
   
     // Fetch a list of files from /proc  
   
     files, err := ioutil.ReadDir("/proc")  
     if err != nil {  
         fmt.Println("ioutil.Readdir(): ", err)  
         os.Exit(1)  
     }  
   
     // Process each /proc/PID (PID starts with a digit)  
   
     for _, f := range files {  
         if f.Name()[0] >= '0' && f.Name()[0] <= '9' {  
             ProcessProcFile(f.Name())  
         }  
     }  
   
     // Display the namespace tree rooted at the initial  
     // user namespace  
   
     DisplayNamespaceTree(initialNS, 0)  
 }  

The following (abbreviated) output shows what happens when we run the program on a system where there are a few user namespaces. (We must run the program with privilege so that we can access the /proc/PID/ns/user files of all users' processes.)

 $ sudo go run userns_overview.go   
 {3 4026531837}  
       PIDs: 1 2 3 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24   
          25 26 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45   
          ...  
          27101 27225 27245 27971 28142 28619 28870 28922 28995 29043   
          29109 29209 29279 29455 29466 29481 29489 29532 29533 29550   
    {3 4026532459}

        {3 4026532663}
                    PIDs: 29745 29749 29823 29847 
        {3 4026532450}

            {3 4026532662}
                        PIDs: 29746 

The output of the program is somewhat primitive, but employs indentation to show the hierarchical relationships between the user namespaces. In all, there are five user namespaces shown above.

The first few lines show the initial user namespace and its member processes. The other user namespaces were created by an instance of the Google Chrome browser. The namespace with the inode number 4026532459 is a child of the initial user namespace. That namespace in turn has two descendants (4026532663 and 4026532450), and the last of those namespaces in turn has a descendant (4026532662).

The output also shows the PIDs of the processes that reside in each namespace. Two of the namespaces (inode numbers 4026532663 and 4026532662) have no member processes (but are pinned to existence by the presence of descendant user namespaces).

Some more details about the namespace introspection feature, as well as a simpler example program (in C) can be found in the namespaces(7) manual page.

2016-10-27

New Zealand Open Source Awards 2016 Special Prize

This is nice! For my work on the Linux man-pages project over the last 16 years, I was this week awarded the New Zealand Open Source Awards 2016 Special Prize (tweet, announcement on the NZOSA website). A video of the award announcement and my prerecorded acceptance speech is available on YouTube.

2016-10-22

Getting ready to say goodbye to some old friends

Anyone who spent time around me in the 2000s will have an idea of what the picture below is about.

As I wrote newer chapters TLPI, I constantly reread chapters that I'd already written, red pen in hand. For years, I'd always have printed copies of a couple of chapters nearby to review, to fill spare moments while I was commuting, waiting for something, or just sitting in the Biergarten on a sunny weekend afternoon.

Each chapter got printed anything from around 6 to 12 times, and typically I'd read each printed copy twice, so that I could better detect ordering and forward-reference issues in the text. So, by the time the book was published, I'd read drafts of each chapter around  12 to 20 times. After reviewing each printed draft, I'd then integrate changes and improvements (sometimes informed by what I'd learned during the writing of later chapters) into the text.

I kept all those printed copies, in part because I wanted to get an idea of how some chapters progressed over time. All of the printed copies are sitting in the pile in the photo. Well, nearly all; after that photo was taken, I discovered another 15cm pile that was still sitting in my hometown in New Zealand.

Because those printouts were my constant companion for such a long time, and the writing of the book was so central to my life for a long time, I haven't quite been able to bring myself to let them go. So, for the last few years, the chapters have been sitting in three banana cartons. But, they have to go out sometime, and I think I'm nearly ready to let them go, though I'll probably keep drafts of a few chapters just as reminders.

2016-10-09

Seventh print run of TLPI

The files for the seventh print run of The Linux Programming Interface are heading off to the printer about now. The changes since the last print run are few: just 15 fixes (all minor) were required for this print run.

2016-10-06

Traditional Chinese translation of TLPI is published this month

The Traditional Chinese translation of TLPI this month. This brings the number of translations of TLPI to four!

The translation is published in two volumes (1, 2) by Taiwanese publisher, GOTOP. My special thanks to Aaron MingYi Liao (廖明沂), who has worked very diligently and for a long time over the translation.