White Paper: Red Hat Crash Utility


< Prev Contents Next >

Crash Usage: A Case Study

The steps taken to debug a kernel crash dump are not etched in stone, and the crash commands used to debug a kernel issue vary according to the problem exhibited. The section contains of a case study that shows how the capabilities of the crash utility were used to to debug a specific kernel problem. However, before doing so, it should be noted that the following commands are typically the most commonly-used:

bt Display the backtrace of the current context, or as specified with arguments. This command is typically the first command entered after starting a dumpfile session. Since the initial context is the panic context, it will show the function trace leading up to the kernel panic. bt -a will show the trace of the active task on each CPU, since there may be an interrelationship between the panicking task on one CPU and the running task(s) on the other CPU(s). When bt is given as the argument to foreach. displays the backtraces of all tasks.
struct Print the contents of a data structure at a specified address. This command is so common that it is typically unnecessary to enter the struct command name on the command line; if the first command line argument is not a crash or gdb command, but it is the name of a known data structure, then all the command line arguments are passed to the struct command. So for example, the following two commands yield the same result:

crash> struct vm_area_struct d3cb2600

crash> vm_area_struct d3cb2600
set Set a new task context by PID, task address, or cpu. Since several crash commands are context-sensitive, it's helpful to be able to change the context to avoid having to pass the PID or task address to those context-sensitive commands in order to access the data of a task that is not the current context.
p Prints the contents of a kernel variable; since it's a gateway to the print command of the mbedded gdb module, it can also be used to print complex C language expressions.
rd Read memory, which may be either kernel virtual, user virtual, or physical, and display it several different formats and sizes.
ps Lists basic task information for each process; it can also display parent and child hierarchies.
log Dump the kernel log_buf, which often contains clues leading up to a subsequent kernel crash.
foreach Execute a crash command on all tasks, or those specified, in the system; can be used with bt, vm, task, files, net, set, sig and vtop.
files Dump the open file descriptor data of a task; most usefully, the file, dentry and inode structure addresses for each open file descriptor.
vm Dump the virtual memory map of a task, including the vital information concerning each vm_area_struct making up a task's address space. It can also dump the physical address of each page in the address space, or if not mapped, its location in a file or on the swap device.


A Case Study: "kernel BUG at pipe.c:120!"

Upon bringing up a crash session, a great deal of information can be gained just by the invocation data. Here is what what displayed in this particular case:

...
      KERNEL: vmlinux-2.4.9-e.10.13enterprise-g
    DUMPFILE: vmcore-incomplete
        CPUS: 2
        DATE: Mon Feb 17 08:20:56 2003
      UPTIME: 4 days, 20:04:41
LOAD AVERAGE: 0.95, 1.04, 1.25
       TASKS: 110
    NODENAME: testbox.redhat.com
     RELEASE: 2.4.9-e.10.13enterprise
     VERSION: #1 SMP Mon Feb 3 12:59:26 EST 2003
     MACHINE: i686  (2788 Mhz)
      MEMORY: 6 GB
       PANIC: "kernel BUG at pipe.c:120!"
         PID: 20571
     COMMAND: "imp"
        TASK: d1566000
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash>

In this case the PANIC string "kernel BUG at pipe.c:120!" points to the exact kernel source code line at which the panic occurred.

Then, getting a backtrace of panicking task is typically the first order of the day:

crash> bt
PID: 20571  TASK: d1566000  CPU: 1   COMMAND: "imp"
 #0 [d1567e44] die at c010785c
 #1 [d1567e54] do_invalid_op at c0107b2c
 #2 [d1567f0c] error_code (via invalid_op) at c01073de
    EAX: 0000001d  EBX: ed87b2e0  ECX: c02f6064  EDX: 00005fa1  EBP: 00001000
    DS:  0018      ESI: f640e740  ES:  0018      EDI: 00001000
    CS:  0010      EIP: c0150b6d  ERR: ffffffff  EFLAGS: 00010292
 #3 [d1567f48] pipe_read at c0150b6d
 #4 [d1567f6c] sys_read at c01468d4
 #5 [d1567fc0] system_call at c01072dc
    EAX: 00000003  EBX: 0000000a  ECX: 40b4e05c  EDX: 00002000
    DS:  002b      ESI: 00002000  ES:  002b      EDI: 40b4e05c
    SS:  002b      ESP: bffe9e88  EBP: bffe9eb8
    CS:  0023      EIP: 40aaa1d4  ERR: 00000003  EFLAGS: 00000286

The backtrace shows that the call to die() was generated by an invalid_op exception. The exception was caused by the BUG() call in the pipe_read() function:

        if (count && PIPE_WAITING_WRITERS(*inode) &&
                !(filp->f_flags & O_NONBLOCK)) {
                /*
                 * We know that we are going to sleep: signal
                 * writers synchronously that there is more
                 * room.
                 */
                wake_up_interruptible_sync(PIPE_WAIT(*inode));
                if (!PIPE_EMPTY(*inode))
                        BUG();
                goto do_more_read;
        }

In the code segment above, the pipe_read() code has previously down'd the semaphore of the inode associated with the pipe, giving it exclusive access. It had read all data in the pipe, but still needed more to satisfy the count requested. Finding that there was a writer with more data -- and who was waiting on the semaphore -- it woke up the writer. However, after doing the wakeup, it did a sanity-check on the pipe contents, and found that it was no longer empty -- which is theoretically impossible since it was still holding the semaphore. It appeared that the writer process wrote to the pipe while the reader process still had exclusive access -- somehow overriding the semaphore.

Since the semaphore mechanism was seemingly not working, it was first necessary to look at the actual semaphore structure associated with the pipe's inode. This first required looking at the first argument to the pipe_read() function; the whatis command shows that it is a struct file pointer:

crash> whatis pipe_read
ssize_t pipe_read(struct file *, char *, size_t, loff_t *);
crash>

Using the bt -f option, each frame in the backtrace is expanded to show all stack data in the frame. Looking at the expansion of the sys_read() frame, we can see that the last thing pushed on the stack before calling pipe_read() was the file pointer address of edf3f740:

...
#3 [d1567f48] pipe_read at c0150b6d
    [RA: c01468d6  SP: d1567f4c  FP: d1567f6c  SIZE: 36]
    d1567f4c: c026701c  00000078  fffffff2  00001000
    d1567f5c: 00000000  edf3f740  ffffffea  00002000
    d1567f6c: c01468d6
 #4 [d1567f6c] sys_read at c01468d4
    [RA: c01072e3  SP: d1567f70  FP: d1567fc0  SIZE: 84]
    d1567f70: edf3f740  40b4f05c  00002000  edf3f760
    d1567f80: c03683d0  fffffffb  00000001  c0120d3b
    d1567f90: 00000046  00000046  0000000b  c0350960
    d1567fa0: 0000000b  f639eb00  c0108e0e  00000020
    d1567fb0: d1566000  00002000  40b4e05c  bffe9eb8
    d1567fc0: c01072e3
...

The task at hand is finding the inode containing the suspect semaphore from the file structure address. The file structure's f_dentry member points to its dentry structure, whose d_inode member in turn points to the pipe's inode. The struct command can be used to dump the complete contents of a data structure at a given address; by tagging the .member onto the structure name, we can print just the member desired. By following the structure chain, the inode address can be determined like so:

crash> struct file.f_dentry edf3f740
  f_dentry = 0xdb0ec440,
crash> struct dentry.d_inode db0ec440
  d_inode = 0xf640e740,
crash> struct inode.i_sem f640e740
  i_sem = {
    count = {
      counter = 2
    },
    sleepers = 0,
    wait = {
      lock = {
        lock = 1
      },
      task_list = {
        next = 0xf640e7ac,
        prev = 0xf640e7ac
      }
    }
  },
crash>

The dump of the semaphore structure above showed the problem: the counter value of 2 is illegal. It should never be greater than 1; in this case a value of 2 allows two successful down operations, i.e., giving two tasks access to the pipe at the same time.

(As an aside, determining the inode address above could also be accomplished by using the context-sensitive files command, which dumps the associated file, dentry and inode structure addresses for each open file descriptor of a task. The dumped file descriptor list would contain one with a reference to the file structure at edf3f740, and would also show the associated inode address of f640e740.)

Before getting a dumpfile, this same panic had occurred several times. It was erroneously presumed that the problem was in the pipe-handling code, but it was eventually determined not to be the case. By instrumenting a kernel with debug code, the starting counter value of a pipe was found to be 3. Compounding that problem was the fact that the inode slab cache is one of a few special cases that presume that the freed inode's contents are left in a legitimate state so that they do not have to be completely reinitialized with each subsequent reallocation. So when the pipe's inode was created, it received an inode with a bogus counter value.

Confirming the existence of bogus inode structures in the slab cache was a multi-stepped procedure. Using the command kmem command to access the inode slab cache, we can get the addresses of all free and currently-allocated inodes. Since there are typically several thousand inodes, the output is extremely verbose, but here is the beginning of it:

crash> kmem -S inode_cache
CACHE    NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
c7666564 inode_cache              448      11563     12339   1371     4k
SLAB      MEMORY    TOTAL  ALLOCATED  FREE
d1d82000  d1d82040      9          9     0
FREE / [ALLOCATED]
  [d1d82040]
  [d1d82200]
  [d1d823c0]
  [d1d82580]
  [d1d82740]
  [d1d82900]
  [d1d82ac0]
  [d1d82c80]
  [d1d82e40]
SLAB      MEMORY    TOTAL  ALLOCATED  FREE
f4e52000  f4e52040      9          7     2
FREE / [ALLOCATED]
   f4e52040  (cpu 1 cache)
   f4e52200  (cpu 1 cache)
  [f4e523c0]
  [f4e52580]
  [f4e52740]
  [f4e52900]
  [f4e52ac0]
  [f4e52c80]
  [f4e52e40]
...

In the truncated output above, all of the inode address in the slab cache are dumped; the ones currently in use are surrounded by brackets, the free ones are not. So, for example, the inodes at addresses f4e52040 and f4e52200 are free; the others are not. The full output was piped to a script that pulled out just the free inode addresses (i.e., output lines starting with three spaces), and redirected them into a file. The file was modified to be a crash input file by making each extracted inode address to be the arguments of the struct command, using its short-cut method that allows the dropping of the struct command name; therefore the input file contained hundreds of crash commands of the form:

inode.i_sem f4e52040
inode.i_sem f4e52200
inode.i_sem f5cdc040
inode.i_sem f5cdc200
inode.i_sem f5cdc3c0
inode.i_sem f5cdc580
...

Note that the struct command would be used by default above, as documented in its help page; if the first command line argument is not a crash or gdb command, but it is the name of a known data structure, it passes the arguments to the struct command.

Using the capability of feeding an input file, in this case consisting of hundreds of short-cut struct commands like those above, the output was again quite verbose, consisting of structure member dumps of the form:

crash> < input.file
crash> inode.i_sem f4e52040
  i_sem = {
    count = {
      counter = 1
    },
    sleepers = 0,
    wait = {
      lock = {
        lock = 1
      },
      task_list = {
        next = 0xf4e520ac,
        prev = 0xf4e520ac
      }
    }
  },
crash> inode.i_sem f4e52200
  i_sem = {
    count = {
      counter = 1
    },
    sleepers = 0,
    wait = {
      lock = {
        lock = 1
      },
      task_list = {
        next = 0xf4e5226c,
        prev = 0xf4e5226c
      }
    }
  },
...

However, it was a simple matter of piping the output to grep, and looking for counter values not equal to 1:

crash> < input.file | grep counter | grep -v "= 1"
      counter = 3
      counter = 3
      counter = 3
      counter = 3
crash> 

This turned out to be the smoking gun. Another round of debugging with an instrumented kernel that trapped attempts to free an inode with a semaphore counter of 3 caught the perpetrator in the act.


< Prev Contents Next >