Introduction
We often get bug reports on the
parisc-linux mailing list about problems people have with their kernels. This document
attempts to point out what is the minimal amount of information
that needs to be included in order for the kernel developers to
be able to try to track down the problem.
If you have never posted questions/bug reports to a mailing
list before, you must read
How-To Ask Smart Questions first.
The Basics
Information you should always include in a bug report are:
- Version of the kernel you are running (output of the
uname -a
command)
- What compiler and linker did you use to build the kernel? (gcc 3.0.4? 3.2? What version of binutils? Are you cross-compiling?)
- What machine you are booting the kernel on? (712/80? B132L? C3000?)
- Does it have a remote management card?
- Are you trying to use serial console or graphics console?
- What kernel command line is palo using?
- Are you using a 32-bit or 64-bit kernel
- Register dumps -
Usually, when a parisc-linux kernel crashes (or encounters a problem),
a register dump (and/or a stack dump -- especially in 2.6 kernels) will
be displayed on the console, or logged to syslog. You should include
this information in your bug report.
- The System.map file that corresponds to the kernel you are booting -
it's preferable if you can put it someplace and supply a URL for it. If not,
please bzip the file before sending it to the list
- Console output immediately preceding the crash or problem you observed
- If you are not using a default config, the .config you used to build your
kernel
HPMC
A HPMC (High-Priority Machine Check) is triggered when the hardware
detects an illegal operation. When a HPMC occurs, the kernel may not
be able to print out a register dump. In this case, the firmware should
have written a copy of the registers into NVRAM. On the next reboot
of the machine, you can recover the saved HPMC data from the firmware
prompt.
The command ser pim
at the firmware prompt will
usually bring up the info needed. Capture the register dump of the HPMC
output, together with any information reported about requestor/responder
addresses, etc in your report. If you are booting a SMP kernel,
remember to include the output for all the processors in your report.
After you have captured the output, issue a ser clearpim
command to clear the saved data. This will ensure that you do not
accidentally pick up stale data the next time around.
"Hung" kernels
If a kernel appears "hung", you may be able to determine
where it's stuck by using the TOC (Transfer of Control) feature of PA
firmware. A TOC is usually triggered by a small button on the back of
the machine. On systems with GSP, you can also do this by issuing the
'TC' command at the GSP prompt. After a TOC is triggered, the machine
should automatically reboot.
Similar to HPMC data, the TOC data can be recovered from the firmware
prompt. Enter 'ser pim' to display the saved data, and copy down
the TOC section of the output in your report.
Note: Once you've captured the output of HPMC or TOC, it may
be useful to issue the command 'ser clearpim' to clear the saved data.
This way, next time when you look at the output of 'ser pim' you can
be sure you are not looking at stale data.
More advanced bug reports
The following information, if available, will help with the
debugging process. In addition to the information included
in the "Basic" section above, you may also try to collect
the following pieces of information.
Let's look at a typical register dump:
Kernel Fault: Code=26 regs=15fb07c0 (Addr=00000000)
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001001111111100001111 Not tainted
r00-03 00000000 00000000 101f0814 102a7ce4
r04-07 00000500 102a7ce4 00002000 10061c00
r08-11 10362010 10029c60 00000400 1029e940
r12-15 000000a0 00000140 00000040 00000000
r16-19 00005000 15fb04c8 15f4d000 00000013
r20-23 00000000 00000000 10029c60 00000001
r24-27 00000000 00080000 00000000 1027e010
r28-31 00000000 00000400 15fb07c0 101f4264
sr0-3 00000000 00000000 00000000 00000000
sr4-7 00000000 00000000 00000000 00000000
IASQ: 00000000 00000000 IAOQ: 101f4290 101f4294
IIR: 0ea01085 ISR: 00000000 IOR: 00000000
CPU: 0 CR30: 15fb0000 CR31: 102b7000
ORIG_R28: 00000000
Several of these numbers are especially important:
- The "code" at the beginning of the register dump
indicates the type of "trap"
- The r02 register is the "return pointer" - it
indicates the calling function
- The IAOQ contents specify the address of the
instruction where the fault occured.
- IIR is the faulting instruction
- IOR, for a memory access, indicates the memory
location being accessed
For IAOQ and R02 addresses, it is helpful to understand
the following:
- Kernel text section addresses are of the form 0x1xxxxxxx
- User text section addresses are usually low
- Shared libraries are loaded at 0x4xxxxxxx
- Stack addresses are at 0xfxxxxxxx
- palo addresses are at 0x06xxxxxx
If the address you get is in the kernel section, you can find the
function that address belongs to by looking at a System.map file.
There are two ways to do this, either manually, or with the 'astk'
tool in parisc-linux cvs (build-tools module). Here, I will
only describe the manual way.
Suppose you get an address 0x1014f818 for IAOQ. Your System.map
will contain entries like this:
...
1014f510 t specific_send_sig_info
1014f678 t .L1157
1014f6b8 t .L1152
1014f790 T force_sig_info
1014f848 t specific_force_sig_info
1014f8f8 T __broadcast_thread_group
...
You want to find the largest address that is smaller than the one you
are looking for, ignoring .Lxxxx entries. In this case, the
function involved is force_sig_info
You should do this lookup for the IAOQ and GR02 addresses if
they are in kernel space.
If the address is in userspace, and it belongs in the shared libs
region, the output of 'ldd' on the offending program is also useful.
Another item that is sometimes useful is the instruction pointed to
by IIR. Unfortunately, there is no easy way to decode the number
to the symbolic instruction (unless you are Lamont :-) One way
that I learned from Richard Hirst is shown below:
legolas[12:00] ~% gdb /bin/ls # most any binary would do....
(gdb) break main
Breakpoint 1 at 0x11c30
(gdb) run
Starting program: /bin/ls
Breakpoint 1, 0x00011c30 in main ()
(gdb) set *(int *)0x11c30 = 0x0ea01085 # we store the instruction we want to decode, 0ea01085 in this example, at the beginning of main
(gdb) x/i 0x11c30 # and then ask gdb to decode it for us....
0x11c30 <main+136>: ldw 0(sr0,r21),r5
(gdb) quit
So, here, the instruction is a ldw from r21. In this particular
example, r21 is 0, so we have a null pointer dereference of some kind.
If the source/target or a load/store instruction is a non-zero register,
then you should see if that address is again a kernel address. If so, you
can look it up in System.map using the procedure above as well. This
is particularly useful to determine which spinlock is being contended, etc
Last revised: Sat, 9 Nov 2002 14:03:14 -0700