A: BUG重现步骤
1: connect gdb to kgdb(GDB was configured as "--host=i686-pc-linux-gnu --target=powerpc-linux-gnu".)
1 | (gdb) target remote udp:10.0.0.15:6443 |
2: set a break point at "module_event"
1 | (gdb) b module_event |
3: insert a module to target:
and the "module_event" breakpoint will be hit, then we get the following error:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | [email protected]:/root> insmod /tmp/dummy.ko Unable to handle kernel paging request for instruction fetch Faulting instruction address: 0x7d82100800095c80 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT NUMA LTT NESTING LEVEL : 0 Maple Modules linked in: dummy(+) kgdboe NIP: 7d82100800095c80 LR: c00000000007a14c CTR: 7d82100800095c80 REGS: c00000017817b9b0 TRAP: 0400 Not tainted (2.6.27.37-WR3.0.2as_standard-00080-gb14bbdf-dirty) MSR: 9000000040009032 <EE,ME,IR,DR> CR: 24002088 XER: 00000000 TASK = c00000017a183180[2223] 'insmod' THREAD: c000000178178000 GPR00: 7d82100800095c80 c00000017817bc30 c0000000005fa2e8 c000000000547c88 GPR04: 0000000000000001 d00000000002d100 0000000024002022 c000000000011770 GPR08: c00000017817b660 c0000000005b0360 c00000000060c300 0000000000000000 GPR12: 0000000044002088 c00000000060c300 0000000000000000 000000001008a334 GPR16: 00000000100b142c 00000000100ad5c0 00000000100eb278 00000000100ad72c GPR20: 00000000100eb2c0 00000000100e5030 0000000000000000 00000000100ad5c4 GPR24: c000000000491940 0000000000000000 0000000000000001 d00000000002d100 GPR28: 0000000000000000 fffffffffffffffc c000000000593ae0 0000000000000000 NIP [7d82100800095c80] 0x7d82100800095c80 LR [c00000000007a14c] .notifier_call_chain+0xcc/0x120 Call Trace: [c00000017817bc30] [c00000000007a15c] .notifier_call_chain+0xdc/0x120 (unreliable) [c00000017817bce0] [c00000000007a520] .__blocking_notifier_call_chain+0x70/0xb0 [c00000017817bd90] [c000000000089ae0] .SyS_init_module+0x100/0x260 [c00000017817be30] [c00000000000852c] syscall_exit+0x0/0x40 Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX ---[ end trace ff196d014336a31d ]--- Segmentation fault |
我同事的简洁描述该问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | I disassembled the vmlinux file to see that the start of the module_event is as follows: (gdb) i line *0xc0000000000a1840 Line 1622 of "/workspace/6101/build/linux/kernel/kgdb.c" starts at address 0xc0000000000a1840 and ends at 0xc0000000000a1850 . And yet gdb insists on putting a breakpoint at a different location. (gdb) i line *0xc0000000005b7b98 No line number information available for address 0xc0000000005b7b98 (gdb) i line module_event Line 1622 of "/workspace/6101/build/linux/kernel/kgdb.c" starts at address 0xc0000000000a1840 and ends at 0xc0000000000a1850 . [jl@prt-server5 linux-emer_atca6101-standard-build]$ grep module_event System.map c0000000000a1840 t .module_event c0000000005b7b98 d module_event So gdb is picking the "d" one... I don't know what the "d" means in the System.map file though. |
B: BUG现场分析
1 2 | "Unable to handle kernel paging request for instruction fetch Faulting instruction address: 0x7d82100800095c80" |
如果我们仔细观察“7d82100800095c80” 这个地址,可以发现其开头的"7d821008" 是PPC平台的触发断点的指令.造成这个BUG的原因很可能就是
gdb/kgdb 本来要修改指针指向指令内容的值,由于某些原因,把这个指针地址本身给改了.举个实例:
1 2 3 | void * ptr; &ptr = 0x005b0360 (*ptr) = (*0x005b0360) = 0x00095c80 |
本来是想修改(*ptr)指向的内容,即把0x00095c80 修改为 0x7d821008,
但由于某些错误操作, 把&ptr自己给修改了,即把 0x005b0360 修改成 0x7d821008了
所以系统在执行(*ptr) -> (*0x7d821008) 取指令的时候出问题了.
C: BUG触发原因
经过一番在kgdb里的艰苦打印调试,并没有发现kgdb有任何异常.
kgdb没辙了,就转向gdb吧.
一般来说,往哪个点设置什么值,是由gdb来主导的,kgdb只是执行相应的动作,既然kgdb是正常执行的,
那也许就意味着是gdb搞错地址了,把module_event函数的地址给取错了,然后触发了这个问题.
于是我objdump出vmlinux的符号地址,然后grep了下module_event这个符号,找到如下信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 | ***************************************************************************** ... c0000000005b0360 <module_event>: c0000000005b0360: c0 00 00 00 lfs f0,0(0) c0000000005b0364: 00 09 5c 80 .long 0x95c80 c0000000005b0368: c0 00 00 00 lfs f0,0(0) c0000000005b036c: 00 5f a2 e8 .long 0x5fa2e8 ... c000000000095c80 <.module_event>: c000000000095c80: 7c 08 02 a6 mflr r0 c000000000095c84: fb c1 ff f0 std r30,-16(r1) ... ***************************************************************************** |
发现有两个关于module_event, 很显然它们的关系是:
看起来上面那个module_event是函数符号表之类的东西,然后它的内容是指向真正的函数地址
(* 0xc0000000005b0360
<.module_event> 是真正的函数入口点地址.
我查看了下 ppc64的 ABI文档,找到了有关上面的解释。
我把关键内容贴出来:
1 2 3 4 5 6 7 8 9 10 11 12 13 | ***************************************************************************** In PPC64 ABI, there is a function descriptors structure. PPC64 ABI Function Descriptors A function descriptor is a three doubleword data structure that contains the following values: * The first doubleword contains the address of the entry point of the function. * The second doubleword contains the TOC base address for the function. * The third doubleword contains the environment pointer for languages such as Pascal and PL/1. For an externally visible function, the value of the symbol with the same name as the function is the address of the function descriptor. Symbol names with a dot (.) prefix are reserved for holding entry point addresses. The value of a symbol named ".FN" is the entry point of the function "FN". The value of a function pointer in a language like C is the address of the function descriptor. ***************************************************************************** |
其它更多的有关ppc64 ABI的信息,可以浏览
ppc64 ABI
因此"c0000000005b0360
看到这,就豁然开朗了,原来gdb那个笨蛋把0xc0000000005b0360这个当成module_event函数的地址,并修改插入断点值.
1 2 3 4 5 6 7 8 9 10 11 12 | ***************************************************************************** c0000000005b0360 <module_event>: c0000000005b0360: 7d 82 21 08 ******----> here was modified to "7d 82 21 08" c0000000005b0364: 00 09 5c 80 .long 0x95c80 c0000000005b0368: c0 00 00 00 lfs f0,0(0) c0000000005b036c: 00 5f a2 e8 .long 0x5fa2e8 ... c000000000095c80 <.module_event>: c000000000095c80: 7c 08 02 a6 mflr r0 c000000000095c84: fb c1 ff f0 std r30,-16(r1) ... ***************************************************************************** |
导致系统读取函数描述符的地址去取指令的时候,访问无效地址而出问题...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | gdb正确的行为应该是: ***************************************************************************** ... c0000000005b0360 <module_event>: c0000000005b0360: c0 00 00 00 lfs f0,0(0) c0000000005b0364: 00 09 5c 80 .long 0x95c80 c0000000005b0368: c0 00 00 00 lfs f0,0(0) c0000000005b036c: 00 5f a2 e8 .long 0x5fa2e8 ... c000000000095c80 <.module_event>: c000000000095c80: 7d 82 21 08 ********modifiy here to "7d 82 21 08"******* c000000000095c84: fb c1 ff f0 std r30,-16(r1) ... ***************************************************************************** |
D: BUG解决方法
修改gdb对ppc64 arch的函数符号解析规则,让其能获取到正确的函数入口地址,而不是取函数描述符.