kgdb抓虫日记 – set breakpoint at ppc64

/ 0评 / 0

A: BUG重现步骤

1: connect gdb to kgdb(GDB was configured as "--host=i686-pc-linux-gnu --target=powerpc-linux-gnu".)

1
(gdb) target remote udp:10.0.0.15:6443

2: set a break point at "module_event"

1
(gdb) b module_event

3: insert a module to target:
and the "module_event" breakpoint will be hit, then we get the following error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 
[email protected]:/root> insmod /tmp/dummy.ko 
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x7d82100800095c80
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT NUMA LTT NESTING LEVEL : 0 
Maple
Modules linked in: dummy(+) kgdboe
NIP: 7d82100800095c80 LR: c00000000007a14c CTR: 7d82100800095c80
REGS: c00000017817b9b0 TRAP: 0400   Not tainted  (2.6.27.37-WR3.0.2as_standard-00080-gb14bbdf-dirty)
MSR: 9000000040009032 <EE,ME,IR,DR>  CR: 24002088  XER: 00000000
TASK = c00000017a183180[2223] 'insmod' THREAD: c000000178178000
GPR00: 7d82100800095c80 c00000017817bc30 c0000000005fa2e8 c000000000547c88 
GPR04: 0000000000000001 d00000000002d100 0000000024002022 c000000000011770 
GPR08: c00000017817b660 c0000000005b0360 c00000000060c300 0000000000000000 
GPR12: 0000000044002088 c00000000060c300 0000000000000000 000000001008a334 
GPR16: 00000000100b142c 00000000100ad5c0 00000000100eb278 00000000100ad72c 
GPR20: 00000000100eb2c0 00000000100e5030 0000000000000000 00000000100ad5c4 
GPR24: c000000000491940 0000000000000000 0000000000000001 d00000000002d100 
GPR28: 0000000000000000 fffffffffffffffc c000000000593ae0 0000000000000000 
NIP [7d82100800095c80] 0x7d82100800095c80
LR [c00000000007a14c] .notifier_call_chain+0xcc/0x120
Call Trace:
[c00000017817bc30] [c00000000007a15c] .notifier_call_chain+0xdc/0x120 (unreliable)
[c00000017817bce0] [c00000000007a520] .__blocking_notifier_call_chain+0x70/0xb0
[c00000017817bd90] [c000000000089ae0] .SyS_init_module+0x100/0x260
[c00000017817be30] [c00000000000852c] syscall_exit+0x0/0x40
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
---[ end trace ff196d014336a31d ]---
Segmentation fault

我同事的简洁描述该问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
I disassembled the vmlinux file to see that the start of the module_event is as follows:
(gdb) i line *0xc0000000000a1840
Line 1622 of "/workspace/6101/build/linux/kernel/kgdb.c"
starts at address 0xc0000000000a1840
and ends at 0xc0000000000a1850 .
And yet gdb insists on putting a breakpoint at a different location.
(gdb) i line *0xc0000000005b7b98
No line number information available for address
0xc0000000005b7b98
(gdb) i line module_event
Line 1622 of "/workspace/6101/build/linux/kernel/kgdb.c"
starts at address 0xc0000000000a1840
and ends at 0xc0000000000a1850 .
 
[jl@prt-server5 linux-emer_atca6101-standard-build]$ grep module_event System.map
c0000000000a1840 t .module_event
c0000000005b7b98 d module_event
So gdb is picking the "d" one... I don't know what the "d" means in the System.map file though.

B: BUG现场分析

1
2
"Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x7d82100800095c80"

如果我们仔细观察“7d82100800095c80” 这个地址,可以发现其开头的"7d821008" 是PPC平台的触发断点的指令.造成这个BUG的原因很可能就是
gdb/kgdb 本来要修改指针指向指令内容的值,由于某些原因,把这个指针地址本身给改了.举个实例:

1
2
3
void * ptr;
&ptr = 0x005b0360
(*ptr) = (*0x005b0360) = 0x00095c80

本来是想修改(*ptr)指向的内容,即把0x00095c80 修改为 0x7d821008,

但由于某些错误操作, 把&ptr自己给修改了,即把 0x005b0360 修改成 0x7d821008了
所以系统在执行(*ptr) -> (*0x7d821008) 取指令的时候出问题了.

C: BUG触发原因

经过一番在kgdb里的艰苦打印调试,并没有发现kgdb有任何异常.

kgdb没辙了,就转向gdb吧.

一般来说,往哪个点设置什么值,是由gdb来主导的,kgdb只是执行相应的动作,既然kgdb是正常执行的,
那也许就意味着是gdb搞错地址了,把module_event函数的地址给取错了,然后触发了这个问题.

于是我objdump出vmlinux的符号地址,然后grep了下module_event这个符号,找到如下信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
*****************************************************************************
...
c0000000005b0360 <module_event>:
c0000000005b0360:       c0 00 00 00     lfs     f0,0(0)
c0000000005b0364:       00 09 5c 80     .long 0x95c80
c0000000005b0368:       c0 00 00 00     lfs     f0,0(0)
c0000000005b036c:       00 5f a2 e8     .long 0x5fa2e8
...
c000000000095c80 <.module_event>:
c000000000095c80:       7c 08 02 a6     mflr    r0
c000000000095c84:       fb c1 ff f0     std     r30,-16(r1)
...
*****************************************************************************

发现有两个关于module_event, 很显然它们的关系是:
看起来上面那个module_event是函数符号表之类的东西,然后它的内容是指向真正的函数地址

(* 0xc0000000005b0360 ) -> c000000000095c80 <.module_event>
<.module_event> 是真正的函数入口点地址.

我查看了下 ppc64的 ABI文档,找到了有关上面的解释。
我把关键内容贴出来:

1
2
3
4
5
6
7
8
9
10
11
12
13
*****************************************************************************
In PPC64 ABI, there is a function descriptors structure.
 
PPC64 ABI Function Descriptors
A function descriptor is a three doubleword data structure that contains the following values:
    * The first doubleword contains the address of the entry point of the function.
    * The second doubleword contains the TOC base address for the function.
    * The third doubleword contains the environment pointer for languages such as Pascal and PL/1.
 
For an externally visible function, the value of the symbol with the same name as the function is the address of the function descriptor. Symbol names with a dot (.) prefix are reserved for holding entry point addresses. The value of a symbol named ".FN" is the entry point of the function "FN".
 
The value of a function pointer in a language like C is the address of the function descriptor.
*****************************************************************************

其它更多的有关ppc64 ABI的信息,可以浏览
ppc64 ABI

因此"c0000000005b0360 " 是函数描述符,其指向的地址 “c000000000095c80 <.module_event>”才是真正的函数地址.

看到这,就豁然开朗了,原来gdb那个笨蛋把0xc0000000005b0360这个当成module_event函数的地址,并修改插入断点值.

1
2
3
4
5
6
7
8
9
10
11
12
*****************************************************************************
c0000000005b0360 <module_event>:
c0000000005b0360:       7d 82 21 08     ******----> here was modified to "7d 82 21 08"
c0000000005b0364:       00 09 5c 80     .long 0x95c80
c0000000005b0368:       c0 00 00 00     lfs     f0,0(0)
c0000000005b036c:       00 5f a2 e8     .long 0x5fa2e8
...
c000000000095c80 <.module_event>:
c000000000095c80:       7c 08 02 a6     mflr    r0
c000000000095c84:       fb c1 ff f0     std     r30,-16(r1)
...
*****************************************************************************

导致系统读取函数描述符的地址去取指令的时候,访问无效地址而出问题...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
gdb正确的行为应该是:
*****************************************************************************
...
c0000000005b0360 <module_event>:
c0000000005b0360:       c0 00 00 00     lfs     f0,0(0)
c0000000005b0364:       00 09 5c 80     .long 0x95c80
c0000000005b0368:       c0 00 00 00     lfs     f0,0(0)
c0000000005b036c:       00 5f a2 e8     .long 0x5fa2e8
...
c000000000095c80 <.module_event>:
c000000000095c80:       7d 82 21 08     ********modifiy here to "7d 82 21 08"*******
c000000000095c84:       fb c1 ff f0     std     r30,-16(r1)
...
*****************************************************************************

D: BUG解决方法

修改gdb对ppc64 arch的函数符号解析规则,让其能获取到正确的函数入口地址,而不是取函数描述符.

发表评论

电子邮件地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据