kgdb抓虫日记 – kgdb 与 sched_clock[softlockup]

/ 1评 / 0

A: BUG现象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
intel_mpcbl0040-standard-glibc_small has the simillar problem
# BUG: soft lockup - CPU#3 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
 
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0394844>] EFLAGS: 00000283 CPU: 3
EIP is at _spin_unlock_irq+0xb/0x28
EAX: 00000001 EBX: 0000007d ECX: 000a9a7e EDX: f78603f4
ESI: f7860000 EDI: 00000102 EBP: f7867e98 ESP: f7867e98
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 4c42b6e0 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c012e30f>] run_timer_softirq+0x142/0x1b4
 [<c0135113>] ? delayed_work_timer_fn+0x0/0x32
 [<c012a667>] __do_softirq+0xb6/0x19c
 [<c012a78c>] do_softirq+0x3f/0x45
 [<c013b55e>] ? hrtimer_interrupt+0x0/0x1bc
 [<c012a942>] irq_exit+0x48/0x55
 [<c0111e8c>] smp_apic_timer_interrupt+0xdf/0x129
 [<c0103cd8>] apic_timer_interrupt+0x28/0x30
 [<c0109452>] ? mwait_idle+0x32/0x37
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0390263>] start_secondary+0x156/0x1c3
 =======================
BUG: soft lockup - CPU#0 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
 
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 0
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: 00000004 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: c04bf300 EBP: c0475fa4 ESP: c0475fa4
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b80e62a8 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0384f32>] rest_init+0x62/0x70
 =======================

B: BUG重现步骤&现场分析

When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the
time from hardware, such as from TSC. In this configuration kgdb will
report a softlock warning messages on resuming or detaching from a
debug session.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Sequence of events in the problem case:
 
1) "cpu sched clock" and "hardware time" are at 100 sec prior
   to a call to kgdb_handle_exception()
 
2) Debugger waits in kgdb_handle_exception() for 80 seconds and on
   exit the following is called
    ...
    touch_softlockup_watchdog()
       --> __raw_get_cpu_var(touch_timestamp) = 0;
 
3) "cpu sched clock" = 100s (it was not updated, because the interrupt
   was disabled in kgdb) but the "hardware time" = 180 sec
 
4) The first timer interrupt after resuming from kgdb_handle_exception
   updates the watchdog from the "cpu sched clock"
 
update_process_times()
{
    ...
    run_local_timers()
          --> softlockup_tick()
              --> check (touch_timestamp == 0)
                (it is "YES" here, we have set "touch_timestamp = 0"
                at kgdb)
                --> __touch_softlockup_watchdog()
              ***(A)--> reset "touch_timestamp" to "get_timestamp()"
                     (Here, the "touch_timestamp" will still be set to
                      100s.)
    ...
 
    scheduler_tick()
      ***(B)--> sched_clock_tick()
               (update "cpu sched clock" to "hardware time" = 180s)
    ...
}
 
 
5) The Second timer interrupt handler appears to have a large jump and
   trips the softlockup warning.
 
update_process_times()
{
    ...
    run_local_timers()
        --> softlockup_tick()
           --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s
           --> printk "soft lockup error messages"
    ...
}
1
2
3
4
5
6
7
8
9
10
11
note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)"
 
Why "touch_timestamp" is 100 sec, instead of 180 sec?
 
With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of
get_timestamp() is:
 
get_timestamp(this_cpu)
      -->cpu_clock(this_cpu)
         -->sched_clock_cpu(this_cpu)
            -->__update_sched_clock(sched_clock_data, now)

C: BUG触发原因

The __update_sched_clock() function uses the GTOD tick value to create
a window to normalize the "now" values. So if "now" values is too big
for sched_clock_data, it will be ignored.

D: BUG解决方法

The fix is to invoke sched_clock_tick() to update "cpu sched clock" in
order to recover from this state. This is done by introducing the
function touch_softlockup_watchdog_sync(), which allows kgdb to
request that the sched clock is updated when the watchdog thread runs
the first time after a resume from kgdb.

[PATCH] softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume

E: BUG简要总结

这里的软锁警告并不是由真正的软锁产生的,从使用角度来说可以忽略,但由于是kgdb引起的,对于kgdb来讲,所以还是有责任去解决的.
这个问题是因kgdb关闭了所有中断,导致sched_clock_data存储的时间没有与硬件时间同步造成的.
所以解决方法就是在kgdb退出后立刻更新sched_clock_data.

一条回应:“kgdb抓虫日记 – kgdb 与 sched_clock[softlockup]”

  1. Kgdb说道:

    Kgdb…

    […] something about kgdb[…]…

发表评论

电子邮件地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据