A: BUG现象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | intel_mpcbl0040-standard-glibc_small has the simillar problem # BUG: soft lockup - CPU#3 stuck for 78s! [swapper:0] Modules linked in: kgdboe ipv6 [last unloaded: kgdboe] Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1) EIP: 0060:[<c0394844>] EFLAGS: 00000283 CPU: 3 EIP is at _spin_unlock_irq+0xb/0x28 EAX: 00000001 EBX: 0000007d ECX: 000a9a7e EDX: f78603f4 ESI: f7860000 EDI: 00000102 EBP: f7867e98 ESP: f7867e98 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 4c42b6e0 CR3: 004c0000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c012e30f>] run_timer_softirq+0x142/0x1b4 [<c0135113>] ? delayed_work_timer_fn+0x0/0x32 [<c012a667>] __do_softirq+0xb6/0x19c [<c012a78c>] do_softirq+0x3f/0x45 [<c013b55e>] ? hrtimer_interrupt+0x0/0x1bc [<c012a942>] irq_exit+0x48/0x55 [<c0111e8c>] smp_apic_timer_interrupt+0xdf/0x129 [<c0103cd8>] apic_timer_interrupt+0x28/0x30 [<c0109452>] ? mwait_idle+0x32/0x37 [<c01021ab>] cpu_idle+0x95/0x11a [<c0390263>] start_secondary+0x156/0x1c3 ======================= BUG: soft lockup - CPU#0 stuck for 78s! [swapper:0] Modules linked in: kgdboe ipv6 [last unloaded: kgdboe] Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1) EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 0 EIP is at mwait_idle+0x32/0x37 EAX: 00000000 EBX: 00000004 ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: c04bf300 EBP: c0475fa4 ESP: c0475fa4 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: b80e62a8 CR3: 004c0000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c01021ab>] cpu_idle+0x95/0x11a [<c0384f32>] rest_init+0x62/0x70 ======================= |
B: BUG重现步骤&现场分析
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the
time from hardware, such as from TSC. In this configuration kgdb will
report a softlock warning messages on resuming or detaching from a
debug session.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 sec prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 seconds and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it was not updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180 sec 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } |
1 2 3 4 5 6 7 8 9 10 11 | note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why "touch_timestamp" is 100 sec, instead of 180 sec? With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) |
C: BUG触发原因
The __update_sched_clock() function uses the GTOD tick value to create
a window to normalize the "now" values. So if "now" values is too big
for sched_clock_data, it will be ignored.
D: BUG解决方法
The fix is to invoke sched_clock_tick() to update "cpu sched clock" in
order to recover from this state. This is done by introducing the
function touch_softlockup_watchdog_sync(), which allows kgdb to
request that the sched clock is updated when the watchdog thread runs
the first time after a resume from kgdb.
[PATCH] softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
E: BUG简要总结
这里的软锁警告并不是由真正的软锁产生的,从使用角度来说可以忽略,但由于是kgdb引起的,对于kgdb来讲,所以还是有责任去解决的.
这个问题是因kgdb关闭了所有中断,导致sched_clock_data存储的时间没有与硬件时间同步造成的.
所以解决方法就是在kgdb退出后立刻更新sched_clock_data.
Kgdb…
[…] something about kgdb[…]…