logo       

Nmi_watchdog and x86_64 lockups: msg#00003

linux.smp

Subject: Nmi_watchdog and x86_64 lockups

Having narrowly skirted death by allowing photographers near the lab
today. . .

I would like to enlist the open source community in debugging our Oracle
problem. Online kernel docs recommend reporting issues with NMI
(related to our lockup/dump issue) to the kernel-smp list.

I have composed the following email, but want to make sure you are
comfortable with me pursuing this. I do not mention any application
details, but it is not possible to omit fairly detailed descriptions of
the hardware when submitting to the kernel list. Not sure if that is
kosher or not.

Please let me know how I should proceed with this.

Domo Arigato.

Jeremy


Hypothetical email:
-------------------
I am looking for assistance with x86_64 SMP systems locking up. Under a
heavy application workload, the system freezes and I am unable to send
an alt-sysrq-d to trigger a dump. The systems are booting with
nmi_watchdog=1 set, but the watchdog is not working. No oops events
are registered in messages and I have observed nothing on the console
(direct attached KVM - working on setting up a term server and logging
serial console).

According to nmi_watchdog.txt, I should see non-zero counters in
/proc/interrupts with this enabled or "you probably have a processor
that needs to be
added to the nmi code".

The lockups are occurring in two separate configurations (details
below), both of which are showing all zeros for NMI in /proc/interrupts.
Any advice on if these configurations are supported by the NMI code or
suggestions for how to successfully get a dump would be most
appreciated.

Thanks in advance,

Jeremy Ulstad

Config 1: 2 x AMD Opteron 240 (8 GB RAM)
SLES 9
Linux number6 2.6.5-7.111.19-smp #1 SMP Fri Dec 10 15:10:58 UTC 2004
x86_64 x86_64 x86_64 GNU/Linux

number6:~ # cat /proc/interrupts
CPU0 CPU1
0: 383170 23276745 IO-APIC-edge timer
1: 9 227 IO-APIC-edge i8042
2: 0 0 XT-PIC cascade
8: 0 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
12: 207 0 IO-APIC-edge i8042
14: 4900 57432 IO-APIC-edge ide0
15: 54 0 IO-APIC-edge ide1
19: 0 0 IO-APIC-level ohci_hcd, ohci_hcd
27: 327047839 0 IO-APIC-level eth0, eth1
NMI: 0 0
LOC: 23656684 23657709
ERR: 0
MIS: 0

Config 2: 4 x AMD Opteron 850 (8 GB RAM)
SLES 9
Linux riddick 2.6.5-7.145-smp #1 SMP Thu Jan 27 09:19:29 UTC 2005 x86_64
x86_64 x86_64 GNU/Linux

riddick:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 20317266 25048606 25048495 25048500 IO-APIC-edge timer
1: 9 0 0 0 IO-APIC-edge i8042
2: 0 0 0 0 XT-PIC
cascade
4: 652 92 0 0 IO-APIC-edge serial
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
12: 59 0 0 0 IO-APIC-edge i8042
15: 63 4 0 0 IO-APIC-edge ide1
19: 0 0 0 0 IO-APIC-level
ohci_hcd, ohci_hcd
25: 93875682 0 1 81 IO-APIC-level eth0
27: 0 275078 99550 4603 IO-APIC-level ioc0
NMI: 0 0 0 0
LOC: 95441672 95441724 95441724 95441606
ERR: 0
MIS: 0

I should also note that all the config 1 systems are being forced to 3.8
GB of memory with "mem=3800m" to compensate for a bug with lkcd which
results in dumps (triggered manually with system up) failing with >= 4GB
RAM.
-
To unsubscribe from this list: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html



<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise