logo       

Re: PE2500 with RedHat v8.0 experiencing high load and hanging/lockups: msg#00439

Subject: Re: PE2500 with RedHat v8.0 experiencing high load and hanging/lockups
At 9:10am CST this machine again locked up from what appears to be massive uncontrollable load. This time I've upgraded the PERC 3/Di firmware from #3153 to #3170. It is currently up. I'll only know if this is the fix after some time, probably a few weeks. I should mention some of the things I've seen on the screen when I get to the frozen system. This time I saw the following.

"aacraid: Host adapter reset request. SCSI hang ?
<msg repeated many times>
I/O error: dev 08:02, sector 30382784
<same msg> 30382904
<same msg> 30383032
<etc etc>
<same msg> 30386648
<same msg> 30386656"

Previously I'd had on the screen, the following message.
"unable to handle kernel paging request at virtual address 4F072740
oops: 0
unable to handle kernel paging request at virtual address 138EE508
I/O error: dev 08:31, sector 4456760
<same msg> 4456792
<same msg> 4460600
<etc etc>
<same msg> 5243104
<same msg> 5243128"

I am thinking this info might help someone...

Peter Smith


Peter Smith wrote:

I upgraded the kernel on this box to the then newest 2.4.18-26.8.0smp kernel on 3/7/2003. Since then it has locked up on 3/15, 3/17, and 3/22 . This morning I upgraded the kernel to the now newest 2.4.18-27.8.0smp kernel. It was down from 3/22 until this morning with the following message displayed, many times, "I/O error: dev 08:21, sector 24641776" with different sectors, finally ending in "0". I just upgraded it to the newest 2.4.18-27.8.0smp this morning. I'm fairly certain there are no disk issues. I'll continue with my testing over the next week. Also I notice there was a massive load spike >2000 on 3/22 immediately before it hung. If I don't see any load spikes and/or locking over the next week then I will move on to updating the firmware on both the system and the PERC (per Jason Andrade's suggestions.)

Peter Smith


Peter Smith wrote:

This is an odd issue which is why I'm notifying/contacting the list.

I have a PE2500 which, up until about 1 1/2 weeks ago, was running RedHat v7.1 without a hitch or hiccup. Since things were going so well, I decided it was high time to upgrade to RedHat v8.0 . At the same time, I upgraded Squid, its main application. Keep in mind this PE2500 is an older unit, shipped on 9/5/2001, and it is using a PERC 2/Di. The reason I upgraded it is I have another, newer, PE2500 which has been running RedHat v8.0 and my newer Squid (all same software revs) using the same PERC 2/Di but in a newer box, shipped 3/26/2002.

The problem I am having is that the failing machine is experiencing massive load (>1000) at certain somewhat cyclic times. I reboot this particular machine every morning at 3:00am. I don't believe the massive load has to do with anything other than drive access. It seems the raid driver is sometimes taking up too much time and can lock up the machine. Only one other time did I have a problem which seemed unrelated to the raid driver--recently after it rebooted at 3:00am it got stuck attempting to initialize the AIC7XXX driver at startup. I understand this is somewhat of a known issue (but for RedHat v8?) and I'm working on getting the newest newest happiest AIC7XXX driver installed, so this probably isn't too much of a problem. However, I am running the RedHat '2.4.18-24.8.0smp' kernel and am still experiencing massive load problems (which I used to not see when running RedHat v7.1 on this box.) I'll be setting up the newest newest kernel '2.4.18-26.8.0smp' probably tonight and will give that a whirl. I have a feeling that unless the Aacraid driver has been changed I'll experience the same problems. I see no massive-load or hangs on my other machine at this time.

The only other thing is this machine is using the on-board Eepro card and two add-on 3c905's. I've left the configuration on these fairly generic. Plus, nothing, as far as network goes, changed in the upgrade to RedHat v8.0 .

Any ideas? Pointers? More data? I'm fairly stumped... I suppose at the worst, I could maybe learn how to hook up a remote kernel profiler/debugger to get some real numbers on it.. When running "iostat" it looks like this box does a lot more raid-driver service time than all the other boxes which leads me to believe it is a raid-driver (aacraid) issue again.

Thank you in advance...

Peter Smith

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/




_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/



_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives 
at http://lists.us.dell.com/htdig/



<Prev in Thread] Current Thread [Next in Thread>