At 9:10am CST this machine again locked up from what appears to be
massive uncontrollable load. This time I've upgraded the PERC 3/Di
firmware from #3153 to #3170. It is currently up. I'll only know if
this is the fix after some time, probably a few weeks. I should mention
some of the things I've seen on the screen when I get to the frozen
system. This time I saw the following.
"aacraid: Host adapter reset request. SCSI hang ?
<msg repeated many times>
I/O error: dev 08:02, sector 30382784
<same msg> 30382904
<same msg> 30383032
<etc etc>
<same msg> 30386648
<same msg> 30386656"
Previously I'd had on the screen, the following message.
"unable to handle kernel paging request at virtual address 4F072740
oops: 0
unable to handle kernel paging request at virtual address 138EE508
I/O error: dev 08:31, sector 4456760
<same msg> 4456792
<same msg> 4460600
<etc etc>
<same msg> 5243104
<same msg> 5243128"
I am thinking this info might help someone...
Peter Smith
Peter Smith wrote:
I upgraded the kernel on this box to the then newest 2.4.18-26.8.0smp
kernel on 3/7/2003. Since then it has locked up on 3/15, 3/17, and
3/22 . This morning I upgraded the kernel to the now newest
2.4.18-27.8.0smp kernel. It was down from 3/22 until this morning
with the following message displayed, many times, "I/O error: dev
08:21, sector 24641776" with different sectors, finally ending in
"0". I just upgraded it to the newest 2.4.18-27.8.0smp this morning.
I'm fairly certain there are no disk issues. I'll continue with my
testing over the next week. Also I notice there was a massive load
spike >2000 on 3/22 immediately before it hung. If I don't see any
load spikes and/or locking over the next week then I will move on to
updating the firmware on both the system and the PERC (per Jason
Andrade's suggestions.)
Peter Smith
Peter Smith wrote:
This is an odd issue which is why I'm notifying/contacting the list.
I have a PE2500 which, up until about 1 1/2 weeks ago, was running
RedHat v7.1 without a hitch or hiccup. Since things were going so
well, I decided it was high time to upgrade to RedHat v8.0 . At the
same time, I upgraded Squid, its main application. Keep in mind this
PE2500 is an older unit, shipped on 9/5/2001, and it is using a PERC
2/Di. The reason I upgraded it is I have another, newer, PE2500
which has been running RedHat v8.0 and my newer Squid (all same
software revs) using the same PERC 2/Di but in a newer box, shipped
3/26/2002.
The problem I am having is that the failing machine is experiencing
massive load (>1000) at certain somewhat cyclic times. I reboot this
particular machine every morning at 3:00am. I don't believe the
massive load has to do with anything other than drive access. It
seems the raid driver is sometimes taking up too much time and can
lock up the machine. Only one other time did I have a problem which
seemed unrelated to the raid driver--recently after it rebooted at
3:00am it got stuck attempting to initialize the AIC7XXX driver at
startup. I understand this is somewhat of a known issue (but for
RedHat v8?) and I'm working on getting the newest newest happiest
AIC7XXX driver installed, so this probably isn't too much of a
problem. However, I am running the RedHat '2.4.18-24.8.0smp' kernel
and am still experiencing massive load problems (which I used to not
see when running RedHat v7.1 on this box.) I'll be setting up the
newest newest kernel '2.4.18-26.8.0smp' probably tonight and will
give that a whirl. I have a feeling that unless the Aacraid driver
has been changed I'll experience the same problems. I see no
massive-load or hangs on my other machine at this time.
The only other thing is this machine is using the on-board Eepro card
and two add-on 3c905's. I've left the configuration on these fairly
generic. Plus, nothing, as far as network goes, changed in the
upgrade to RedHat v8.0 .
Any ideas? Pointers? More data? I'm fairly stumped... I suppose
at the worst, I could maybe learn how to hook up a remote kernel
profiler/debugger to get some real numbers on it.. When running
"iostat" it looks like this box does a lot more raid-driver service
time than all the other boxes which leads me to believe it is a
raid-driver (aacraid) issue again.
Thank you in advance...
Peter Smith
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the
list archives at http://lists.us.dell.com/htdig/
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list
archives at http://lists.us.dell.com/htdig/
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives
at http://lists.us.dell.com/htdig/
|