Hi Steve,
A quick reply for the moment. We use what sounds like a similar environment to
you. We have a number of Dell servers (PE1550,1650,2450,2550, and 2650's)
running Linux. And we also own a NetApp F820 filer, and are planning the
purchase of an F825 to keep it company. We mount a lot of stuff over NFS from
the Linux server to the filer. This includes home grown code, using things
like mod_perl. And it also includes some pretty heavy database type stuff -
some PostgreSQL databases and a big Notes Domino db, and Verity K2 collections.
Plus we plan to put mount some Oracle databases this way within a month or so.
Generally this works very well for us. We have a mix of kernels talking to the
filer, and no problems like you are describing. We only run stock RH kernels
(including AS kernels).
My biggest tip is to check out the following, if you haven't already. It's
pretty thorough and has a lot of good info in it. Written by a NetApp
engineer: Using the Linux® NFS Client with Network Appliance Filers
http://www.netapp.com/tech_library/3183.html
Some of what we've found:
- Our default standard mount options are:
proto=udp,hard,bg,intr,wsize=8192,rsize=8192
- For Oracle we'll try the recommended tcp protocol, and probably test 16k
windows
- Early kernels seem to work best with network stack socket buffer tuning in
/etc/sysctl.conf
- We've tried to ensure heavy traffic servers run newer kernels - 2.4.18-17 or
later
- We've got the filer (twice) and major servers on Gb ethernet, which improved
things a lot over 100Mb
I think that the IP fragmentation issue can be partly resolved via newer
kernels and/or bigger socket buffers via sysctl.conf. At one point, as well as
the NetApp suggestions I played with:
net.ipv4.tcp_rmem = 8192 262143 8388608
net.ipv4.tcp_wmem = 4096 262143 8388608
Though later testing didn't seem to require this. Use at your own risk. May
be worth checking nfsstat -r to see if you are getting a lot of rpc retransmits.
Hope this helps!
Yours,
Ed
>I've actually been dealing with a couple different NFS-related issues,
>on desktops and servers, on RedHat 7.2, 7.3, and 8.0. We have a fleet
>(~ 22) of PE2650s in the back room (for batch job processing), and
>several Precision 530ns for desktops. I would imagine other people may
>not see the same severity of problems I'm seeing because our environment
>is heavily NFS-driven, i.e. I run Oracle on Linux over NFS, /usr/local,
>our project directories, and just about everything is NFS mounted. We
>use numerous Network Appliance NFS filers to serve data.
>
>On heavily loaded (NFS traffic) RH 7.3 systems (2.4.18-4), the NFS
>performance is spotty and erratic. I see tons of "kernel: nfs: server
>xxx not responding, still trying" errors in /var/log/messages, followed
>by variable amounts of time (usu. 1-60secs), then "server xxx OK". At
>its worst my Oracle server stalled for 1.5 hours in the middle of a
>long-running report. While this is slowing things down, at least
>nothing is dying because of it :-), as NFS does pick up eventually and
>things continue.
>
>NetApp has a bug on file similar to this issue, claiming the bug is
>actually in the Linux IP fragmentation code, and that switching to tcp
>mounts will help. And so at first using tcp mounts seemed to help,
>because the barrage of "server not reponding" messages went away, but
>sadly they were replaced by random hangings, accompanied by "kernel:
>lockd: server xx.xx.xx.xx not responding, still trying" messages.
>Great--now it's lockd. Argh :-).
>
>I've wanted to try later kernels, but given the widespread reports of
>problems with the tg3 driver, I felt I'd be trading one set of problems
>for another :-). Also, I'm speculating that since RedHat is just
>patching the same old 2.4.18 kernel, there probably aren't really any
>bug fixes for the NFS code between, say 2.4.18-4 and 2.4.18-26 (maybe
>I'm mistaken here, please let me know if so). What I need is fixes to
>the NFS client-side code which I'm thinking will only come with an
>upgrade to a later kernel version (e.g. 2.4.2x).
>
>On the desktop side, our Precisions came with RedHat 8.0 (2.4.18-14),
>and we would experience random hangings periodically throughout the day
>while accessing files over NFS. I tried upgrading to a stock 2.4.20
>kernel, but then reading files was preceded by a 1-2 second pause (I've
>seen other reports of this with 2.4.20). I then installed RH's
>2.4.18-24 and my pauses went away, but other users are still
>complaining. I tried converting those users to tcp, but then the NFS
>performance dropped through the floor, so I had to switch them back to
>udp :-).
>
>I've also tried a bunch of other things too (e.g. changing NFS block
>sizes), but it's hard to remember everything. If it weren't my job, at
>this point I'd just say "oh well" and wait 6-12 months for Linux's NFS
>to get better :-). But I've gotten enough positive responses to my
>query about tg3 in 2.4.18-26, so I'll try upgrading one of my 2650s to
>it and report back here. Thanks again everyone.
---
Ed Martin
Head of Systems and Network Performance
IOP Publishing Ltd
Dirac House, Temple Back
Bristol BS1 6BE
ddi: +44 (0)117 930 1102
www: http://www.iop.org
**********************************************************************
Institute of Physics
Registered charity No. 293851
76 Portland Place, London, W1B 1NT, England
IOP Publishing Limited
Registered in England under Registration No 467514.
Registered Office: Dirac House, Temple Back, Bristol BS1 6BE England
This e-mail message has been checked by MIMEsweeper using
F-Secure Anti-Virus for the presence of computer viruses.
**********************************************************************
_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@xxxxxxxx
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives
at http://lists.us.dell.com/htdig/
|