osdir.com
mailing list archive

Subject: problems with gfs locking - msg#00021

List: linux.redhat.cluster

Date: Prev Next Index Thread: Prev Next Index
Hello everybody!

Please, someone help me with a huge problem.

We have two servers HP DL380G4 connected to HP MSA500 (Modular Smart
Array with currently installed 3 disks as RAID-5 summary volume of
274G). Servers works under Red Hat Enterprise Linux, data storage is
formatted to GFS. Two months system with 2 nodes works fine. But two
weeks ago we started experiencing problems with system load. Symptoms
are as follows:

1. Server on which httpd is running become unstable because of
increasing of simultaneously running processes - uptime shows numbers
10, 20,..., 120, 160 in few minutes, top hangs after this number is
big enough. If run ps to see httpd processes, all of them will be with
status D (uninterruptible sleep) - so Apache runs MaxClients processes
every of them never ends. I can't kill none of them and they are
locked with high probability by GFS - there are two processes
gulm_Cb_Handler both taking about 100% of CPU usage.

2. Apache server-status shows that almost every process hangs with
status W (sending reply), MySQL shows that lot of connections are open
(each script in auto-prepend file opens connection) but they are
sleeping. Apache document_root points to GFS raid, so every
http-request causes filesystem to read or write files (users activity
was about 8 Gb in 10000 files in last month, which is twice as much in
previous month, when system seemed stable). Now filesystem is used at
15% (about 40Gb of 274Gb), the biggest folder contains over 30000
files - may be this is the reason of problems, like when quantity
turns into (low) quality.

3. Another reason which caused locking of filesystem is cvs, which
goes over all of that thousands of files. But this can not be repeated
- only few times cvs hanged while updating (in fact, checking) some
folders (not very big sometimes).

4. Traffic diagram (by MRTG) shows that when GFS going down there are suspicious
spikes of activity on network interface which is used to link GFS
nodes raising up to 4 Mbits/sec (while average throughput is about 100
kbits/sec) in both sides. We assume that our problems started when we
changed link between two nodes from plain patch cord to Cisco Catalyst
switch (which may have only 10 Mbits/sec througput). Can slow network be the
reason of our troubles? And another question - does journals
synchronizes or is there any other activity between two nodes while
reading data from GFS on one of them?

Thanks for any qualified answers.



--
Sergey



Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: LOCK_DLM Performance under Fire

On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote: > The app itself is a really old COBOL app built on Liant's RM/Cobol -- an > abstraction software similar to java which allows the same object code > to run on Linux, UNIX, and Windows with very little modification through > a runtime application. So, while I have access to the source for the > compiled object, I don't have access to the runtime app code, which is > really the thing doing all the locking. > > This specific testing app is opening one file with locks, but it's > beating that file up. Essentially, it's going through the file and > performing a series of sorts and searches, which, for the most part, > would beat up the proc more than the I/O. The "real" application for > the most part will not be nearly as intense, but will open probably > around 100 shared files simultaneously with posix locking. Would > adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h > affect this type of application? Any other tunable parameters which > will help out? I'm not tied to DLM at this point...is there another > mechanism which would do this equally well? Taking a step back, is this a parallelized/clusterized application? i.e. will it be running concurrently on different machines with the data shared using GFS? If so, then the distributed fcntl locks are critical. If not, it would be safe to use the localflocks mount option which means fcntl locks are no longer translated to distributed locks. -- Dave Teigland <teigland@xxxxxxxxxx>

Next Message by Date: click to view message preview

RE: LOCK_DLM Performance under Fire

Yes, the idea was to parallelize the app across multiple machines sharing a common SAN infrastructure (hopefully iSCSI; if not, then GNBD in the interim). There is no central control daemon or database manager; each instance of the app does its own record locking and such, so it really doesn't matter where the data resides, as long as all the clients are able to touch the same files. Therefore, distributed locks are really important. I had suspected that the locking subsys was causing the slowdowns, so that's why I did a test with the localflocks -- it's not as fast as ext3, but works fine with only one server involved. Of course, that's not going to work for this application. :) --Peter -----Original Message----- From: David Teigland [mailto:teigland@xxxxxxxxxx] Sent: Wednesday, April 06, 2005 7:31 PM To: Peter Shearer Cc: linux-cluster@xxxxxxxxxx Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote: > The app itself is a really old COBOL app built on Liant's RM/Cobol -- an > abstraction software similar to java which allows the same object code > to run on Linux, UNIX, and Windows with very little modification through > a runtime application. So, while I have access to the source for the > compiled object, I don't have access to the runtime app code, which is > really the thing doing all the locking. > > This specific testing app is opening one file with locks, but it's > beating that file up. Essentially, it's going through the file and > performing a series of sorts and searches, which, for the most part, > would beat up the proc more than the I/O. The "real" application for > the most part will not be nearly as intense, but will open probably > around 100 shared files simultaneously with posix locking. Would > adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h > affect this type of application? Any other tunable parameters which > will help out? I'm not tied to DLM at this point...is there another > mechanism which would do this equally well? Taking a step back, is this a parallelized/clusterized application? i.e. will it be running concurrently on different machines with the data shared using GFS? If so, then the distributed fcntl locks are critical. If not, it would be safe to use the localflocks mount option which means fcntl locks are no longer translated to distributed locks. -- Dave Teigland <teigland@xxxxxxxxxx>

Previous Message by Thread: click to view message preview

LOCK_DLM Performance under Fire

This is a multi-part message in MIME format. Hi, Everyone --   I've been playing around with RHEL 4 and GFS from the tar files (not CVS) on three OptiPlex GX280 workstations using hyperthreading, SATA drives, and GNBD for sharing over a 1Gb network (dual NICs per machine).  I'm exploring moving a legacy file-based COBOL application/database over to Linux on a bunch of smaller boxes vs its current home of a quad proc AIX machine.  I have a test application which basically does applies a bunch of file and record locks on and within files along with some processor intense sorting algorithms to stress test the power of the solution.  I'm running into some serious performance discrepancies of which I hope someone can help me make sense.  Here's what I'm running into when I test this app on different file systems:   ext3 on local disk, the test app takes about 3 min 20 sec to complete. ext3 on GNBD exported disk (one node only, obviously); completes in about 3 min 35 sec. GFS on GNBD mounted with the localflocks option; completes in 5 min 30 sec. GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs; completes in 50 min 45 sec. GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs; went over 80 min and wasn't even half done. GFS on GNBD mounted using LOCK_GULM...don't want to go there; I left it running for over 2 hrs and it was worse off than the two servers using LOCK_DLM.  :)   The test app mostly does a whole lot of file & record level locking -- not a lot of file transfer from the source disk to the memory of the local server.  iostat on the client and server both show that the transfer rate of data on and off the hard disk is only at about 300kBs.  top shows that the cpu on the client is being beat up as the dlm_astd, lock_dlm1, and lock_dlm2 are taking on average 50% - 60% of the proc (30%, 15%, 15%) and my test app is taking up the rest.  When it's running on ext3 or GFS mounted with localflocks, there isn't this problem at all -- the test app goes to 99% of cpu; hence the faster completion times.  I have isolated the data paths so that the GNBD data is running over one NIC and the rest of the cluster data is on the second NIC in these computers.   Anyone have some ideas on how to tune this?  Would exporting the GNBD file system with caching enabled help as I'm not using multiple GNBD servers, just multiple GNBD clients?  Other options?  Am I just way off base here?   Thanks! ________________________________________ Peter Shearer A+, MCSE, MCSE: Security, CCNA IT Network Engineer Lumbermens  

Next Message by Thread: click to view message preview

test hung after 36 hours

I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit a problem at Apr 6 05:30. So the test ran for 36 hours. cl030 and cl031 were getting "SM: process_reply invalid" messages and cl032 got "No response" and "Missed too many heartbeats" cl032: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl030a from the cluster : Missed too many heartbeats CMAN: removing node cl031a from the cluster : No response to messages CMAN: quorum lost, blocking activity [-- MARK -- Wed Apr 6 05:30:00 2005] GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" cl030: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl032a from the cluster : Missed too many heartbeats GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" GFS: fsid=gfs_cluster:stripefs.0: Joined cluster. Now mounting FS... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Done GFS: fsid=gfs_cluster:stripefs.0: jid=1: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Done GFS: fsid=gfs_cluster:stripefs.0: jid=2: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Done GFS: fsid=gfs_cluster:stripefs.0: jid=3: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Done SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 cl031: [-- MARK -- Wed Apr 6 05:15:00 2005] SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" SM: process_reply invalid id=20505 nodeid=4294967295 GFS: fsid=gfs_cluster:stripefs.1: Joined cluster. Now mounting FS... A bit more info is available here. http://developer.osdl.org/daniel/GFS/test.04apr2005/ Any ideas on what is going on? Daniel
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by