logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Solaris 8/Bind 9 Processor/Memory Leak: msg#00267

Subject: Solaris 8/Bind 9 Processor/Memory Leak

Howdy,

Last week, we upgraded our primary name server to Bind 9.2.1 running on
a Sun Netra T1, 512mb ram, with Solaris 8 kernel revision Generic_108528-12.

Something seems to be going wrong in that the kernel is slowly but surely eating
up all of the CPU time.  And the free memory seems to steadily declining, down 
to just over 100mb free from over 400mb free just two days ago, when named was 
started.  

Currently a top on the system shows this:
load averages:  1.17,  1.16,  1.23                                     11:21:02
23 processes:  22 sleeping, 1 on cpu
CPU states: 42.3% idle,  4.8% user, 52.9% kernel,  0.0% iowait,  0.0% swap
Memory: 512M real, 119M free, 98M swap in use, 1574M swap free

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
   161 named      6  58    0   70M   69M sleep  150:02  6.80% named


We have seen this pattern repeat itself three times so far in the week since 
our upgrade.  
The other two times the kernel grew to consuming over 70% of the cpu time and 
the free memory dropped into the single digits, grinding the system to a near 
halt and causing named to miss queries.  
Killing the named job neither restored the free memory, nor returned the kernel 
to a normal level of cpu usage.
A reboot returned the system to a usable state but only for a few days as the 
kernel slowly but steadily began to eat up more of the cpu time and the free 
memory dwindled.

For a dramatic demonstration of the trend we are seeing take a look at these
graphs which begin shortly after the last reboot:

this one shows cpu usage as 'idle', 'user', or 'system':
http://hotrod.uchicago.edu/orca/o_unicron_usr_pct,__sys_pct,__100_-_usr_pct_-_sys_pct.html

this one shows free memory:
http://hotrod.uchicago.edu/orca/o_unicron_1024_X_freememK.html

Moreover, our secondary name servers, which we upgraded to bind 9 many moons 
ago, are all running the same build of bind on the same build of solaris on the 
same hardware (netra t1s) and have never had a problem like this. 
And some of our secondaries answer far more queries than our primary, which, 
judging from the # of packets in/out it's interfaces per second is not esp. 
busy.

We are aware of the general bind 9 memory leak wherein named's cache simply 
grows w/o bound, but have added the option: max-cache-size 300000000; # 300 
million bytes
to our named.conf file, which is supposed to fix this.  Moreover, on our 
secondary name servers the cache memory leak was a slow going thing, taking a 
month or more to go through 512mb of memory.

Anybody seen anything like this before?
general consensus among aged and wise sys admins around here is that something
must be going terribly wrong for the kernel to be gobbling up all the cpu time.
but no one is sure what.

Thanks,
Scott Matott sXe
-----
Scott Matott sXe
Networking Specialist
Data Network Operations
Voice and Data Networking
The University of Chicago





<Prev in Thread] Current Thread [Next in Thread>