Howdy,
Last week, we upgraded our primary name server to Bind 9.2.1 running on
a Sun Netra T1, 512mb ram, with Solaris 8 kernel revision Generic_108528-12.
Something seems to be going wrong in that the kernel is slowly but surely eating
up all of the CPU time. And the free memory seems to steadily declining, down
to just over 100mb free from over 400mb free just two days ago, when named was
started.
Currently a top on the system shows this:
load averages: 1.17, 1.16, 1.23 11:21:02
23 processes: 22 sleeping, 1 on cpu
CPU states: 42.3% idle, 4.8% user, 52.9% kernel, 0.0% iowait, 0.0% swap
Memory: 512M real, 119M free, 98M swap in use, 1574M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
161 named 6 58 0 70M 69M sleep 150:02 6.80% named
We have seen this pattern repeat itself three times so far in the week since
our upgrade.
The other two times the kernel grew to consuming over 70% of the cpu time and
the free memory dropped into the single digits, grinding the system to a near
halt and causing named to miss queries.
Killing the named job neither restored the free memory, nor returned the kernel
to a normal level of cpu usage.
A reboot returned the system to a usable state but only for a few days as the
kernel slowly but steadily began to eat up more of the cpu time and the free
memory dwindled.
For a dramatic demonstration of the trend we are seeing take a look at these
graphs which begin shortly after the last reboot:
this one shows cpu usage as 'idle', 'user', or 'system':
http://hotrod.uchicago.edu/orca/o_unicron_usr_pct,__sys_pct,__100_-_usr_pct_-_sys_pct.html
this one shows free memory:
http://hotrod.uchicago.edu/orca/o_unicron_1024_X_freememK.html
Moreover, our secondary name servers, which we upgraded to bind 9 many moons
ago, are all running the same build of bind on the same build of solaris on the
same hardware (netra t1s) and have never had a problem like this.
And some of our secondaries answer far more queries than our primary, which,
judging from the # of packets in/out it's interfaces per second is not esp.
busy.
We are aware of the general bind 9 memory leak wherein named's cache simply
grows w/o bound, but have added the option: max-cache-size 300000000; # 300
million bytes
to our named.conf file, which is supposed to fix this. Moreover, on our
secondary name servers the cache memory leak was a slow going thing, taking a
month or more to go through 512mb of memory.
Anybody seen anything like this before?
general consensus among aged and wise sys admins around here is that something
must be going terribly wrong for the kernel to be gobbling up all the cpu time.
but no one is sure what.
Thanks,
Scott Matott sXe
-----
Scott Matott sXe
Networking Specialist
Data Network Operations
Voice and Data Networking
The University of Chicago
|