logo       

RE: FW: [TEC] How to monitor the availability of TEC: msg#00492

sysutils.tivoli.tme10

Subject: RE: FW: [TEC] How to monitor the availability of TEC

After I came down off my coffee high, I noticed that what I wrote before looked kind of of like techno-gibberish.  Here's a little more info for anyone interested.
 
We use the method below to make sure TEC is still running.  We do that by having TEC continually give us an "I'm OK" signal in the form of updating a file on the TEC server.  We make sure that the "I'm OK" file has not gone long without being updated by checking it with a Tivoli-independent process (crontab/perl-script).  When TEC quits processing events, the problem may be (has been) any number of things - network problem, DB problem, TMR problem, or other wierd TEC problem that just causes it to hang (fortunately this doesn't happen very much any more).  When TEC quits working, the perl script sends emails and pages to get somebody to check it out.  It's usually not a Tivoli problem, but the Tivoli admin that checks it out gets the ball rolling to get it fixed. 
 
Besides the "TEC heartbeat" , we also went so far as to make a cron heartbeat.  That is because *sometimes* the crontab, which we depend on for the TEC heartbeat, quits working (usually this is because the root password is expired).  We do the crontab heartbeat similar - 1) touch a file from the crontab, 2) check the file with a perl script running as a scheduled Tivoli Job (cron-independent checker).  
 
This process has worked well.  BTW, it wasn't accurate to say we are "preparing to implement" it. We've been using it a long time.  But recently I moved the "touch" of the 'I'm OK" file inside the rules for efficiency.  We used to have exec_program() call a shell script which executed /usr/bin/touch for every new event (I know, that was stoopid).
 
-James
-----Original Message-----
From: owner-tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx [mailto:owner-tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx]On Behalf Of Redusj-pWz/JrSLlZPq+pQ9gifPwA@xxxxxxxxxxxxxxxx
Sent: Tuesday, March 22, 2005 11:40 AM
To: tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx
Subject: RE: [tme10] FW: [TEC] How to monitor the availability of TEC

We are preparing to implement the rules below to 'touch' a file each XXX seconds if events were received. To know if events were received we are counting every new event.  I didn't want to have to count every event, but I have observed situations where events quit being processed (e.g., if we lose DB comm) and the timer still fires. 
 
While the file is being 'touched' from the rules, the crontab is running a perl script that does a stat() on the file to see how long since it was last modified.  I.e.,
 
...
    $now=time();
    ($device, $inode, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks) = stat($Tec_check_file);
    $diff=$now-$mtime;
    if ($diff > $threshold) {
        do_something;    # send emails, pages
    }
...
 
(I haven't looked at the other methods posted yet - they might be more efficient than mine)
 
 
-James
 
 
--------------------------------------------------
 
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This rule counts each event. Used by tec_hb.
%
rule: init_count_each_evt: (
   event: _event of_class _class,
 
   reception_action: (
      get_global_var('TEC_HB', 'COUNT', _old_count, 0),
      _new_count is _old_count + 1,
      set_global_var('TEC_HB', 'COUNT', _new_count)
   )
).
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This rule schedules a tec_hb timer.
%
rule: init_tec_start_schedule_hb_timer: (
   event: _event of_class 'TEC_Start',
 
   reception_action: (
      get_global_var('TEC_HB', 'TIMER_STARTED', _started, 'NOPE'),
      _started == 'NOPE',
      set_global_var('TEC_HB', 'TIMER_STARTED', 'YEP'),
      first_instance(event: _tic of_class 'TEC_Tick' where []),
      set_timer(_tic, 30, 'tec_hb')
   )
).
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Every _duration seconds we'll 'touch' a file for tec hb.
%
timer_rule: tec_hb_touch: (
   event: _tic of_class 'TEC_Tick' where [],
   timer_info: equals 'tec_hb',
   timer_duration: _duration,
 
   action: (
 
      get_global_var('TEC_HB', 'COUNT', _count, 0),
      get_global_var('TEC_HB', 'LAST_COUNT', _last_count, 0),
      set_global_var('TEC_HB', 'LAST_COUNT', _count),
      _interval_count is _count - _last_count,
 
      % only continue w/ 'touch' if _interval_count > 0.
      _interval_count > 0,
 
      get_local_time(_time_local_struct),
      resolve_time(_time_local_struct, _seconds, _minutes, _hours, _day_of_month, _month0, _year0, _day_of_week, _day_of_year, _daylight_savings),
      _year4 is _year0 + 1900,
      _month is _month0 + 1,
 
      sprintf(_log_entry, '%04d-%02d-%02d/%02d:%02d:%02d Events(Total/Interval):%d/%d', [_year4, _month, _day_of_month,_hours,_minutes,_seconds, _count, _interval_count ]),
      % Probably want to change file mode from a->w.
      fopen(_hbfile, '/Tivoli/custom/log/dm_hb/heartbeat.tec', a),
      fprintf(_hbfile,'%s\n',[_log_entry]),
      fclose(_hbfile)
   ),
 
   action: (
      set_timer(_tic, _duration, 'tec_hb')
   )
 
).
 
 
 
 
 
 
 
 
 
 
-----Original Message-----
From: owner-tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx [mailto:owner-tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx]On Behalf Of Nes van, P (Peter)
Sent: Tuesday, March 22, 2005 5:32 AM
To: tme10-XtjxT7Vmt5b1ENwx4SLHqw@xxxxxxxxxxxxxxxx
Subject: [tme10] FW: [TEC] How to monitor the availability of TEC

Hi list,
 
Just curious...
 
How do you monitor the availability of your Tivoli environment?
 
When you have a single TMR environment with a separated TMR- and TECserver, your automated incident registration is connected to you TMR. Then the monitoring of the availability of your TEC is essential. What we need is an indication in case of unvailability of the TEC server.
When a TEC server is shutdown using the wstopesvr command a TEC_Stop event is generated which is visible on the TEC console. In this case you will get a notification that the eventserver is unavailable. In the sitiuation when the tec_* processes are killed (or aborted by a coredump) or the eventserver gets overflooded by events the console is unable to detect the unavailability.
This is because the TEC (java) console queries the DB directly and does not communicate with the tec_ui_server when no modifications are made to the interface by human intervention (acknowledgement / closing).
 
Has anyone found the ultimate solution, or does anyone know about future developments concerning TEC Console which wil deal with this problem?
 
Cheers,
 
 
Peter
 
 
 

================================================
De informatie opgenomen in dit bericht kan vertrouwelijk zijn en
is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht
onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en
de afzender direct te informeren door het bericht te retourneren.
================================================
The information contained in this message may be confidential
and is intended to be exclusively for the addressee. Should you
receive this message unintentionally, please do not use the contents
herein and notify the sender immediately by return e-mail.

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise