OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug 62318] healthcheck executed more often than configured in hcinterval


https://bz.apache.org/bugzilla/show_bug.cgi?id=62318

--- Comment #1 from WJCarpenter <bill-apache@xxxxxxxxxxxxx> ---
I recently hit this problem, and I think I understand why it happens. I did a
simple proof-of-concept change to validate my theory. Although I was also
working with 2.4.33, I don't see any recent changes to suggest it's a new bug.
BTW, I saw this on a Windows 10 Enterprise machine.

The health checks are done via callbacks from the watchdog timer module. The
callback is scheduled to run every AP_WD_TM_SLICE interval. On my server,
that's 100 milliseconds (I didn't investigate whether it's 100ms for everyone).
When triggered, the callback iterates over all of the configured health checks
to see which need attention. And, for any that are "due", the health check is
performed.

In mod_proxy_hchceck.c, in function hc_watchdog_callback(), near line 928 (in
2.4.33), there is this "if" condition:

if (!PROXY_WORKER_IS(worker, PROXY_WORKER_STOPPED) &&
   (worker->s->method != NONE) &&
   (now > worker->s->updated + worker->s->interval)) {

The last part is intended to mean "if the HC interval has passed since the last
time we checked, then do a health check".

If it is "due", according to that "if" condition, the actual health check is
kicked off in function hc_check(), near line 811 (in 2.4.33). It makes the TCP
or HTTP health check call made, the results evaluated, and the "updated"
timestamp is updated.

====================

That's what the code does. Now, here is a description of the problem as I
understand it:

1. It's kind of icky to be polling every 100 ms compared to what would be a
normal health check interval [default is 60 s]. But I don't know if there is a
wiser way to do it in httpd.

2. The "if" condition and surrounding logic don't keep track of whether there
is already a health check in progress for a given target. I suppose there could
be a reason to allow multiple requests (hung connections?), but in any case it
at least contributes to this problem.

3. The "update" timestamp is updated after the health check is completed, but
it is updated with a timestamp from when the health check began. In other
words, if it takes 1000 ms for a health check call, the "updated" timestamp
will eventually be updated with a time 1000 ms in the past.

The points just mentioned conspire to keep the health check "due" on every 100
ms watchdog callback if the individual health checks take longer than 100 ms to
complete. Suppose, for example, each health check for a given target takes 1000
ms. Then there will be 9 or 10 in-progress health checks for that target. As
each finishes, it will mark the "updated" timestamp to 1000 ms in the past. The
situation will never resolve until the elapsed time for the health checks drops
below 100 ms. That's made less likely by the additional health checks
contributing to the load.

====================

To test the theory, I locally modified the hc_check() function to modify the
"updated" timestamp with the time the health check finishes instead of the time
it started. That produced the desired behavior. I have not examined that
solution for unexpected consequences.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: bugs-help@xxxxxxxxxxxxxxxx