I upgraded one of my UML instances from 2.4.21 to 2.4.24-um1 this evening
and started experiencing strange time dialation. For instance:
-=- cut here -=-
ewen@umldebwoody:~$ uname -r && date && sleep 1 && date
2.4.21-0um
Sun Feb 29 06:23:24 UTC 2004
Sun Feb 29 06:23:25 UTC 2004
-=- cut here -=-
-=- cut here -=-
ewen@umldebwoody:~$ uname -r && date && sleep 1 && date
2.4.24-1um
Sun Feb 29 06:25:56 UTC 2004
Sun Feb 29 06:26:34 UTC 2004
-=- cut here -=-
Notice how the sleep of 1 second actually takes approximately 38 seconds
of wallclock time to complete in 2.4.24-1um, but in the 2.4.21 UML kernel
it takes the expected 1 second. (It really took ages on the 2.4.24-um1
kernel; it's not just date(1) lying.)
The host machine is a Mobile Pentium III, running at 733MHz, with 512MB
of RAM, running Linux 2.4.24 with the skas3 patch (which is detected by
both the 2.4.21 UML and 2.4.24-um1 UML kernels). It has ACPI enabled,
including CONFIG_ACPI_SLEEP=y which uses ACPI for the CPU idle handler
(to reduce power consumption -- as the Mobile Pentium III might suggest
the machine is a laptop).
After spending the last 4 hours or so looking for differences I've
narrowed down the problem to CONFIG_UML_REAL_TIME_CLOCK; if it is set to
"y" (the recommended default) then weird time dialations occur. If it
is set to "n" then time passes relatively normally.
CONFIG_UML_REAL_TIME_CLOCK controls a small section of code in
arch/um/kernel/time_kern.c:
-=- cut here -=-
#if defined(CONFIG_UML_REAL_TIME_CLOCK)
unsigned long long tsc;
/* We've had 1 tick */
tsc = time_stamp();
delta += tsc - prev_tsc;
prev_tsc = tsc;
ticks += (delta * HZ) / host_hz;
delta -= (ticks * host_hz) / HZ;
#else
ticks = 1;
#endif
-=- cut here -=-
If it is set to "n" then there is always one tick activated for each call
to timer_irq() which will usually track real time fairly well except
under very heavy load. If it is set to "y" then the code attempts to
figure out how many ticks really have happened and put them in (and
tries to keep track of how far it is off from "real" time).
The relevant external values are:
- time_stamp() -- obtains the value of the host TSC counter (via adm rdtsc)
- host_hz -- calculated from the CPU MHz value of the host CPU on startup
- HZ -- value for UML kernel (currently 100)
The code assumes that that time_stamp() (viz, rdtsc, the TSC counter)
is a regularly increasing counter value that can be relied on to track
real time. Unfortunately that assumption is simply false on this host
CPU (I suspect because of CONFIG_ACPI_SLEEP=y, but I haven't verified this).
The TSC counter is monotonically increasing, but how much it increases
in any given wallclock time period varies depending on the load on the
machine. This can be observed with this simple program:
-=- cut here -=-
#include <stdio.h>
#include <unistd.h>
unsigned long long time_stamp(void)
{
unsigned long low, high;
asm("rdtsc" : "=a" (low), "=d" (high));
return((((unsigned long long) high) << 32) + low);
}
int main(int argc, char *argv[])
{
unsigned long long start_tsc = time_stamp();
unsigned long long end_tsc = 0ULL;
sleep (1);
end_tsc = time_stamp();
printf("%lld\n", (end_tsc-start_tsc));
}
-=- cut here -=-
When run with the CPU basically idle, you get values like this:
-=- cut here -=-
ewen@basilica:~ $ ./tscdiff
8571409
ewen@basilica:~ $ ./tscdiff
8048297
ewen@basilica:~ $ ./tscdiff
7950872
ewen@basilica:~ $ ./tscdiff
8492941
-=- cut here -=-
But when run with the CPU kept busy (bzip2 -d <... >/dev/null), you get
values like this:
-=- cut here -=-
ewen@basilica:~ $ ./tscdiff
725245279
ewen@basilica:~ $ ./tscdiff
715308846
ewen@basilica:~ $ ./tscdiff
722927651
ewen@basilica:~ $ ./tscdiff
737779053
-=- cut here -=-
Note how there are nearly two orders of magnitude difference in those
values. The ones with the CPU busy are close to what one might expect
for a 733.33 MHz machine (733330000 clock ticks per second), and the
ones with the CPU idle are _much_ smaller than that, suggesting that
the CPU is being clocked around 1% of the time when idle.
Needless to say the combination of the assumption that the host TSC
increases regularly with CPU_MHZ*10000000 clock ticks per second, and
a host TSC which only increases regularly with that many clock ticks
when it is kept busy plays extreme havoc with the time keeping inside
the UML kernel.
Obviously the answer for me for this host is to disable
CONFIG_UML_REAL_TIME_CLOCK since it does much more harm than good
on this host. However this will not be the only host with this issue
(I strongly suspect any host using ACPI and CONFIG_ACPI_SLEEP will
have this issue), and the kernel's Configure.help suggests that NUMA
boxes and various other things will have non-uniforum TSC values
(and provides an option to disable TSC in those cases).
I would suggest:
- a prominent warning be included in the the Configure.help entry
for CONFIG_UML_REAL_TIME_CLOCK that it should NOT be enabled if
your TSC does not increase regularly (due to ACPI/APM sleeping
the processor, NUMA, or otherwise).
- ideally there would be a kernel config option that could enable/disable
this feature at runtime, so that distributions can ship a UML kernel which
can be made to work reliably on boxes with both regularly increasing
TSCs and those without (otherwise the distributions will probably
have to disable this config option, because the time skew is MUCH
worse with it enabled and TSC not increasing regularly than without
it enabled on a host where TSC does increase regularly).
Or alternatively that the UML kernel's idea of the time be latched to
the host kernel's idea of the time, which is kept accurate by taking
care of these sorts of issues anyway. (Eg, with the CONFIG_ACPI_SLEEP
the host kernel manages to keep its time of day straight.)
This could perhaps be achieved by tying it to changes in the host kernel's
gettimeofday(2) values -- obviously gettimeofday(2) has more overhead
than "asm rdtsc" but with some care it could be done, eg, ten times a
second to resynchronise things reducing the overhead considerably.
This would probably also avoid any drift due to the TSC-increments/second
value being rounded to 5-6 significant figures (of 9-10) by the way
the CPU MHz value is retrieved and turned into clock ticks. (Likely
to be several seconds a week by my count.)
Ewen
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
|