Cet article est disponible en francais.

Using Hyper-V Server, you may find that the time is drifting a lot from the actual time, especially when Guest Virtual Machines are using CPUs heavily. The host OS is also virtualized, which means that the load of the host is also making the clock drift.

How to prevent the clock from drifting

  1. Disable the Time Synchronization in the Integration Services. (Warning, this setting is defined per snapshot)
  2. Import the following registry file :

    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\W32Time\Config]
    "MaxAllowedPhaseOffset"=dword:00000001
    "SpecialPollInterval"=dword:00000005
    "SpecialInterval"=dword:00000001


    Note: If you are using notepad, make sure to save the file using an Unicode encoding.

  3. If the guest OS (and the server OS) is not on a domain, type in the following to set the time source :

    w32tm /config /manualpeerlist:"time.windows.com,0x01 1.ca.pool.ntp.org,0x01 2.ca.pool.ntp.org,0x01" /syncfromflags:MANUAL /update

    Note: Hosts FQDN are separated by spaces.
  4. Run the following command to force a time synchronization

    w32tm /resync
  5. Check that the clock is not drifting anymore by using this command :

    w32tm /monitor /computer:time.windows.com

 

A bit of background ...

The system I’m currently working on is heavily based on time. It relies a lot on timestamps taken from various steps of the process. If for some reason the system clock is unstable, that means the data generated by the system is unreliable. It sometimes generates corrupt data, and this is not good for the business.

I was investigating a sequence of events stored in the database in order that could not have happened, because the code cannot generate it this way.

After loads of investigation looking for code issues, I stumbled upon something rather odd in my application logs, considering that each line from the same thread should be time stamped later than the previous :

2009-10-13T17:15:26.541T [INFO][][7] ...
2009-10-13T17:15:26.556T [INFO][][7] ...
2009-10-13T17:15:24.203T [INFO][][7] ...
2009-10-13T17:15:24.219T [INFO][][7] ...
2009-10-13T17:15:24.234T [INFO][][7] ...

All the lines above were generated from the same thread, which means that the system time changed radically between the second and the third line. From the application point of view, the time went backward of about two seconds and that also means that during that two seconds, there were data generated in the future. This is not very good...

 

The Investigation

Looking at the Log4net source code, I confirmed that the time is grabbed using System.DateTime.Now call, which excludes any code issues.

Then I looked at the Windows Time Service utility, and by running the following command :

w32tm /stripchart /computer:time.windows.com

I found out that the time difference from the NTP source was very different, something like 10 seconds. But the most disturbing was not the time difference itself, but the evolution of that time difference.

Depending on the load of the virtual machine, the difference would grow very large, up to a second behind in less than a minute. Both the host and the guest machines were exposing this behavior. Since Hyper-V Integration Services are by default synchronizing the clock of all the virtual machines on the guest OS, that means that the load of a single virtual machine can influence the clock of all other virtual machines. The host machine CPU load can also influence the overall clock rate, because it is also virtualized.

 

Trying to explain this behavior

To try and make an educated guess, the time source used by windows seems to be the TSC of the processor (by the use of the RDTSC opcode), which is virtualized. The preemption of the CPU by other virtual machines seems to have an negative effect on the counter used as a reference by windows.

The more the CPU is preempted, the more the counter drifts.

 

Correcting the drift

By default, the Time Service has a “phase adjustment” process that slows down or speeds up the system clock rate to match a reliable time source. The TSC counter on the physical CPU is clocked by the system Quartz (If it is still like this). The “normal” drift of that kind of component is generally not very important, and may be related to external factors like the temperature of the room. The time service can deal with that kind of slow drift.

But the default configuration does not seem to be a good fit for a time source that drifts this quickly and is rather unpredictable. We need to shorten the process of phase adjustment.

Fixing this drift is rather simple, the Time Service needs to correct the clock rate more frequently, to cope with the load of the virtual machines that slow down the clock of the host.

Unfortunately, the default parameters on Hyper-V Server R2 are those of the default member of a domain, which are defined here. The default polling period from a reliable time source is way too long, 3600 seconds, considering the drift faced by the host clock.

A few parameters need to be adjusted in the registry for the clock to stay synchronized :

  • Set the SpecialInterval value to 0x1 to force the use of SpecialPollInterval.
  • Set SpecialPollInterval to 10, to force the source NTP to be polled every 10 seconds.
  • Set the MaxAllowedPhaseOffset to 1, to force the maximum drift to 1 second before the clock is set directly, if adjusting the clock rate failed.

Using these parameters will not mean that the clock will stay perfectly stable, but at the very least it will correct itself very quickly.

It seems that there is a hidden boot.ini parameter for Windows 2003, /USEPMTIMER, which forces windows to use the ACPI timer and avoid that kind of drift. I have not been able to confirm this has any effect at all, and I cannot confirm if the OS is actually using the PM Timer or the TSC.