Microarchitectural Determinants of Virtual Machine Performance
Some time ago i posted a tweet. Though, that statement is an oversimplification so in this article i will expand on the idea but from a less abstracted perspective, focusing purely on a microarchitectural and hypervisor-level factors that influence VM performance. This article is a case study on hardware-assisted virtualization with a primary focus on Intel VT-x.
Introduction
The performance profile of a virtual machine is the result of a complex synchronization between physical resource availability, hypervisor orchestration logic, and the specific microarchitectural constraints of the underlying hardware. Virtualization efficiency is governed by the latency of hardware transitions, the depth of address translation walks, and the isolation of shared execution resources. This analysis explores the low-level determinants of core virtual machine performance.
Public enemy No 1: VM Transitions
The fundamental unit of overhead in hardware-assisted virtualization is the the transitions between the VMM(hypervisor) and guest machine virtual machine. This transition occurs when control and execution switches from the hypervisor context to the guest virtual machine. The transition from the hypervisor to the guest is called a VM Entry, while the switch from the guest back to the hypervisor is called a VM Exit.
Mechanics of VM Exit
A VM exit occurs when the guest executes an instruction or encounters an event that requires hypervisor intervention to preserve isolation and security. This is triggered by sensitive operations, such as privilege instructions(e.g IO/, CPUID, modifications to control registers like CR3 for page table switches), interrupts, exceptions or memory accesses that violate virtualization rules. Upon the detection , the processor saves the guest's state into the Virtual Machine Control Structure(VMCS) and restores the host's state, transferring control to the hypervisor.
When the transition occurs, the hypervisor analyzes the exit reason by reading the VMCS's EXIT_REASON/0x4402 field which encodes why the guest triggered a VM exit, performs the necessary actions such as trap-and-emulated the instruction, handling the interrupts, etc. This process incurs latency because it is CPU-intensive due several factors, which include :
Reading the VMCS
When a guest executes a privilege instruction, the CPU performs a VM exit. The hypervisor has to look at the VMCS using specific instruction, VMREAD. Although the Intel SDM does not specify the precise latency of this instruction, the good folks at KVM conducted performance benchmarks and indicated that the latency of a VMREAD operation typically falls in the range of approximately 150 to 250 CPU cycles, depending on the microarchitecture and system conditions. This repeated VMREAD overhead constitutes a significant bottleneck in most conventional VMX-based hypervisors.
Pipeline flushes
Modern CPUs are deeply pipelined, meaning they process many instructions at different stages simultaneously. The challenge here and the reason for the pipeline flush is: The CPU cannot mix guest instructions with host instructions in the same pipeline for security and architectural reasons. SO, upon the exit, the pipeline must be completely flushed.
Trap-and-Emulate (Instruction interpretation)
If the exit was caused by a privilege instruction, e.g CPUID, the hypervisor doesn't let the hardware run it. It must emulate it. In this context if we use the CPUID instruction as an example, that would mean:
- The hypervisor would need to parse the guest's RAX value at the time of the exit.
- Calculate what the guest should see, perhaps limiting certain CPU features.
- Write the fake results back into the guest's register image.
- Increment guest's RIP(instruction pointer) to point to the next instruction so the guest doesn't get stuck in a loop.
Heavy Memory Operations(EPT violations)
The most expensive exits often invlove Extended Page Tables. If the Guest tries to access memory that isn't mappped, and EPT violation occurs. When this happens the hypervisor must walk it's own page tables, verify permissions, potentially swap data from disk, and update the EPT entry. This involves multiple memory accesses, which are orders of magnitudes slower than register operations.