KVM TLB Shootdown Preemption Notes

Notes on Towards a more Scalable KVM Hypervisor presented at KVM FORUM 2018.

[PATCH v8 0/4] KVM: X86: Add Paravirt TLB Shootdown

Guest OSes are affected by the host OS scheduler during OS-level synchronization mechanisms such as TLB shoot down and RCU processing. Operations that would complete immediately in a bare-metal environment cannot ignore delays in a virtual environment.

TLB (Translation Lookaside Buffer) is used to cache mappings between virtual memory addresses and physical memory addresses. When one CPU switches that mapping, the TLB must be flushed on other CPUs. This is called TLB shoot down.

In modern OSes, TLB shoot down is positioned as a performance-critical area and is tuned to prevent delays from this process. TLB flush is implemented with IPI (Inter Processor Interrupt) on the assumption that it completes immediately. Since the completion of remote CPU TLB flush is waited for with busy wait, it’s compatible with bare-metal environments. On the other hand, in virtual environments, vCPUs can be preempted by other guests or blocked from execution, so busy wait continues for a long time.

This problem can be solved by paravirtualized TLB shoot down. In paravirtualized TLB shoot down, operations on inactive vCPUs are delayed, and when that vCPU next starts, KVM handles the TLB flush. Particularly noticeable performance improvements are seen in overcommitted environments.

A flag indicating “whether a certain vCPU has been preempted” is prepared in a memory area that can be referenced from both guest and host. The pv_mmu_ops.flush_tlb_others function sends TLB Flush notifications via IPI for active vCPUs, and for inactive vCPUs, it sets the KVM_VCPU_FLUSH_TLB flag. Later, when KVM starts a vCPU with the KVM_VCPU_FLUSH_TLB flag, it issues INVVPID. As the number of VMs increases, the effect of this tuning becomes more prominent.