kvm-PLE程式碼分析

EwanHai發表於2021-02-24

Linux原始碼版本: 5.3.0

相關資料結構

#define KVM_DEFAULT_PLE_GAP		128 // ple_gap
#define KVM_VMX_DEFAULT_PLE_WINDOW	4096 //ple_window
// ple_window的增大系數,每次呼叫grow_ple_window時,ple_window增大2倍
#define KVM_DEFAULT_PLE_WINDOW_GROW	2 
// ple_window的縮小系數
#define KVM_DEFAULT_PLE_WINDOW_SHRINK	0
// ple_window最大不能超過這麼大,該值在32bit和64bit機器上取值不同
#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX	UINT_MAX

// ple_window和ple_gap的初始化,前提是該vcpu沒有禁用ple
if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
		vmcs_write32(PLE_GAP, ple_gap);
		vmx->ple_window = ple_window;
		vmx->ple_window_dirty = true;
	}

PAUSE Exit的處理

Intel的cpu上,使用的VMM為kvm時,當guest的vcpu變為busy-waiting狀態,也就是loop-wait狀態,就會在一定情況下觸發vmexit.

觸發條件: 由於kvm中不會使能"PAUSE exiting"feature,因此單一的PAUSE指令不會導致vmexit,kvm中只使用"PAUSE-loop exiting" feature,即迴圈(loop-wait)中的PAUSE指令會導致vmexit,具體情境為:當一個迴圈中的兩次PAUSE之間的時間差不超過PLE_gap常量,且該迴圈中某次PAUSE指令與第一次PAUSE指令的時間差超過了PLE_window,那麼就會產生一個vmexit,觸發原因field會填為PAUSE指令.

kvm程式碼中,如果進入了handle_pause()函式,說明已經觸發了pause_vmexit.

handle_pause()的大致結構:

其中,grow_ple_window()是為了讓"沒有禁用PLE的guest"調整PLE_window

/*
 * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
 * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
 */
static int handle_pause(struct kvm_vcpu *vcpu)
{
    
	if (!kvm_pause_in_guest(vcpu->kvm))  // 1. 如果該vm沒有禁用PLE,則增大PLE_window的值
		grow_ple_window(vcpu); 

	/*
	 * Intel sdm vol3 ch-25.1.3 says: The "PAUSE-loop exiting"
	 * VM-execution control is ignored if CPL > 0. OTOH, KVM
	 * never set PAUSE_EXITING and just set PLE if supported,
	 * so the vcpu must be CPL=0 if it gets a PAUSE exit.
	 */
	kvm_vcpu_on_spin(vcpu, true); // 2. 找一個之前被搶佔,目前又可以執行的vcpu,繼續執行spin-loop
    																 // 結束當前vcpu的spin狀態
    
    /* 
      * 3. 如果guest在debug狀態,則產生了單步中斷,(vcpu_enter_guest)返回0,exit到userspace繼續處理 
      * 如果guest不在debug狀態,則(vcpu_enter_guest)返回1,無需exit到userspace處理
      */
	return kvm_skip_emulated_instruction(vcpu);
}


/*
* 返回值為:當前guest是否禁用PLE feature
* 禁用:返回true
* 沒有禁用: 返回false
*/ 
static inline bool kvm_pause_in_guest(struct kvm *kvm)
{
	return kvm->arch.pause_in_guest;
}

/* 增加PLE_window的值,new_ple_window *= 2 */
static void grow_ple_window(struct kvm_vcpu *vcpu)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	int old = vmx->ple_window;

	vmx->ple_window = __grow_ple_window(old, ple_window,
					    ple_window_grow,
					    ple_window_max);

	if (vmx->ple_window != old)
		vmx->ple_window_dirty = true;

	trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
}

kvm_pause_in_guest()

分析過程

不知道這個kvm_pause_in_guest()是什麼意思,但在handle_pause()中可以看到的是,每次發生PAUSE vmexit,都會檢查kvm_pause_in_guest()的返回值,如果返回值為false,則要增大PLE_window的值。

分析1:

思考一下什麼條件下需要增大PLE_window的值呢?

只有在kvm覺得這個guest提前exit了的時候,才需要增大PLE_window,因為再多等一下就可以等到那個鎖了。

結合kvm_pause_in_guest()函式的名字,猜測該函式返回的是PAUSE vmexit前等待的那個鎖是否還沒有開啟,如果沒開啟,返回true,如果開啟了,就返回false.

分析2:

kvm development mail list中對kvm_pause_in_guest()返回的arch.pause_in_guest的說明是:"Allow to disable pause loop exit/pause filtering on a per VM basis. If some VMs have dedicated host CPUs, they won't be negatively affected due to needlessly intercepted PAUSE instructions.", 大意為,允許在特定guest上禁用PLE(intel)/PF(amd). 如果有些guest擁有繫結的host cpu,則不會由於不必要地攔截PAUSE指令而對它們產生負面影響。

什麼意思呢?假設有guestA和guestB,guestA有2個固定vcpu,繫結在host的cpu0,cpu1上,guestB有2個vcpu,不固定host cpu。

當在guestB上的vcpu0上發生spin-loop時,需要vcpu1上的lock,但是vcpu1由於排程原因去做其他事情了,該lock無法處理,guestB只能攔截PAUSE指令,exit到host.

當在guestA上的vcpu0上發生spin-loop時,需要vcpu1上的lock,因為vcpu1固定屬於guestA,不會被排程去做其他事情,相比與guestB,lock的平均解鎖時間肯定小於guestB,所以就沒必要exit到host,spin-wait就行.

結論

綜上所述,kvm_pause_in_guest()返回的是該guest是否禁用了PLE,如果禁用了就返回true,否則false.

該結論的程式碼支援:

// arch/x86/kvm/x86.c
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
			    struct kvm_enable_cap *cap)
{
	...
	
	
	case KVM_CAP_X86_DISABLE_EXITS:
	...
	if (cap->args[0] & KVM_X86_DISABLE_EXITS_PAUSE)  // 如果禁用該vm中的pause exit
			kvm->arch.pause_in_guest = true;  // 該bool值就為true
	...
}

kvm_vcpu_on_spin()

首先查詢了mail list中該函式的相關內容,發現了KVM: introduce kvm_vcpu_on_spin,"Introduce kvm_vcpu_on_spin, to be used by VMX/SVM to yield processing once the cpu detects pause-based looping.",直接說明了kvm_vcpu_on_spin()函式的用意,"一旦cpu檢測到pause-loop,就會進行相關操作。"

該函式主要將當前vcpu中的剩餘spin-loop的剩餘任務切換到新的vcpu中執行

// virt/kvm/kvm_main.c
void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
{

...
	// 將剛剛pause vmexit的vcpu設定為in-spin-loop狀態
	kvm_vcpu_set_in_spin_loop(me, true);
    
    // 將當前未執行,但狀態為可執行的vcpu的優先順序提高,因為這樣的vcpu之前被搶佔了,
    // 被搶佔之後又在__vcpu_run()中執行了排程函式,所以我們要提高它的優先順序。
    // 希望這些vcpu中包含我們需要的鎖,從最後一個被提升優先順序的vcpu開始迴圈試圖切換
    for (pass = 0; pass < 2 && !yielded && try; pass++) {
        ....
    }
    
	// 將當前vcpu設定為非spin-loop狀態
	kvm_vcpu_set_in_spin_loop(me, false);
    
    // 確保當前vcpu在下一次spin-loop時不被選為exit的vcpu
    // 因為只有大家輪流執行spin-loop,效能才能平均且 高
    kvm_vcpu_set_dy_eligible(me, false);
}

kvm_skip_emulated_instruction()

該函式主要獲取當前vcpu的RFLAGS暫存器內容,賦值給當前guest的相應資料結構。同時檢查是否需要產生單步中斷.

使Guest的RIP跳過一個指令.

// arch/x86/kvm/x86.c

int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
	unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
	int r = EMULATE_DONE;
	
    // 更新rip值,確保guest interruptibiliy state的最後2bit為0,即STI和MOV SS為0,即不接收中斷
	kvm_x86_ops->skip_emulated_instruction(vcpu);

	/*
	 * rflags is the old, "raw" value of the flags.  The new value has
	 * not been saved yet.
	 *
	 * This is correct even for TF set by the guest, because "the
	 * processor will not generate this exception after the instruction
	 * that sets the TF flag".
	 */
	if (unlikely(rflags & X86_EFLAGS_TF)) //如果guest處於debug狀態,就會產生單步中斷,那麼就會將r置為1
		kvm_vcpu_do_singlestep(vcpu, &r);
	return r == EMULATE_DONE; // 如果產生單步中斷,則需要exit到VMM中處理
}


static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
    // 獲取RIP暫存器應該跳躍的值,然後將新的RIP值更新到之前vmexit的vcpu暫存器中
	unsigned long rip;

	rip = kvm_rip_read(vcpu);
	rip += vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
	kvm_rip_write(vcpu, rip);

	/* skipping an emulated instruction also counts */
	vmx_set_interrupt_shadow(vcpu, 0);
}

	/*   該函式的本意為:
      * 如果在vmexit期間,該vm的GUEST_INTERRUPTIBILITY_INFO發生變化,那麼就將變化寫入vmcs.
      * 但在以上skip_emulated_instruction()中,呼叫了vmx_set_interrupt_shadow(vcpu, 0); mask為0時,
      * vmx_set_interrupt_shadow的只是在確定GUEST_INTERRUPTIBILITY_INFO的最後2bit,
      * 即GUEST_INTR_STATE_STI和GUEST_INTR_STATE_MOV_SS是否一直為0,而這2個bit為除錯使用的狀態
      */
void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
{
	u32 interruptibility_old = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
	u32 interruptibility = interruptibility_old;

	interruptibility &= ~(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS);

	if (mask & KVM_X86_SHADOW_INT_MOV_SS)
		interruptibility |= GUEST_INTR_STATE_MOV_SS;
	else if (mask & KVM_X86_SHADOW_INT_STI)
		interruptibility |= GUEST_INTR_STATE_STI;

	if ((interruptibility != interruptibility_old))
		vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, interruptibility);
}

/* 單步中斷的賦值操作 */
static void kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu, int *r)
{
	struct kvm_run *kvm_run = vcpu->run;

	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
		kvm_run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1 | DR6_RTM;
		kvm_run->debug.arch.pc = vcpu->arch.singlestep_rip;
		kvm_run->debug.arch.exception = DB_VECTOR;
		kvm_run->exit_reason = KVM_EXIT_DEBUG;
		*r = EMULATE_USER_EXIT;
	} else {
		kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS);
	}
}

相關文章