Go runtime 排程器精講（九）：系統呼叫引起的搶佔

胡云Troy發表於2024-09-16

原文網址 : https://www.cnblogs.com/xingzheanan/p/18416149

原創文章，歡迎轉載，轉載請註明出處，謝謝。

0. 前言

第八講介紹了當 goroutine 執行時間過長會被搶佔的情況。這一講繼續看 goroutine 執行系統呼叫時間過長的搶佔。

1. 系統呼叫時間過長的搶佔

看下面的示例：

func longSyscall() {
	timeout := syscall.NsecToTimeval(int64(5 * time.Second))
	fds := make([]syscall.FdSet, 1)

	if _, err := syscall.Select(0, &fds[0], nil, nil, &timeout); err != nil {
		fmt.Println("Error:", err)
	}

	fmt.Println("Select returned after timeout")
}

func main() {
	threads := runtime.GOMAXPROCS(0)
	for i := 0; i < threads; i++ {
		go longSyscall()
	}

	time.Sleep(8 * time.Second)
}

longSyscall goroutine 執行一個 5s 的系統呼叫，在系統呼叫過程中，sysmon 會監控 longSyscall，發現執行系統呼叫過長，會對其搶佔。

回到 sysmon 執行緒看它是怎麼搶佔系統呼叫時間過長的 goroutine 的。

func sysmon() {
    ...
    idle := 0 // how many cycles in succession we had not wokeup somebody
    delay := uint32(0)
    ...

    for {
		if idle == 0 { // start with 20us sleep...
			delay = 20
		} else if idle > 50 { // start doubling the sleep after 1ms...
			delay *= 2
		}
		if delay > 10*1000 { // up to 10ms
			delay = 10 * 1000
		}
		usleep(delay)

        ...
        // retake P's blocked in syscalls
		// and preempt long running G's
		if retake(now) != 0 {
			idle = 0
		} else {
			idle++
		}
        ...
    }
}

類似於執行時間過長的 goroutine，呼叫 retake 進行搶佔：

func retake(now int64) uint32 {
	n := 0
	lock(&allpLock)
	for i := 0; i < len(allp); i++ {
		pp := allp[i]
		if pp == nil {
			continue
		}
		pd := &pp.sysmontick
		s := pp.status
		sysretake := false

        if s == _Prunning || s == _Psyscall {                           // goroutine 處於 _Prunning 或 _Psyscall 時會搶佔
			// Preempt G if it's running for too long.
			t := int64(pp.schedtick)
			if int64(pd.schedtick) != t {
				pd.schedtick = uint32(t)
				pd.schedwhen = now
			} else if pd.schedwhen+forcePreemptNS <= now {   
                // 對於 _Prunning 或者 _Psyscall 執行時間過長的情況，都會進入 preemptone
                // preemptone 我們在執行時間過長的搶佔中介紹過，它主要設定了 goroutine 的標誌位
                // 對於處於系統呼叫的 goroutine，這麼設定並不會搶佔。因為執行緒一直處於系統呼叫狀態           
				preemptone(pp)                                          
				// In case of syscall, preemptone() doesn't
				// work, because there is no M wired to P.
				sysretake = true
			}
		}

        if s == _Psyscall {                                             
            // Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
            // P 處於系統呼叫之中，需要檢查是否需要搶佔
            // syscalltick 用於記錄系統呼叫的次數，在完成系統呼叫之後加 1
			t := int64(pp.syscalltick)
			if !sysretake && int64(pd.syscalltick) != t {
                // pd.syscalltick != pp.syscalltick，說明已經不是上次觀察到的系統呼叫了，  
                // 而是另外一次系統呼叫，需要重新記錄 tick 和 when 值
				pd.syscalltick = uint32(t)
				pd.syscallwhen = now
				continue
			}

            // On the one hand we don't want to retake Ps if there is no other work to do,
			// but on the other hand we want to retake them eventually
			// because they can prevent the sysmon thread from deep sleep.
            // 如果滿足下面三個條件的一個則執行搶佔：
            // 1. 執行緒繫結的本地佇列中有可執行的 goroutine
            // 2. 沒有無所事事的 P（表示大家都挺忙的，那就不要執行系統呼叫那麼長時間佔資源了）
            // 3. 執行系統呼叫時間超過 10ms 的
			if runqempty(pp) && sched.nmspinning.Load()+sched.npidle.Load() > 0 && pd.syscallwhen+10*1000*1000 > now {
				continue
			}

            // 下面是執行搶佔的邏輯
            unlock(&allpLock)
			// Need to decrement number of idle locked M's
			// (pretending that one more is running) before the CAS.
			// Otherwise the M from which we retake can exit the syscall,
			// increment nmidle and report deadlock.
			incidlelocked(-1)
			if atomic.Cas(&pp.status, s, _Pidle) {                  // 將 P 的狀態更新為 _Pidle
				n++                                                 // 搶佔次數 + 1
				pp.syscalltick++                                    // 系統呼叫搶佔次數 + 1              
				handoffp(pp)                                        // handoffp 搶佔
			}
			incidlelocked(1)
			lock(&allpLock)
        }
    }
    unlock(&allpLock)
	return uint32(n)
}

進入 handoffp：

// Hands off P from syscall or locked M.
// Always runs without a P, so write barriers are not allowed.
//
//go:nowritebarrierrec
func handoffp(pp *p) {
    // if it has local work, start it straight away
    // 這裡如果 P 的本地有工作（goroutine），或者全域性有工作的話
    // 將 P 和其它執行緒繫結，其它執行緒指的是不是執行系統呼叫的那個執行緒
    // 執行系統呼叫的執行緒不需要 P 了，這時候把 P 釋放出來，算是資源的合理利用，相比於執行緒，P 是有限的
	if !runqempty(pp) || sched.runqsize != 0 {
		startm(pp, false, false)
		return
	}

    ...
    // no local work, check that there are no spinning/idle M's,
	// otherwise our help is not required
	if sched.nmspinning.Load()+sched.npidle.Load() == 0 && sched.nmspinning.CompareAndSwap(0, 1) { // TODO: fast atomic
		sched.needspinning.Store(0)
		startm(pp, true, false)
		return
	}

    ...
    // 判斷全域性佇列有沒有工作要處理
    if sched.runqsize != 0 {
		unlock(&sched.lock)
		startm(pp, false, false)
		return
	}

    ...
    // 如果都沒有工作，那就把 P 放到全域性空閒佇列中
    pidleput(pp, 0)
	unlock(&sched.lock)
}

可以看到搶佔系統呼叫過長的 goroutine，這裡搶佔的意思是釋放系統呼叫執行緒所繫結的 P，搶佔的意思不是不讓執行緒做系統呼叫，而是把 P 釋放出來。(由於前面設定了這個 goroutine 的 stackguard0，類似於執行時間過長 goroutine 的搶佔的流程還是會走一遍的)。

我們看一個示意圖可以更直觀清晰的瞭解這個過程：

handoff 結束之後，增加搶佔次數 n，retake 返回：

func sysmon() {
	...
	idle := 0 // how many cycles in succession we had not wokeup somebody
	delay := uint32(0)
	for {
		if idle == 0 { // start with 20us sleep...
			delay = 20                  // 如果 idle == 0，表示 sysmon 需要打起精神來，要隔 20us 監控一次
		} else if idle > 50 { // start doubling the sleep after 1ms...
			delay *= 2                  // 如果 idle 大於 50，表示迴圈了 50 次都沒有搶佔，sysmon 將加倍休眠，比較空，sysmon 也不浪費資源，先睡一會
		}
		if delay > 10*1000 { // up to 10ms
			delay = 10 * 1000           // 當然，不能無限制睡下去。最大休眠時間設定成 10ms
		}

        if retake(now) != 0 {               
			idle = 0                    // 有搶佔，則 idle = 0，表示 sysmon 要忙起來
		} else {
			idle++                      // 沒有搶佔，idle + 1
		}
    ...
    }
    ...
}

2. 小結

本講介紹了系統呼叫時間過長引起的搶佔。下一講將繼續介紹非同步搶佔。

Go runtime 排程器精講（十）：非同步搶佔
2024-09-16
Go非同步
Go runtime 排程器精講（八）：執行時間過長的搶佔
2024-09-16
Go
Go runtime 排程器精講（五）：排程策略
2024-09-14
Go
Go runtime 排程器精講（二）：排程器初始化
2024-09-11
Go
Go runtime 排程器精講（七）：案例分析
2024-09-15
Go
Go runtime 排程器精講（三）：main goroutine 建立
2024-09-13
GoAI
Go runtime 排程器精講（一）：Go 程式初始化
2024-09-11
Go
Go runtime 排程器精講（四）：執行 main goroutine
2024-09-13
GoAI
Go runtime 排程器精講（十一）：總覽全域性
2024-09-17
Go
Go runtime 排程器精講（六）：非 main goroutine 執行
2024-09-14
GoAI
Go 的搶佔式排程
2021-05-13
Go
Go Runtime 的排程器
2021-06-10
Go
Go1.12將支援搶佔式goroutine排程
2018-04-21
Go
linux搶佔式排程
2019-05-19
Linux
async-await：協作排程 vs 搶佔排程
2022-01-26
AI
go1.14 基於訊號的搶佔式排程實現原理
2020-02-26
Go
Go 併發程式設計 - runtime 協程排程（三）
2023-11-01
Go程式設計
非可搶佔式和搶佔式程式排程的區別是什麼?
2018-04-14
Go排程器系列（2）巨集觀看排程器
2019-03-27
Go
Go排程器系列（3）圖解排程原理
2019-04-06
Go圖解
Go語言排程器之主動排程(20)
2019-05-28
Go
kube-scheduler原始碼分析（3）-搶佔排程分析
2022-03-13
原始碼
Go語言排程器之排程main goroutine（14）
2019-05-09
GoAI
Runtime原始碼方法呼叫的過程
2019-03-04
原始碼
技術分享| 快對講綜合排程系統
2022-07-27
解決方案| 快對講綜合排程系統
2022-06-16
Golang原始碼學習：排程邏輯（四）系統呼叫
2020-05-27
Golang原始碼
解決方案| 快對講排程系統：高效協作
2022-04-21
技術分享| 快對講排程系統設計概要
2022-05-31
07 系統排程
2018-11-24
【深入淺出 Yarn 架構與實現】5-3 Yarn 排程器資源搶佔模型
2023-03-27
Yarn架構模型
Go語言高併發與微服務實戰專題精講——遠端過程呼叫 RPC
2024-04-19
Go微服務RPC
Flink排程之排程器、排程策略、排程模式
2023-03-08
模式
Go語言goroutine排程器初始化
2024-06-11
Go
Go 排程模型 GPM
2020-06-08
Go模型
第3講：程序排程
2024-06-10
排程系統設計精要
2020-02-12
Yarn的排程器
2023-10-02
Yarn

Go runtime 排程器精講（九）：系統呼叫引起的搶佔

0. 前言

1. 系統呼叫時間過長的搶佔

2. 小結

相關文章