詳解Go語言排程迴圈原始碼實現

luozhiyun發表於2021-02-21

原文網址 : https://www.cnblogs.com/luozhiyun/p/14426737.html

轉載請宣告出處哦~，本篇文章釋出於luozhiyun的部落格： https://www.luozhiyun.com/archives/448

本文使用的go的原始碼15.7

概述

提到"排程"，我們首先想到的就是作業系統對程式、執行緒的排程。作業系統排程器會將系統中的多個執行緒按照一定演算法排程到物理CPU上去執行。雖然執行緒比較輕量，但是在排程時也有比較大的額外開銷。每個執行緒會都佔用 1M 以上的記憶體空間，執行緒切換和恢復暫存器中的內容也需要向系統申請資源。

Go 語言的 Goroutine 可以看作對 thread 加的一層抽象，它更輕量級，不僅減少了上下文切換帶來的額外開銷，Goroutine 佔用的資源也會更少。如建立一個 Goroutine 的棧記憶體消耗為 2 KB，而 thread 佔用 1M 以上空間；thread 建立和銷毀是核心級的，所以都會有巨大的消耗，而 Goroutine 由 Go runtime 負責管理的，建立和銷燬的消耗非常小；Goroutine 的切換成本也比 thread 要小得多。

G M P 模型

Go 的排程器使用三個結構體來實現 Goroutine 的排程：G M P。

G：代表一個 Goroutine，每個 Goroutine 都有自己獨立的棧存放當前的執行記憶體及狀態。可以把一個 G 當做一個任務，當 Goroutine 被調離 CPU 時，排程器程式碼負責把 CPU 暫存器的值儲存在 G 物件的成員變數之中，當 Goroutine 被排程起來執行時，排程器程式碼又負責把 G 物件的成員變數所儲存的暫存器的值恢復到 CPU 的暫存器。

M：表示核心執行緒，它本身就與一個核心執行緒進行繫結，每個工作執行緒都有唯一的一個 M 結構體的例項物件與之對應。M 結構體物件除了記錄著工作執行緒的諸如棧的起止位置、當前正在執行的Goroutine 以及是否空閒等等狀態資訊之外，還通過指標維持著與 P 結構體的例項物件之間的繫結關係。

P：代表一個虛擬的 Processor 處理器，它維護一個區域性 Goroutine 可執行 G 佇列，工作執行緒優先使用自己的區域性執行佇列，只有必要時才會去訪問全域性執行佇列，這大大減少了鎖衝突，提高了工作執行緒的併發性。每個 G 要想真正執行起來，首先需要被分配一個 P。

除了上面三個結構體以外，還有一個存放所有Runnable 可執行 Goroutine 的容器 schedt。每個Go程式中schedt結構體只有一個例項物件，在程式碼中是一個共享的全域性變數，每個工作執行緒都可以訪問它以及它所擁有的 Goroutine 執行佇列。

下面是G、P、M以及schedt中的全域性佇列的關係：

GMP

從上圖可以看出，每個 m 都繫結了一個 P，每個 P 都有一個私有的本地 Goroutine 佇列，m對應的執行緒從本地和全域性 Goroutine 佇列中獲取 Goroutine 並執行，綠色的 G 代表正在執行的 G。

在預設情況下，執行時會將 GOMAXPROCS 設定成當前機器的核數，假設一個四核機器會建立四個活躍的作業系統執行緒，每一個執行緒都對應一個執行時中的 M。

M_bind_CPU

詳解

結構體

G M P 結構體定義於src/runtime/runtime2.go

G

type g struct { 
	// 當前 Goroutine 的棧記憶體範圍 [stack.lo, stack.hi)
	stack       stack 
	// 用於排程器搶佔式排程  
	stackguard0 uintptr   

	_panic       *_panic  
	_defer       *_defer  
	// 當前 Goroutine 佔用的執行緒
	m            *m       
	// 儲存 Goroutine 的排程相關的資料
	sched        gobuf 
	// Goroutine 的狀態
	atomicstatus uint32 
	// 搶佔訊號
	preempt       bool // preemption signal, duplicates stackguard0 = stackpreempt
	// 搶佔時將狀態修改成 `_Gpreempted`
	preemptStop   bool // transition to _Gpreempted on preemption; otherwise, just deschedule
	// 在同步安全點收縮棧
	preemptShrink bool // shrink stack at synchronous safe point
	...
}

下面看看gobuf結構體，主要在排程器儲存或者恢復上下文的時候用到：

type gobuf struct {
	// 棧指標
	sp   uintptr
	// 程式計數器
	pc   uintptr
	// gobuf對應的Goroutine
	g    guintptr 
	// 系統呼叫的返回值
	ret  sys.Uintreg
	...
}

在執行過程中，G可能處於以下幾種狀態：

const (
	//  剛剛被分配並且還沒有被初始化
	_Gidle = iota // 0 
	// 沒有執行程式碼，沒有棧的所有權，儲存在執行佇列中
	_Grunnable // 1 
	// 可以執行程式碼，擁有棧的所有權，被賦予了核心執行緒 M 和處理器 P
	_Grunning // 2 
	// 正在執行系統呼叫，擁有棧的所有權，沒有執行使用者程式碼，
	// 被賦予了核心執行緒 M 但是不在執行佇列上
	_Gsyscall // 3 
	// 由於執行時而被阻塞，沒有執行使用者程式碼並且不在執行佇列上，
	// 但是可能存在於 Channel 的等待佇列上
	_Gwaiting // 4  
	// 表示當前goroutine沒有被使用，沒有執行程式碼，可能有分配的棧
	_Gdead // 6  
	// 棧正在被拷貝，沒有執行程式碼，不在執行佇列上
	_Gcopystack // 8 
	// 由於搶佔而被阻塞，沒有執行使用者程式碼並且不在執行佇列上，等待喚醒
	_Gpreempted // 9 
	// GC 正在掃描棧空間，沒有執行程式碼，可以與其他狀態同時存在
	_Gscan          = 0x1000 
	...
)

上面的狀態看起來很多，但是實際上只需要關注下面幾種就好了：

等待中：_ Gwaiting、_Gsyscall 和 _Gpreempted，這幾個狀態表示G沒有在執行；
可執行：_Grunnable，表示G已經準備就緒，可以線上程執行;
執行中：_Grunning，表示G正在執行；

M

Go 語言併發模型中的 M 是作業系統執行緒，最多隻會有 GOMAXPROCS 個活躍執行緒能夠正常執行。

type m struct {
	// 持有排程棧的 Goroutine
	g0      *g       
	// 處理 signal 的 G
	gsignal       *g           
	// 執行緒本地儲存 thread-local
	tls           [6]uintptr   // thread-local storage (for x86 extern register)
	// 當前執行的G
	curg          *g       // current running goroutine
	caughtsig     guintptr // goroutine running during fatal signal
	// 正在執行程式碼的P
	p             puintptr // attached p for executing go code (nil if not executing go code)
	nextp         puintptr
	// 之前使用的P
	oldp          puintptr  
	...
}

P

排程器中的處理器 P 是執行緒 M 和 G 的中間層，用於排程 G 在 M 上執行。

type p struct {
	id          int32
	// p 的狀態
	status      uint32  
    // 排程器呼叫會+1
	schedtick   uint32     // incremented on every scheduler call
    // 系統呼叫會+1
	syscalltick uint32     // incremented on every system call
	// 對應關聯的 M
	m           muintptr    
	mcache      *mcache
	pcache      pageCache 
	// defer 結構池
	deferpool    [5][]*_defer  
	deferpoolbuf [5][32]*_defer  
	// 可執行的 Goroutine 佇列，可無鎖訪問
	runqhead uint32
	runqtail uint32
	runq     [256]guintptr
	// 快取可立即執行的 G
	runnext guintptr 
	// 可用的 G 列表，G 狀態等於 Gdead 
	gFree struct {
		gList
		n int32
	}
	...
}

下面看看P的幾個狀態：

const ( 
	// 表示P沒有執行使用者程式碼或者排程器 
	_Pidle = iota 
	// 被執行緒 M 持有，並且正在執行使用者程式碼或者排程器
	_Prunning 
	// 沒有執行使用者程式碼，當前執行緒陷入系統呼叫
	_Psyscall
	// 被執行緒 M 持有，當前處理器由於垃圾回收 STW 被停止
	_Pgcstop 
	// 當前處理器已經不被使用
	_Pdead
)

sched

sched 我們在上面也提到了，主要存放了排程器持有的全域性資源，如空閒的 P 連結串列、 G 的全域性佇列等。

type schedt struct {
	...
	lock mutex 
	// 空閒的 M 列表
	midle        muintptr  
	// 空閒的 M 列表數量
	nmidle       int32      
	// 下一個被建立的 M 的 id
	mnext        int64  
	// 能擁有的最大數量的 M  
	maxmcount    int32    
	// 空閒 p 連結串列
	pidle      puintptr // idle p's
	// 空閒 p 數量
	npidle     uint32
	// 處於 spinning 狀態的 M 的數量
	nmspinning uint32   
	// 全域性 runnable G 佇列
	runq     gQueue
	runqsize int32  
	// 有效 dead G 的全域性快取.
	gFree struct {
		lock    mutex
		stack   gList // Gs with stacks
		noStack gList // Gs without stacks
		n       int32
	} 
	// sudog 結構的集中快取
	sudoglock  mutex
	sudogcache *sudog 
	// defer 結構的池
	deferlock mutex
	deferpool [5]*_defer 
	...
}

從Go程式啟動講起

這裡還是藉助dlv來進行除錯。有關 dlv 如何斷點彙編的內容我在這一篇：https://www.luozhiyun.com/archives/434 《詳解Go中記憶體分配原始碼實現》已經有很詳細的介紹了，感興趣的可以去看看。需要注意的是這裡有個坑，下面的例子是在Linux中進行的。

首先我們寫一個非常簡單的例子：

package main

import "fmt"

func main() {
	fmt.Println("hello world")
}

然後進行構建：

go build main.go
dlv exec ./main

開打程式後按步驟輸入下面的命令：

(dlv) r
Process restarted with PID 33191
(dlv) list
> _rt0_amd64_linux() /usr/local/go/src/runtime/rt0_linux_amd64.s:8 (PC: 0x4648c0)
Warning: debugging optimized function
Warning: listing may not match stale executable
     3: // license that can be found in the LICENSE file.
     4:
     5: #include "textflag.h"
     6:
     7: TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
=>   8:         JMP     _rt0_amd64(SB)
     9:
    10: TEXT _rt0_amd64_linux_lib(SB),NOSPLIT,$0
    11:         JMP     _rt0_amd64_lib(SB) 
(dlv) si
> _rt0_amd64() /usr/local/go/src/runtime/asm_amd64.s:15 (PC: 0x4613e0)
Warning: debugging optimized function
Warning: listing may not match stale executable
    10: // _rt0_amd64 is common startup code for most amd64 systems when using
    11: // internal linking. This is the entry point for the program from the
    12: // kernel for an ordinary -buildmode=exe program. The stack holds the
    13: // number of arguments and the C-style argv.
    14: TEXT _rt0_amd64(SB),NOSPLIT,$-8
=>  15:         MOVQ    0(SP), DI       // argc
    16:         LEAQ    8(SP), SI       // argv
    17:         JMP     runtime·rt0_go(SB)
    18:
    19: // main is common startup code for most amd64 systems when using
    20: // external linking. The C startup code will call the symbol "main"
(dlv)

通過上面的斷點可以知道在linux amd64系統的啟動函式是在asm_amd64.s的runtime·rt0_go函式中。當然，不同的平臺有不同的程式入口，感興趣的同學可以自行去了解。

下面我們看看runtime·rt0_go：

TEXT runtime·rt0_go(SB),NOSPLIT,$0
	...
	// 初始化執行檔案的絕對路徑
	CALL	runtime·args(SB)
	// 初始化 CPU 個數和記憶體頁大小
	CALL	runtime·osinit(SB)
	// 排程器初始化
	CALL	runtime·schedinit(SB) 
	// 建立一個新的 goroutine 來啟動程式
	MOVQ	$runtime·mainPC(SB), AX		// entry
	// 新建一個 goroutine，該 goroutine 繫結 runtime.main
	CALL	runtime·newproc(SB) 
	// 啟動M，開始排程goroutine
	CALL	runtime·mstart(SB)
	...

上面的CALL方法中：

schedinit進行各種執行時元件初始化工作，這包括我們的排程器與記憶體分配器、回收器的初始化；

newproc負責根據主 G 入口地址建立可被執行時排程的執行單元；

mstart開始啟動排程器的排程迴圈；

排程初始化 runtime.schedinit

func schedinit() {
	...
	_g_ := getg()
	...
	// 最大執行緒數10000
	sched.maxmcount = 10000 
	// M0 初始化
	mcommoninit(_g_.m, -1)
	...	  
    // 垃圾回收器初始化
	gcinit()

	sched.lastpoll = uint64(nanotime())
    // 通過 CPU 核心數和 GOMAXPROCS 環境變數確定 P 的數量
	procs := ncpu
	if n, ok := atoi32(gogetenv("GOMAXPROCS")); ok && n > 0 {
		procs = n
	}
	// P 初始化
	if procresize(procs) != nil {
		throw("unknown runnable goroutine during bootstrap")
	}
    ...
}

schedinit函式會將 maxmcount 設定成10000，這也就是一個 Go 語言程式能夠建立的最大執行緒數。然後呼叫 mcommoninit 對 M0 進行初始化，通過 CPU 核心數和 GOMAXPROCS 環境變數確定 P 的數量之後就會呼叫 procresize 函式對 P 進行初始化。

M0 初始化

func mcommoninit(mp *m, id int64) {
	_g_ := getg()
	...
	lock(&sched.lock)
	// 如果傳入id小於0，那麼id則從mReserveID獲取，初次從mReserveID獲取id為0
	if id >= 0 {
		mp.id = id
	} else {
		mp.id = mReserveID()
	}
	//random初始化，用於竊取 G
	mp.fastrand[0] = uint32(int64Hash(uint64(mp.id), fastrandseed))
	mp.fastrand[1] = uint32(int64Hash(uint64(cputicks()), ^fastrandseed))
	if mp.fastrand[0]|mp.fastrand[1] == 0 {
		mp.fastrand[1] = 1
	}
	// 建立用於訊號處理的gsignal，只是簡單的從堆上分配一個g結構體物件,然後把棧設定好就返回了
	mpreinit(mp)
	if mp.gsignal != nil {
		mp.gsignal.stackguard1 = mp.gsignal.stack.lo + _StackGuard
	}

	// 把 M 掛入全域性連結串列allm之中
	mp.alllink = allm
	...
}

這裡傳入的 id 是-1，初次呼叫會將 id 設定為 0，這裡並未對m0做什麼關於排程相關的初始化，所以可以簡單的認為這個函式只是把m0放入全域性連結串列allm之中就返回了。

P 初始化

runtime.procresize

var allp       []*p 

func procresize(nprocs int32) *p {
	// 獲取先前的 P 個數
	old := gomaxprocs
	// 更新統計資訊
	now := nanotime()
	if sched.procresizetime != 0 {
		sched.totaltime += int64(old) * (now - sched.procresizetime)
	}
	sched.procresizetime = now
	// 根據 runtime.MAXGOPROCS 調整 p 的數量,因為 runtime.MAXGOPROCS 使用者可以自行設定
	if nprocs > int32(len(allp)) { 
		lock(&allpLock)
		if nprocs <= int32(cap(allp)) {
			allp = allp[:nprocs]
		} else {
			nallp := make([]*p, nprocs) 
			copy(nallp, allp[:cap(allp)])
			allp = nallp
		}
		unlock(&allpLock)
	}
 
	// 初始化新的 P
	for i := old; i < nprocs; i++ {
		pp := allp[i]
		// 為空,則申請新的 P 物件
		if pp == nil {
			pp = new(p)
		}
		pp.init(i)
		atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
	}

	_g_ := getg()
	// P 不為空,並且 id 小於 nprocs ,那麼可以繼續使用當前 P
	if _g_.m.p != 0 && _g_.m.p.ptr().id < nprocs {
		// continue to use the current P
		_g_.m.p.ptr().status = _Prunning
		_g_.m.p.ptr().mcache.prepareForSweep()
	} else { 
		// 釋放當前 P，因為已失效
		if _g_.m.p != 0 { 
			_g_.m.p.ptr().m = 0
		}
		_g_.m.p = 0
		p := allp[0]
		p.m = 0
		p.status = _Pidle
		// P0 繫結到當前的 M0
		acquirep(p) 
	}
	// 從未使用的 P 釋放資源
	for i := nprocs; i < old; i++ {
		p := allp[i]
		p.destroy() 
		// 不能釋放 p 本身，因為他可能在 m 進入系統呼叫時被引用
	}
	// 釋放完 P 之後重置allp的長度
	if int32(len(allp)) != nprocs {
		lock(&allpLock)
		allp = allp[:nprocs]
		unlock(&allpLock)
	}
	var runnablePs *p
	// 將沒有本地任務的 P 放到空閒連結串列中
	for i := nprocs - 1; i >= 0; i-- {
		p := allp[i]
		// 當前正在使用的 P 略過
		if _g_.m.p.ptr() == p {
			continue
		}
		// 設定狀態為 _Pidle 
		p.status = _Pidle
		// P 的任務列表是否為空
		if runqempty(p) {
			// 放入到空閒列表中
			pidleput(p)
		} else {
			// 獲取空閒 M 繫結到 P 上
			p.m.set(mget())
            // 
			p.link.set(runnablePs)
			runnablePs = p
		}
	}
	stealOrder.reset(uint32(nprocs))
	var int32p *int32 = &gomaxprocs // make compiler check that gomaxprocs is an int32
	atomic.Store((*uint32)(unsafe.Pointer(int32p)), uint32(nprocs))
	return runnablePs
}

procresize方法的執行過程如下：

allp 是全域性變數 P 的資源池，如果 allp 的切片中的處理器數量少於期望數量，會對切片進行擴容；
擴容的時候會使用 new 申請一個新的 P ，然後使用 init 初始化，需要注意的是初始化的 P 的 id 就是傳入的 i 的值，狀態為 _Pgcstop；
然後通過 _g_.m.p 獲取 M0，如果 M0 已與有效的 P 繫結上，則將被繫結的 P 的狀態修改為 _Prunning。否則獲取 allp[0] 作為 P0 呼叫 runtime.acquirep 與 M0 進行繫結；
超過處理器個數的 P 通過p.destroy釋放資源，p.destroy會將與 P 相關的資源釋放，並將 P 狀態設定為 _Pdead；
通過截斷改變全域性變數 allp 的長度保證與期望處理器數量相等；
遍歷 allp 檢查 P 的是否處於空閒狀態，是的話放入到空閒列表中；

P.init

func (pp *p) init(id int32) {
	// 設定id
	pp.id = id
	// 設定狀態為 _Pgcstop
	pp.status = _Pgcstop
	// 與 sudog 相關
	pp.sudogcache = pp.sudogbuf[:0]
	for i := range pp.deferpool {
		pp.deferpool[i] = pp.deferpoolbuf[i][:0]
	}
	pp.wbBuf.reset()
	// mcache 初始化
	if pp.mcache == nil {
		if id == 0 {
			if mcache0 == nil {
				throw("missing mcache?")
			} 
			pp.mcache = mcache0
		} else {
			pp.mcache = allocmcache()
		}
	}
	...
	lockInit(&pp.timersLock, lockRankTimers)
}

這裡會初始化一些 P 的欄位值，如設定 id、status、sudogcache、mcache、lock相關。

初始化 sudogcache 這個欄位存的是 sudog 的集合與 Channel 相關，可以看這裡：多圖詳解Go中的Channel原始碼 https://www.luozhiyun.com/archives/427。

每個 P 中會儲存相應的 mcache ，能快速的進行分配微物件和小物件的分配，具體的可以看這裡：詳解Go中記憶體分配原始碼實現 https://www.luozhiyun.com/archives/434。

下面再來看看 runtime.acquirep 是如何將 P 與 M 繫結的：

runtime.acquirep

func acquirep(_p_ *p) { 
	wirep(_p_)
	...
}

func wirep(_p_ *p) {
	_g_ := getg()

	...
	// 將 P 與 M 相互繫結
	_g_.m.p.set(_p_)
	_p_.m.set(_g_.m)
	// 設定 P 狀態為 _Prunning
	_p_.status = _Prunning
}

這個方法十分簡單，就不解釋了。下面再看看 runtime.pidleput將 P 放入空閒列表：

func pidleput(_p_ *p) {
	// 如果 P 執行佇列不為空，那麼不能放入空閒列表
	if !runqempty(_p_) {
		throw("pidleput: P has non-empty run queue")
	}
	// 將 P 與 pidle 列表關聯
	_p_.link = sched.pidle
	sched.pidle.set(_p_)
	atomic.Xadd(&sched.npidle, 1) // TODO: fast atomic
}

G 初始化

從彙編可以知道執行完runtime·schedinit後就會執行 runtime.newproc是建立G的入口。

runtime.newproc

func newproc(siz int32, fn *funcval) {
	argp := add(unsafe.Pointer(&fn), sys.PtrSize)
	// 獲取當前的 G 
	gp := getg()
	// 獲取呼叫者的程式計數器 PC
	pc := getcallerpc() 
	systemstack(func() {
		// 獲取新的 G 結構體
		newg := newproc1(fn, argp, siz, gp, pc)
		_p_ := getg().m.p.ptr()
        // 將 G 加入到 P 的執行佇列
		runqput(_p_, newg, true)
		// mainStarted 為 True 表示主M已經啟動
		if mainStarted {
			// 喚醒新的  P 執行 G
			wakep()
		}
	})
}

runtime.newproc會獲取當前 G 以及呼叫方的程式計數器，然後呼叫 newproc1 獲取新的 G 結構體；然後將 G 放入到 P 的 runnext 欄位中。

runtime.newproc1

func newproc1(fn *funcval, argp unsafe.Pointer, narg int32, callergp *g, callerpc uintptr) *g {
	_g_ := getg()

	if fn == nil {
		_g_.m.throwing = -1 // do not dump full stacks
		throw("go of nil func value")
	}
	// 加鎖，禁止 G 的 M 被搶佔
	acquirem() // disable preemption because it can be holding p in a local var
	siz := narg
	siz = (siz + 7) &^ 7 

	_p_ := _g_.m.p.ptr()
	// 從 P 的空閒列表 gFree 查詢空閒 G
	newg := gfget(_p_)
	if newg == nil {
		// 建立一個棧大小為 2K 大小的 G
		newg = malg(_StackMin)
		// CAS 改變 G 狀態為 _Gdead
		casgstatus(newg, _Gidle, _Gdead)
		// 將 G 加入到全域性 allgs 列表中
		allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
	}
	...
	// 計算執行空間大小
	totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
	totalSize += -totalSize & (sys.SpAlign - 1)                  // align to spAlign
	sp := newg.stack.hi - totalSize
	spArg := sp
	...
	if narg > 0 {
		// 從 argp 引數開始的位置，複製 narg 個位元組到 spArg（引數拷貝）
		memmove(unsafe.Pointer(spArg), argp, uintptr(narg))
		...
	}
	// 清理、建立並初始化的 G
	memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
	newg.sched.sp = sp
	newg.stktopsp = sp
	newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
	newg.sched.g = guintptr(unsafe.Pointer(newg))
	gostartcallfn(&newg.sched, fn)
	newg.gopc = callerpc
	newg.ancestors = saveAncestors(callergp)
	newg.startpc = fn.fn
	if _g_.m.curg != nil {
		newg.labels = _g_.m.curg.labels
	}
	if isSystemGoroutine(newg, false) {
		atomic.Xadd(&sched.ngsys, +1)
	}
	// 將 G 狀態CAS為 _Grunnable 狀態
	casgstatus(newg, _Gdead, _Grunnable) 
	newg.goid = int64(_p_.goidcache)
	_p_.goidcache++
	...
	// 釋放鎖，對應上面 acquirem
	releasem(_g_.m)

	return newg
}

newproc1函式比較長，下面總結一下主要做了哪幾件事：

從 P 的空閒列表 gFree 查詢空閒 G；
如果獲取不到 G ，那麼呼叫 malg 建立建立一個新的 G ，需要注意的是 _StackMin 為2048，表示建立的 G 的棧上記憶體佔用為2K。然後 CAS 改變 G 狀態為 _Gdead，並加入到全域性 allgs 列表中；
根據要執行函式的入口地址和引數，初始化執行棧的 SP 和引數的入棧位置，呼叫 memmove 進行引數拷貝；
清理、建立並初始化的 G，將 G 狀態CAS為 _Grunnable 狀態，返回；

下面看看 runtime.gfget是如何查詢 G：

runtime.gfget

func gfget(_p_ *p) *g {
	retry:
		// 如果 P 的空閒列表 gFree 為空，sched 的的空閒列表 gFree 不為空
		if _p_.gFree.empty() && (!sched.gFree.stack.empty() || !sched.gFree.noStack.empty()) {
			lock(&sched.gFree.lock) 
			// 從sched 的 gFree 列表中移動 32 個到 P 的 gFree 中
			for _p_.gFree.n < 32 { 
				gp := sched.gFree.stack.pop()
				if gp == nil {
					gp = sched.gFree.noStack.pop()
					if gp == nil {
						break
					}
				}
				sched.gFree.n--
				_p_.gFree.push(gp)
				_p_.gFree.n++
			}
			unlock(&sched.gFree.lock)
			goto retry
		}
		// 此時如果 gFree 列表還是為空，返回空 
		gp := _p_.gFree.pop()
		if gp == nil {
			return nil
		}
		...
		return gp
}

當 P 的空閒列表 gFree 為空時會從 sched 持有的空閒列表 gFree 轉移32個 G 到當前的 P 的空閒列表上；
然後從 P 的 gFree 列表頭返回一個 G；

當 newproc 執行完 newproc1 後會呼叫 runtime.runqput將 G 放入到執行列表中：

runtime.runqput

func runqput(_p_ *p, gp *g, next bool) {
	if randomizeScheduler && next && fastrand()%2 == 0 {
		next = false
	} 
	if next {
	retryNext:
	// 將 G 放入到 runnext 中作為下一個處理器執行的任務
		oldnext := _p_.runnext
		if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
			goto retryNext
		}
		if oldnext == 0 {
			return
		} 
		// 將原來 runnext 的 G 放入到執行佇列中
		gp = oldnext.ptr()
	}

retry:
	h := atomic.LoadAcq(&_p_.runqhead)  
	t := _p_.runqtail
	// 放入到 P 本地執行佇列中
	if t-h < uint32(len(_p_.runq)) {
		_p_.runq[t%uint32(len(_p_.runq))].set(gp)
		atomic.StoreRel(&_p_.runqtail, t+1)  
		return
	}
	// P 本地佇列放不下了，放入到全域性的執行佇列中
	if runqputslow(_p_, gp, h, t) {
		return
	} 
	goto retry
}

runtime.runqput會根據 next 來判斷是否要將 G 放入到 runnext 中；
next 為 false 的時候會將傳入的 G 嘗試放入到本地佇列中，本地佇列時一個大小為256的環形連結串列，如果放不下了則呼叫 runqputslow函式將 G 放入到全域性佇列的 runq 中。

runq

排程迴圈

我們繼續回到runtime·rt0_go中，在初始化工作完成後，會呼叫runtime·mstart開始排程 G

TEXT runtime·rt0_go(SB),NOSPLIT,$0
	...
	// 初始化執行檔案的絕對路徑
	CALL	runtime·args(SB)
	// 初始化 CPU 個數和記憶體頁大小
	CALL	runtime·osinit(SB)
	// 排程器初始化
	CALL	runtime·schedinit(SB) 
	// 建立一個新的 goroutine 來啟動程式
	MOVQ	$runtime·mainPC(SB), AX		// entry
	// 新建一個 goroutine，該 goroutine 繫結 runtime.main
	CALL	runtime·newproc(SB) 
	// 啟動M，開始排程goroutine
	CALL	runtime·mstart(SB)
	...

runtime·mstart會呼叫到runtime·mstart1會初始化 M0 並呼叫runtime.schedule進入排程迴圈。

mstart

func mstart1() {
	_g_ := getg()

	if _g_ != _g_.m.g0 {
		throw("bad runtime·mstart")
	} 
	// 一旦呼叫 schedule 就不會返回，所以需要儲存一下棧幀
	save(getcallerpc(), getcallersp())
	asminit()
	minit() 
	// 設定訊號 handler
	if _g_.m == &m0 {
		mstartm0()
	}
	// 執行啟動函式
	if fn := _g_.m.mstartfn; fn != nil {
		fn()
	}
	// 如果當前 m 並非 m0，則要求繫結 p
	if _g_.m != &m0 {
		acquirep(_g_.m.nextp.ptr())
		_g_.m.nextp = 0
	}
	// 開始排程
	schedule()
}

mstart1儲存排程資訊後，會呼叫schedule進入排程迴圈，尋找一個可執行的 G 並執行。下面看看schedule執行函式。

schedule

func schedule() {
	_g_ := getg()

	if _g_.m.locks != 0 {
		throw("schedule: holding locks")
	} 
	... 
top:
	pp := _g_.m.p.ptr()
	pp.preempt = false
	// GC 等待
	if sched.gcwaiting != 0 {
		gcstopm()
		goto top
	}
	// 不等於0，說明在安全點
	if pp.runSafePointFn != 0 {
		runSafePointFn()
	}

	// 如果在 spinning ，那麼執行佇列應該為空，
	if _g_.m.spinning && (pp.runnext != 0 || pp.runqhead != pp.runqtail) {
		throw("schedule: spinning with local work")
	}
	// 執行 P 上準備就緒的 Timer
	checkTimers(pp, 0)

	var gp *g
	var inheritTime bool 
	...
	if gp == nil { 
		// 為了公平，每呼叫 schedule 函式 61 次就要從全域性可執行 G 佇列中獲取
		if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 {
			lock(&sched.lock)
			// 從全域性佇列獲取1個 G
			gp = globrunqget(_g_.m.p.ptr(), 1)
			unlock(&sched.lock)
		}
	}
	// 從 P 本地獲取 G 任務
	if gp == nil {
		gp, inheritTime = runqget(_g_.m.p.ptr()) 
	}
	// 執行到這裡表示從本地執行佇列和全域性執行佇列都沒有找到需要執行的 G
	if gp == nil {
		// 阻塞地查詢可用 G
		gp, inheritTime = findrunnable() // blocks until work is available
	}
	...
	// 執行 G 任務函式
	execute(gp, inheritTime)
}

在這個函式中，我們只關注排程有關的程式碼。從上面的程式碼可以知道主要是從下面幾個方向去尋找可用的 G：

為了保證公平，當全域性執行佇列中有待執行的 G 時，通過對 schedtick 取模 61 ，表示排程器每排程 61 次的時候，都會嘗試從全域性佇列裡取出待執行的 G 來執行；
呼叫 runqget 從 P 本地的執行佇列中查詢待執行的 G；
如果前兩種方法都沒有找到 G ，會通過 findrunnable 函式去其他 P 裡面去“偷”一些 G 來執行，如果“偷”不到，就阻塞直到有可執行的 G；

全域性佇列獲取 G

func globrunqget(_p_ *p, max int32) *g {
	// 如果全域性佇列中沒有 G 直接返回
	if sched.runqsize == 0 {
		return nil
	}
	// 計算 n 的個數
	n := sched.runqsize/gomaxprocs + 1
	if n > sched.runqsize {
		n = sched.runqsize
	}
	// n 的最大個數
	if max > 0 && n > max {
		n = max
	}
	if n > int32(len(_p_.runq))/2 {
		n = int32(len(_p_.runq)) / 2
	}

	sched.runqsize -= n
	// 拿到全域性佇列隊頭 G
	gp := sched.runq.pop()
	n--
	// 將其餘 n-1 個 G 從全域性佇列放入本地佇列
	for ; n > 0; n-- {
		gp1 := sched.runq.pop()
		runqput(_p_, gp1, false)
	}
	return gp
}

globrunqget 會從全域性 runq 佇列中獲取 n 個 G ，其中第一個 G 用於執行，n-1 個 G 從全域性佇列放入本地佇列。

本地佇列獲取 G

func runqget(_p_ *p) (gp *g, inheritTime bool) {
	// 如果 runnext 不為空，直接獲取返回
	for {
		next := _p_.runnext
		if next == 0 {
			break
		}
		if _p_.runnext.cas(next, 0) {
			return next.ptr(), true
		}
	}
	// 從本地佇列頭指標遍歷本地佇列
	for {
		h := atomic.LoadAcq(&_p_.runqhead)  
		t := _p_.runqtail
		// 表示本地佇列為空
		if t == h {
			return nil, false
		}
		gp := _p_.runq[h%uint32(len(_p_.runq))].ptr()
		if atomic.CasRel(&_p_.runqhead, h, h+1) { // cas-release, commits consume
			return gp, false
		}
	}
}

本地佇列的獲取會先從 P 的 runnext 欄位中獲取，如果不為空則直接返回。如果 runnext 為空，那麼從本地佇列頭指標遍歷本地佇列，本地佇列是一個環形佇列，方便複用。

任務竊取 G

任務竊取方法 findrunnable 非常的複雜，足足有300行之多，我們慢慢來分析：

func findrunnable() (gp *g, inheritTime bool) {
	_g_ := getg()
top:
	_p_ := _g_.m.p.ptr()
	// 如果在 GC，則休眠當前 M，直到復始後回到 top
	if sched.gcwaiting != 0 {
		gcstopm()
		goto top
	}
	// 執行到安全點
	if _p_.runSafePointFn != 0 {
		runSafePointFn()
	}

	now, pollUntil, _ := checkTimers(_p_, 0)
	...
	// 從本地 P 的可執行佇列獲取 G
	if gp, inheritTime := runqget(_p_); gp != nil {
		return gp, inheritTime
	}

	// 從全域性的可執行佇列獲取 G
	if sched.runqsize != 0 {
		lock(&sched.lock)
		gp := globrunqget(_p_, 0)
		unlock(&sched.lock)
		if gp != nil {
			return gp, false
		}
	} 
	// 從I/O輪詢器獲取 G
	if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 {
		// 嘗試從netpoller獲取Glist
		if list := netpoll(0); !list.empty() { // non-blocking
			gp := list.pop()
			//將其餘佇列放入 P 的可執行G佇列
			injectglist(&list)
			casgstatus(gp, _Gwaiting, _Grunnable)
			if trace.enabled {
				traceGoUnpark(gp, 0)
			}
			return gp, false
		}
	}
	...
	if !_g_.m.spinning {
		// 設定 spinning ，表示正在竊取 G
		_g_.m.spinning = true
		atomic.Xadd(&sched.nmspinning, 1)
	}
	// 開始竊取
	for i := 0; i < 4; i++ {
		for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
			if sched.gcwaiting != 0 {
				goto top
			}
			// 如果 i>2 表示如果其他 P 執行佇列中沒有 G ，將要從其他佇列的 runnext 中獲取
			stealRunNextG := i > 2 // first look for ready queues with more than 1 g
			// 隨機獲取一個 P
			p2 := allp[enum.position()]
			if _p_ == p2 {
				continue
			}
			// 從其他 P 的執行佇列中獲取一般的 G 到當前佇列中
			if gp := runqsteal(_p_, p2, stealRunNextG); gp != nil {
				return gp, false
			}

			// 如果執行佇列中沒有 G，那麼從 timers 中獲取可執行的定時器
			if i > 2 || (i > 1 && shouldStealTimers(p2)) {
				tnow, w, ran := checkTimers(p2, now)
				now = tnow
				if w != 0 && (pollUntil == 0 || w < pollUntil) {
					pollUntil = w
				}
				if ran {
					if gp, inheritTime := runqget(_p_); gp != nil {
						return gp, inheritTime
					}
					ranTimer = true
				}
			}
		}
	}
	if ranTimer {
		goto top
	}

stop: 
	// 處於 GC 階段的話，獲取執行GC標記任務的G
	if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
		_p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
		gp := _p_.gcBgMarkWorker.ptr()
		//將本地 P 的 GC 標記專用 G 職位 Grunnable
		casgstatus(gp, _Gwaiting, _Grunnable)
		if trace.enabled {
			traceGoUnpark(gp, 0)
		}
		return gp, false
	}

	...
	// 放棄當前的 P 之前，對 allp 做一個快照
	allpSnapshot := allp

	// return P and block
	lock(&sched.lock)
	// 進入了 gc，回到頂部並阻塞
	if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 {
		unlock(&sched.lock)
		goto top
	}
	// 全域性佇列中又發現了任務
	if sched.runqsize != 0 {
		gp := globrunqget(_p_, 0)
		unlock(&sched.lock)
		return gp, false
	}
	if releasep() != _p_ {
		throw("findrunnable: wrong p")
	}
	// 將 p 放入 idle 空閒連結串列
	pidleput(_p_)
	unlock(&sched.lock)
 
	wasSpinning := _g_.m.spinning
	if _g_.m.spinning {
		// M 即將睡眠，狀態不再是 spinning
		_g_.m.spinning = false
		if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
			throw("findrunnable: negative nmspinning")
		}
	}
 
	// 休眠之前再次檢查全域性 P 列表
	//遍歷全域性 P 列表的 P，並檢查他們的可執行G佇列
	for _, _p_ := range allpSnapshot {
		// 如果這時本地佇列不空
		if !runqempty(_p_) {
			lock(&sched.lock)
			// 重新獲取 P
			_p_ = pidleget()
			unlock(&sched.lock)
			if _p_ != nil {
				// M 繫結 P
				acquirep(_p_)
				if wasSpinning {
					// spinning 重新切換為 true
					_g_.m.spinning = true
					atomic.Xadd(&sched.nmspinning, 1)
				}
				// 這時候是有 work 的，回到頂部尋找 G
				goto top
			}
			break
		}
	}
 
	// 休眠前再次檢查 GC work
	if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
		lock(&sched.lock)
		_p_ = pidleget()
		if _p_ != nil && _p_.gcBgMarkWorker == 0 {
			pidleput(_p_)
			_p_ = nil
		}
		unlock(&sched.lock)
		if _p_ != nil {
			acquirep(_p_)
			if wasSpinning {
				_g_.m.spinning = true
				atomic.Xadd(&sched.nmspinning, 1)
			}
			// Go back to idle GC check.
			goto stop
		}
	}

	// poll network
	// 休眠前再次檢查 poll 網路
	if netpollinited() && (atomic.Load(&netpollWaiters) > 0 || pollUntil != 0) && atomic.Xchg64(&sched.lastpoll, 0) != 0 {
		...
		lock(&sched.lock)
		_p_ = pidleget()
		unlock(&sched.lock)
		if _p_ == nil {
			injectglist(&list)
		} else {
			acquirep(_p_)
			if !list.empty() {
				gp := list.pop()
				injectglist(&list)
				casgstatus(gp, _Gwaiting, _Grunnable)
				if trace.enabled {
					traceGoUnpark(gp, 0)
				}
				return gp, false
			}
			if wasSpinning {
				_g_.m.spinning = true
				atomic.Xadd(&sched.nmspinning, 1)
			}
			goto top
		}
	} else if pollUntil != 0 && netpollinited() {
		pollerPollUntil := int64(atomic.Load64(&sched.pollUntil))
		if pollerPollUntil == 0 || pollerPollUntil > pollUntil {
			netpollBreak()
		}
	}
	// 休眠當前 M
	stopm()
	goto top
}

這個函式需要注意一下，工作執行緒M的自旋狀態(spinning)。工作執行緒在從其它工作執行緒的本地執行佇列中盜取 G 時的狀態稱為自旋狀態。有關netpoller的知識可以到這裡看：詳解Go語言I/O多路複用netpoller模型 https://www.luozhiyun.com/archives/439。

下面我們看一下 findrunnable 做了什麼：

首先檢查是是否正在進行 GC，如果是則暫止當前的 M 並阻塞休眠；
從本地執行佇列、全域性執行佇列中查詢 G；
從網路輪詢器中查詢是否有 G 等待執行；
將 spinning 設定為 true 表示開始竊取 G。竊取過程用了兩個巢狀for迴圈，內層迴圈遍歷 allp 中的所有 P ，檢視其執行佇列是否有 G，如果有，則取其一半到當前工作執行緒的執行佇列，然後從 findrunnable 返回，如果沒有則繼續遍歷下一個 P 。需要注意的是，遍歷 allp 時是從隨機位置上的 P 開始，防止每次遍歷時使用同樣的順序訪問allp中的元素；
所有的可能性都嘗試過了，在準備休眠 M 之前，還要進行額外的檢查；
首先檢查此時是否是 GC mark 階段，如果是，則直接返回 mark 階段的 G；
休眠之前再次檢查全域性 P 列表，遍歷全域性 P 列表的 P，並檢查他們的可執行G佇列；
還需要再檢查是否有 GC mark 的 G 出現，如果有，獲取 P 並回到第一步，重新執行偷取工作；
再檢查是否存在 poll 網路的 G，如果有，則直接返回；
什麼都沒找到，那麼休眠當前的 M ；

任務執行

schedule 執行到到這裡表示終於找到了可以執行的 G：

func execute(gp *g, inheritTime bool) {
	_g_ := getg()

	// 將 G 繫結到當前 M 上
	_g_.m.curg = gp
	gp.m = _g_.m
	// 將 g 正式切換為 _Grunning 狀態
	casgstatus(gp, _Grunnable, _Grunning)
	gp.waitsince = 0
	// 搶佔訊號
	gp.preempt = false
	gp.stackguard0 = gp.stack.lo + _StackGuard
	if !inheritTime {
		// 排程器排程次數增加 1
		_g_.m.p.ptr().schedtick++
	} 
	... 
    // gogo 完成從 g0 到 gp 真正的切換
	gogo(&gp.sched)
}

當開始執行 execute 後，G 會被切換到 _Grunning 狀態，並將 M 和 G 進行繫結，最終呼叫 runtime.gogo 開始執行。

runtime.gogo 中會從 runtime.gobuf 中取出 runtime.goexit 的程式計數器和待執行函式的程式計數器，然後跳轉到 runtime.goexit 中並執行：

TEXT runtime·goexit(SB),NOSPLIT,$0-0
	CALL	runtime·goexit1(SB)
	
func goexit1() {
    // 呼叫goexit0函式 
	mcall(goexit0)
}

goexit1 通過 mcall 完成 goexit0 的呼叫：

func goexit0(gp *g) {
	_g_ := getg()
	// 設定當前 G 狀態為 _Gdead
	casgstatus(gp, _Grunning, _Gdead) 
	// 清理 G
	gp.m = nil
	...
	gp.writebuf = nil
	gp.waitreason = 0
	gp.param = nil
	gp.labels = nil
	gp.timer = nil
 
	// 解綁 M 和 G
	dropg() 
	...
	// 將 G 扔進 gfree 連結串列中等待複用
	gfput(_g_.m.p.ptr(), gp)
	// 再次進行排程
	schedule()
}

goexit0 會對 G 進行復位操作，解綁 M 和 G 的關聯關係，將其放入 gfree 連結串列中等待其他的 go 語句建立新的 g。在最後，goexit0 會重新呼叫 schedule觸發新一輪的排程。

execute