探究kubernetes 探針引數periodSeconds和timeoutSeconds

charlieroro發表於2024-07-10

探究kubernetes 探針引數 periodSecondstimeoutSeconds

問題起源

kubernetes probes的配置中有兩個容易混淆的引數,periodSecondstimeoutSeconds,其配置方式如下:

apiVersion: v1
kind: Pod
metadata:
  name: darwin-app
spec:
  containers:
  - name: darwin-container
    image: darwin-image
    livenessProbe:
      httpGet:
        path: /darwin-path
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3

官方對這兩個引數的解釋如下:

  • periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. The minimum value is 1.
  • timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.

意思是說periodSeconds表示執行探針的週期,而timeoutSeconds表示執行探針的超時時間。

網上有不少針對這兩個引數的討論(如下),其中涉及到一個問題,如果timeoutSeconds > periodSeconds 會怎麼樣?

  1. What is the role of timeoutSeconds in kubernetes liveness/readiness probes?
  2. Kubernetes Health Check: timeoutSeconds exceeds periodSeconds
  3. Does periodSeconds in Kubernetes probe configuration count from the last probe time or the last response/failure time?

其中在上面的第3篇中對timeoutSeconds>periodSeconds的情況有如下描述,即在這種情況下,如果探針超時,則探針週期等於timeoutSeconds。那麼這種說法是否正確呢?

If you had the opposite (timeoutSeconds=10, periodSeconds=5), then the probes would look as follows:

0s: liveness probe initiated
10s: liveness probe times out
10s: liveness probe initiated again

原始碼探究

鑑於網上眾說紛紜,我們透過原始碼來一探究竟。

kubernetes的探針機制是由kubelet執行的,目前支援execgrpchttpGettcpSocket這4種探針方式。

探針的程式碼邏輯並不複雜,以v1.30.2的程式碼為例,其入口函式如下,可以看到它會啟動一個週期為w.spec.PeriodSeconds(即探針中定義的periodSeconds)定時器,週期性地執行探針。

// run periodically probes the container.
func (w *worker) run() {
	ctx := context.Background()
	probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second
	...

	probeTicker := time.NewTicker(probeTickerPeriod)
	...
probeLoop:
	for w.doProbe(ctx) {
		// Wait for next probe tick.
		select {
		case <-w.stopCh:
			break probeLoop
		case <-probeTicker.C:
		case <-w.manualTriggerCh:
			// continue
		}
	}
}

現在已經找到periodSeconds的用途,下一步需要找到timeoutSeconds

  1. 首先進入doProbe函式,它呼叫了w.probeManager.prober.probe

    // doProbe probes the container once and records the result.
    // Returns whether the worker should continue.
    func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
    	...
    	// Note, exec probe does NOT have access to pod environment variables or downward API
    	result, err := w.probeManager.prober.probe(ctx, w.probeType, w.pod, status, w.container, w.containerID)
    	if err != nil {
    		// Prober error, throw away the result.
    		return true
    	}
    	...
    }
    
  2. 下面的probe函式用於執行一個特定的探針。需要注意的是,它呼叫了pb.runProbeWithRetries,其中maxProbeRetries值為3,說明在一個週期(periodSeconds)中最多可以執行3次探針命令

    // probe probes the container.
    func (pb *prober) probe(ctx context.Context, probeType probeType, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (results.Result, error) {
    	var probeSpec *v1.Probe
    	switch probeType {
    	case readiness:
    		probeSpec = container.ReadinessProbe
    	case liveness:
    		probeSpec = container.LivenessProbe
    	case startup:
    		probeSpec = container.StartupProbe
    	default:
    		return results.Failure, fmt.Errorf("unknown probe type: %q", probeType)
    	}
    	...
    	result, output, err := pb.runProbeWithRetries(ctx, probeType, probeSpec, pod, status, container, containerID, maxProbeRetries)
    	...
    }
    
  3. runProbeWithRetries的註釋說明,可能會執行多次探針,直到探針返回成功或全部嘗試失敗:

    // runProbeWithRetries tries to probe the container in a finite loop, it returns the last result
    // if it never succeeds.
    func (pb *prober) runProbeWithRetries(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID, retries int) (probe.Result, string, error) {
    	...
    	for i := 0; i < retries; i++ {
    		result, output, err = pb.runProbe(ctx, probeType, p, pod, status, container, containerID)
    	  ...
    	}
    	...
    }
    
  4. runProbe函式中,最終找到了timeoutSeconds對應的引數p.TimeoutSeconds,其作為各個探針命令的超時引數,如在httpGet型別的探針中,它作為了httpClient的請求超時時間:

    
    func (pb *prober) runProbe(ctx context.Context, probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {
    
      timeout := time.Duration(p.TimeoutSeconds) * time.Second
      
    	if p.Exec != nil {
    		command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)
    		return pb.exec.Probe(pb.newExecInContainer(ctx, container, containerID, command, timeout))
    	}
      
    	if p.HTTPGet != nil {
    		req, err := httpprobe.NewRequestForHTTPGetAction(p.HTTPGet, &container, status.PodIP, "probe")
    		...
    		return pb.http.Probe(req, timeout)
    	}
      
    	if p.TCPSocket != nil {
    		port, err := probe.ResolveContainerPort(p.TCPSocket.Port, &container)
    		...
    		host := p.TCPSocket.Host
    		if host == "" {
    			host = status.PodIP
    		}
    		return pb.tcp.Probe(host, port, timeout)
    	}
    
    	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.GRPCContainerProbe) && p.GRPC != nil {
    		host := status.PodIP
    		service := ""
    		if p.GRPC.Service != nil {
    			service = *p.GRPC.Service
    		}
    		return pb.grpc.Probe(host, service, int(p.GRPC.Port), timeout)
    	}
    	...
    }
    

至此我們可以拼接出periodSecondstimeoutSeconds的關係,其邏輯關係與如下程式碼類似。

probeTicker := time.NewTicker(periodSeconds)

for {
	select {
	case <-probeTicker.C:
    for i := 0; i < 3; i++ {
      if ok:=probe(timeoutSeconds);ok{
        return
      }
    }
}

總結

  • periodSeconds用於啟動一個週期性呼叫探針命令的定時器,而timeoutSeconds作為探針命令的超時引數
  • timeoutSecondsperiodSeconds之間並沒有明確的關係。如果timeoutSeconds=10s,periodSeconds=5s,則本次探針週期可能為[5s, 30s)之內的任意值,並不是該文中說的periodSeconds=timeoutSeconds(由於本文寫於3年前,經查閱v1.19.10版本程式碼,邏輯上與現有版本程式碼相同。)
  • 由於健康檢查的邏輯大部分都不會很複雜,如檢查某個檔案是否存在,檢查服務的/hleathz http endpoint是否可以訪問等,因此建議將timeoutSeconds設定為一個小於periodSeconds的合理的值。

failureThreshold/successThresholdmaxProbeRetries的關係

  • maxProbeRetries用於定義一次探針週期內探針命令執行的最大嘗試次數;
  • 如果在一個探針週期內,探針命令返回成功,則successThreshold 加1,反之failureThreshold加1;

相關文章