victoriaMetrics無法獲取抓取target的問題

charlieroro發表於2022-05-08

victoriaMetrics無法獲取抓取target的問題

問題描述

最近在新環境中部署了一個服務,其暴露的指標路徑為:10299/metrics,配置檔案如下(名稱欄位有修改):

apiVersion: v1
items:
- apiVersion: operator.victoriametrics.com/v1beta1
  kind: VMServiceScrape
  metadata:
    labels:
      app_id: audit
    name: audit
    namespace: default
  spec:
    endpoints:
    - path: /metrics
      targetPort: 10299
    namespaceSelector:
      matchNames:
      - default
    selector:
      matchLabels:
        app_id: audit

但在vmagent上檢視其狀態如下,vmagent無法發現該target:

victoriaMetrics無法獲取抓取target的問題

一般排查方式

  1. 確保服務本身沒問題,可以通過${podIp}:10299/metrics訪問到指標
  2. 確保vmservicescrape-->service-->enpoints鏈路是通的,即配置的selector欄位能夠正確匹配到對應的資源
  3. 確保vmservicescrape格式正確。注:vmservicescrape資源格式不正確可能會導致vmagent無法載入配置,可以通過第5點檢測到
  4. 確保vmagent中允許發現該名稱空間中的target
  5. 在vmagent的UI介面執行reload,檢視vmagent的日誌是否有相關錯誤提示

經過排查發現上述方式均無法解決問題,更奇怪的是在vmagent的api/v1/targets中無法找到該target,說明vmagent壓根沒有發現該服務,即vmservicescrape配置沒有生效。在vmagent中檢視上述vmservicescrape生成的配置檔案如下(其拼接了靜態配置),可以看到它使用了kubernetes_sd_configs的方式來發現target:

- job_name: serviceScrape/default/audit/0
  metrics_path: /metrics
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app_id]
    regex: audit
    action: keep
  - source_labels: [__meta_kubernetes_pod_container_port_number]
    regex: "10299"
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    target_label: node
    regex: Node;(.*)
    replacement: ${1}
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    target_label: pod
    regex: Pod;(.*)
    replacement: ${1}
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod
  - source_labels: [__meta_kubernetes_pod_container_name]
    target_label: container
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_service_name]
    target_label: service
  - source_labels: [__meta_kubernetes_service_name]
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: "8080"
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      own_namespace: false
      names:
      - default

程式碼分析

既然配置沒有問題,那隻能通過victoriametrics的kubernetes_sd_configs的運作方式看下到底是哪裡出問題了。在victoriametrics的原始碼可以看到其拼接的target url如下:

scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)

其中:

  • schemeRelabeled:預設是http
  • metricsPathRelabeled:即生成的配置檔案的metrics_path欄位
  • optionalQuestionparamsStr沒有配置,可以忽略

最主要的欄位就是addressRelabeled,它來自一個名為"__address__"的標籤

func mergeLabels(swc *scrapeWorkConfig, target string, extraLabels, metaLabels map[string]string) []prompbmarshal.Label {
	...
	m["job"] = swc.jobName
	m["__address__"] = target
	m["__scheme__"] = swc.scheme
	m["__metrics_path__"] = swc.metricsPath
	m["__scrape_interval__"] = swc.scrapeInterval.String()
	m["__scrape_timeout__"] = swc.scrapeTimeout.String()
	...
}

繼續跟蹤程式碼,可以看到該標籤是通過sc.KubernetesSDConfigs[i].MustStart獲取到的,從KubernetesSDConfigs的名稱上看,它就是負責處理kubernetes_sd_configs機制的:

func (sc *ScrapeConfig) mustStart(baseDir string) {
	swosFunc := func(metaLabels map[string]string) interface{} {
		target := metaLabels["__address__"]
		sw, err := sc.swc.getScrapeWork(target, nil, metaLabels)
		if err != nil {
			logger.Errorf("cannot create kubernetes_sd_config target %q for job_name %q: %s", target, sc.swc.jobName, err)
			return nil
		}
		return sw
	}
	for i := range sc.KubernetesSDConfigs {
		sc.KubernetesSDConfigs[i].MustStart(baseDir, swosFunc)
	}
}

繼續往下看,看看這個"__address__"欄位到底是什麼,函式呼叫如下:

MustStart --> cfg.aw.mustStart --> aw.gw.startWatchersForRole --> uw.reloadScrapeWorksForAPIWatchersLocked --> o.getTargetLabels

最後一個函式getTargetLabels是個介面方法

type object interface {
	key() string

	// getTargetLabels must be called under gw.mu lock.
	getTargetLabels(gw *groupWatcher) []map[string]string
}

getTargetLabels的實現如下,這就是kubernetes_sd_configs的各個role的具體實現。上述服務用到的是kubernetes_sd_configsrole為endpoints

victoriaMetrics無法獲取抓取target的問題

實現如下:

func (eps *Endpoints) getTargetLabels(gw *groupWatcher) []map[string]string {
	var svc *Service
	if o := gw.getObjectByRoleLocked("service", eps.Metadata.Namespace, eps.Metadata.Name); o != nil {
		svc = o.(*Service)
	}
	podPortsSeen := make(map[*Pod][]int)
	var ms []map[string]string
	for _, ess := range eps.Subsets {
		for _, epp := range ess.Ports {
			ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.Addresses, epp, svc, "true")
			ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.NotReadyAddresses, epp, svc, "false")
		}
	}
	// See https://kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
	// and https://github.com/kubernetes/kubernetes/pull/99975
	switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {
	case "truncated":
		logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())
	case "warning":
		logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())
	}

	// Append labels for skipped ports on seen pods.
	portSeen := func(port int, ports []int) bool {
		for _, p := range ports {
			if p == port {
				return true
			}
		}
		return false
	}
	for p, ports := range podPortsSeen {
		for _, c := range p.Spec.Containers {
			for _, cp := range c.Ports {
				if portSeen(cp.ContainerPort, ports) {
					continue
				}
				addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
				m := map[string]string{
					"__address__": addr,
				}
				p.appendCommonLabels(m)
				p.appendContainerLabels(m, c, &cp)
				if svc != nil {
					svc.appendCommonLabels(m)
				}
				ms = append(ms, m)
			}
		}
	}
	return ms
}

可以看到,"__address__"其實就是拼接了p.Status.PodIPcp.ContainerPort,而p則代表一個kubernetes的pod資料結構,因此要求:

  1. pod狀態是running的,且能夠正確分配到PodIP
  2. p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的埠

問題解決

鑑於上述分析,檢視了一下環境中的deployment,發現該deployment只配置了8080埠,並沒有配置暴露指標的埠10299。問題解決。

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app_id: audit
  name: audit
  namespace: default
spec:
  ...
  template:
    metadata:
      ...
    spec:
      containers:
      - env:
        - name: APP_ID
          value: audit
        ports:
        - containerPort: 8080
          protocol: TCP

總結

kubernetes_sd_configs方式其實就是通過listwatch的方式獲取對應role的配置,然後拼接出target的__address__,此外它還會暴露一些額外的指標,如:

  • __meta_kubernetes_endpoint_hostname: Hostname of the endpoint.
  • __meta_kubernetes_endpoint_node_name: Name of the node hosting the endpoint.
  • __meta_kubernetes_endpoint_ready: Set to true or false for the endpoint's ready state.
  • __meta_kubernetes_endpoint_port_name: Name of the endpoint port.
  • __meta_kubernetes_endpoint_port_protocol: Protocol of the endpoint port.
  • __meta_kubernetes_endpoint_address_target_kind: Kind of the endpoint address target.
  • __meta_kubernetes_endpoint_address_target_name: Name of the endpoint address target.

相關文章