victoriaMetrics無法獲取抓取target的問題
問題描述
最近在新環境中部署了一個服務,其暴露的指標路徑為:10299/metrics
,配置檔案如下(名稱欄位有修改):
apiVersion: v1
items:
- apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
labels:
app_id: audit
name: audit
namespace: default
spec:
endpoints:
- path: /metrics
targetPort: 10299
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app_id: audit
但在vmagent上檢視其狀態如下,vmagent無法發現該target:
一般排查方式
- 確保服務本身沒問題,可以通過
${podIp}:10299/metrics
訪問到指標 - 確保vmservicescrape-->service-->enpoints鏈路是通的,即配置的
selector
欄位能夠正確匹配到對應的資源 - 確保vmservicescrape格式正確。注:vmservicescrape資源格式不正確可能會導致vmagent無法載入配置,可以通過第5點檢測到
- 確保vmagent中允許發現該名稱空間中的target
- 在vmagent的UI介面執行
reload
,檢視vmagent的日誌是否有相關錯誤提示
經過排查發現上述方式均無法解決問題,更奇怪的是在vmagent的api/v1/targets
中無法找到該target
,說明vmagent壓根沒有發現該服務,即vmservicescrape配置沒有生效。在vmagent中檢視上述vmservicescrape生成的配置檔案如下(其拼接了靜態配置),可以看到它使用了kubernetes_sd_configs
的方式來發現target:
- job_name: serviceScrape/default/audit/0
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app_id]
regex: audit
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "10299"
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: node
regex: Node;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: pod
regex: Pod;(.*)
replacement: ${1}
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: service
- source_labels: [__meta_kubernetes_service_name]
target_label: job
replacement: ${1}
- target_label: endpoint
replacement: "8080"
kubernetes_sd_configs:
- role: endpoints
namespaces:
own_namespace: false
names:
- default
程式碼分析
既然配置沒有問題,那隻能通過victoriametrics的kubernetes_sd_configs
的運作方式看下到底是哪裡出問題了。在victoriametrics的原始碼可以看到其拼接的target url如下:
scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)
其中:
- schemeRelabeled:預設是http
- metricsPathRelabeled:即生成的配置檔案的
metrics_path
欄位 - optionalQuestion和paramsStr沒有配置,可以忽略
最主要的欄位就是addressRelabeled
,它來自一個名為"__address__"
的標籤
func mergeLabels(swc *scrapeWorkConfig, target string, extraLabels, metaLabels map[string]string) []prompbmarshal.Label {
...
m["job"] = swc.jobName
m["__address__"] = target
m["__scheme__"] = swc.scheme
m["__metrics_path__"] = swc.metricsPath
m["__scrape_interval__"] = swc.scrapeInterval.String()
m["__scrape_timeout__"] = swc.scrapeTimeout.String()
...
}
繼續跟蹤程式碼,可以看到該標籤是通過sc.KubernetesSDConfigs[i].MustStart
獲取到的,從KubernetesSDConfigs
的名稱上看,它就是負責處理kubernetes_sd_configs
機制的:
func (sc *ScrapeConfig) mustStart(baseDir string) {
swosFunc := func(metaLabels map[string]string) interface{} {
target := metaLabels["__address__"]
sw, err := sc.swc.getScrapeWork(target, nil, metaLabels)
if err != nil {
logger.Errorf("cannot create kubernetes_sd_config target %q for job_name %q: %s", target, sc.swc.jobName, err)
return nil
}
return sw
}
for i := range sc.KubernetesSDConfigs {
sc.KubernetesSDConfigs[i].MustStart(baseDir, swosFunc)
}
}
繼續往下看,看看這個"__address__"
欄位到底是什麼,函式呼叫如下:
MustStart
--> cfg.aw.mustStart
--> aw.gw.startWatchersForRole
--> uw.reloadScrapeWorksForAPIWatchersLocked
--> o.getTargetLabels
最後一個函式getTargetLabels
是個介面方法:
type object interface {
key() string
// getTargetLabels must be called under gw.mu lock.
getTargetLabels(gw *groupWatcher) []map[string]string
}
getTargetLabels
的實現如下,這就是kubernetes_sd_configs
的各個role的具體實現。上述服務用到的是kubernetes_sd_configs
role為endpoints
:
其實現如下:
func (eps *Endpoints) getTargetLabels(gw *groupWatcher) []map[string]string {
var svc *Service
if o := gw.getObjectByRoleLocked("service", eps.Metadata.Namespace, eps.Metadata.Name); o != nil {
svc = o.(*Service)
}
podPortsSeen := make(map[*Pod][]int)
var ms []map[string]string
for _, ess := range eps.Subsets {
for _, epp := range ess.Ports {
ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.Addresses, epp, svc, "true")
ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.NotReadyAddresses, epp, svc, "false")
}
}
// See https://kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
// and https://github.com/kubernetes/kubernetes/pull/99975
switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {
case "truncated":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())
case "warning":
logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())
}
// Append labels for skipped ports on seen pods.
portSeen := func(port int, ports []int) bool {
for _, p := range ports {
if p == port {
return true
}
}
return false
}
for p, ports := range podPortsSeen {
for _, c := range p.Spec.Containers {
for _, cp := range c.Ports {
if portSeen(cp.ContainerPort, ports) {
continue
}
addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
m := map[string]string{
"__address__": addr,
}
p.appendCommonLabels(m)
p.appendContainerLabels(m, c, &cp)
if svc != nil {
svc.appendCommonLabels(m)
}
ms = append(ms, m)
}
}
}
return ms
}
可以看到,"__address__"
其實就是拼接了p.Status.PodIP
和cp.ContainerPort
,而p
則代表一個kubernetes的pod
資料結構,因此要求:
- pod狀態是running的,且能夠正確分配到PodIP
p.Spec.Containers[].ports[].ContainerPort
中配置了暴露metrics target的埠
問題解決
鑑於上述分析,檢視了一下環境中的deployment,發現該deployment只配置了8080埠,並沒有配置暴露指標的埠10299。問題解決。
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app_id: audit
name: audit
namespace: default
spec:
...
template:
metadata:
...
spec:
containers:
- env:
- name: APP_ID
value: audit
ports:
- containerPort: 8080
protocol: TCP
總結
kubernetes_sd_configs方式其實就是通過listwatch的方式獲取對應role的配置,然後拼接出target的__address__
,此外它還會暴露一些額外的指標,如:
__meta_kubernetes_endpoint_hostname
: Hostname of the endpoint.__meta_kubernetes_endpoint_node_name
: Name of the node hosting the endpoint.__meta_kubernetes_endpoint_ready
: Set totrue
orfalse
for the endpoint's ready state.__meta_kubernetes_endpoint_port_name
: Name of the endpoint port.__meta_kubernetes_endpoint_port_protocol
: Protocol of the endpoint port.__meta_kubernetes_endpoint_address_target_kind
: Kind of the endpoint address target.__meta_kubernetes_endpoint_address_target_name
: Name of the endpoint address target.