作者丨 caiyfc

來自神州數碼鈦合金戰隊

神州數碼鈦合金戰隊是一支致力於為企業提供分散式資料庫 TiDB 整體解決方案的專業技術團隊。團隊成員擁有豐富的資料庫從業背景，全部擁有 TiDB 高級資格證書，並活躍於 TiDB 開源社群，是官方認證合作伙伴。目前已為 10+ 客戶提供了專業的 TiDB 交付服務，涵蓋金融、證券、物流、電力、政府、零售等重點行業。

背景

筆者最近在駐場，發現這裡的 tidb 叢集是真的多，有將近 150 套叢集。而且叢集少則 6 個節點起步，多則有 200 多個節點。在這麼龐大的叢集體量下，巡檢就變得非常的繁瑣了。

那麼有沒有什麼辦法能夠代替手動巡檢，並且能夠快速準確的獲取到叢集相關資訊的方法呢？答案是，有但不完全有。其實可以利用 tidb 的 Prometheus 來獲取叢集相關的各項資料，比如告警就是一個很好的例子。可惜了，告警只是獲取了當前資料進行告警判斷，而巡檢需要使用一段時間的資料來作為判斷的依據。而且，告警是已經達到臨界值了，巡檢卻是要排查叢集的隱患，提前開始規劃，避免出現異常。

那直接用 Prometheus 獲取一段時間的資料，並且把告警值改低不就行了？

認識 PromQL

要使用 Prometheus ，那必須要先了解什麼是 PromQL 。

PromQL 查詢語言和日常使用的資料庫 SQL 查詢語言（SELECT * FROM ...）是不同的，PromQL 是一種巢狀的函式式語言，就是我們要把需要查詢的資料描述成一組巢狀的表示式，每個表示式都會評估為一箇中間值，每個中間值都會被用作它上層表示式中的引數，而查詢的最外層表示式表示你可以在表格、圖形中看到的最終返回值。比如下面的查詢語句：

histogram_quantile(  # 查詢的根，最終結果表示一個近似分位數。
  0.9,  # histogram_quantile() 的第一個引數，分位數的目標值
  # histogram_quantile() 的第二個引數，聚合的直方圖
  sum by(le, method, path) (
    # sum() 的引數，直方圖過去5分鐘每秒增量。
    rate(
      # rate() 的引數，過去5分鐘的原始直方圖序列
      demo_api_request_duration_seconds_bucket{job="demo"}[5m]
    )
  )
)

然後還需要認識一下告警的 PromQL 中，經常出現的一些函式：

rate

用於計算變化率的最常見函式是 rate() ， rate() 函式用於計算在指定時間範圍內計數器平均每秒的增加量。因為是計算一個時間範圍內的平均值，所以我們需要在序列選擇器之後新增一個範圍選擇器。

irate

由於使用 rate 或者 increase 函式去計算樣本的平均增長速率，容易陷入長尾問題當中，其無法反應在時間視窗內樣本資料的突發變化。

例如，對於主機而言在 2 分鐘的時間視窗內，可能在某一個由於訪問量或者其它問題導致 CPU 佔用 100%的情況，但是透過計算在時間視窗內的平均增長率卻無法反應出該問題。

為瞭解決該問題，PromQL 提供了另外一個靈敏度更高的函式 irate(v range-vector) 。 irate 同樣用於計算區間向量的計算率，但是其反應出的是瞬時增長率。

histogram_quantile

獲取資料的分位數。histogram_quantile(φ scalar, b instant-vector) 函式用於計算歷史資料指標一段時間內的分位數。該函式將目標分位數 (0 ≤ φ ≤ 1) 和直方圖指標作為輸入，就是大家平時講的 pxx，p50 就是中位數，引數 b 一定是包含 le 這個標籤的瞬時向量，不包含就無從計算分位數了，但是計算的分位數是一個預估值，並不完全準確，因為這個函式是假定每個區間內的樣本分佈是線性分佈來計算結果值的，預估的準確度取決於 bucket 區間劃分的粒度，粒度越大，準確度越低。

該部分引用： Prometheus 基礎相關--PromQL 基礎(2) ( https:// zhuanlan.zhihu.com/p/58 6528847 ) 想學習的同學可以去看看原文

修改 PromQL

要讓巡檢使用 PromQL ，就必須要修改告警中的 PromQL。這裡需要介紹一個函式：max_over_time(range-vector)，它是獲取區間向量內每個指標的最大值。其實還有其他這類時間聚合函式，比如 avg_over_time、min_over_time、sum_over_time 等等，但是我們只需要獲取到最大值，來提醒 dba 就行了。

Prometheus 是支援子查詢的，它允許我們首先以指定的步長在一段時間內執行內部查詢，然後根據子查詢的結果計算外部查詢。子查詢的表示方式類似於區間向量的持續時間，但需要冒號後新增了一個額外的步長引數： [:]。

舉個例子：

# 原版
sum(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)
# 修改
max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)[24h:1m])

這是獲取 TiKV raftstore 執行緒池 CPU 使用率的告警項。原版是直接將 1 分鐘內所有執行緒的變化率相加，而筆者的修改版是將 1 分鐘內所有執行緒的使用率取平均值，並且從此刻向後倒 24 小時內，每一分鐘執行一次獲取平均執行緒使用率的查詢，再取最大值。

也就是說，從 24 小時前，到現在，每分鐘執行一次（步長為 1 分鐘）： avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance) ，並獲取其中最大的一次值。這樣就滿足了我們需要使用一段時間的資料來判斷叢集是否有風險的依據了。

然後我們可以選取合適的 PromQL 來加上時間聚合函式和查詢時間及步長資訊：

# TiKV 1
'TiDB.tikv.TiKV_server_is_down': {
    'pql': 'probe_success{group="tikv",instance=~".*"} == 0',
    'pql_max': '',
    'note': 'TiKV 服務不可用'
},
'TiDB.tikv.TiKV_node_restart': {
    'pql': 'changes(process_start_time_seconds{job="tikv",instance=~".*"}[24h])> 0',
    'pql_max': 'max(changes(process_start_time_seconds{job="tikv",instance=~".*"}[24h]))',
    'note': 'TiKV 服務5分鐘內出現重啟'
},
'TiDB.tikv.TiKV_GC_can_not_work': {
    'pql_max': '',
    'pql': 'sum(increase(tikv_gcworker_gc_tasks_vec{task="gc", instance=~".*"}[2d])) by (instance) < 1 and (sum(increase('
           'tikv_gc_compaction_filter_perform{instance=~".*"}[2d])) by (instance) < 1 and sum(increase('
           'tikv_engine_event_total{cf="write",db="kv",type="compaction",instance=~".*"}[2d])) by (instance) >= 1)',
    'note': 'TiKV 服務GC無法工作'
},
# TiKV 2
'TiDB.tikv.TiKV_raftstore_thread_cpu_seconds_total': {
    'pql_max': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)[24h:1m])',
    'pql': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)[24h:1m])  > 0.8',
    'note': 'TiKV raftstore 執行緒池 CPU 使用率過高'
},
'TiDB.tikv.TiKV_approximate_region_size': {
    'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_region_size_bucket{instance=~".*"}[1m])) '
               'by (le,instance))[24h:1m])',
    'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_region_size_bucket{instance=~".*"}[1m])) '
           'by (le,instance))[24h:1m]) > 1073741824',
    'note': 'TiKV split checker 掃描到的最大的 Region approximate size 大於 1 GB'
},
'TiDB.tikv.TiKV_async_request_write_duration_seconds': {
    'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_storage_engine_async_request_duration_seconds_bucket'
               '{type="write", instance=~".*"}[1m])) by (le, instance, type))[24h:1m])',
    'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_storage_engine_async_request_duration_seconds_bucket'
           '{type="write", instance=~".*"}[1m])) by (le, instance, type))[24h:1m]) > 1',
    'note': 'TiKV 中Raft寫入響應時間過長'
},
'TiDB.tikv.TiKV_scheduler_command_duration_seconds': {
    'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_scheduler_command_duration_seconds_bucket[20m])) by (le, instance, type)  / 1000)[24h:20m]) ',
    'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_scheduler_command_duration_seconds_bucket[20m])) by (le, instance, type)  / 1000)[24h:20m])  > 20 ',
    'note': 'TiKV 排程器請求響應時間過長'
},
'TiDB.tikv.TiKV_scheduler_latch_wait_duration_seconds': {
    'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_scheduler_latch_wait_duration_seconds_bucket[20m])) by (le, instance, type))[24h:20m]) ',
    'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_scheduler_latch_wait_duration_seconds_bucket[20m])) by (le, instance, type))[24h:20m])  > 20',
    'note': 'TiKV 排程器鎖等待響應時間過長'
},
'TiDB.tikv.TiKV_write_stall': {
    'pql_max': 'max_over_time(delta(tikv_engine_write_stall{instance=~".*"}[10m])[24h:10m])',
    'pql': 'max_over_time(delta('
           'tikv_engine_write_stall{instance=~".*"}[10m])[24h:10m]) > 10',
    'note': 'TiKV 中存在寫入積壓'
},
# TiKV 3
'TiDB.tikv.TiKV_server_report_failure_msg_total': {
    'pql_max': 'max_over_time(sum(rate(tikv_server_report_failure_msg_total{type="unreachable"}[10m])) BY (instance)[24h:10m])',
    'pql': 'max_over_time(sum(rate(tikv_server_report_failure_msg_total{type="unreachable"}[10m])) BY (instance)[24h:10m]) > 10',
    'note': 'TiKV 節點報告失敗次數過多'
},
'TiDB.tikv.TiKV_channel_full_total': {
    'pql_max': 'max_over_time(sum(rate(tikv_channel_full_total{instance=~".*"}[10m])) BY (type, instance)[24h:10m])',
    'pql': 'max_over_time(sum(rate(tikv_channel_full_total{instance=~".*"}[10m])) BY (type, instance)[24h:10m]) > 0',
    'note': 'TIKV 通道已佔滿 tikv 過忙'
},
'TiDB.tikv.TiKV_raft_log_lag': {
    'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_log_lag_bucket{instance=~".*"}[1m])) by (le,instance))[24h:10m])',
    'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_log_lag_bucket{instance=~".*"}[1m])) by (le, '
           'instance))[24h:10m]) > 5000',
    'note': 'TiKV 中 raft 日誌同步相差過大'
},
'TiDB.tikv.TiKV_thread_unified_readpool_cpu_seconds': {
    'pql_max': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"unified_read_po*", instance=~".*"}[1m])) by (instance)[24h:1m])',
    'pql': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"unified_read_po*", instance=~".*"}[1m])) '
           'by (instance)[24h:1m]) > 0.7',
    'note': 'unifiled read 執行緒池使用率大於70%'
},
'TiDB.tikv.TiKV_low_space': {
    'pql_max': 'sum(tikv_store_size_bytes{type="available"}) by (instance) / sum(tikv_store_size_bytes{type="capacity"}) by (instance)',
    'pql': 'sum(tikv_store_size_bytes{type="available"}) by (instance) / sum(tikv_store_size_bytes{type="capacity"}) by (instance) < 0.3',
    'note': 'TiKV 當前儲存可用空間小於閾值'
},

由於有的告警項是獲取了 5 分鐘或者 10 分鐘的資料，在寫步長的時候也要同步修改為 5 分鐘或者 10 分鐘，保持一致可以保證，檢查能覆蓋選定的全部時間段，並且不會重複計算造成資源浪費。

順帶一提，如果不加 max_over_time 可以獲取到帶有時間戳的全部資料，而不是隻獲取到最大的一個資料。這個帶時間戳的全部資料可以方便畫圖，像 grafana 那樣展示資料趨勢。

巡檢指令碼

瞭解了以上所有知識，我們就可以開始編寫巡檢指令碼了。

這是筆者和同事共同編寫的一部分巡檢指令碼，最重要的是 tasks 中的 PromQL ，在指令碼執行之前要寫好 PromQL，其他部分可以隨意更改。如果一次性巡檢天數太多，比如一次巡檢一個月的時間，Prometheus 可能會因檢查資料太多而報錯的，所以使用的時候要注意報錯資訊，避免漏掉一些巡檢項。

# -*- coding: utf-8 -*-
import subprocess
import re
import datetime
import requests
import sys
import pandas as pd
days = None
def get_cluster_name():
    try:
        command = "tiup cluster list"
        result = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output, error = result.communicate()
        cluster_name_match = re.search(r'([a-zA-Z0-9_-]+)\s+tidb\s+v', output.decode('utf-8'))
        if cluster_name_match:
            return cluster_name_match.group(1)
        else:
            return None
    except Exception as e:
        print("An error occurred:", e)
        return None
def display_cluster_info(cluster_name):
    if not cluster_name:
        print("Cluster name not found.")
        return
    try:
        command = "tiup cluster display {0}".format(cluster_name)
        result = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output, error = result.communicate()
        return output.decode('utf-8')
    except Exception as e:
        print("An error occurred:", e)
def extract_id_role(output):
    id_role_dict = {}
    lines = output.strip().split("\n")
    for line in lines:
        print(line)
        parts = line.split()
        if is_valid_ip_port(parts[0]):
            node_id, role = parts[0], parts[1]
            id_role_dict[node_id] = role
    return id_role_dict
def is_valid_ip_port(input_str):
    pattern = re.compile(r'^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}$')
    return bool(pattern.match(input_str))
def get_prometheus_ip(data_dict):
    prometheus_ip = None
    for key, value in data_dict.items():
        if value == 'prometheus':
            prometheus_ip = key
            break
    return prometheus_ip
def get_tasks():
    global days
    tasks = {
        # TiKV 1
        'TiDB.tikv.TiKV_server_is_down': {
            'pql': 'probe_success{group="tikv",instance=~".*"} == 0',
            'pql_max': '',
            'note': 'TiKV 服務不可用'
        },
        'TiDB.tikv.TiKV_node_restart': {
            'pql': 'changes(process_start_time_seconds{job="tikv",instance=~".*"}[24h])> 0',
            'pql_max': 'max(changes(process_start_time_seconds{job="tikv",instance=~".*"}[24h]))',
            'note': 'TiKV 服務5分鐘內出現重啟'
        },
        'TiDB.tikv.TiKV_GC_can_not_work': {
            'pql_max': '',
            'pql': 'sum(increase(tikv_gcworker_gc_tasks_vec{task="gc", instance=~".*"}[2d])) by (instance) < 1 and (sum(increase('
                   'tikv_gc_compaction_filter_perform{instance=~".*"}[2d])) by (instance) < 1 and sum(increase('
                   'tikv_engine_event_total{cf="write",db="kv",type="compaction",instance=~".*"}[2d])) by (instance) >= 1)',
            'note': 'TiKV 服務GC無法工作'
        },
        # TiKV 2
        'TiDB.tikv.TiKV_raftstore_thread_cpu_seconds_total': {
            'pql_max': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)[24h:1m])',
            'pql': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"(raftstore|rs)_.*"}[1m])) by (instance)[24h:1m])  > 0.8',
            'note': 'TiKV raftstore 執行緒池 CPU 使用率過高'
        },
        'TiDB.tikv.TiKV_approximate_region_size': {
            'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_region_size_bucket{instance=~".*"}[1m])) '
                       'by (le,instance))[24h:1m])',
            'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_region_size_bucket{instance=~".*"}[1m])) '
                   'by (le,instance))[24h:1m]) > 1073741824',
            'note': 'TiKV split checker 掃描到的最大的 Region approximate size 大於 1 GB'
        },
        'TiDB.tikv.TiKV_async_request_write_duration_seconds': {
            'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_storage_engine_async_request_duration_seconds_bucket'
                       '{type="write", instance=~".*"}[1m])) by (le, instance, type))[24h:1m])',
            'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_storage_engine_async_request_duration_seconds_bucket'
                   '{type="write", instance=~".*"}[1m])) by (le, instance, type))[24h:1m]) > 1',
            'note': 'TiKV 中Raft寫入響應時間過長'
        },
        'TiDB.tikv.TiKV_write_stall': {
            'pql_max': 'max_over_time(delta(tikv_engine_write_stall{instance=~".*"}[10m])[24h:10m])',
            'pql': 'max_over_time(delta('
                   'tikv_engine_write_stall{instance=~".*"}[10m])[24h:10m]) > 10',
            'note': 'TiKV 中存在寫入積壓'
        },
        # TiKV 3
        'TiDB.tikv.TiKV_server_report_failure_msg_total': {
            'pql_max': 'max_over_time(sum(rate(tikv_server_report_failure_msg_total{type="unreachable"}[10m])) BY (instance)[24h:10m])',
            'pql': 'max_over_time(sum(rate(tikv_server_report_failure_msg_total{type="unreachable"}[10m])) BY (instance)[24h:10m]) > 10',
            'note': 'TiKV 節點報告失敗次數過多'
        },
        'TiDB.tikv.TiKV_channel_full_total': {
            'pql_max': 'max_over_time(sum(rate(tikv_channel_full_total{instance=~".*"}[10m])) BY (type, instance)[24h:10m])',
            'pql': 'max_over_time(sum(rate(tikv_channel_full_total{instance=~".*"}[10m])) BY (type, instance)[24h:10m]) > 0',
            'note': 'TIKV 通道已佔滿 tikv 過忙'
        },
        'TiDB.tikv.TiKV_raft_log_lag': {
            'pql_max': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_log_lag_bucket{instance=~".*"}[1m])) by (le,instance))[24h:10m])',
            'pql': 'max_over_time(histogram_quantile(0.99, sum(rate(tikv_raftstore_log_lag_bucket{instance=~".*"}[1m])) by (le, '
                   'instance))[24h:10m]) > 5000',
            'note': 'TiKV 中 raft 日誌同步相差過大'
        },
        'TiDB.tikv.TiKV_thread_unified_readpool_cpu_seconds': {
            'pql_max': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"unified_read_po*", instance=~".*"}[1m])) by (instance)[24h:1m])',
            'pql': 'max_over_time(avg(rate(tikv_thread_cpu_seconds_total{name=~"unified_read_po*", instance=~".*"}[1m])) '
                   'by (instance)[24h:1m]) > 0.7',
            'note': 'unifiled read 執行緒池使用率大於70%'
        },
        'TiDB.tikv.TiKV_low_space': {
            'pql_max': 'sum(tikv_store_size_bytes{type="available"}) by (instance) / sum(tikv_store_size_bytes{type="capacity"}) by (instance)',
            'pql': 'sum(tikv_store_size_bytes{type="available"}) by (instance) / sum(tikv_store_size_bytes{type="capacity"}) by (instance) < 0.3',
            'note': 'TiKV 當前儲存可用空間小於閾值'
        },
    }
    for key, value in tasks.items():
        for inner_key, inner_value in value.items():
            if isinstance(inner_value, str) and 'pql' in inner_key:
                value[inner_key] = inner_value.replace("24h:", f"{24 * days}h:").replace("[24h]", f"[{24 * days}h]")
    return tasks
def request_prome(prometheus_address, query):
    try:
        response = requests.get('http://%s/api/v1/query' % prometheus_address, params={'query': query})
        return response
    except:
        return None
def has_response(prometheus_address, query):
    response = request_prome(prometheus_address, query)
    if not response:
        return False
    try:
        if response.json()["data"]['result']:
            return True
        else:
            return False
    except:
        return False
def check_prome_alive(prometheus_address):
    # dummy query is used to judge if prometheus is alive
    dummy_query = 'probe_success{}'
    return has_response(prometheus_address, dummy_query)
def find_alive_prome(prometheus_addresses):
    if check_prome_alive(prometheus_addresses):
        return prometheus_addresses
    return None
# ip:port -> ip_port
def decode_instance(instance):
    return instance.replace(':', '_')
def check_metric(alert_name, prometheus_address, pql, is_value, pql_max):
    record = []
    try:
        is_warning = "異常"
        response = request_prome(prometheus_address, pql)
        alert_name = alert_name.split('.')
        result = response.json()['data']['result']
        # 判斷是否出現異常
        if len(result) == 0:
            is_warning = "正常"
            if pql_max == '':
                result = [{'metric': {}, 'value': [0, '0']}]
            else:
                response = request_prome(prometheus_address, pql_max)
                result = response.json()['data']['result']
        for i in result:
            # 判斷是否按節點顯示
            if 'instance' in i['metric']:
                instance = i['metric']['instance']
                node = decode_instance(instance)
            else:
                node = '叢集'
            # 判斷是否有type
            if 'type' in i['metric']:
                type = i['metric']['type']
            else:
                type = '無型別'
            value = i['value'][1]
            if value == 'NaN':
                value = 0
            else:
                value = round(float(value), 3)
            message = "%s,%s,%s,%s,%s,%s,%s,%s" % (
                datetime.datetime.now(), node, alert_name[1], alert_name[2], type, is_warning, is_value, value)
            print(message)
            record.append(message)
    except Exception as e:
        print(alert_name[2] + "----An error occurred check_metric:", e)
        return
    return record
def csv_report(record):
    data = pd.DataFrame([line.split(',') for line in record],
                        columns=['timestamp', 'ip_address', 'service', 'event_type', 'type', 'status', 'description',
                                 'value'])
    grouped = data.groupby("service")
    writer = pd.ExcelWriter("inspection_report.xlsx", engine="xlsxwriter")
    for name, group in grouped:
        group.to_excel(writer, sheet_name=name, index=False)
        worksheet = writer.sheets[name]
        for i, col in enumerate(group.columns):
            column_len = max(group[col].astype(str).str.len().max(), len(col)) + 2
            worksheet.set_column(i, i, column_len)
    writer.save()
def run_tasks(role_metrics, prometheus_address):
    record = []
    for alert in role_metrics:
        pql = role_metrics[alert]['pql']
        is_value = role_metrics[alert]['note']
        pql_max = role_metrics[alert]['pql_max']
        message = check_metric(alert, prometheus_address, pql, is_value, pql_max)
        for data in message:
            record.append(data)
    csv_report(record)
def run_script(prometheus_addresses):
    active_prometheus_address = find_alive_prome(prometheus_addresses)
    # check if all prometheus are down
    if not active_prometheus_address:
        sys.exit()
    tasks = get_tasks()
    run_tasks(tasks, active_prometheus_address)
def get_user_input():
    global days
    try:
        user_input = int(input("請輸入需要巡檢的天數: "))
        days = user_input
    except ValueError:
        print("輸入無效，請輸入一個有效的數字。")
if __name__ == "__main__":
    # 輸入巡檢天數
    get_user_input()
    prometheus_ip = '10.3.65.136:9091'
    # prometheus_ip = None
    if prometheus_ip is None:
        cluster_name = get_cluster_name()
        cluster_info = display_cluster_info(cluster_name)
        id_role_dict = extract_id_role(cluster_info)
        print(id_role_dict)
        prometheus_ip = get_prometheus_ip(id_role_dict)
    print(prometheus_ip)
    run_script(prometheus_ip)

總結

一個完善的巡檢指令碼的編寫是一個長期的工作。因為時間有限，筆者只編寫了基於 Prometheus 的一部分巡檢項，有興趣的同學可以繼續編寫更多巡檢項。
目前巡檢指令碼都是基於 Prometheus 的資料來作判斷，但是在真實的巡檢當中，dba 還會檢視一些 Prometheus 沒有的資料，比如表的健康度、一段時間內的慢 SQL、熱力圖、日誌資訊等等，這些資訊在後面一些時間，可能會慢慢入到巡檢指令碼中。
現在該指令碼已在 Gitee 上開源，歡迎大家使用：

https:// gitee.com/mystery-cyf/p rometheus--for-inspection/tree/master

透過 Prometheus 編寫 TiDB 巡檢指令碼（指令碼已開源，內附連結）

背景