企業實踐｜分散式系統可觀測性之應用業務指標監控

作者簡介：

趙君｜南京愛福路汽車科技有限公司基礎設施部雲原生工程師，過去一直從事 java 相關的架構和研發工作。目前主要負責公司的雲原生落地相關工作，負責 F6 基礎設施和業務核心應用全面上雲和雲原生化改造。

徐航｜南京愛福路汽車科技有限公司基礎設施部雲原生工程師，過去一直負責資料庫高可用以及相關運維和調優工作。目前主要負責研發效能 DevOps 的落地以及業務系統雲原生可觀測性的改造。

隨著分散式架構逐漸成為了架構設計的主流，可觀測性（Observability）一詞也日益被人頻繁地提起。

2017 年的分散式追蹤峰會（2017 Distributed Tracing Summit）結束後，Peter Bourgon 撰寫了總結文章《Metrics, Tracing, and Logging》系統地闡述了這三者的定義、特徵，以及它們之間的關係與差異。文中將可觀測性問題對映到了如何處理指標（metrics）、追蹤（tracing）、日誌（logging）三類資料上。

其後，Cindy Sridharan 在其著作《Distributed Systems Observability》中，進一步講到指標、追蹤、日誌是可觀測性的三大支柱（three pillars）。

title=

到了 2018 年， CNCF Landscape 率先出現了 Observability 的概念，將可觀測性（ Observability ）從控制論（ Cybernetics ）中引入到 IT 領域。在控制論中，可觀測性是指系統可以由其外部輸出，來推斷其內部狀態的程度，系統的可觀察性越強，我們對系統的可控制性就越強。

可觀測性可以解決什麼問題？Google SRE Book 第十二章給出了簡潔明快的答案：快速排障。

There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are:
Building observability—with both white-box metrics and structured logs—into each component from the ground up
Designing systems with well-understood and observable interfaces between components.
Google SRE Book, Chapter 12

而在雲原生時代，分散式系統越來越複雜，分散式系統的變更是非常頻繁的，每次變更都可能導致新型別的故障。應用上線之後，如果缺少有效的監控，很可能導致遇到問題我們自己都不知道，需要依靠使用者反饋才知道應用出了問題。

本文主要講述如何建立應用業務指標Metrics監控和如何實現精準告警。Metrics 可以翻譯為度量或者指標，指的是對於一些關鍵資訊以可聚合的、數值的形式做定期統計，並繪製出各種趨勢圖表。透過它，我們可以觀察系統的狀態與趨勢。

技術棧選擇

我們的應用都是 Spring Boot 應用，並且使用 Spring Boot Actuator 實現應用的健康檢查。從 Spring Boot 2.0 開始，Actuator 將底層改為 Micrometer，提供了更強、更靈活的監測能力。Micrometer 支援對接各種監控系統，包括 Prometheus。

所以我們選擇 Micrometer 收集業務指標，Prometheus 進行指標的儲存和查詢，通過 Grafana 進行展示，通過阿里雲的告警中心實現精準告警。

指標收集

對於整個研發部門來說，應該聚焦在能夠實時體現公司業務狀態的最核心的指標上。例如 Amazon 和 eBay 會跟蹤銷售量， Google 和 Facebook 會跟蹤廣告曝光次數等與收入直接相關的實時指標。

Prometheus 預設採用一種名為 OpenMetrics 的指標協議。OpenMetrics 是一種基於文字的格式。下面是一個基於 OpenMetrics 格式的指標表示格式樣例。

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"}    3

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\DIR\FILE.TXT",error="Cannot find file:\n"FILE.TXT""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

指標的資料由指標名(metric_name)，一組 key/value 標籤（label_name=label_value)，數字型別的指標值(value)，時間戳組成。

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

Meter

Micrometer 提供了多種度量類庫(Meter)，Meter 是指一組用於收集應用中的度量資料的介面。Micrometer 中，Meter 的具體型別包括：Timer, Counter, Gauge, DistributionSummary, LongTaskTimer, FunctionCounter, FunctionTimer, and TimeGauge

Counter 用來描述一個單調遞增的變數，如某個方法的呼叫次數，快取命中/訪問總次數等。支援配置 recordFailuresOnly，即只記錄方法呼叫失敗的次數。Counter 的指標資料，預設有四個 label：class, method, exception, result。
Timer 會同時記錄 totalcount, sumtime, maxtime 三種資料，有一個預設的 label: exception。
Gauge 用來描述在一個範圍內持續波動的變數。Gauge 通常用於變動的測量值，比如佇列中的訊息數量，執行緒池任務佇列數等。
DistributionSummary 用於統計資料分佈。

應用接入流程

為了方便微服務應用接入，我們封裝了 micrometer-spring-boot-starter。micrometer-spring-boot-starter 的具體實現如下。

引入 Spring Boot Actuator 依賴

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
  <version>${micrometer.version}</version>
</dependency>

進行初始配置

Actuator 預設開啟了一些指標的收集，比如 system, jvm, http，可以通過配置關閉它們。其實僅僅是我們需要關閉，因為我們已經接了 jmx exporter 了。

management.metrics.enable.jvm=false
management.metrics.enable.process=false
management.metrics.enable.system=false

如果不希望 Web 應用的 Actuator 管理埠和應用埠重合的話，可以使用 management.server.port 設定獨立的埠。這是好的實踐，可以看到黑客針對 actuator 的攻擊，但是換了埠號，不暴露公網問題會少很多。

1management.server.port=xxxx

配置 spring bean

TimedAspect 的 Tags.empty() 是故意的，防止產生太長的 class 名稱對 prometheus 造成壓力。

@PropertySource(value = {"classpath:/micrometer.properties"})
@Configuration
public class MetricsConfig {

    @Bean
    public TimedAspect timedAspect(MeterRegistry registry) {
        return new TimedAspect(registry, (pjp) -> Tags.empty());
    }

    @Bean
    public CountedAspect countedAspect(MeterRegistry registry) {
        return new CountedAspect(registry);
    }

    @Bean
    public PrometheusMetricScrapeEndpoint prometheusMetricScrapeEndpoint(CollectorRegistry collectorRegistry) {
        return new PrometheusMetricScrapeEndpoint(collectorRegistry);
    }

    @Bean
    public PrometheusMetricScrapeMvcEndpoint prometheusMvcEndpoint(PrometheusMetricScrapeEndpoint delegate) {
        return new PrometheusMetricScrapeMvcEndpoint(delegate);
    }

}

應用接入時，引入 micrometer-spring-boot-starter 依賴

<dependency>
  <groupId>xxx</groupId>
  <artifactId>micrometer-spring-boot-starter</artifactId>
</dependency>

現在，就可以通過訪問 http://ip:port/actuator/prome...，來檢視 Micrometer 記錄的資料。

自定義業務指標

Micrometer 內建了 Counted 和 Timed 兩個 annotation。可以通過在對應的方法上加上 @Timed 和 @Counted 註解，來收集方法的呼叫次數，時間和是否發生異常等資訊。

@Timed

如果想要記錄列印方法的呼叫次數和時間，需要給 print 方法加上 @Timed 註解，並給指標定義一個名稱。

@Timed(value = "biz.print", percentiles = {0.95, 0.99}, description = "metrics of print")
public String print(PrintData printData) {

}

在 print 方法上加上 @Timed 註解之後，Micrometer 會記錄 print 方法的呼叫次數(count)，方法呼叫最大耗時(max)，方法呼叫總耗時(sum)三個指標。percentiles = {0.95, 0.99} 表示計算 p95，p99 的請求時間。記錄的指標資料如下。

biz_print_seconds_count{exception="none"} 4.0
biz_print_seconds_sum{exception="none"} 7.783213927
biz_print_seconds_max{exception="none"} 6.14639717
biz_print_seconds{exception="NullPointerException"} 0.318767104
biz_print_seconds{exception="none",quantile="0.95",} 0.58720256
biz_print_seconds{exception="none",quantile="0.99",} 6.157238272

@Timed 註解支援配置一些屬性：

value：必填，指標名
extraTags：給指標定義標籤，支援多個，格式 {"key", "value", "key", "value"}
percentiles：小於等於 1 的數，計算時間的百分比分佈，比如 p95，p99
histogram：記錄方法耗時的 histogram 直方圖型別指標

@Timed 會記錄方法丟擲的異常。不同的異常會被記錄為獨立的資料。程式碼邏輯是先 catch 方法丟擲的異常，記錄下異常名稱，然後再丟擲方法本身的異常：

try {
    return pjp.proceed();
} catch (Exception ex) {
    exceptionClass = ex.getClass().getSimpleName();
    throw ex;
} finally {
    try {
        sample.stop(Timer.builder(metricName)
                    .description(timed.description().isEmpty() ? null : timed.description())
                    .tags(timed.extraTags())
                    .tags(EXCEPTION_TAG, exceptionClass)
                    .tags(tagsBasedOnJoinPoint.apply(pjp))
                    .publishPercentileHistogram(timed.histogram())
                    .publishPercentiles(timed.percentiles().length == 0 ? null : timed.percentiles())
                    .register(registry));
    } catch (Exception e) {
        // ignoring on purpose
    }
}

@Counted

如果不關心方法執行的時間，只關心方法呼叫的次數，甚至只關心方法呼叫發生異常的次數，使用 @Counted 註解是更好的選擇。recordFailuresOnly = true 表示只記錄異常的方法呼叫次數。

@Timed(value = "biz.print", recordFailuresOnly = true, description = "metrics of print")
public String print(PrintData printData) {

}

記錄的指標資料如下。

biz_print_failure_total{class="com.xxx.print.service.impl.PrintServiceImpl",exception="NullPointerException",method="print",result="failure",} 4.0

counter 是一個遞增的數值，每次方法呼叫後，會自增 1。

private void record(ProceedingJoinPoint pjp, Counted counted, String exception, String result) {
    counter(pjp, counted)
            .tag(EXCEPTION_TAG, exception)
            .tag(RESULT_TAG, result)
            .register(meterRegistry)
            .increment();
}

private Counter.Builder counter(ProceedingJoinPoint pjp, Counted counted) {
    Counter.Builder builder = Counter.builder(counted.value()).tags(tagsBasedOnJoinPoint.apply(pjp));
    String description = counted.description();
    if (!description.isEmpty()) {
        builder.description(description);
    }
    return builder;
}

Gauge

Gauge 用來描述在一個範圍內持續波動的變數。Gauge 通常用於變動的測量值，例如雪花演算法的 workId，列印的模板 id，執行緒池任務佇列數等。

注入 PrometheusMeterRegistry
構造 Gauge。給指標命名並賦值。

@Autowired
private PrometheusMeterRegistry meterRegistry;

public void buildGauge(Long workId) {
    Gauge.builder("biz.alphard.snowFlakeIdGenerator.workId", workId, Long::longValue)
            .description("alphard snowFlakeIdGenerator workId")
            .tag("workId", workId.toString())
            .register(meterRegistry).measure();
}

記錄的指標資料如下。

biz_alphard_snowFlakeIdGenerator_workId{workId="2"} 2

配置 SLA 指標

如果想要記錄指標時間資料的 sla 分佈，Micrometer 提供了對應的配置：

management.metrics.distribution.sla[biz.print]=300ms,400ms,500ms,1s,10s

記錄的指標資料如下。

biz_print_seconds_bucket{exception="none",le="0.3",} 1.0
biz_print_seconds_bucket{exception="none",le="0.4",} 3.0
biz_print_seconds_bucket{exception="none",le="0.5",} 10.0
biz_print_seconds_bucket{exception="none",le="0.6",} 11.0
biz_print_seconds_bucket{exception="none",le="1.0",} 11.0
biz_print_seconds_bucket{exception="none",le="10.0",} 12.0
biz_print_seconds_bucket{exception="none",le="+Inf",} 12.0

儲存查詢

我們使用 Prometheus 進行指標資料的儲存和查詢。Prometheus 採用拉取式採集（Pull-Based Metrics Collection）。Pull 就是 Prometheus 主動從目標系統中拉取指標，相對地，Push 就是由目標系統主動推送指標。Prometheus 官方解釋選擇 Pull 的原因。

Pulling over HTTP offers a number of advantages:
You can run your monitoring on your laptop when developing changes.
You can more easily tell if a target is down.
You can manually go to a target and inspect its health with a web browser.
Overall, we believe that pulling is slightly better than pushing, but it should not be considered a major point when considering a monitoring system.

Prometheus 也支援 Push 的採集方式，就是 Pushgateway。

For cases where you must push, we offer the Pushgateway.

為了讓 Prometheus 採集應用的指標資料，我們需要做兩件事：

應用通過 service 暴露出 actuator 埠，並新增 label: monitor/metrics

apiVersion: v1
kind: Service
metadata:
  name: print-svc
  labels:
    monitor/metrics: ""
spec:
  ports:
  - name: custom-metrics
    port: xxxx
    targetPort: xxxx
    protocol: TCP
  type: ClusterIP
  selector:
    app: print-test

新增 ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: metrics
  labels:
    app: metric-monitor
spec:
  namespaceSelector:
    any: true
  endpoints:
  - interval: 15s
    port: custom-metrics
    path: "/manage/prometheusMetric"
  selector:
    matchLabels:
      monitor/metrics: ""

Prometheus 會定時訪問 service 的 endpoints (http://podip:port/manage/prom...)，拉取應用的 metrics，儲存到自己的時序資料庫。

Prometheus 儲存的資料是文字格式，雖然 Prometheus 也有 Graph，但是不夠炫酷，而且功能有限。還需要有一些視覺化工具去展示資料，通過標準易用的視覺化大盤去獲知當前系統的執行狀態。比較常見的解決方案就是 Grafana。Prometheus 內建了強大的時序資料庫，並提供了 PromQL 的資料查詢語言，能對時序資料進行豐富的查詢、聚合以及邏輯運算。通過在 Grafana 配置 Prometheus 資料來源和 PromQL，讓 Grafana 去查詢 Prometheus 的指標資料，以圖表的形式展示出來。

grafana 配置 Prometheus 資料來源

title=

新增看板，配置資料來源，query 語句，圖表樣式

title=

可以在一個 dasborad 新增多個看板，構成監控大盤。

title=

精準告警

任何系統都不是完美的，當出現異常和故障時，能在第一時間發現問題且快速定位問題原因就尤為重要。但要想做到以上這兩點，只有資料收集是不夠的，需要依賴完善的監控和告警體系，迅速反應併發出告警。

我們最初的方案是，基於 Prometheus operator 的 PrometheusRule 建立告警規則， Prometheus servers 把告警傳送給 Alertmanager，Alertmanager 負責把告警發到釘釘群機器人。但是這樣執行一段時間之後，我們發現這種方式存在一些問題。SRE 團隊和研發團隊負責人收到的告警太多，所有的告警都發到一個群裡，開啟群訊息，滿屏的告警標題，告警級別，告警值。其中有需要運維處理的系統告警，有需要研發處理的應用告警，資訊太多，很難快速篩選出高優先順序的告警，很難快速轉派告警到對應的處理人。所以我們希望應用告警可以精準傳送到應用歸屬的研發團隊。

經過一段時間的調研，我們最終選擇阿里雲的《ARMS 告警運維中心》來負責告警的管理。ARMS 告警運維中心支援接入 Prometheus 資料來源，支援新增釘釘群機器人作為聯絡人。

收集研發團隊的釘釘群機器人的 webhook 地址，建立機器人作為聯絡人。

title=

給每個研發團隊分別配置通知策略，通知策略篩選告警資訊裡的 team 欄位，並繫結對應的釘釘群機器人聯絡人。\

title=

通過這個方式，實現了應用的告警直接傳送到對應的研發團隊，節省了資訊篩選和二次轉派的時間，提高了告警處理效率。

效果如下：

title=

ARMS 告警運維中心支援接入 grafana，zabbix，arms 等多種資料來源，具有告警分派和認領，告警彙總去重，通過升級通知方式對長時間沒有處理的告警進行多次提醒，或升級通知到領導，保證告警及時解決。

企業實踐｜分散式系統可觀測性之應用業務指標監控

技術棧選擇

指標收集

Meter

應用接入流程

自定義業務指標

@Timed

@Counted

Gauge

配置 SLA 指標

儲存查詢

精準告警

相關文章