Prometheus +VictoriaMetrics+Consul+Granafa安裝部署

授客發表於2025-01-20

測試環境

prometheus-2.54.1.linux-amd64.tar.gz

下載地址:

https://www.prometheus.io/download/

https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz

node_exporter-1.8.2.linux-amd64.tar.gz

下載地址:

https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

https://prometheus.io/download/#node_exporter

consul_exporter-0.12.1.linux-amd64.tar.gz

下載地址:

https://github.com/prometheus/consul_exporter/releases/download/v0.12.1/consul_exporter-0.12.1.linux-amd64.tar.gz

pushgateway-1.9.0.linux-amd64.tar.gz

下載地址:

https://www.prometheus.io/download/#pushgateway

https://github.com/prometheus/pushgateway/releases/download/v1.9.0/pushgateway-1.9.0.linux-amd64.tar.gz

victoria-metrics-linux-amd64-v1.103.0.tar.gz

下載地址:

https://github.com/VictoriaMetrics/VictoriaMetrics/releases

https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.103.0

https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.103.0/victoria-metrics-linux-amd64-v1.103.0.tar.gz

consul_1.19.2_linux_amd64.zip

https://releases.hashicorp.com/consul/1.19.2/consul_1.19.2_linux_amd64.zip

grafana-7.5.6-1.x86_64.rpm

下載地址:https://dl.grafana.com/oss/release/grafana-7.5.6-1.x86_64.rpm

CentOS 7.9

注意:prometheusvictoria-metricsgrafanapushgateway 都可以安裝在不同機器上,本文僅涉及學習實踐,所以,都安裝在同一臺機器上了。

實踐過程

VictoriaMetrics安裝

# wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.103.0/victoria-metrics-linux-amd64-v1.103.0.tar.gz
# tar -xvzf victoria-metrics-linux-amd64-v1.103.0.tar.gz -C /usr/local/bin # 解壓後會生成一個名為victoria-metrics-prod的二進位制檔案
# 建立一個用於儲存VictoriaMetrics資料的資料夾
# mkdir -p /usr/data/victoria-metrics
# 建立服務
# vi /etc/systemd/system/victoriametrics.service
[Unit]
Description=Victoria metrics service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/victoria-metrics-prod -storageDataPath=/usr/data/victoria-metrics -retentionPeriod=30d -selfScrapeInterval=10s
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

PrivateTmp=yes

[Install]
WantedBy=multi-user.target

說明:

-storageDataPath 設定資料目錄路徑(如果目錄路徑不存在,程式啟動時會自動建立)。VictoriaMetrics會將所有資料儲存在此目錄中。預設儲存在當前工作目錄中的victoria-metrics-data目錄

-retentionPeriod 設定儲存資料的保留。自動刪除舊的資料。預設保留時間為1個月(31天)。 最小值為24h或者1d,-retentionPeriod=3設定資料僅儲存3個月,-retentionPeriod=1d設定資料僅保留1天。

一般情況下,只需要設定上述兩個引數標識即可,其它引數已經有足夠好的預設值,僅在確實需要修改它們時才進行設定,執行./victoria-metrics-prod --help 可以檢視所有可獲取的引數描述和及其預設值所有可獲取的引數描述和預設值

預設的,VictoriaMetrics透過埠8428監聽Prometheus查詢API請求

建議為 VictoriaMetrics設定監控

前臺啟動檢視

# /usr/local/bin/victoria-metrics-prod -storageDataPath=/usr/data/victoria-metrics -retentionPeriod=30d -selfScrapeInterval=10s
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/lib/logger/flag.go:12	build version: victoria-metrics-20240828-135248-tags-v1.103.0-0-g5aeb759df9
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/lib/logger/flag.go:13	command-line flags
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -retentionPeriod="30d"
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -selfScrapeInterval="10s"
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -storageDataPath="/usr/data/victoria-metrics"
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/app/victoria-metrics/main.go:73	starting VictoriaMetrics at "[:8428]"...
2024-09-03T16:33:42.187Z	info	VictoriaMetrics/app/vmstorage/main.go:107	opening storage at "/usr/data/victoria-metrics" with -retentionPeriod=30d
2024-09-03T16:33:42.189Z	info	VictoriaMetrics/lib/memory/memory.go:42	limiting caches to 611758080 bytes, leaving 407838720 bytes to the OS according to -memory.allowedPercent=60
2024-09-03T16:33:42.205Z	info	VictoriaMetrics/app/vmstorage/main.go:121	successfully opened storage "/usr/data/victoria-metrics" in 0.018 seconds; partsCount: 0; blocksCount: 0; rowsCount: 0; sizeBytes: 0
2024-09-03T16:33:42.205Z	info	VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:127	loading rollupResult cache from "/usr/data/victoria-metrics/cache/rollupResult"...
2024-09-03T16:33:42.207Z	info	VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:156	loaded rollupResult cache from "/usr/data/victoria-metrics/cache/rollupResult" in 0.001 seconds; entriesCount: 0, sizeBytes: 0
2024-09-03T16:33:42.207Z	info	VictoriaMetrics/app/victoria-metrics/main.go:84	started VictoriaMetrics in 0.020 seconds
2024-09-03T16:33:42.207Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:121	starting server at http://127.0.0.1:8428/
2024-09-03T16:33:42.207Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:122	pprof handlers are exposed at http://127.0.0.1:8428/debug/pprof/
2024-09-03T16:33:42.208Z	info	VictoriaMetrics/app/victoria-metrics/self_scraper.go:46	started self-scraping `/metrics` page with interval 10.000 seconds
2024-09-03T16:33:52.293Z	info	VictoriaMetrics/lib/storage/partition.go:202	creating a partition "2024_09" with smallPartsPath="/usr/data/victoria-metrics/data/small/2024_09", bigPartsPath="/usr/data/victoria-metrics/data/big/2024_09"
2024-09-03T16:33:52.295Z	info	VictoriaMetrics/lib/storage/partition.go:211	partition "2024_09" has been created

關閉前臺啟動的程序,啟動服務並設定開機啟動

# systemctl daemon-reload && sudo systemctl enable --now victoriametrics.service
# 檢視服務是否啟動成功
# systemctl status victoriametrics.service

開放防火牆埠

# firewall-cmd --permanent --zone=public --add-port=8248/tcp
success
# firewall-cmd --reload
success

除了以二進位制方式啟動,VictoriaMetrics也支援Docker安裝,具體可參考連結 https://hub.docker.com/r/victoriametrics/victoria-metrics/

參考連結

https://docs.victoriametrics.com/quick-start/

Prometheus安裝與配置

# wget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
# tar -C /usr/local/ -xvzf prometheus-2.54.1.linux-amd64.tar.gz
# cd /usr/local/prometheus-2.54.1.linux-amd64
# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool
# ln -s /usr/local/prometheus-2.26.0.linux-amd64/prometheus /usr/local/bin/prometheus
# cp prometheus.yml prometheus.yml.bak
# echo ''> prometheus.yml 
# vi prometheus.yml

prometheus.yml內容替換為以下內容

global:
  scrape_interval:     15s # 預設,每15秒取樣一次目標

# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
  # job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
  - job_name: 'prometheus'

    # 覆蓋全域性預設值,從該job每5秒對目標取樣一次
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

remote_write:
  - url: http://192.168.88.132:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

說明:

為了能給VictoriaMetrics傳送資料,在Prometheus配置檔案中,增加remote_write 配置。 新增以下程式碼到 Prometheus 配置檔案(一般是在prometheus.yml):

remote_write:
  - url: http://<victoriametrics-addr>:8428/api/v1/write

注意:新增時,需要替換<victoriametrics-addr>VictoriaMetrics主機名稱或者IP地址,形如以下

remote_write:
  - url: http://192.168.88.132.71.170:8428/api/v1/write

啟動Prometheus服務

# ./prometheus 
ts=2024-09-04T15:50:32.906Z caller=main.go:601 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2024-09-04T15:50:32.906Z caller=main.go:645 level=info msg="Starting Prometheus Server" mode=server version="(version=2.54.1, branch=HEAD, revision=e6cfa720fbe6280153fab13090a483dbd40bece3)"
ts=2024-09-04T15:50:32.906Z caller=main.go:650 level=info build_context="(go=go1.22.6, platform=linux/amd64, user=root@812ffd741951, date=20240827-10:56:41, tags=netgo,builtinassets,stringlabels)"
ts=2024-09-04T15:50:32.906Z caller=main.go:651 level=info host_details="(Linux 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 localhost.localdomain (none))"
ts=2024-09-04T15:50:32.906Z caller=main.go:652 level=info fd_limits="(soft=4096, hard=4096)"
ts=2024-09-04T15:50:32.906Z caller=main.go:653 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-09-04T15:50:32.917Z caller=web.go:571 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2024-09-04T15:50:32.925Z caller=main.go:1160 level=info msg="Starting TSDB ..."
ts=2024-09-04T15:50:32.930Z caller=tls_config.go:313 level=info component=web msg="Listening on" address=[::]:9090
ts=2024-09-04T15:50:32.930Z caller=tls_config.go:316 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2024-09-04T15:50:32.932Z caller=head.go:626 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2024-09-04T15:50:32.932Z caller=head.go:713 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=5.601µs
ts=2024-09-04T15:50:32.933Z caller=head.go:721 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2024-09-04T15:50:32.933Z caller=head.go:793 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2024-09-04T15:50:32.933Z caller=head.go:830 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=25.237µs wal_replay_duration=560.14µs wbl_replay_duration=141ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=5.601µs total_replay_duration=605.091µs
ts=2024-09-04T15:50:32.938Z caller=main.go:1181 level=info fs_type=XFS_SUPER_MAGIC
ts=2024-09-04T15:50:32.938Z caller=main.go:1184 level=info msg="TSDB started"
ts=2024-09-04T15:50:32.938Z caller=main.go:1367 level=info msg="Loading configuration file" filename=prometheus.yml
ts=2024-09-04T15:50:32.940Z caller=dedupe.go:112 component=remote level=info remote_name=b93975 url=http://192.168.88.132:8428/api/v1/write msg="Starting WAL watcher" queue=b93975
ts=2024-09-04T15:50:32.940Z caller=dedupe.go:112 component=remote level=info remote_name=b93975 url=http://192.168.88.132:8428/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-09-04T15:50:32.945Z caller=main.go:1404 level=info msg="updated GOGC" old=100 new=75
ts=2024-09-04T15:50:32.945Z caller=main.go:1415 level=info msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=6.619214ms db_storage=25.792µs remote_storage=1.190631ms web_handler=652ns query_engine=18.267µs scrape=3.897727ms scrape_sd=49.586µs notify=1.164µs notify_sd=954ns rules=60.122µs tracing=62.555µs
ts=2024-09-04T15:50:32.945Z caller=main.go:1145 level=info msg="Server is ready to receive web requests."
ts=2024-09-04T15:50:32.945Z caller=manager.go:164 level=info component="rule manager" msg="Starting rule manager..."
ts=2024-09-04T15:50:32.945Z caller=dedupe.go:112 component=remote level=info remote_name=b93975 url=http://192.168.88.132:8428/api/v1/write msg="Replaying WAL" queue=b93975
ts=2024-09-04T15:50:40.288Z caller=dedupe.go:112 component=remote level=info remote_name=b93975 url=http://192.168.88.132:8428/api/v1/write msg="Done replaying WAL" duration=7.342804783s

說明:如果希望使用非預設配置檔案,可以在執行命令時指定具體的配置檔案,類似如下:

# ./prometheus --config.file=./custom_prometheus.yml

備註:重啟命令

kill -HUP `pid_of_prometheus`

當然也可直接Ctrl + c終止Prometheus程序,然後重新執行。

Prometheus把傳入的資料寫入到本地儲存的同時將資料複製到遠端儲存。這意味著即使遠端儲存不可用,儲存在本地,--storage.tsdb.retention.time 指定資料保留期內的資料依然可用。

如果需要從多個 Prometheus 例項往 VictoriaMetrics 傳送資料,新增external_labels配置到Prometheus配置檔案global結點,形如以下:

global:
  external_labels:
    datacenter: dc-123

如上,以上配置告訴 Prometheus 新增 datacenter=dc-123 標籤到傳送給遠端儲存的每個時間序列。標籤名稱可以是任意的,比如 datacenter 。標籤值必須在所有 Prometheus 例項中保持唯一,這樣,可以透過該標籤過濾或者分組時序。

對於高負載的Prometheus例項(每秒200k+個樣本),可以應用以下調優:

remote_write:
  - url: http://<victoriametrics-addr>:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

使用遠端寫入增加 Prometheus 約25%的記憶體使用率,這取決於資料形態(原文:Using remote write increases memory usage for Prometheus up to ~25% and depends on the shape of data)。如果你正面臨太高記憶體消耗的問題,嘗試降低 max_samples_per_sendcapacity 引數配置值(注意:這兩個引數是緊密相連的)檢視更多關於遠端寫入調優.

建議升級Prometheus到 v2.12.0 或更高,因為之前版本使用 remote_write存在問題。

也可以檢視下 vmagentvmalert, 也是用於減少Prometheus更快和更少的資源消耗的一種選擇方案。

參考連結:https://docs.victoriametrics.com/#prometheus-setup

建立服務

# vi /etc/systemd/system/rometheus.service
[Unit]
Description=Prometheus service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/prometheus-2.54.1.linux-amd64/prometheus --config.file=/usr/local/prometheus-2.54.1.linux-amd64/prometheus.yml
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

注意:配置檔案必須使用絕對路徑,否則執行服務報錯,找不到檔案/prometheus.yml

# 先手動停止上述前臺執行的prometheus,然後執行以下命令
# systemctl daemon-reload && systemctl enable --now prometheus
# systemctl status prometheus

grafana安裝與配置

# yum install grafana-7.5.6-1.x86_64.rpm

說明:如果使用yum命令安裝找不到軟體包,可以考慮下載安裝,如下

# wget https://dl.grafana.com/oss/release/grafana-7.5.6-1.x86_64.rpm
# yum install -y fontconfig urw-fonts
# rpm -ivh grafana-7.5.6-1.x86_64.rpm
warning: grafana-7.5.6-1.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 24098cb6: NOKEY
Preparing...                          ################################# [100%]
Updating / installing...
   1:grafana-7.5.6-1                  ################################# [100%]
### NOT starting on installation, please execute the following statements to configure grafana to start automatically using systemd
 sudo /bin/systemctl daemon-reload
 sudo /bin/systemctl enable grafana-server.service
### You can start grafana-server by executing
 sudo /bin/systemctl start grafana-server.service
POSTTRANS: Running script
# /bin/systemctl daemon-reload
 ~]# /bin/systemctl enable grafana-server.service
Created symlink from /etc/systemd/system/multi-user.target.wants/grafana-server.service to /usr/lib/systemd/system/grafana-server.service.
# /bin/systemctl start grafana-server.service

說明:如果不執行yum install -y fontconfig urw-fonts命令,安裝grafana時可能報錯,如下

warning: grafana-7.5.6-1.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 24098cb6: NOKEY
error: Failed dependencies:
        fontconfig is needed by grafana-7.5.6-1.x86_64
        urw-fonts is needed by grafana-7.5.6-1.x86_64

修改grafana配置[可選]

# vim /etc/grafana/grafana.ini

瀏覽器輸入網址:http://<grafana-addr>:3000訪問看看效果:

備註:grafana預設登入賬號密碼為admin/admin

參考連結:https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1

開放防火牆埠

# firewall-cmd --permanent --zone=public --add-port=3000/tcp
success
# firewall-cmd --reload
success

建立Prometheus資料來源

使用以下URL建立 Prometheus資料來源 (建立資料來源時僅修改URL,其它保持預設) :

http://<victoriametrics-addr>:8428

如下,替換<victoriametrics-addr>VictoriaMetrics主機名稱或者IP地址,形如http://192.168.55.88.132:8428,然後使用PromQLMetricsQL用建立的資料來源建立圖表

Access模式簡介

訪問模式控制如何處理對資料來源的請求。如果沒有其他說明,Server(default)應該是首選方式。

  • Server訪問模式(預設)
    來自瀏覽器發起的所有請求都將傳送到Grafana後端/伺服器,後者將請求轉發到資料來源,從而規避可能的跨域源資源共享 Cross-Origin Resource Sharing (CORS) 要求。如果選擇該訪問模式,則要求URL可被grafana後端/伺服器訪問。
  • Browser 訪問模式
    來自瀏覽器的所有請求都將直接傳送給資料來源,並可能受到跨域源資源共享要求的約束。如果選擇該訪問模式,則要求URL可從瀏覽器訪問。·

參考連結:https://docs.victoriametrics.com/#grafana-setup

安裝 pushgateway

# wget https://github.com/prometheus/pushgateway/releases/download/v1.9.0/pushgateway-1.9.0.linux-amd64.tar.gz
# tar -C /usr/local/ -xvzf pushgateway-1.9.0.linux-amd64.tar.gz
# ln -s /usr/local/pushgateway-1.9.0.linux-amd64/pushgateway /usr/local/bin/pushgateway
[root@localhost ~]# pushgateway
ts=2024-09-04T17:38:16.325Z caller=main.go:87 level=info msg="starting pushgateway" version="(version=1.9.0, branch=HEAD, revision=d1ca1a6a426126a09a21f745e8ffbaba550f9643)"
ts=2024-09-04T17:38:16.325Z caller=main.go:88 level=info build_context="(go=go1.22.4, platform=linux/amd64, user=root@2167597b1e9c, date=20240608-15:04:08, tags=unknown)"
ts=2024-09-04T17:38:16.328Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9091
ts=2024-09-04T17:38:16.328Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9091

建立服務

# vi /etc/systemd/system/pushgateway.service
[Unit]
Description=Pushgateway service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/pushgateway-1.9.0.linux-amd64/pushgateway
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

先手動停止上述前臺執行的pushgateway,然後執行以下命令

# systemctl daemon-reload && systemctl enable --now pushgateway
# systemctl status pushgateway

Node Exporter安裝與配置

注意:只需在需要被監控的機器上安裝(本例中為一臺redis伺服器,IP:192.168.88.131)

# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
# tar -C /usr/local/ -xvzf node_exporter-1.8.2.linux-amd64.tar.gz
# ln -s /usr/local/node_exporter-1.8.2.linux-amd64/node_exporter  /usr/local/bin/node_exporter
# node_exporter 
ts=2024-09-04T16:01:28.241Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.2, branch=HEAD, revision=f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)"
ts=2024-09-04T16:01:28.241Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.5, platform=linux/amd64, user=root@03d440803209, date=20240714-11:53:45, tags=unknown)"
ts=2024-09-04T16:01:28.242Z caller=node_exporter.go:196 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2024-09-04T16:01:28.242Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-09-04T16:01:28.242Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-09-04T16:01:28.242Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=fibrechannel
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=hwmon
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=infiniband
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=ipvs
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=netclass
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=netstat
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=nfs
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=nfsd
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=os
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=powersupplyclass
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=rapl
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=softnet
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=tapestats
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=thermal_zone
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=time
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=udp_queues
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=xfs
ts=2024-09-04T16:01:28.243Z caller=node_exporter.go:118 level=info collector=zfs
ts=2024-09-04T16:01:28.244Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9100
ts=2024-09-04T16:01:28.244Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9100

透過輸出可知,Node Exporter已在執行,並且在埠9100暴露指標

Node Exporter 指標

透過請求/metrics端點來確認指標是否已成功暴露:

# curl http://localhost:9100/metrics

看到類似下輸出則表明暴露成功。

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
...

Node Exporter現在暴露了Prometheus可以抓取的指標,包括輸出中更深層級的各種系統指標(以node_為字首)。要檢視這些指標可以執行以下命令:

# curl http://localhost:9100/metrics | grep "node_"

參考連結:https://prometheus.io/docs/guides/node-exporter/#monitoring-linux-host-metrics-with-the-node-exporter

建立服務

# vi /etc/systemd/system/node_exporter.service
[Unit]
Description=node exporter service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/node_exporter
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

安裝consul

官方安裝指引:

# yum install -y yum-utils
# yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
# yum -y install consul
# consul -v
Consul v1.9.5

說明:該安裝方式無法安裝成功

可行安裝方式:

# mkdir /usr/local/consul
# mv consul_1.19.2_linux_amd64.zip /usr/local/consul/
# cd /usr/local/consul/
# unzip consul_1.19.2_linux_amd64.zip 
# ./consul -v
Consul v1.19.2
Revision 048f1936
Build Date 2024-08-27T16:06:44Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# ln -s /usr/local/consul/consul /usr/bin/consul

啟動代理

伺服器節點1(192.168.88.133):

# mkdir -p /data/consul
# consul agent -server -bootstrap -datacenter=testDC -ui=true -data-dir=/data/consul -node=server1 -bind=192.168.88.133 -client=0.0.0.0 -serf-lan-port=8303 -serf-wan-port=8305 -dns-port=8601 -http-port=8603 -syslog -log-level=INFO

伺服器節點2(192.168.88.134):

# mkdir -p /data/consul
# consul agent -server=true -datacenter=testDC -data-dir=/data/consul --node=server2 -bind=192.168.88.134 -client=0.0.0.0 -retry-join=192.168.88.133 -serf-lan-port=8303 -serf-wan-port=8305 -dns-port=8601 -http-port=8603 -syslog -log-level=INFO

說明:啟動後加入叢集192.168.88.134:8301

瀏覽器訪問驗證

建立服務

伺服器節點1(192.168.88.133):

# vi /etc/systemd/system/consul.service
[Unit]
Description=Consul service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/consul agent -server -bootstrap -datacenter=testDC -ui=true -data-dir=/data/consul -node=server1 -bind=192.168.88.133 -client=0.0.0.0 -serf-lan-port=8303 -serf-wan-port=8305 -dns-port=8601 -http-port=8603 -syslog -log-level=INFO
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

伺服器節點2(192.168.88.134):

# vi /etc/systemd/system/consul.service
[Unit]
Description=Consul service
After=network.target

[Service]
Type=simple
Restart=always
TimeoutStartSec=30
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/consul agent -server=true -datacenter=testDC -data-dir=/data/consul --node=server2 -bind=192.168.88.134 -client=0.0.0.0 -retry-join=192.168.88.133 -serf-lan-port=8303 -serf-wan-port=8305 -dns-port=8601 -http-port=8603 -syslog -log-level=INFO 
ExecStop=/bin/kill $MAINPID
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

先手動停止上述前臺執行的consul,然後分別在兩臺機器上執行以下命令

# systemctl daemon-reload && systemctl enable --now consul
# systemctl status consul

參考連結:

https://developer.hashicorp.com/consul/install

https://developer.hashicorp.com/consul/docs/agent

註冊服務

# curl -X PUT -d '                                      
{                                                    
    "id": "redis-node-exporter",                                   
    "name": "redis-node-exporter",   
    "Tags": ["primary"],
    "address": "192.168.88.131",                      
    "port": 9100,                       
    "EnableTagOverride": false
}' http://192.168.88.133:8603/v1/agent/service/register

# curl -X PUT -d '                                      
{                                                    
    "id": "pushgateway-node-exporter",                                   
    "name": "pushgateway-node-exporter",   
    "Tags": ["pushgateway"],
    "address": "192.168.88.132",                      
    "port": 9091,                       
    "EnableTagOverride": false
}' http://192.168.88.133:8603/v1/agent/service/register

配置Prometheus例項

為了訪問採集Consul叢集相關服務的指標,需要正確配置本地執行的Prometheus例項。

修改prometheus配置檔案

global:
  scrape_interval:     15s # 預設,每15秒取樣一次目標

# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
  # job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
  - job_name: 'prometheus'

    # 覆蓋全域性預設值,從該job每5秒對目標取樣一次
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']
           
  - job_name: 'node_discovery_by_consul'
    scrape_interval: 5s
    consul_sd_configs:
      - server: '192.168.88.133:8603'
        services: ['redis-node-exporter', 'pushgateway-node-exporter']

    relabel_configs:
      - source_labels: [__meta_consul_service]
        action: keep
        regex: .*  
      
remote_write:
  - url: http://192.168.88.132:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

重啟Prometheus

# system restart prometheus
# system status prometheus

參考連結:

https://prometheus.io/blog/2015/06/01/advanced-service-discovery/#discovery-with-consul

檢視服務發現是否生效

Status -> Targets

點選對應Endpoint,自動跳轉並顯示指標資料

透過Prometheus expression browser檢視Node Exporter 指標

特定於Node Exporter的指標字首為Node_,包括Node_cpu_seconds_totalNode_Exporter_build_info等指標。

點選以下連結檢視一些示例指標

Metric Meaning
rate(node_cpu_seconds_total{mode="system"}[1m]) 最近一分鐘,每秒在系統模式下消耗的CPU時間平均值(以秒為單位)
node_filesystem_avail_bytes 非root使用者可用的檔案系統空間(以位元組為單位)
rate(node_network_receive_bytes_total[1m]) 最近一分鐘,每秒接收的平均網路流量(以位元組為單位)

驗證Grafana能否正常展示資料

普羅米修斯配置

global:
  scrape_interval:     15s # 預設,每15秒取樣一次目標

  # 與其它外部系統(比如federation, remote storage, Alertmanager)互動時,會附加這些標籤到時序資料或者報警
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
  # job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
  - job_name: 'prometheus'

    # 覆蓋全域性預設值,從該job每5秒對目標取樣一次
    scrape_interval: 5s

    static_configs:
      - targets: ['10.118.71.170:9090']

  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['10.118.32.92:9100']
        labels:
          group: 'k8snode1'
remote_write:
  - url: http://10.118.71.170:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

相關文章