Grafana 系列文章(十四):Helm 安裝Loki

東風微鳴發表於2023-02-11

前言

寫或者翻譯這麼多篇 Loki 相關的文章了, 發現還沒寫怎麼安裝 ?

現在開始介紹如何使用 Helm 安裝 Loki.

前提

有 Helm, 並且新增 Grafana 的官方源:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

?Warning:

網路受限, 需要保證網路通暢.

部署

架構

Promtail(收集) + Loki(儲存及處理) + Grafana(展示)

Loki 架構圖

Promtail

  1. 啟用 Prometheus Operator Service Monitor 做監控
  2. 增加external_labels - cluster, 以識別是哪個 K8S 叢集;
  3. pipeline_stages 改為 cri, 以對 cri 日誌做處理(因為我的叢集用的 Container Runtime 是 CRI, 而 Loki Helm 預設配置是 docker)
  4. 增加對 systemd-journal 的日誌收集:
promtail:
  config:
    snippets:
      pipelineStages:
        - cri: {}

  extraArgs: 
    - -client.external-labels=cluster=ctyun
  # systemd-journal 額外配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

Loki

  1. 啟用持久化儲存
  2. 啟用 Prometheus Operator Service Monitor 做監控
    1. 並配置 Loki 相關 Prometheus Rule 做告警
  3. 因為個人叢集日誌量較小, 適當調大 ingester 相關配置

Grafana

  1. 啟用持久化儲存
  2. 啟用 Prometheus Operator Service Monitor 做監控
  3. sidecar 都配置上, 方便動態更新 dashboards/datasources/plugins/notifiers;

Helm 安裝

透過如下命令安裝:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定義 values.yaml 如下:

loki:
  enabled: true
  persistence:
    enabled: true
    storageClassName: local-path
    size: 20Gi
  serviceScheme: https
  user: admin
  password: changit!
  config:
    ingester:
      chunk_idle_period: 1h
      max_chunk_age: 4h
    compactor:
      retention_enabled: true
  serviceMonitor:
    enabled: true
    prometheusRule:
      enabled: true
      rules:
        #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki
        - alert: LokiProcessTooManyRestarts
          expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Loki process too many restarts (instance {{ $labels.instance }})
            description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: LokiRequestErrors
          expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: Loki request errors (instance {{ $labels.instance }})
            description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: LokiRequestPanic
          expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request panic (instance {{ $labels.instance }})
            description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
        - alert: LokiRequestLatency
          expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Loki request latency (instance {{ $labels.instance }})
            description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

promtail:
  enabled: true
  config:
    snippets:
      pipelineStages:
        - cri: {}  
  extraArgs:
    - -client.external-labels=cluster=ctyun        
  serviceMonitor:
    # -- If enabled, ServiceMonitor resources for Prometheus Operator are created
    enabled: true

  # systemd-journal 額外配置:
  # Add additional scrape config
  extraScrapeConfigs:
    - job_name: journal
      journal:
        path: /var/log/journal
        max_age: 12h
        labels:
          job: systemd-journal
      relabel_configs:
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        - source_labels: ['__journal__hostname']
          target_label: 'hostname'

  # Mount journal directory into Promtail pods
  extraVolumes:
    - name: journal
      hostPath:
        path: /var/log/journal

  extraVolumeMounts:
    - name: journal
      mountPath: /var/log/journal
      readOnly: true

fluent-bit:
  enabled: false

grafana:
  enabled: true
  adminUser: caseycui
  adminPassword: changit!
  ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders
  ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards
  sidecar:
    image:
      repository: quay.io/kiwigrid/k8s-sidecar
      tag: 1.15.6
      sha: ''
    dashboards:
      enabled: true
      SCProvider: true
      label: grafana_dashboard
    datasources:
      enabled: true
      # label that the configmaps with datasources are marked with
      label: grafana_datasource
    plugins:
      enabled: true
      # label that the configmaps with plugins are marked with
      label: grafana_plugin
    notifiers:
      enabled: true
      # label that the configmaps with notifiers are marked with
      label: grafana_notifier
  image:
    tag: 8.3.5
  persistence:
    enabled: true
    size: 2Gi
    storageClassName: local-path
  serviceMonitor:
    enabled: true
  imageRenderer:
    enabled: disable

filebeat:
  enabled: false

logstash:
  enabled: false

安裝後的資源拓撲如下:

Loki K8S 資源拓撲

Day 2 配置(按需)

Grafana 增加 Dashboards

在同一個 NS 下, 建立如下 ConfigMap: (只要打上grafana_dashboard 這個 label 就會被 Grafana 的 sidecar 自動匯入)

apiVersion: v1
kind: ConfigMap
metadata:
  name: sample-grafana-dashboard
  labels:
     grafana_dashboard: "1"
data:
  k8s-dashboard.json: |-
  [...]

Grafana 增加 DataSource

在同一個 NS 下, 建立如下 ConfigMap: (只要打上grafana_datasource 這個 label 就會被 Grafana 的 sidecar 自動匯入)

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-loki-stack
  labels:
    grafana_datasource: '1'
data:
  loki-stack-datasource.yaml: |-
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
      version: 1

Traefik 配置 Grafana IngressRoute

因為我是用的 Traefik 2, 透過 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - kind: Rule
      match: Host(`grafana.ewhisper.cn`)
      middlewares:
        - name: hsts-header
          namespace: kube-system
        - name: redirectshttps
          namespace: kube-system
      services:
        - name: loki-grafana
          namespace: monitoring
          port: 80
  tls: {}

最終效果

如下:

Grafana Explore Logs

???

?️參考文件

Grafana 系列文章

Grafana 系列文章

三人行, 必有我師; 知識共享, 天下為公. 本文由東風微鳴技術部落格 EWhisper.cn 編寫.

相關文章