使用 Ansible 快速部署 HBase 叢集

buttercup發表於2022-04-30

背景

出於資料安全的考慮,自研了一個低成本的時序資料儲存系統,用於儲存歷史行情資料。

系統借鑑了 InfluxDB 的列存與壓縮策略,並基於 HBase 實現了海量儲存能力。

由於運維同事缺乏 Hadoop 全家桶的運維經驗,只能由我這個研發臨時兼職,親自指揮親自部署了。

Hadoop 發行版選擇

目前可選的方案並不多,主要有:

  • CDH 目前中小企業選型首選的發行版
  • Amibari 最為靈活的且可定製的發行版
  • Apache 最原始的發行版

CDH 的缺點:

  • Hadoop 元件的版本老舊,不支援新的 API
  • JDK 版本受限,無法受益於新版 JDK 帶來的效能提升
  • 存在大量已知且未修復的 Bug,為後續運維埋下隱患
  • 新版本的 CDH 不再免費,無法免費升級

Amibari 的缺點:

  • 文件較少,構建困難(前端元件版本較舊,構建直接報錯)
  • 該專案已經退役,未來不再進行維護

Apache 的缺陷:

  • 部署流程複雜,版本相容可能會踩坑
  • 監控系統不完善,自己搭建需要一定的動手能力

最終方案

系統規劃現狀:

  • 合規嚴格要求,必須避免版權糾紛
  • 叢集規模不大,節點數量小於 50
  • 沒有 Hadoop 相關研發能力,無法自主修復 Bug
  • 需要保證查詢效能,最好能用上 ZGC 或 ShenandoahGC

最終敲定基於原始的 Apache 發行版搭建 HBase 叢集。

版本選擇

HBase 元件

版本選擇如下:

  • Adoptium JDK
  • HBase 2.4.11 (JDK 17)
  • Hadoop 3.2.3 (JDK 8)
  • Zookeeper 3.6.3 (JDK 17)

Hadoop 版本

Hadoop 3.3.x 之後不再使用 native 版本的 snappy 與 lz4(相關連結),而最新的 HBase 穩定版 2.4.x 版尚未適配該變更,因此選擇 3.2.x 版本。

而 Hadoop 3.2.x 依賴 Zookeeper 3.4.14 的客戶端,無法執行在 JDK14 以上的環境(參考案例),因此使用 JDK 8 進行部署。

Zookeeper 版本

Zookeeper 3.6.x 是自帶 Prometheus 監控版本中最低的,並且高版本 Zookeeper 保證了對低版本客戶端的相容性,因此選擇該版本。並且該版本已經支援 JDK 11 部署,因此可以放心的將 JRE 升級為 JDK 17 進行部署。

JDK 發行版

JDK 17 是首個支援 ZGC 的 LTS 版本。因 Oracle JDK17 暫不支援 ShenandoahGC,最終選擇 Adoptium JDK。網上有朋友分享過在 JDK 15 上部署 CDH 版 HBase 的經驗,但需要打一個 Patch,具體步驟參考附錄。

運維工具

為了彌補 Apache 發行版難以運維的缺點,需要藉助兩個高效的開源運維工具:

Ansible

一款簡單易用的自動化部署工具

  • 支援冪等部署,減少部署過程中出錯概率
  • 通過 ssh 實現通訊,侵入性低,無需安裝 agent
  • playbook 可以將運維操作文件化,方便他人接手

Ansible 版本的分界線是 2.9.x,該版本是最後一個支援 Python 2.x 的版本。為了適應現有的運維環境,最終選擇該版本。

不過有條件還是建議升級到 Python 3.x 以上,並使用更新版本的 Ansible。畢竟有些 Bug 只在新版本修復,不會同步至低版本。

Prometheus

新一代監控告警平臺

  • 獨特的 PromQL 提供靈活高效的查詢能力
  • 自帶 TSDB 與 AlertManager,部署架構簡單
  • 生態元件豐富
    • 通過 JMX Exporter 實現監控指標接入
    • 通過 Grafana 實現監控指標的視覺化

沒有歷史包袱,可以直接選擇最新版。

配置詳解

為了保證配置變更的可追溯性,使用 Git 新建了一個工程來維護部署指令碼,整個工程的目錄結構如下:

.
├── hosts
├── ansible.cfg
├── book
│   ├── config-hadoop.yml
│   ├── config-hbase.yml
│   ├── config-metrics.yml
│   ├── config-zk.yml
│   ├── install-hadoop.yml
│   ├── sync-host.yml
│   └── vars.yml
├── conf
│   ├── hadoop
│   │   ├── core-site.xml
│   │   ├── hdfs-site.xml
│   │   ├── mapred-site.xml
│   │   ├── workers
│   │   └── yarn-site.xml
│   ├── hbase
│   │   ├── backup-masters
│   │   ├── hbase-site.xml
│   │   └── regionservers
│   ├── metrics
│   │   ├── exports
│   │   │   ├── hmaster.yml
│   │   │   ├── jmx_exporter.yml
│   │   │   └── regionserver.yml
│   │   └── targets
│   │       ├── hadoop-cluster.yml
│   │       ├── hbase-cluster.yml
│   │       └── zk-cluster.yml
│   └── zk
│       ├── myid
│       └── zoo.cfg
└── repo
    ├── hadoop
    │   ├── apache-zookeeper-3.6.3-bin.tar.gz
    │   ├── hadoop-3.2.3.tar.gz
    │   ├── hbase-2.4.11-bin.tar.gz
    │   ├── hbase-2.4.11-src.tar.gz
    │   ├── hbase-server-2.4.11.jar
    │   ├── OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz
    │   ├── OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz
    │   └── repo.md5
    └── metrics
        └── jmx_prometheus_javaagent-0.16.1.jar

各個目錄的作用

  • repo :儲存用於部署的二進位制的檔案
  • book :儲存 ansible-playbook 的自動化指令碼
  • conf :儲存 HBase 元件的配置模板

hosts 檔案

對主機進行分類,便於規劃叢集部署:

[newborn]

[nodes]
172.20.72.1 hostname='my.hadoop1 my.hbase1 my.zk1'
172.20.72.2 hostname='my.hadoop2 my.hbase2 my.zk2'
172.20.72.3 hostname='my.hadoop3 my.hbase3 my.zk3'
172.20.72.4 hostname='my.hadoop4 my.hbase4'

[zk_nodes]
my.zk1 ansible_host=172.30.73.209 myid=1
my.zk2 ansible_host=172.30.73.210 myid=2
my.zk3 ansible_host=172.30.73.211 myid=3

[hadoop_nodes]
my.hadoop[1:4]

[namenodes]
my.hadoop1 id=nn1 rpc_port=8020 http_port=9870
my.hadoop2 id=nn2 rpc_port=8020 http_port=9870

[datanodes]
my.hadoop[1:4]

[journalnodes]
my.hadoop1 journal_port=8485
my.hadoop2 journal_port=8485
my.hadoop3 journal_port=8485

[resourcemanagers]
my.hadoop3 id=rm1 peer_port=8032 tracker_port=8031 scheduler_port=8030 web_port=8088
my.hadoop4 id=rm2 peer_port=8032 tracker_port=8031 scheduler_port=8030 web_port=8088

[hbase_nodes]
my.hbase[1:4]

[hmasters]
my.hbase[1:2]

[regionservers]
my.hbase[1:4]

[all:vars]
ansible_user=admin
deploy_dir=/opt
data_dir=/data

ansible.cfg 檔案

ansile 的基礎配置檔案:

[defaults]
inventory      = ./hosts
host_key_checking = False

conf 目錄

conf/zk 目錄

zoo.cfg

# ZK 與客戶端間的心跳間隔,單位 mills
tickTime=2000
# Leader 與 Follower 間建立連線的超時時間,單位為 tick
initLimit=30
# Leader 與 Follower 間通訊的超時時間,單位為 tick
syncLimit=5
# 快照目錄
dataDir={{ zk_data_dir }}
# WAL目錄,最好為其指定一個獨立的空閒裝置(建議使用 SSD)
dataLogDir={{ zk_data_log_dir }}
# 使用預設通訊埠
clientPort=2181
# 增加最大連線數
maxClientCnxns=2000
# 開啟 Prometheus 監控
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpHost={{ ansible_host | default(inventory_hostname) }}
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
# 配置叢集資訊
# server.{myid}={server-address}:{rpc-port}:{election-port}
{% for host in groups['zk_nodes'] %}
server.{{ hostvars[host]['myid'] }}={{ hostvars[host]['ansible_host'] }}:2888:3888
{% endfor %}

myid

{{ myid }}

conf/hadoop 目錄

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- 指定 NameNode 地址 (使用叢集名稱替代) -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://{{ hdfs_name }}</value>
  </property>
  <!-- 指定資料儲存目錄 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>{{ hadoop_data_dir }}</value>
  </property>
  <!-- 指定 Web 使用者許可權(預設使用者 dr.who 無法上傳檔案) -->
  <property>
     <name>hadoop.http.staticuser.user</name>
     <value>{{ ansible_user }}</value>
  </property>
  <!-- 指定 DFSZKFailoverController 所需的 ZK -->
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
  </property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <!-- NameNode 資料儲存目錄 -->
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file://${hadoop.tmp.dir}/name</value>
 </property>
 <!-- DataNode 資料儲存目錄 -->
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file://${hadoop.tmp.dir}/data</value>
 </property>
 <!-- JournalNode 資料儲存目錄(絕對路徑,不能帶 file://) -->
 <property>
   <name>dfs.journalnode.edits.dir</name>
   <value>${hadoop.tmp.dir}/journal</value>
 </property>
 <!-- HDFS 叢集名稱 -->
 <property>
   <name>dfs.nameservices</name>
   <value>{{ hdfs_name }}</value>
 </property>
 <!-- 叢集 NameNode 節點列表 -->
 <property>
   <name>dfs.ha.namenodes.{{hdfs_name}}</name>
   <value>{{ groups['namenodes'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
 </property>
 <!-- NameNode RPC 地址 -->
 {% for host in groups['namenodes'] %}
 <property>
   <name>dfs.namenode.rpc-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
   <value>{{host}}:{{hostvars[host]['rpc_port']}}</value>
 </property>
 {% endfor %}
 <!-- NameNode HTTP 地址 -->
 {% for host in groups['namenodes'] %}
 <property>
   <name>dfs.namenode.http-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
    <value>{{host}}:{{hostvars[host]['http_port']}}</value>
 </property>
 {% endfor %}
 <!-- NameNode 後設資料在 JournalNode 上的存放位置 -->
 <property>
   <name>dfs.namenode.shared.edits.dir</name>
   <value>qjournal://{{groups['journalnodes'] | zip( groups['journalnodes']|map('extract', hostvars)|map(attribute='journal_port') )| map('join', ':') | join(';') }}/{{hdfs_name}}</value>
 </property>
 <!-- fail-over 代理類 (client 通過 proxy 來確定 Active NameNode) -->
 <property>
   <name>dfs.client.failover.proxy.provider.my-hdfs</name>
   <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
 </property>
 <!-- 隔離機制 (保證只存在唯一的 Active NameNode) -->
 <property>
   <name>dfs.ha.fencing.methods</name>
   <value>sshfence</value>
 </property>
 <!-- SSH 隔離機制依賴的登入祕鑰 -->
 <property>
   <name>dfs.ha.fencing.ssh.private-key-files</name>
   <value>/home/{{ ansible_user }}/.ssh/id_rsa</value>
 </property>
 <!-- 啟用自動故障轉移 -->
 <property>
    <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
 </property>
 <!-- NameNode 工作執行緒數量 -->
 <property>
   <name>dfs.namenode.handler.count</name>
   <value>21</value>
 </property>
</configuration>

yarn-site.xml

<?xml version="1.0"?>
<configuration>
 <!-- 啟用 ResourceManager HA -->
 <property>
   <name>yarn.resourcemanager.ha.enabled</name>
   <value>true</value>
 </property>  
 <!-- YARN 叢集名稱 -->
 <property>
   <name>yarn.resourcemanager.cluster-id</name>
   <value>{{yarn_name}}</value>
 </property>  
<!-- ResourceManager 節點列表 -->
 <property>
   <name>yarn.resourcemanager.ha.rm-ids</name>
   <value>{{ groups['resourcemanagers'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
 </property>  
 <!-- ResourceManager 地址 -->
 {% for host in groups['resourcemanagers'] %}
 <property>
   <name>yarn.resourcemanager.hostname.{{hostvars[host]['id']}}</name>
   <value>{{host}}</value>
 </property>
 {% endfor %}
 <!-- ResourceManager 內部通訊地址 -->
 {% for host in groups['resourcemanagers'] %}
 <property>
     <name>yarn.resourcemanager.address.{{hostvars[host]['id']}}</name>
     <value>{{host}}:{{hostvars[host]['peer_port']}}</value>
 </property>
 {% endfor %}
 <!-- NM 訪問 ResourceManager 地址 -->
 {% for host in groups['resourcemanagers'] %}
 <property>
     <name>yarn.resourcemanager.resource-tracker.{{hostvars[host]['id']}}</name>
     <value>{{host}}:{{hostvars[host]['tracker_port']}}</value>
 </property>
 {% endfor %}
 <!-- AM 向 ResourceManager 申請資源地址 -->
 {% for host in groups['resourcemanagers'] %}
 <property>
     <name>yarn.resourcemanager.scheduler.address.{{hostvars[host]['id']}}</name>
     <value>{{host}}:{{hostvars[host]['scheduler_port']}}</value>
 </property>
 {% endfor %}
 <!-- ResourceManager Web 入口 -->
 {% for host in groups['resourcemanagers'] %}
 <property>
     <name>yarn.resourcemanager.webapp.address.{{hostvars[host]['id']}}</name>
     <value>{{host}}:{{hostvars[host]['web_port']}}</value>
 </property>
 {% endfor %}
 <!-- 啟用自動故障轉移 -->
 <property>
   <name>yarn.resourcemanager.recovery.enabled</name>
   <value>true</value>
 </property>
 <!-- 指定 Zookeeper 列表 -->
 <property>
   <name>yarn.resourcemanager.zk-address</name>
   <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
 </property>
 <!-- 將狀態資訊儲存在 Zookeeper 叢集-->
 <property>
   <name>yarn.resourcemanager.store.class</name>
   <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
 </property>
 <!-- 減少 ResourceManager 處理 Client 請求的執行緒-->
 <property>
   <name>yarn.resourcemanager.scheduler.client.thread-count</name>
   <value>10</value>
 </property>  
 <!-- 禁止 NodeManager 自適應硬體配置(非獨佔節點)-->
 <property>
   <name>yarn.nodemanager.resource.detect-hardware-capbilities</name>
   <value>false</value>
 </property>
 <!-- NodeManager 給容器分配的 CPU 核數-->
 <property>
   <name>yarn.nodemanager.resource.cpu-vcores</name>
   <value>4</value>
 </property>
 <!-- NodeManager 使用物理核計算 CPU 數量(可選)-->
 <property>
   <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
   <value>false</value>
 </property>  
 <!-- 減少 NodeManager 使用記憶體-->
 <property>
   <name>yarn.nodemanager.resource.memory-mb</name>
   <value>4096</value>
 </property>  
 <!-- 容器記憶體下限 -->
 <property>
   <name>yarn.scheduler.minimum-allocation-mb</name>
   <value>1024</value>
 </property>  
 <!-- 容器記憶體上限 -->
 <property>
   <name>yarn.scheduler.maximum-allocation-mb</name>
   <value>2048</value>
 </property>  
 <!-- 容器CPU下限 -->
 <property>
   <name>yarn.scheduler.minimum-allocation-vcores</name>
   <value>1</value>
 </property>  
 <!-- 容器CPU上限 -->
 <property>
   <name>yarn.scheduler.maximum-allocation-vcores</name>
   <value>2</value>
 </property>  
 <!-- 容器CPU上限 -->
 <property>
   <name>yarn.scheduler.maximum-allocation-vcores</name>
   <value>2</value>
 </property>
 <!-- 關閉虛擬記憶體檢查 -->
 <property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
 </property>
 <!-- 設定虛擬記憶體和實體記憶體的比例 -->
 <property>
   <name>yarn.nodemanager.vmem-pmem-ratio</name>
   <value>2.1</value>
 </property>
 <!-- NodeManager 在 MR 過程中使用 Shuffle(可選)-->
 <property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
 </property>  
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <!-- MapReduce 執行在 YARN 上 -->
 <property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
 <!-- MapReduce Classpath -->
 <property>
   <name>yarn.app.mapreduce.am.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
 </property>
 <property>
   <name>mapreduce.map.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
 </property>
 <property>
   <name>mapreduce.reduce.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
 </property>
 <!-- MapReduce JVM 引數(不允許換行) -->
 <property>
   <name>yarn.app.mapreduce.am.command-opts</name>
   <value>-Xmx1024m --add-opens java.base/java.lang=ALL-UNNAMED</value>
 </property>
 <property>
   <name>mapred.child.java.opts</name>
   <value>--add-opens java.base/java.lang=ALL-UNNAMED -verbose:gc -Xloggc:/tmp/@taskid@.gc</value>
 </property>
</configuration>

workers

{% for host in groups['datanodes'] %}
{{ host }}
{% endfor %}

conf/hbase 目錄

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.tmp.dir</name>
    <value>./tmp</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://{{ hdfs_name }}/hbase</value>
  </property>
  <property>
    <name>hbase.master.maxclockskew</name>
    <value>180000</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
      <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
  </property>
</configuration>

regionservers

{% for host in groups['regionservers'] %}
{{ host }}
{% endfor %}

backup-masters

{% for host in groups['hmasters'][1:] %}
{{ host }}
{% endfor %}

conf/metrics/exports 目錄

jmx_exporter.yml

---
# https://github.com/prometheus/jmx_exporter
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules: 
# ignore service
- pattern: Hadoop<service=(\w+), name=([\w-.]+), sub=(\w+)><>([\w._]+)
  name: $4
  labels:
    name: "$2"
    group: "$3"
  attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=(\w+)-([^<]+)><>([\w._]+)
  name: $4
  labels:
    name: "$2"
    entity: "$3"
  attrNameSnakeCase: true
# ignore service
- pattern: Hadoop<service=(\w+), name=([^<]+)><>([\w._]+)
  name: $3
  labels:
    name: "$2"
  attrNameSnakeCase: true
- pattern: .+

hmaster.yml

---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=RegionServer,*"
rules:
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
  name: $2
  labels:
    group: "$1"
    stat: "$3"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)
  name: $2
  labels:
    group: "$1"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=Master><>([\w._]+)
  name: $1
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
  name: $3
  labels:
    name: "$1"
    group: "$2"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
  name: $2
  labels:
    name: "$1"
  attrNameSnakeCase: true
- pattern: .+

regionserver.yml

---
startDelaySeconds: 5
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
blacklistObjectNames:
- "Hadoop:service=HBase,name=JvmMetrics*"
- "Hadoop:service=HBase,name=Master,*"
rules:
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Regions><>namespace_([\w._]+)_table_([\w._]+)_region_(\w+)_metric_([\w._]+)
  name: $4
  labels:
    group: Regions
    namespace: "$1"
    table: "$2"
    region: "$3"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=Tables><>namespace_([\w._]+)_table_([\w._]+)_columnfamily_([\w._]+)_metric_([\w._]+)
  name: $4
  labels:
    group: Tables
    namespace: "$1"
    table: "$2"
    column_family: "$3"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
  name: $4
  labels:
    group: "$1"
    namespace: "$2"
    table: "$3"
    stat: "$5"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)
  name: $4
  labels:
    group: "$1"
    namespace: "$2"
    table: "$3"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
  name: $2
  labels:
    group: "$1"
    stat: "$3"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)
  name: $2
  labels:
    group: "$1"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
  name: $3
  labels:
    name: "$1"
    group: "$2"
  attrNameSnakeCase: true
- pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
  name: $2
  labels:
    name: "$1"
  attrNameSnakeCase: true
- pattern: .+

conf/metrics/targets 目錄

zk-cluster.yml

- targets:
{% for host in groups['zk_nodes'] %}
  - {{ host }}:7000
{% endfor %}
  labels:
    service: zookeeper

hadoop-cluster.yml

- targets:
{% for host in groups['namenodes'] %}
  - {{ host }}:{{ namenode_metrics_port }}
{% endfor %}
  labels:
    role: namenode
    service: hdfs

- targets:
{% for host in groups['datanodes'] %}
  - {{ host }}:{{ datanode_metrics_port }}
{% endfor %}
  labels:
    role: datanode
    service: hdfs

- targets:
{% for host in groups['journalnodes'] %}
  - {{ host }}:{{ journalnode_metrics_port }}
{% endfor %}
  labels:
    role: journalnode
    service: hdfs

- targets:
{% for host in groups['resourcemanagers'] %}
  - {{ host }}:{{ resourcemanager_metrics_port }}
{% endfor %}
  labels:
    role: resourcemanager
    service: yarn

- targets:
{% for host in groups['datanodes'] %}
  - {{ host }}:{{ nodemanager_metrics_port }}
{% endfor %}
  labels:
    role: nodemanager
    service: yarn

hbase-cluster.yml

- targets:
{% for host in groups['hmasters'] %}
  - {{ host }}:{{ hmaster_metrics_port }}
{% endfor %}
  labels:
    role: hmaster
    service: hbase

- targets:
{% for host in groups['regionservers'] %}
  - {{ host }}:{{ regionserver_metrics_port }}
{% endfor %}
  labels:
    role: regionserver
    service: hbase

book 目錄

vars.yml

hdfs_name: my-hdfs
yarn_name: my-yarn

sync-host.yml

---
- name: Config Hostname & SSH Keys
  hosts: nodes  
  connection: local
  gather_facts: no
  any_errors_fatal: true

  vars:
    hostnames: |
      {% for h in groups['nodes'] if hostvars[h].hostname is defined %}{{h}} {{ hostvars[h].hostname }}
      {% endfor %}

  tasks:

    - name: test connectivity
      ping:
      connection: ssh

    - name: change local hostname 
      become: true
      blockinfile:  
        dest: '/etc/hosts'
        marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
        block: '{{ hostnames }}'
      run_once: true

    - name: sync remote hostname 
      become: true
      blockinfile:  
        dest: '/etc/hosts'
        marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
        block: '{{ hostnames }}'
      connection: ssh

    - name: fetch exist status
      stat:
        path: '~/.ssh/id_rsa'
      register: ssh_key_path
      connection: ssh

    - name: generate ssh key
      openssh_keypair:
        path: '~/.ssh/id_rsa'
        comment: '{{ ansible_user }}@{{ inventory_hostname }}'
        type: rsa
        size: 2048
        state: present
        force: no
      connection: ssh
      when: not ssh_key_path.stat.exists

    - name: collect ssh key
      command: ssh {{ansible_user}}@{{ansible_host|default(inventory_hostname)}} 'cat ~/.ssh/id_rsa.pub'
      register: host_keys  # cache data in hostvars[hostname].host_keys
      changed_when: false

    - name: create temp file
      tempfile:
        state: file
        suffix: _keys
      register: temp_ssh_keys
      changed_when: false
      run_once: true

    - name: save ssh key ({{temp_ssh_keys.path}})
      blockinfile:  
        dest: "{{temp_ssh_keys.path}}"  
        block: |  
          {% for h in groups['nodes'] if hostvars[h].host_keys is defined %}  
          {{ hostvars[h].host_keys.stdout }}  
          {% endfor %}  
      changed_when: false
      run_once: true

    - name: deploy ssh key
      vars:
        ssh_keys: "{{ lookup('file', temp_ssh_keys.path).split('\n') | select('match', '^ssh') | join('\n') }}"
      authorized_key:
        user: "{{ ansible_user }}"
        key: "{{ ssh_keys }}"
        state: present
      connection: ssh

install-hadoop.yml

---
- name: Install Hadoop Package
  hosts: newborn
  gather_facts: no
  any_errors_fatal: true

  vars:
    local_repo: '../repo/hadoop'
    remote_repo: '~/repo/hadoop'
    package_info:
      - {src: 'OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz', dst: 'java/jdk-17.0.2+8', home: 'jdk17'}
      - {src: 'OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz', dst: 'java/jdk8u322-b06', home: 'jdk8'}
      - {src: 'apache-zookeeper-3.6.3-bin.tar.gz', dst: 'apache/zookeeper-3.6.3', home: 'zookeeper'}
      - {src: 'hbase-2.4.11-bin.tar.gz', dst: 'apache/hbase-2.4.11',home: 'hbase'}
      - {src: 'hadoop-3.2.3.tar.gz', dst: 'apache/hadoop-3.2.3', home: 'hadoop'}

  tasks:

    - name: test connectivity
      ping:

    - name: copy hadoop package
      copy:
          src: '{{ local_repo }}'
          dest: '~/repo'

    - name: prepare directory
      become: true # become root
      file:
        state: directory
        path: '{{ deploy_dir }}/{{ item.dst }}'
        owner: '{{ ansible_user }}'
        group: '{{ ansible_user }}'
        mode: 0775
        recurse: yes
      with_items: '{{ package_info }}'

    - name: create link
      become: true # become root
      file:
        state: link
        src: '{{ deploy_dir }}/{{ item.dst }}'
        dest: '{{ deploy_dir }}/{{ item.home }}'
        owner: '{{ ansible_user }}'
        group: '{{ ansible_user }}'
      with_items: '{{ package_info }}'

    - name: install package
      unarchive:
        src: '{{ remote_repo }}/{{ item.src }}'
        dest: '{{ deploy_dir }}/{{ item.dst }}'
        remote_src: yes
        extra_opts:
          - --strip-components=1
      with_items: '{{ package_info }}'

    - name: config /etc/profile
      become: true
      blockinfile:  
        dest: '/etc/profile'
        marker: "# {mark} ANSIBLE MANAGED PROFILE"
        block: |
          export JAVA_HOME={{ deploy_dir }}/jdk8
          export HADOOP_HOME={{ deploy_dir }}/hadoop
          export HBASE_HOME={{ deploy_dir }}/hbase
          export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PATH

    - name: config zkEnv.sh
      lineinfile:
        path: '{{ deploy_dir }}/zookeeper/bin/zkEnv.sh'
        line: 'JAVA_HOME={{ deploy_dir }}/jdk17'
        insertafter: '^#\!\/usr\/bin'
        firstmatch: yes

    - name: config hadoop-env.sh
      blockinfile:
        dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
        marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP ENV"
        block: |
          export JAVA_HOME={{ deploy_dir }}/jdk8

    - name: config hbase-env.sh
      blockinfile:
        dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
        marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE ENV"
        block: |
          export JAVA_HOME={{ deploy_dir }}/jdk17
          export HBASE_MANAGES_ZK=false
          export HBASE_LIBRARY_PATH={{ deploy_dir }}/hadoop/lib/native
          export HBASE_OPTS="$HBASE_OPTS --add-exports=java.base/jdk.internal.access=ALL-UNNAMED --add-exports=java.base/jdk.internal=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.security.pkcs=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/jdk.internal=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-opens java.base/jdk.internal.access=ALL-UNNAMED"

    - name: patch hbase
      copy:
        src: '{{ local_repo }}/hbase-server-2.4.11.jar'
        dest: '{{ deploy_dir }}/hbase/lib'
        backup: no
        force: yes

    - name: link hadoop config
      file:
        state: link
        src: '{{ deploy_dir }}/hadoop/etc/hadoop/{{ item }}'
        dest: '{{ deploy_dir }}/hbase/conf/{{ item }}'
      with_items: 
        - core-site.xml
        - hdfs-site.xml

    - name: add epel-release repo
      shell: 'sudo yum -y install epel-release && sudo yum makecache'

    - name: install native libary
      shell: 'sudo yum -y install snappy snappy-devel lz4 lz4-devel libzstd libzstd-devel'

    - name: check hadoop native
      shell: '{{ deploy_dir }}/hadoop/bin/hadoop checknative -a'
      register: hadoop_checknative
      failed_when: false
      changed_when: false
      ignore_errors: yes
      environment:
        JAVA_HOME: '{{ deploy_dir }}/jdk8'

    - name: hadoop native status
      debug:
        msg: "{{ hadoop_checknative.stdout_lines }}"

    - name: check hbase native
      shell: '{{ deploy_dir }}/hbase/bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker'
      register: hbase_checknative
      failed_when: false
      changed_when: false
      ignore_errors: yes
      environment:
        JAVA_HOME: '{{ deploy_dir }}/jdk17'
        HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'

    - name: hbase native status
      debug:
        msg: "{{ hbase_checknative.stdout_lines|select('match', '^[^0-9]') | list }}"

    - name: test native compresssion
      shell: '{{ deploy_dir }}/hbase/bin/hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test {{ item }}'
      register: 'compression'
      failed_when: false
      changed_when: false
      ignore_errors: yes
      environment:
        JAVA_HOME: '{{ deploy_dir }}/jdk17'
        HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
      with_items:
        - snappy
        - lz4

    - name: native compresssion status
      vars:
        results: "{{ compression | json_query('results[*].{type:item, result:stdout}') }}"
      debug:
        msg: |
          {% for r in results %} {{ r.type }} => {{ r.result == 'SUCCESS' }} {% endfor %}

config-zk.yml

---
- name: Change Zk Config
  hosts: zk_nodes
  gather_facts: no
  any_errors_fatal: true

  vars:
    template_dir: ../conf/zk
    zk_home: '{{ deploy_dir }}/zookeeper'
    zk_data_dir: '{{ zk_home }}/status/data'
    zk_data_log_dir: '{{ zk_home }}/status/logs'

  tasks:

    - name: Create data directory
      file:
        state: directory
        path: '{{ item }}'
        recurse: yes
      with_items: 
        - '{{ zk_data_dir }}'
        - '{{ zk_data_log_dir }}'

    - name: Init zookeeper myid
      template:
        src: '{{ template_dir }}/myid'
        dest: '{{ zk_data_dir }}'

    - name: Update zookeeper env
      become: true
      blockinfile:
        dest: '{{ zk_home }}/bin/zkEnv.sh'
        marker: "# {mark} ANSIBLE MANAGED ZK ENV"
        block: |
          export SERVER_JVMFLAGS="-Xmx1G -XX:+UseShenandoahGC -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608"
      notify:
        - Restart zookeeper service

    - name: Update zookeeper config
      template:
        src: '{{ template_dir }}/zoo.cfg'
        dest: '{{ zk_home }}/conf'
      notify:
        - Restart zookeeper service

  handlers:
    - name: Restart zookeeper service
      shell:
        cmd: '{{ zk_home }}/bin/zkServer.sh restart'

config-hadoop.yml

---
- name: Change Hadoop Config
  hosts: hadoop_nodes
  gather_facts: no
  any_errors_fatal: true

  vars:
    template_dir: ../conf/hadoop
    hadoop_home: '{{ deploy_dir }}/hadoop'
    hadoop_conf_dir: '{{ hadoop_home }}/etc/hadoop'
    hadoop_data_dir: '{{ data_dir }}/hadoop'

  tasks:

    - name: Include common vars
      include_vars: file=vars.yml

    - name: Create data directory
      become: true
      file:
        state: directory
        path: '{{ hadoop_data_dir }}'
        owner: '{{ ansible_user }}'
        group: '{{ ansible_user }}'
        mode: 0775
        recurse: yes

    - name: Sync hadoop config
      template:
        src: '{{ template_dir }}/{{ item }}'
        dest: '{{ hadoop_conf_dir }}/{{ item }}'
      with_items: 
        - core-site.xml
        - hdfs-site.xml
        - mapred-site.xml
        - yarn-site.xml
        - workers

    - name: Config hadoop env
      blockinfile:
        dest: '{{ hadoop_conf_dir }}/hadoop-env.sh'
        marker: "# {mark} ANSIBLE MANAGED HADOOP ENV"
        block: |
          export HADOOP_PID_DIR={{ hadoop_home }}/pid
          export HADOOP_LOG_DIR={{ hadoop_data_dir }}/logs

          JVM_OPTS="-XX:+AlwaysPreTouch"
          export HDFS_JOURNALNODE_OPTS="-Xmx1G $JVM_OPTS $HDFS_JOURNALNODE_OPTS"
          export HDFS_NAMENODE_OPTS="-Xmx4G $JVM_OPTS $HDFS_NAMENODE_OPTS"
          export HDFS_DATANODE_OPTS="-Xmx8G $JVM_OPTS $HDFS_DATANODE_OPTS"

    - name: Config yarn env
      blockinfile:
        dest: '{{ hadoop_conf_dir }}/yarn-env.sh'
        marker: "# {mark} ANSIBLE MANAGED YARN ENV"
        block: |
          JVM_OPTS=""
          export YARN_RESOURCEMANAGER_OPTS="$JVM_OPTS $YARN_RESOURCEMANAGER_OPTS"
          export YARN_NODEMANAGER_OPTS="$JVM_OPTS $YARN_NODEMANAGER_OPTS"

config-hbase.yml

---
- name: Change HBase Config
  hosts: hbase_nodes
  gather_facts: no
  any_errors_fatal: true

  vars:
    template_dir: ../conf/hbase
    hbase_home: '{{ deploy_dir }}/hbase'
    hbase_conf_dir: '{{ hbase_home }}/conf'
    hbase_data_dir: '{{ data_dir }}/hbase'
    hbase_log_dir: '{{ hbase_data_dir }}/logs'
    hbase_gc_log_dir: '{{ hbase_log_dir }}/gc'

  tasks:

    - name: Include common vars
      include_vars: file=vars.yml

    - name: Create data directory
      become: true
      file:
        state: directory
        path: '{{ item }}'
        owner: '{{ ansible_user }}'
        group: '{{ ansible_user }}'
        mode: 0775
        recurse: yes
      with_items:
        - '{{ hbase_data_dir }}'
        - '{{ hbase_log_dir }}'
        - '{{ hbase_gc_log_dir }}'

    - name: Sync hbase config
      template:
        src: '{{ template_dir }}/{{ item }}'
        dest: '{{ hbase_conf_dir }}/{{ item }}'
      with_items: 
        - hbase-site.xml
        - backup-masters
        - regionservers

    - name: Config hbase env
      blockinfile:
        dest: '{{ hbase_conf_dir }}/hbase-env.sh'
        marker: "# {mark} ANSIBLE MANAGED HBASE ENV"
        block: |
          export HBASE_LOG_DIR={{ hbase_log_dir }}

          export HBASE_OPTS="-Xss256k -XX:+UseShenandoahGC -XX:+AlwaysPreTouch $HBASE_OPTS"
          export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hmaster-%p-%t.log"
          export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hregion-%p-%t.log"

config-metrics.yml

---
- name: Install Metrics Package
  hosts: "{{ groups['hadoop_nodes'] + groups['hbase_nodes'] }}"
  gather_facts: no
  any_errors_fatal: true

  vars:
    local_repo: '../repo/metrics'
    remote_repo: '~/repo/metrics'
    template_dir: ../conf/metrics
    default_conf: jmx_exporter.yml

    export_tmpl: '{{template_dir}}/exports'
    target_tmpl: '{{template_dir}}/targets'

    metrics_dir: '{{ deploy_dir }}/prometheus'
    hadoop_home: '{{ deploy_dir }}/hadoop'
    hbase_home: '{{ deploy_dir }}/hbase'

    jmx_exporter: 'jmx_prometheus_javaagent-0.16.1.jar'
    agent_path: '{{ metrics_dir }}/{{ jmx_exporter }}'

    namenode_metrics_port: 7021
    datanode_metrics_port: 7022
    journalnode_metrics_port: 7023
    resourcemanager_metrics_port: 7024
    nodemanager_metrics_port: 7025
    historyserver_metrics_port: 7026

    hmaster_metrics_port: 7027
    regionserver_metrics_port: 7028

    host_to_ip: |
      { {% for h in groups['nodes'] %} {% for n in hostvars[h]['hostname'].split() %}
       "{{ n }}" : "{{ h }}" ,
      {% endfor %} {% endfor %} }

    hadoop_metrics:
      - { env: 'HDFS_NAMENODE_OPTS', conf: 'namenode.yml', port: '{{namenode_metrics_port}}',  }
      - { env: 'HDFS_DATANODE_OPTS', conf: 'datanode.yml', port: '{{datanode_metrics_port}}'}
      - { env: 'HDFS_JOURNALNODE_OPTS', conf: 'journalnode.yml', port: '{{journalnode_metrics_port}}' }
      - { env: 'YARN_RESOURCEMANAGER_OPTS', conf: 'resourcemanager.yml', port: '{{resourcemanager_metrics_port}}' }
      - { env: 'YARN_NODEMANAGER_OPTS', conf: 'nodemanager.yml', port: '{{nodemanager_metrics_port}}' }
      - { env: 'MAPRED_HISTORYSERVER_OPTS', conf: 'historyserver.yml', port: '{{historyserver_metrics_port}}' }

    hbase_metrics:
      - { env: 'HBASE_MASTER_OPTS', conf: 'hmaster.yml', port: '{{hmaster_metrics_port}}' }
      - { env: 'HBASE_REGIONSERVER_OPTS', conf: 'regionserver.yml', port: '{{regionserver_metrics_port}}'}

  tasks:

    - name: test connectivity
      ping:

    - name: copy metrics package
      copy:
          src: '{{ local_repo }}'
          dest: '~/repo'

    - name: ensure metrics dir
      become: true
      file: 
        path: '{{ metrics_dir }}'
        owner: '{{ ansible_user }}'
        group: '{{ ansible_user }}'
        state: directory

    - name: install jmx exporter
      copy:
        src: '{{ remote_repo }}/{{ jmx_exporter }}'
        dest: '{{ metrics_dir }}/{{ jmx_exporter }}'
        remote_src: yes

    - name: fetch exist exporter config
      stat:
        path: '{{ export_tmpl }}/{{ item }}'
      with_items: "{{ (hadoop_metrics + hbase_metrics) | map(attribute='conf') | list }}"
      register: metric_tmpl
      run_once: yes
      connection: local

    - name: update hadoop exporter config
      vars:
        metrics_ip: '{{host_to_ip[inventory_hostname]}}'
        metrics_port: '{{ item.port }}'
        custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
      template:
        src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
        dest: '{{ metrics_dir }}/{{ item.conf }}'
      with_items: '{{ hadoop_metrics }}'
      when: inventory_hostname in groups['hadoop_nodes']

    - name: update hbase exporter config
      vars:
        metrics_ip: '{{host_to_ip[inventory_hostname]}}'
        metrics_port: '{{ item.port }}'
        custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
      template:
        src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
        dest: '{{ metrics_dir }}/{{ item.conf }}'
      with_items: '{{ hbase_metrics }}'
      when: inventory_hostname in groups['hbase_nodes']


    - name: config hadoop-env.sh
      blockinfile:
        dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
        marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP METRIC ENV"
        block: |
          {% for m in hadoop_metrics %}
          export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
          {% endfor %}
      when: inventory_hostname in groups['hadoop_nodes']


    - name: config hbase-env.sh
      blockinfile:
        dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
        marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE METRIC ENV"
        block: |
          {% for m in hbase_metrics %}
          export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
          {% endfor %}
      when: inventory_hostname in groups['hbase_nodes']

    - name: ensure generated target dir
      file: 
        path: '/tmp/gen-prometheus-targets'
        state: directory
      run_once: yes
      connection: local

    - name: generate target config to /tmp/gen-prometheus-targets
      template:
        src: '{{ target_tmpl }}/{{ item }}'
        dest: '/tmp/gen-prometheus-targets/{{ item }}'
      with_items: 
        - hadoop-cluster.yml
        - hbase-cluster.yml
        - zk-cluster.yml
      run_once: yes
      connection: local

操作步驟

配置中控機

必須禁用 SSH 登陸詢問,否則後面的安裝步驟可能卡住

初始化機器

  1. 修改 hosts 配置(必須為 IP 格式)
  • [nodes] 列出叢集中所有節點
  • [newborn] 列出叢集中未部署安裝包的節點
  1. 執行 ansible-playbook book/sync-host.yml
  2. 執行 ansible-playbook book/install-hadoop.yml
  3. 修改 hosts 配置
  • [newborn] 清空該組節點

配置並啟動 Zookeeper

  1. 修改 hosts 配置(必須配置 ansible_usermyid
  • [zk_nodes] 列出叢集中所有 ZK 節點
  1. 修改 book/config-zk.yml 調整 JVM 引數
  2. 執行 ansible-playbook book/config-zk.yml

配置 Hadoop

  1. 修改 hosts 配置
  • [hadoop_nodes] 列出叢集中所有 Hadoop 節點
  • [namenodes] 叢集中所有 NameNode(必須配置 idrpc_porthttp_port
  • [datanodes] 叢集中所有 DataNode
  • [journalnodes] 叢集中所有 JournalNode(必須配置 journal_port
  • [resourcemanagers] 叢集中所有 ResourceManager(必須配置 idpeer_porttracker_portscheduler_portweb_port
  1. 修改 book/config-hadoop.yml 調整 JVM 引數
  2. 執行 ansible-playbook book/config-hadoop.yml

啟動 HDFS

  1. 在所有 JournalNode 節點上,啟動 journalnode 服務
ansible journalnodes -m shell -a 'source /etc/profile && nohup hdfs --daemon start journalnode'

# 檢視是否存在程式 JournalNode
ansible journalnodes -m shell -a 'source /etc/profile && jps | grep JournalNode'
  1. 在 nn1 節點上,格式化 NameNode 並啟動 namenode 服務
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs namenode -format'

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && nohup hdfs --daemon start namenode'

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && jps | grep NameNode'
  1. 其餘 NameNode 節點同步 nn1 的後設資料資訊 並啟動 namenode 服務
ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && hdfs namenode -bootstrapStandby'

ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && nohup hdfs --daemon start namenode'

ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && jps | grep NameNode'
  1. 在所有 DataNode 節點上啟動 datanode 服務(提前檢查 DataNode 配置是否正常)
ansible datanodes -m shell -a 'source /etc/profile && nohup hdfs --daemon start datanode'

ansible datanodes -m shell -a 'source /etc/profile && jps | grep DataNode'
  1. 檢查 NameNode 是否處於 Standby 狀態
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn1'
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn2'
  1. 初始化 DFSZKFailoverController 的狀態
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs zkfc -formatZK'
  1. 重啟 HDFS 叢集
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && stop-dfs.sh'

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-dfs.sh'

# 檢視是否存在 DFSZKFailoverController 程式
ansible 'namenodes' -m shell -a 'source /etc/profile && jps | grep FailoverController'
  1. 檢查 NameNode 是否處於 Active 狀態
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn1'
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn2'

啟動 YARN

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-yarn.sh'

# 檢視是否存在程式 ResourceManager 與 NodeManager
ansible 'hadoop_nodes' -m shell -a 'source /etc/profile && jps | grep Manager'

檢視各個 ResourceManager 的狀態,找到 Active RM

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && yarn rmadmin -getServiceState rm1'

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && yarn rmadmin -getServiceState rm2'

配置 HBase

  1. 修改 hosts 配置
  • [hbase_nodes] 列出叢集中所有 HBase 節點
  • [hmasters] 叢集中所有 HMaster
  • [regionservers] 叢集中所有 RegionServer
  1. 修改 book/config-hbase.yml 調整 JVM 引數
  2. 執行 ansible-playbook book/config-hbase.yml

啟動 HBase

ansible 'hmasters[0]' -m shell -a 'source /etc/profile && nohup start-hbase.sh'

# 檢視是否存在程式 HMaster 與 RegionServer
ansible 'hbase_nodes' -m shell -a 'source /etc/profile && jps | grep H'

配置監控

  1. 修改 book/install-metrics.yml 調整 JVM 引數
  2. book/install-metrics.yml 定製不同節點的配置
  3. 執行 ansible-playbook book/install-metrics.yml
  4. 重啟服務
# 關閉 HBase
ansible 'hmasters[0]' -m shell -a 'source /etc/profile && stop-hbase.sh'

ansible 'hbase_nodes' -m shell -a 'source /etc/profile && jps | grep H'

# 關閉 Hadoop

ansible 'resourcemanagers[0]' -m shell -a 'source /etc/profile && stop-yarn.sh'

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && stop-dfs.sh'

ansible 'hadoop_nodes' -m shell -a 'source /etc/profile && jps | grep -v "Jps\|QuorumPeerMain"'

# 啟動 HDFS

ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-dfs.sh'

# 檢查 HDFS
# curl my.hadoop1:7021/metrics
# curl my.hadoop1:7022/metrics
# curl my.hadoop1:7023/metrics

# 啟動 YARN
ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-yarn.sh'

# 檢查 YARN
# curl my.hadoop3:7024/metrics
# curl my.hadoop3:7025/metrics

# 啟動 HBase
ansible 'hmasters[0]' -m shell -a 'source /etc/profile && nohup start-hbase.sh'

# 檢查 HBase
# curl my.hbase1:7027/metrics
# curl my.hbase1:7028/metrics

安裝 Prometheus 與 Grafana

  • 安裝 Prometheus(參考附錄)
  • 安裝 Grafana(參考附錄)

附錄

安裝 Ansible

安裝依賴

  • 安裝 pip(版本為 Python 2.7)
curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py

python get-pip.py --user

pip -V
  • 安裝依賴庫
sudo yum install -y gcc glibc-devel zlib-devel rpm-build openssl-devel
sudo yum install -y python-devel python-yaml python-jinja2 python2-jmespath

編譯安裝

而 Python2 僅支援 2.9 系列,因此無法通過 yum 進行安裝

下載 ansible 2.9.27 原始碼,在本地編譯安裝

wget https://releases.ansible.com/ansible/ansible-2.9.27.tar.gz

tar -xf ansible-2.9.27.tar.gz

pushd ansible-2.9.27/

python setup.py build

sudo python setup.py install

popd

ansible --version

配置免密登陸

  • 在主控機生成金鑰
ssh-keygen -t rsa -b 3072
cat ~/.ssh/id_rsa.pub
  • 受控機訪問授權
cat <<EOF >> ~/.ssh/authorized_keys
ssh-rsa XXX
EOF
  • 禁用受控機 SSH 登陸詢問
vim /etc/ssh/ssh_config
# 在 Host * 後加上
Host *
        StrictHostKeyChecking no

安裝 Prometheus

建立 prometheus 使用者

sudo useradd --no-create-home --shell /bin/false prometheus

# 授予sudo許可權
sudo visudo
prometheus ALL=(ALL) NOPASSWD:ALL

官網找到下載連結

wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz

tar -xvf prometheus-2.35.0.linux-amd64.tar.gz && sudo mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus-2.35.0 

sudo mkdir -p /data/prometheus/tsdb
sudo mkdir -p /etc/prometheus

sudo ln -s /usr/local/prometheus-2.35.0 /usr/local/prometheus

sudo mv /usr/local/prometheus/prometheus.yml /etc/prometheus

sudo chown -R prometheus:prometheus /usr/local/prometheus/
sudo chown -R prometheus:prometheus /data/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus

新增到系統服務 (配置格式)

sudo vim /etc/systemd/system/prometheus.service

# 新增以下內容
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/prometheus/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/data/prometheus/tsdb \
    --web.listen-address=:9090

[Install]
WantedBy=multi-user.target

啟動服務

sudo systemctl start prometheus.service

# 檢視服務狀態
systemctl status prometheus.service

# 檢視日誌
sudo journalctl -u prometheus

# 測試 curl 127.0.0.1:9090

修改配置 prometheus.yml

scrape_configs:

  - job_name: "prometheus"
    file_sd_configs:
      - files:
        - targets/prometheus-*.yml
        refresh_interval: 1m

  - job_name: "zookeeper"
    file_sd_configs:
      - files:
        - targets/zk-cluster.yml
        refresh_interval: 1m
    metric_relabel_configs:
    - action: replace
      source_labels: ["instance"]
      target_label: "instance"
      regex: "([^:]+):.*"
      replacement: "$1"

  - job_name: "hadoop"
    file_sd_configs:
      - files:
        - targets/hadoop-cluster.yml
        refresh_interval: 1m
    metric_relabel_configs:
    - action: replace
      source_labels: ["__name__"]
      target_label: "__name__"
      regex: "Hadoop_[^_]*_(.*)"
      replacement: "$1"
    - action: replace
      source_labels: ["instance"]
      target_label: "instance"
      regex: "([^:]+):.*"
      replacement: "$1"

  - job_name: "hbase"
    file_sd_configs:
      - files:
        - targets/hbase-cluster.yml
        refresh_interval: 1m
    metric_relabel_configs:
    - action: replace
      source_labels: ["instance"]
      target_label: "instance"
      regex: "([^:]+):.*"
      replacement: "$1"
    - action: replace
      source_labels: ["stat"]
      target_label: "stat"
      regex: "(.*)th_percentile"
      replacement: "p$1"

增加 targets

pushd /etc/prometheus/targets

sudo cat <<EOF >> prometheus-servers.yml
- targets:
  - localhost:9090
  labels:
    service: prometheus
EOF

sudo cat <<EOF >> zk-cluster.yml
- targets:
  - my.zk1:7000
  - my.zk2:7000
  - my.zk3:7000
  labels:
    service: zookeeper
EOF

sudo cat <<EOF >> hadoop-cluster.yml
- targets:
  - my.hadoop1:7021
  - my.hadoop2:7021
  labels:
    role: namenode
    service: hdfs
- targets:
  - my.hadoop1:7022
  - my.hadoop2:7022
  - my.hadoop3:7022
  - my.hadoop4:7022
  labels:
    role: datanode
    service: hdfs
- targets:
  - my.hadoop1:7023
  - my.hadoop2:7023
  - my.hadoop3:7023
  labels:
    role: journalnode
    service: hdfs
- targets:
  - my.hadoop3:7024
  - my.hadoop4:7024
  labels:
    role: resourcemanager
    service: yarn
- targets:
  - my.hadoop1:7025
  - my.hadoop2:7025
  - my.hadoop3:7025
  - my.hadoop4:7025
  labels:
    role: nodemanager
    service: yarn
EOF

sudo cat <<EOF >> hbase-cluster.yml
- targets:
  - my.hbase1:7027
  - my.hbase2:7027
  labels:
    app: hmaster
    service: hbase
- targets:
  - my.hbase1:7028
  - my.hbase2:7028
  - my.hbase3:7028
  - my.hbase4:7028
  labels:
    app: regionserver
    service: hbase
EOF

安裝 Grafana

安裝服務

官網找到下載連結(選擇 OSS 版):

wget https://dl.grafana.com/oss/release/grafana-8.5.0-1.x86_64.rpm
sudo yum install grafana-8.5.0-1.x86_64.rpm

# 檢視安裝後生成的配置檔案
rpm -ql grafana

修改配置 grafana.ini

sudo vim /etc/grafana/grafana.ini

# 儲存路徑
[paths]
data = /data/grafana/data
logs = /data/grafana/logs

# 管理員賬號
[security]
admin_user = admin
admin_password = admin

啟動 grafana 服務

sudo mkdir -p /data/grafana/{data,logs} && sudo chown -R grafana:grafana /data/grafana

sudo systemctl start grafana-server

systemctl status grafana-server

# 測試 curl 127.0.0.1:3000

配置 LDAP

修改配置檔案 grafana.ini

sudo vim /etc/grafana/grafana.ini

# 開啟 LDAP
[auth.ldap]
enabled = true

# 調整日誌等級為 debug 方便除錯(可選)
[log]
level = debug

增加 ldap 配置 參考

sudo vim /etc/grafana/ldap.toml

[[servers]]
# LDAP服務
host = "ldap.service.com"
port = 389

# 訪問授權
bind_dn = "cn=ldap_sync,cn=Users,dc=staff,dc=my,dc=com"
bind_password = """???"""

# 查詢範圍
search_filter = "(sAMAccountName=%s)"
search_base_dns = ["ou=Employees,dc=staff,dc=my,dc=com"]

# 使用者資訊對映
[servers.attributes]
name = "givenname"
surname = "cn"
username = "cn"
email =  "mail"

# 許可權對映相關配置,此處忽略...

啟動 grafana 服務

systemctl restart grafana-server

# 在介面登入,並觀察日誌(需要 ctrl + G 定位到末尾)
sudo journalctl -u grafana-server

配置 Dashboard

新增資料來源

使用 admin 賬號登入,新增 Prometehues 作為資料來源:

Configuration (側邊欄)
  -> Data sources (進入子頁面)
    -> Add data source (藍色按鈕)
      -> Prometheus (列表選項)
        -> 填寫 http 地址並點選 Save & Test (藍色按鈕)

新增 Dashboard

zookeeper 官方提供的配置指南 與預設 Dashboard

Create (側邊欄)
  -> Import (進入子頁面)
    -> 填寫 http 地址並點選 Load (藍色按鈕)
      -> 選擇 Prometehues 資料來源並點選 Import (藍色按鈕)
        -> 進入 Dashboard 後設定 Cluster (下拉框)
            -> 點選右上角按鈕 Save Dashboard (檔案圖示)

無現成的 Hbase Dashboard 模板,需要參考官網進行配置。

無現成的 Hadoop Dashboard 模板,需要參考官網該文章進行配置。

為了方便各位懶人,這裡給出幾個簡單 Dashboard 模板:

Patch HBase

HBase 對 JDK 12+ 存在相容性問題,並且問題是最新的 Fix 只存在 3.0,並沒有合併到 2.x,因此需要手動打 Patch 並參考 解決方案 直接替換 hbase-server.jar 即可,為了與 release 包一致,編譯時使用 JDK 8。

但是實踐中發現打包後的 jar 體積明顯小於 release 包中的 hbase-server.jar
為了保險起見,決定只替換 class 檔案,步驟如下:


# 將重新編譯的 hbase-server-2.4.11.jar 重新命名為 patch.jar
# 將其與官方的 hbase-server-2.4.11.jar 拷貝到相同目錄
unzip patch.jar -d patch
unzip hbase-server-2.4.11.jar -d hbase-server-2.4.11

# 查詢目標類所在路徑
jar -tf hbase-server-2.4.11.jar | grep HFileSystem.class

# 檢視 class 版本是否一致
file patch/org/apache/hadoop/hbase/fs/HFileSystem.class
file hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class

# 下載 cfr 用於反編譯 jar
wget https://www.benf.org/other/cfr/cfr-0.152.jar

# 反編譯 class 
java -jar cfr-0.152.jar hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class > A.java
java -jar cfr-0.152.jar patch/org/apache/hadoop/hbase/fs/HFileSystem.class > B.java

# 檢視修改是否成功
diff A.java B.java

# 檢查完畢後,將 patch 後的 class 檔案打包進 hbase-server-2.4.11.jar 包
cd patch
jar -uf ../hbase-server-2.4.11.jar org/apache/hadoop/hbase/fs/HFileSystem.class

相關文章