griffin環境搭建及功能測試

dabiao_z發表於2021-12-27

注1:griffin 官網 doc 部分教程內容較少,主要在 griffin 原始檔的 griffin-doc/ 目錄下。

注2:在 griffin 原始檔的 griffin-doc/ 目錄下含有詳細部署教程,建議先大致瀏覽一遍再安裝。


1 準備

為了減少安裝中出現的錯誤,下載 griffin 後,檢視下 measure/pom.xmlservice/pom.xml檔案,確定需要提前安裝的軟體的版本。

軟體 版本
JDK jdk1.8.0_301
Scala scala2.11.8
MySQL mysql5.6.24
Hadoop hadoop-2.7.3
Hive hive-2.3.9
Spark spark-2.2.1
Livy livy-0.7.1
ES es-7.16.2
MAVEN maven-3.8.4
griffin griffin0.5.0

官方教程上寫,可以按如下命令提前安裝 npm,暫未測試。

$ sudo apt install nodejs
$ sudo apt install npm
$ node -v
$ npm -v

mysql

安裝方法見 file 目錄下的 尚矽谷MySQL安裝

在 mysql 中建立資料庫 quartz,然後執行 Init_quartz_mysql_innodb.sql 指令碼初始化表資訊:

mysql> create database quartz;
Query OK, 1 row affected (0.00 sec)

mysql> use quartz;
Database changed

mysql> source /root/Init_quartz_mysql_innodb.sql;
Query OK, 0 rows affected, 1 warning (0.01 sec)
...

mysql> show tables;
+--------------------------+
| Tables_in_quartz         |
+--------------------------+
| QRTZ_BLOB_TRIGGERS       |
| QRTZ_CALENDARS           |
| QRTZ_CRON_TRIGGERS       |
| QRTZ_FIRED_TRIGGERS      |
| QRTZ_JOB_DETAILS         |
| QRTZ_LOCKS               |
| QRTZ_PAUSED_TRIGGER_GRPS |
| QRTZ_SCHEDULER_STATE     |
| QRTZ_SIMPLE_TRIGGERS     |
| QRTZ_SIMPROP_TRIGGERS    |
| QRTZ_TRIGGERS            |
+--------------------------+
11 rows in set (0.00 sec)

hive

hive 安裝方法點 這裡

hive-site.xml 中新增如下配置:

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://bigdata101:9083</value>
</property>

啟動服務

nohup /opt/hive-2.3.9/bin/hive --service metastore &
nohup /opt/hive-2.3.9/bin/hive --service hiveserver2 &

hive2.x 版本需要啟動兩個服務 metastore 和 hiveserver2,否則會報錯 Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

hadoop

hadoop 安裝方法點 這裡

spark

下載解壓後,進行如下配置即可。

[root@bigdata101 conf]# cat spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata101:9000/spark_directory
spark.master yarn-client
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.jars hdfs://bigdata101:9000/home/spark_lib/*
spark.yarn.dist.files hdfs://bigdata101:9000/home/spark_conf/hive-site.xml
spark.sql.broadcastTimeout  500

[root@bigdata101 conf]# cat spark-env.sh
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_HOST=bigdata101
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8082
SPARK_LOCAL_IP=bigdata101
YARN_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop 
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=30 -Dspark.history.fs.logDirectory=hdfs://bigdata101:9000/spark_directory"

複製如下配置檔案

# 把 hive 中 lib/datanucleus-*.jar 包拷貝到 spark 的 jars/ 路徑下。
[root@bigdata101 opt]# cp /opt/hive-2.3.9/lib/datanucleus-*.jar $SPARK_HOME/jars/

# 把 hive 中 conf/hive-site.xml 包拷貝到 spark 的 conf/ 路徑下。
[root@bigdata101 opt]# cp /opt/hive-2.3.9/conf/hive-site.xml $SPARK_HOME/conf/

# 在 hdfs 上建立 `/home/spark_conf` 目錄,並將 `hive-site.xml` 上傳到該目錄下
[root@bigdata101 opt]# hadoop fs -mkdir -p /home/spark_conf

[root@bigdata101 opt]# hadoop fs -put /opt/hive-2.3.9/conf/hive-site.xml /home/spark_conf/

[root@bigdata101 opt]# hadoop fs -ls /home/spark_conf
Found 1 items
-rw-r--r--   3 root supergroup       1259 2021-12-25 18:18 /home/spark_conf/hive-site.xml

# 在 hdfs 上建立 `/home/spark_lib` 目錄,並將 spark `jars/` 目錄下的檔案上傳到該目錄下
[root@bigdata101 opt]# hadoop fs -mkdir /home/spark_lib

[root@bigdata101 opt]# hadoop fs -put $SPARK_HOME/jars/*  hdfs://bigdata101:9000/home/spark_lib/

如果在 hadoop 中配置了 lzo,則需要把 hadoop-lzo-0.4.21-SNAPSHOT.jar 檔案複製到 spark 的 conf/ 目錄中。

livy

下載:http://archive.apache.org/dist/incubator/livy/

[root@bigdata101 livy-0.7.1]# cat conf/livy.conf 
livy.server.host = bigdata101
livy.spark.master = yarn
livy.spark.deployMode = client
livy.repl.enableHiveContext = true
livy.server.port = 8998

安裝 spark 後再啟動 livy,否則會報錯。

執行命令 bin/livy-server start 啟動

es

安裝方法點 這裡

啟動 es 後,在 es 裡建立 griffin 索引,並檢視

[zgg@bigdata101 es-7.16.2]$ curl -k -H "Content-Type: application/json" -X PUT http://bigdata101:9200/griffin?include_type_name=true \
>  -d '{
>     "aliases": {},
>     "mappings": {
>         "accuracy": {
>             "properties": {
>                 "name": {
>                     "fields": {
>                         "keyword": {
>                             "ignore_above": 256,
>                             "type": "keyword"
>                         }
>                     },
>                     "type": "text"
>                 },
>                 "tmst": {
>                     "type": "date"
>                 }
>             }
>         }
>     },
>     "settings": {
>         "index": {
>             "number_of_replicas": "2",
>             "number_of_shards": "5"
>         }
>     }
> }'
{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}


# 檢視所有索引
[zgg@bigdata101 es-7.16.2]$ curl -X GET http://bigdata101:9200/_cat/indices
green open .geoip_databases HDSgUJ23Qwe-TI6KZ4e6Rw 1 1 43 46 224.8mb 111.9mb
green open griffin          h9fUJ3IpT9Sxcr4RbZKEKQ 5 2  0  0   3.3kb   1.1kb

# 檢視索引griffin內容
[zgg@bigdata101 es-7.16.2]$ curl -X GET http://bigdata101:9200/griffin/_search?pretty
{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

maven

下載:https://maven.apache.org/download.cgi

[root@bigdata101 maven-3.8.4]# cat conf/settings.xml
<!-- 新增阿里雲映象--> 
<mirror> 
	<id>nexus-aliyun</id> 
	<mirrorOf>central</mirrorOf> 
	<name>Nexus aliyun</name> 
	<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror> 
<mirror> 
	<id>UK</id> 
	<name>UK Central</name> 
	<url>http://uk.maven.org/maven2</url> 
	<mirrorOf>central</mirrorOf> 
</mirror> 
<mirror> 
	<id>repo1</id> 
	<mirrorOf>central</mirrorOf> 
	<name>Human Readable Name for this Mirror.</name> 
	<url>http://repo1.maven.org/maven2/</url> 
</mirror> 
<mirror> 
	<id>repo2</id> 
	<mirrorOf>central</mirrorOf> 
	<name>Human Readable Name for this Mirror.</name> 
	<url>http://repo2.maven.org/maven2/</url> 
</mirror>

配置環境變數

[root@bigdata101 opt]# cat /etc/profile   
export JAVA_HOME=/opt/jdk1.8.0_301
export ES_JAVA_HOME=/opt/elasticsearch-7.16.2/jdk

export HADOOP_HOME=/opt/hadoop-2.7.3
export MAVEN_HOME=/opt/maven-3.8.4
export SPARK_HOME=/opt/spark-2.2.1
export SCALA_HOME=/opt/scala-2.11.8
export LIVY_HOME=/opt/livy-0.7.1
export HIVE_HOME=/opt/hive-2.3.9
export GRIFFIN_HOME=/opt/griffin-0.5.0
export HADOOP_COMMON_HOME=/opt/hadoop-2.7.3
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop-2.7.3/lib/native
export HADOOP_HDFS_HOME=/opt/hadoop-2.7.3
export HADOOP_INSTALL=/opt/hadoop-2.7.3
export HADOOP_MAPRED_HOME=/opt/hadoop-2.7.3
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=/opt/hadoop-2.7.3
export PATH=.:$LIVY_HOME/bin:$GRIFFIN_HOME/bin:$MAVEN_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin:$HIVE_HOME/bin:$HIVE_HOME/sbin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH

2 安裝griffin

下載:https://griffin.apache.org/docs/download.html

配置配置檔案

下載解壓後,配置配置檔案,如下:

配置 service/src/main/resources/application.properties

# griffin server啟動預設埠8080
server.port=8090
spring.application.name=griffin_service
spring.datasource.url=jdbc:mysql://bigdata101:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=root
spring.datasource.password=000000
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore
hive.metastore.uris=thrift://bigdata101:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
#Hive jdbc
hive.jdbc.className=org.apache.hive.jdbc.HiveDriver
hive.jdbc.url=jdbc:hive2://localhost:10000/
hive.need.kerberos=false
hive.keytab.user=xxx@xx.com
hive.keytab.path=/path/to/keytab/file
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://bigdata101:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=hdfs://bigdata101:9000
# elasticsearch
elasticsearch.host=bigdata101
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://bigdata101:8998/batches
livy.need.queue=false
livy.task.max.concurrent.count=20
livy.task.submit.interval.second=3
livy.task.appId.retry.count=3
livy.need.kerberos=false
livy.server.auth.kerberos.principal=livy/kerberos.principal
livy.server.auth.kerberos.keytab=/path/to/livy/keytab/file
# yarn url
yarn.uri=http://bigdata102:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook

logging.file=logs/griffin-service.log

配置 service/src/main/resources/quartz.properties

org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000

配置 service/src/main/resources/sparkProperties.json

{
  "file": "hdfs://bigdata101:9000/griffin/griffin-measure.jar",
  "className": "org.apache.griffin.measure.Application",
  "queue": "default",
  "numExecutors": 2,
  "executorCores": 1,
  "driverMemory": "1g",
  "executorMemory": "1g",
  "conf": {
    "spark.yarn.dist.files": "hdfs://bigdata101:9000/home/spark_conf/hive-site.xml"
  },
  "files": [
  ]
}

配置 service/src/main/resources/env/env_batch.json

也可以單獨建立配置,在執行任務的時候指向這個檔案,env_streaming.json 檔案同理。

{
  "spark": {
    "log.level": "INFO"
  },
  "sinks": [
    {
      "name": "console",
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 10
      }
    },
    {
      "name": "hdfs",
      "type": "HDFS",
      "config": {
        "path": "hdfs://bigdata101:9000/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "name": "elasticsearch",
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "hdfs://bigdata101:9000/griffin/accuracy",
        "connection.timeout": "1m",
        "retry": 10
      }
    }
  ],
  "griffin.checkpoint": []
}

配置 service/src/main/resources/env/env_streaming.json

{
  "spark": {
    "log.level": "WARN",
    "checkpoint.dir": "hdfs://bigdata101:9000/griffin/checkpoint/${JOB_NAME}",
    "init.clear": true,
    "batch.interval": "1m",
    "process.interval": "5m",
    "config": {
      "spark.default.parallelism": 4,
      "spark.task.maxFailures": 5,
      "spark.streaming.kafkaMaxRatePerPartition": 1000,
      "spark.streaming.concurrentJobs": 4,
      "spark.yarn.maxAppAttempts": 5,
      "spark.yarn.am.attemptFailuresValidityInterval": "1h",
      "spark.yarn.max.executor.failures": 120,
      "spark.yarn.executor.failuresValidityInterval": "1h",
      "spark.hadoop.fs.hdfs.impl.disable.cache": true
    }
  },
  "sinks": [
    {
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 100
      }
    },
    {
      "type": "HDFS",
      "config": {
        "path": "hdfs://bigdata101:9000/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "hdfs://bigdata101:9000/griffin/accuracy"
      }
    }
  ],
  "griffin.checkpoint": [
    {
      "type": "zk",
      "config": {
        "hosts": "zk:2181",
        "namespace": "griffin/infocache",
        "lock.path": "lock",
        "mode": "persist",
        "init.clear": true,
        "close.clear": false
      }
    }
  ]
}

service/pom.xml

<!-- <dependency> 
	<groupId>org.postgresql</groupId> 
	<artifactId>postgresql</artifactId> 
	<version>${postgresql.version}</version> 
</dependency> --> 
<dependency> 
	<groupId>mysql</groupId> 
	<artifactId>mysql-connector-java</artifactId> 
</dependency> 
<!-- 注意:版本號刪除掉 -->

編譯

[root@bigdata101 griffin-0.5.0-src]# mvn clean
[root@bigdata101 griffin-0.5.0-src]# mvn -T2C install -DskipTests

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO] 
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [  3.069 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SUCCESS [02:05 min]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [01:02 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [01:39 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:08 min (Wall Clock)
[INFO] Finished at: 2021-12-26T12:04:02+08:00
[INFO] ------------------------------------------------------------------------

部署jar包

# 環境變數已在上面配置
[root@bigdata101 opt]# mkdir griffin-0.5.0

# 將 measure-0.5.0.jar 複製到 griffin-0.5.0 目錄下,並重新命名
[root@bigdata101 griffin-0.5.0-src]# cp measure/target/measure-0.5.0.jar $GRIFFIN_HOME/griffin-measure.jar

# 將 service-0.5.0.jar 複製到 griffin-0.5.0 目錄下,並重新命名
[root@bigdata101 griffin-0.5.0-src]# cp service/target/service-0.5.0.jar $GRIFFIN_HOME/griffin-service.jar

[root@bigdata101 griffin-0.5.0]# ls
griffin-measure.jar  griffin-service.jar

[root@bigdata101 griffin-0.5.0]# hadoop fs -mkdir -p /griffin/persist

[root@bigdata101 griffin-0.5.0]# hadoop fs -mkdir -p /griffin/checkpoint

# 將 griffin-measure.jar 上傳到 hdfs
[root@bigdata101 griffin-0.5.0]# hadoop fs -put $GRIFFIN_HOME/griffin-measure.jar /griffin/

[root@bigdata101 griffin-0.5.0]# hadoop fs -ls /griffin
Found 3 items
drwxr-xr-x   - root supergroup          0 2021-12-26 12:27 /griffin/checkpoint
-rw-r--r--   3 root supergroup   46538384 2021-12-26 12:27 /griffin/griffin-measure.jar
drwxr-xr-x   - root supergroup          0 2021-12-26 12:27 /griffin/persist
[root@bigdata101 griffin-0.5.0]# nohup java -jar $GRIFFIN_HOME/griffin-service.jar >$GRIFFIN_HOME/service.out 2>&1 &

[root@bigdata101 griffin-0.5.0]# jps
2098 DataNode
2996 Master
3748 RunJar
2646 -- process information unavailable
2374 NodeManager
5206 griffin-service.jar
5271 Jps
3064 Worker
1994 NameNode
2555 LivyServer

瀏覽器輸入 http://bigdata101:8090/,檢視:

griffin安裝01

預設賬戶和密碼都是無,直接點登入即可。

介面操作示例點這裡

3 批處理測試

獲取測試資料:https://griffin.apache.org/data/batch

[root@bigdata101 griffin-0.5.0]# mkdir data/

wget https://griffin.apache.org/data/batch/create-table.hql
wget https://griffin.apache.org/data/batch/delta_src
wget https://griffin.apache.org/data/batch/delta_tgt
wget https://griffin.apache.org/data/batch/demo_basic
wget https://griffin.apache.org/data/batch/demo_src
wget https://griffin.apache.org/data/batch/demo_tgt
wget https://griffin.apache.org/data/batch/gen-hive-data.sh
wget https://griffin.apache.org/data/batch/gen_delta_src.sh
wget https://griffin.apache.org/data/batch/gen_demo_data.sh
wget https://griffin.apache.org/data/batch/insert-data.hql.template

[root@bigdata101 data]# ls
create-table.hql  delta_tgt   demo_src  gen_delta_src.sh  gen-hive-data.sh
delta_src         demo_basic  demo_tgt  gen_demo_data.sh  insert-data.hql.template

[root@bigdata101 data]# chmod 777 *

# 修改 create-table.hql 中的 LOCATION 路徑
[root@bigdata101 data]# vi create-table.hql
...
LOCATION
  'hdfs://bigdata101:9000/griffin/data/batch/demo_src';
...
LOCATION
  'hdfs://bigdata101:9000/griffin/data/batch/demo_tgt';


# 生成測試資料
[root@bigdata101 data]# ./gen-hive-data.sh

hive> select * from demo_src limit 5; 
OK
0       1       1       20211226        13
0       2       2       20211226        13
0       3       3       20211226        13
0       4       4       20211226        13
0       5       5       20211226        13

hive> desc demo_src;
OK
id                      bigint                                      
age                     int                                         
desc                    string                                      
dt                      string                                      
hour                    string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
hour                    string  

hive> select * from demo_tgt limit 5; 
OK
0       1       1       20211226        13
0       2       2       20211226        13
0       3       3       20211226        13
0       4       4       20211226        13
0       5       5       20211226        13

hive> desc demo_tgt;
OK
id                      bigint                                      
age                     int                                         
desc                    string                                      
dt                      string                                      
hour                    string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
hour                    string                                      
Time taken: 0.468 seconds, Fetched: 11 row(s) 

準確度度量 Accuracy Measure

建立規則

[root@bigdata101 griffin-0.5.0]# mkdir measures
[root@bigdata101 measures]# vi dq.json
{
  "name": "batch_accu",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "src",
      "baseline": true,
      "connectors": [
        {
          "type": "hive",
          "version": "2.3",
          "config": {
            "database": "default",
            "table.name": "demo_src"
          }
        }
      ]
    }, {
      "name": "tgt",
      "connectors": [
        {

          "type": "hive",
          "version": "2.3",
          "config": {
            "database": "default",
            "table.name": "demo_tgt"
          }
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out": [
          {
            "type": "metric",
            "name": "accu"
          },
          {
            "type": "record",
            "name": "missRecords"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS"]
}

指定結果儲存路徑

[root@bigdata101 measures]# vi env.json
{
  "spark": {
    "log.level": "WARN"
  },
  "sinks": [
    {
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 10
      }
    },
    {
      "type": "HDFS",
      "config": {
        "path": "hdfs://bigdata101:9000/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "http://bigdata101:9200/griffin/accuracy",
        "connection.timeout": "1m",
        "retry": 10
      }
    }
  ],
  "griffin.checkpoint": []
}

提交job

[root@bigdata101 griffin-0.5.0]# ls
data  griffin-measure.jar  griffin-service.jar  measures  service.out
[root@bigdata101 griffin-0.5.0]# ls measures/
dq.json  env.json

[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq.json
......
data source timeRanges: src -> (1640586475986, 1640586475986], tgt -> (1640586475986, 1640586475986]
[1640586475986] batch_accu start: application_1640586390189_0001
batch_accu [1640586475986] metrics: 
{"name":"batch_accu","tmst":1640586475986,"value":{"total_count":375000,"miss_count":1475,"matched_count":373525,"matchedFraction":0.9960666666666667},"applicationId":"application_1640586390189_0001"}
[1640586475986] 1640586626794: process using time: 150808 ms
[1640586475986] batch_accu finish

檢視 hdfs 上的 /griffin/persist/ 目錄

[root@bigdata101 opt]# hadoop fs -ls /griffin/persist/batch_accu/1640586475986
Found 5 items
-rw-r--r--   1 root supergroup          0 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_FINISH
-rw-r--r--   1 root supergroup        138 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_LOG
-rw-r--r--   1 root supergroup        200 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_METRICS
-rw-r--r--   1 root supergroup         30 2021-12-27 14:28 /griffin/persist/batch_accu/1640586475986/_START
-rw-r--r--   1 root supergroup     128324 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/missRecords


[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_START
application_1640586390189_0001


[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_LOG
================ log of Mon Dec 27 14:27:55 CST 2021 ================
--- Mon Dec 27 14:30:26 CST 2021 ---
process using time: 150808 ms


[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_METRICS
{"name":"batch_accu","tmst":1640586475986,"value":{"total_count":375000,"miss_count":1475,"matched_count":373525,"matchedFraction":0.9960666666666667},"applicationId":"application_1640586390189_0001"}


[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/missRecords
....
{"id":124,"age":1628,"desc":"1628","dt":"20211227","hour":"13","__tmst":1640586475986}
{"id":124,"age":1628,"desc":"1628","dt":"20211227","hour":"14","__tmst":1640586475986}
{"id":124,"age":1479,"desc":"1479","dt":"20211227","hour":"15","__tmst":1640586475986}
{"id":124,"age":1479,"desc":"1479","dt":"20211227","hour":"15","__tmst":1640586475986}
{"id":124,"age":1555,"desc":"1555","dt":"20211227","hour":"14","__tmst":1640586475986}
{"id":124,"age":1555,"desc":"1555","dt":"20211227","hour":"13","__tmst":1640586475986}

效能分析度量 Profiling Measure

檢視 demo_src 表的最大值、總個數、空值個數、最大長度。

src.id.count() 這裡是 "name": "src"

建立規則

[root@bigdata101 measures]# vi dq2.json
{
  "name": "batch_Prof",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "src",
      "baseline": true,
      "connectors": [
        {
          "type": "hive",
          "version": "2.3",
          "config": {
            "database": "default",
            "table.name": "demo_src"
          }
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "profiling",
        "out.dataframe.name": "prof",
        "rule": "src.id.count() as id_count,src.age.max() as age_max,src.desc.length().max() as desc_length_max",
        "out": [
          {
            "type": "metric",
            "name": "prof"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS"]
}

使用測試1的 env.json,提交 job

[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq2.json
...
{"name":"batch_prof","tmst":1640589832672,"value":{"id_count":375000,"age_max":2000,"desc_length_max":4},"applicationId":"application_1640586390189_0004"}

Rule 規則編寫

在 rule 裡可以寫 sql,比如將 demo_src 表按 age 降序輸出。

注意查詢時,不是查詢表名而是查詢 name 名稱。

建立規則

[root@bigdata101 measures]# vi dq3.json
{
  "name": "batch_prof",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "src",
      "baseline": true,
      "connectors": [
        {
          "type": "hive",
          "version": "2.3",
          "config": {
            "database": "default",
            "table.name": "demo_src"
          }
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "profiling",
        "out.dataframe.name": "prof",
        "rule": "select age from src order by age desc limit 5",
        "out": [
          {
            "type": "metric",
            "name": "prof"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS"]
}

使用測試1的 env.json,提交 job

[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq3.json
...
{"name":"batch_prof","tmst":1640590110604,"value":{"prof":[{"age":2000},{"age":2000},{"age":2000},{"age":1999},{"age":1999}]},"applicationId":"application_1640586390189_0005"}

4 流處理測試【有問題,待解決】

獲取測試資料:https://griffin.apache.org/data/streaming/

wget https://griffin.apache.org/data/streaming/gen-data.sh --no-check-certificate
wget https://griffin.apache.org/data/streaming/source.temp --no-check-certificate
wget https://griffin.apache.org/data/streaming/streaming-data.sh --no-check-certificate
wget https://griffin.apache.org/data/streaming/target.temp --no-check-certificate

[root@bigdata101 data]# chmod 777 *

# 先檢視下 streaming-data.sh 和 gen-data.sh 檔案,根據實際情況,做修改 

# execute `./streaming-data.sh` to generate json string data file and produce them into kafka topics minutely.

建立規則

[root@bigdata101 measures]# vi dq4.json
{
  "name": "streaming_accu",
  "process.type": "streaming",
  "data.sources": [
    {
      "name": "src",
      "baseline": true,
      "connectors": [
        {
          "type": "kafka",
          "version": "2.3",
          "config": {
            "kafka.config": {
              "bootstrap.servers": "bigdata101:9092,bigdata102:9092,bigdata103:9092",
              "group.id": "griffin",
              "auto.offset.reset": "largest",
              "auto.commit.enable": "false"
            },
            "topics": "source",
            "key.type": "java.lang.String",
            "value.type": "java.lang.String"
          },
          "pre.proc": [
            {
              "dsl.type": "df-opr",
              "rule": "from_json"
            }
          ]
        }
      ],
      "checkpoint": {
        "type": "json",
        "file.path": "hdfs://bigdata101:9000/griffin/streaming/dump/source",
        "info.path": "source",
        "ready.time.interval": "10s",
        "ready.time.delay": "0",
        "time.range": ["-5m", "0"],
        "updatable": true
      }
    }, {
      "name": "tgt",
      "connectors": [
        {
          "type": "kafka",
          "version": "2.3",
          "config": {
            "kafka.config": {
              "bootstrap.servers": "bigdata101:9092,bigdata102:9092,bigdata103:9092",
              "group.id": "griffin",
              "auto.offset.reset": "largest",
              "auto.commit.enable": "false"
            },
            "topics": "target",
            "key.type": "java.lang.String",
            "value.type": "java.lang.String"
          },
          "pre.proc": [
            {
              "dsl.type": "df-opr",
              "rule": "from_json"
            }
          ]
        }
      ],
      "checkpoint": {
        "type": "json",
        "file.path": "hdfs://bigdata101:9000/griffin/streaming/dump/target",
        "info.path": "target",
        "ready.time.interval": "10s",
        "ready.time.delay": "0",
        "time.range": ["-1m", "0"]
      }
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "src.id = tgt.id AND src.name = tgt.name AND src.color = tgt.color AND src.time = tgt.time",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out":[
          {
            "type":"metric",
            "name": "accu"
          },
          {
            "type":"record",
            "name": "missRecords"
          }
        ]
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS"]
}

env.json

[root@bigdata101 measures]# vi env.json
{
  "spark": {
    "log.level": "WARN",
    "checkpoint.dir": "hdfs://bigdata101:9000/griffin/checkpoint",
    "batch.interval": "20s",
    "process.interval": "1m",
    "init.clear": true,
    "config": {
      "spark.default.parallelism": 4,
      "spark.task.maxFailures": 5,
      "spark.streaming.kafkaMaxRatePerPartition": 1000,
      "spark.streaming.concurrentJobs": 4,
      "spark.yarn.maxAppAttempts": 5,
      "spark.yarn.am.attemptFailuresValidityInterval": "1h",
      "spark.yarn.max.executor.failures": 120,
      "spark.yarn.executor.failuresValidityInterval": "1h",
      "spark.hadoop.fs.hdfs.impl.disable.cache": true
    }
  },
  "sinks": [
    {
      "type": "console"
    },
    {
      "type": "hdfs",
      "config": {
        "path": "hdfs://bigdata101:9000/griffin/persist"
      }
    },
    {
      "type": "elasticsearch",
      "config": {
        "method": "post",
        "api": "http://bigdata101:9200/griffin/accuracy"
      }
    }
  ],
  "griffin.checkpoint": [
    {
      "type": "zk",
      "config": {
        "hosts": "bigdata101:2181,bigdata102:2182,bigdata103:2181",
        "namespace": "griffin/infocache",
        "lock.path": "lock",
        "mode": "persist",
        "init.clear": true,
        "close.clear": false
      }
    }
  ]
}

啟動 kafka,執行 ./streaming-data.sh 生產資料。

[root@bigdata101 data]# ./streaming-data.sh
>>>>>>>>>>>>>>>>>>>>>>insert data at 2021-12-27_16:15:54
>>>>>>>>>>>>>>>>>>>>>>insert data at 2021-12-27_16:16:57
....

提交 job 【可能是記憶體不足的原因,待驗證】

[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq4.json
....
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /lock error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/lock
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /info.final error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /info error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info
21/12/27 21:38:49 WARN streaming.Checkpoint: Checkpoint directory hdfs://bigdata101:9000/griffin/checkpoint does not exist
21/12/27 21:39:10 WARN offset.OffsetCheckpointInZK: read /info.final/last.proc.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/last.proc.time
21/12/27 21:39:10 WARN offset.OffsetCheckpointInZK: read /info.final/ready.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/ready.time
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info/source/old.cache.index warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info/source/old.cache.index
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info.final/last.proc.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/last.proc.time
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info.final/ready.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/ready.time
data source timeRanges: src -> (-300001, -1], tgt -> (-60001, -1]
[1640612329475] streaming_accu start: application_1640606908996_0015
已殺死

5 問題

問題1

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.6.0 0.6.0:
[INFO] 
[INFO] Apache Griffin 0.6.0 ............................... SUCCESS [  6.164 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [06:54 min]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [02:10 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [02:44 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  07:01 min (Wall Clock)
[INFO] Finished at: 2021-12-26T11:55:52+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm install) on project ui: Failed to run task: 'npm install' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]

修改 ui\pom.xml 的如下內容:

原內容:

<!-- It will install nodejs and npm -->
<execution>
    <id>install node and npm</id>
    <goals>
        <goal>install-node-and-npm</goal>
    </goals>
    <configuration>
		<nodeVersion>${node.version}</nodeVersion>
        <npmVersion>${npm.version}</npmVersion>
    </configuration>
</execution>

修改後:

<!-- It will install nodejs and npm -->
<execution>
    <id>install node and npm</id>
    <goals>
        <goal>install-node-and-npm</goal>
    </goals>
    <configuration>
        <nodeVersion>v10.16.3</nodeVersion>
        <npmVersion>6.11.3</npmVersion>
        <nodeDownloadRoot>http://nodejs.org/dist/</nodeDownloadRoot>
        <!-- <npmDownloadRoot>http://registry.npmjs.org/npm/</npmDownloadRoot> -->
        <npmDownloadRoot>http://registry.npm.taobao.org/npm/</npmDownloadRoot>
    </configuration>
</execution>

如果出現瞭如下問題,把 <nodeDownloadRoot><npmDownloadRoot> 兩行刪掉。

[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:install-node-and-npm (install node and npm) on project ui: Could not download npm: Got error code 404 from the server. -> [Help 1]

問題2

[ERROR]
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,26): Cannot find name 'SVGElementTagNameMap'.
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,89): Cannot find name 'SVGElementTagNameMap'.
[ERROR]
[ERROR] npm ERR! Linux 3.10.0-514.el7.x86_64
[ERROR] npm ERR! argv "/opt/soft/griffin-0.5.0/ui/.tmp/node/node" "/opt/soft/griffin-0.5.0/ui/.tmp/node/node_modules/npm/bin/npm-cli.js" "run" "build         "
[ERROR] npm ERR! node v6.11.3
[ERROR] npm ERR! npm  v3.10.10
[ERROR] npm ERR! code ELIFECYCLE
[ERROR] npm ERR! griffin@0.0.0 build: `ng build`
[ERROR] npm ERR! Exit status 1
[ERROR] npm ERR!
[ERROR] npm ERR! Failed at the griffin@0.0.0 build script 'ng build'.
[ERROR] npm ERR! Make sure you have the latest version of node.js and npm installed.
[ERROR] npm ERR! If you do, this is most likely a problem with the griffin package,
[ERROR] npm ERR! not with npm itself.
[ERROR] npm ERR! Tell the author that this fails on your system:
[ERROR] npm ERR!     ng build
[ERROR] npm ERR! You can get information on how to open an issue for this project with:
[ERROR] npm ERR!     npm bugs griffin
[ERROR] npm ERR! Or if that isn't available, you can get their info via:
[ERROR] npm ERR!     npm owner ls griffin
[ERROR] npm ERR! There is likely additional logging output above.
[ERROR]
[ERROR] npm ERR! Please include the following file with any support request:
[ERROR] npm ERR!     /opt/soft/griffin-0.5.0/ui/angular/npm-debug.log
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 16.497 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [24:12 min]
[INFO] Apache Griffin :: Web Service ...................... SKIPPED
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  24:28 min
[INFO] Finished at: 2020-06-30T15:04:04+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm build) on project ui: Failed to run task: 'npm run build' faile         d. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command

vi /opt/griffin-0.5.0-source/ui/angular/node_modules/@types/jquery/JQuery.d.ts

刪除4137這行

find<K extends keyof SVGElementTagNameMap>(selector_element: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;

問題3

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [  9.715 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SKIPPED
[INFO] Apache Griffin :: Web Service ...................... FAILURE [02:55 min]
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:08 min (Wall Clock)
[INFO] Finished at: 2019-05-27T14:27:07+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project service: Could not resolve dependencies for project org.apache.griffin:service:jar:0.5.0: Could not find artifact io.confluent:kafka-schema-registry-client:jar:3.2.0 in nexus-aliyun (http://maven.aliyun.com/nexus/content/groups/public) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :service

下載 kafka-schema-registry-client,手動新增

mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=3.2.0 -Dpackaging=jar -Dfile=kafka-schema-registry-client-3.2.0.jar

問題4【待解決】

griffin-0.6.0 編譯成功,但在執行 nohup java -jar $GRIFFIN_HOME/griffin-service.jar >$GRIFFIN_HOME/service.out 2>&1 & 時,依次出現如下錯誤:

/opt/griffin-0.6.0/griffin-service.jar中沒有主清單屬性

可以在 pom 中新增一個 SpringBoot 的構建的外掛,然後重新執行 mvn install 即可。(未測試,來自這裡

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: javax/servlet/Filter

Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter

使用 0.5.0 版本沒有出現此問題。

問題5

在測試提交 spark job 時,出現如下錯誤:

NoSuchMethodError: org.apache.spark.SparkConf.setAll(Lscala/collection/Traversable;)Lorg/apache/spark/SparkConf

安裝的 spark 的版本與 griffin 內部指定的版本不一致,檢視 measure/pom.xml 檔案,更換對應的 spark 版本。

問題6

在測試提交 spark job 時,出現如下錯誤:

ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found

hadoop-lzo-0.4.21-SNAPSHOT.jar 檔案複製到 spark 的 conf/ 目錄中。

問題7

在測試流資料處理時,出現 java.io.CharConversionException: Invalid UTF-32 character 0x8000000(above 10ffff) at char #1, byte #7) 異常的原因就是沒有在 spark 中配置支援 lzo


參考:

總結了很多報錯問題:https://blog.csdn.net/weixin_43160819/article/details/118298681

https://blog.csdn.net/weixin_40004348/article/details/107191430

https://blog.csdn.net/github_39577257/article/details/90607081

尚矽谷griffin

https://blog.csdn.net/guanghuotainkong/article/details/112168871

相關文章