注1:griffin 官網 doc 部分教程內容較少,主要在 griffin 原始檔的 griffin-doc/
目錄下。
注2:在 griffin 原始檔的 griffin-doc/
目錄下含有詳細部署教程,建議先大致瀏覽一遍再安裝。
1 準備
為了減少安裝中出現的錯誤,下載 griffin 後,檢視下 measure/pom.xml
和 service/pom.xml
檔案,確定需要提前安裝的軟體的版本。
軟體 | 版本 |
---|---|
JDK | jdk1.8.0_301 |
Scala | scala2.11.8 |
MySQL | mysql5.6.24 |
Hadoop | hadoop-2.7.3 |
Hive | hive-2.3.9 |
Spark | spark-2.2.1 |
Livy | livy-0.7.1 |
ES | es-7.16.2 |
MAVEN | maven-3.8.4 |
griffin | griffin0.5.0 |
官方教程上寫,可以按如下命令提前安裝 npm,暫未測試。
$ sudo apt install nodejs
$ sudo apt install npm
$ node -v
$ npm -v
mysql
安裝方法見 file
目錄下的 尚矽谷MySQL安裝
。
在 mysql 中建立資料庫 quartz
,然後執行 Init_quartz_mysql_innodb.sql 指令碼初始化表資訊:
mysql> create database quartz;
Query OK, 1 row affected (0.00 sec)
mysql> use quartz;
Database changed
mysql> source /root/Init_quartz_mysql_innodb.sql;
Query OK, 0 rows affected, 1 warning (0.01 sec)
...
mysql> show tables;
+--------------------------+
| Tables_in_quartz |
+--------------------------+
| QRTZ_BLOB_TRIGGERS |
| QRTZ_CALENDARS |
| QRTZ_CRON_TRIGGERS |
| QRTZ_FIRED_TRIGGERS |
| QRTZ_JOB_DETAILS |
| QRTZ_LOCKS |
| QRTZ_PAUSED_TRIGGER_GRPS |
| QRTZ_SCHEDULER_STATE |
| QRTZ_SIMPLE_TRIGGERS |
| QRTZ_SIMPROP_TRIGGERS |
| QRTZ_TRIGGERS |
+--------------------------+
11 rows in set (0.00 sec)
hive
hive 安裝方法點 這裡
hive-site.xml
中新增如下配置:
<property>
<name>hive.metastore.uris</name>
<value>thrift://bigdata101:9083</value>
</property>
啟動服務
nohup /opt/hive-2.3.9/bin/hive --service metastore &
nohup /opt/hive-2.3.9/bin/hive --service hiveserver2 &
hive2.x 版本需要啟動兩個服務 metastore 和 hiveserver2,否則會報錯 Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
hadoop
hadoop 安裝方法點 這裡
spark
下載解壓後,進行如下配置即可。
[root@bigdata101 conf]# cat spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata101:9000/spark_directory
spark.master yarn-client
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.jars hdfs://bigdata101:9000/home/spark_lib/*
spark.yarn.dist.files hdfs://bigdata101:9000/home/spark_conf/hive-site.xml
spark.sql.broadcastTimeout 500
[root@bigdata101 conf]# cat spark-env.sh
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_HOST=bigdata101
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8082
SPARK_LOCAL_IP=bigdata101
YARN_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=30 -Dspark.history.fs.logDirectory=hdfs://bigdata101:9000/spark_directory"
複製如下配置檔案
# 把 hive 中 lib/datanucleus-*.jar 包拷貝到 spark 的 jars/ 路徑下。
[root@bigdata101 opt]# cp /opt/hive-2.3.9/lib/datanucleus-*.jar $SPARK_HOME/jars/
# 把 hive 中 conf/hive-site.xml 包拷貝到 spark 的 conf/ 路徑下。
[root@bigdata101 opt]# cp /opt/hive-2.3.9/conf/hive-site.xml $SPARK_HOME/conf/
# 在 hdfs 上建立 `/home/spark_conf` 目錄,並將 `hive-site.xml` 上傳到該目錄下
[root@bigdata101 opt]# hadoop fs -mkdir -p /home/spark_conf
[root@bigdata101 opt]# hadoop fs -put /opt/hive-2.3.9/conf/hive-site.xml /home/spark_conf/
[root@bigdata101 opt]# hadoop fs -ls /home/spark_conf
Found 1 items
-rw-r--r-- 3 root supergroup 1259 2021-12-25 18:18 /home/spark_conf/hive-site.xml
# 在 hdfs 上建立 `/home/spark_lib` 目錄,並將 spark `jars/` 目錄下的檔案上傳到該目錄下
[root@bigdata101 opt]# hadoop fs -mkdir /home/spark_lib
[root@bigdata101 opt]# hadoop fs -put $SPARK_HOME/jars/* hdfs://bigdata101:9000/home/spark_lib/
如果在 hadoop 中配置了 lzo,則需要把 hadoop-lzo-0.4.21-SNAPSHOT.jar
檔案複製到 spark 的 conf/
目錄中。
livy
下載:http://archive.apache.org/dist/incubator/livy/
[root@bigdata101 livy-0.7.1]# cat conf/livy.conf
livy.server.host = bigdata101
livy.spark.master = yarn
livy.spark.deployMode = client
livy.repl.enableHiveContext = true
livy.server.port = 8998
安裝 spark 後再啟動 livy,否則會報錯。
執行命令 bin/livy-server start
啟動
es
安裝方法點 這裡
啟動 es 後,在 es 裡建立 griffin 索引,並檢視
[zgg@bigdata101 es-7.16.2]$ curl -k -H "Content-Type: application/json" -X PUT http://bigdata101:9200/griffin?include_type_name=true \
> -d '{
> "aliases": {},
> "mappings": {
> "accuracy": {
> "properties": {
> "name": {
> "fields": {
> "keyword": {
> "ignore_above": 256,
> "type": "keyword"
> }
> },
> "type": "text"
> },
> "tmst": {
> "type": "date"
> }
> }
> }
> },
> "settings": {
> "index": {
> "number_of_replicas": "2",
> "number_of_shards": "5"
> }
> }
> }'
{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}
# 檢視所有索引
[zgg@bigdata101 es-7.16.2]$ curl -X GET http://bigdata101:9200/_cat/indices
green open .geoip_databases HDSgUJ23Qwe-TI6KZ4e6Rw 1 1 43 46 224.8mb 111.9mb
green open griffin h9fUJ3IpT9Sxcr4RbZKEKQ 5 2 0 0 3.3kb 1.1kb
# 檢視索引griffin內容
[zgg@bigdata101 es-7.16.2]$ curl -X GET http://bigdata101:9200/griffin/_search?pretty
{
"took" : 27,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
maven
下載:https://maven.apache.org/download.cgi
[root@bigdata101 maven-3.8.4]# cat conf/settings.xml
<!-- 新增阿里雲映象-->
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
<mirror>
<id>UK</id>
<name>UK Central</name>
<url>http://uk.maven.org/maven2</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>repo1</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo1.maven.org/maven2/</url>
</mirror>
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
配置環境變數
[root@bigdata101 opt]# cat /etc/profile
export JAVA_HOME=/opt/jdk1.8.0_301
export ES_JAVA_HOME=/opt/elasticsearch-7.16.2/jdk
export HADOOP_HOME=/opt/hadoop-2.7.3
export MAVEN_HOME=/opt/maven-3.8.4
export SPARK_HOME=/opt/spark-2.2.1
export SCALA_HOME=/opt/scala-2.11.8
export LIVY_HOME=/opt/livy-0.7.1
export HIVE_HOME=/opt/hive-2.3.9
export GRIFFIN_HOME=/opt/griffin-0.5.0
export HADOOP_COMMON_HOME=/opt/hadoop-2.7.3
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop-2.7.3/lib/native
export HADOOP_HDFS_HOME=/opt/hadoop-2.7.3
export HADOOP_INSTALL=/opt/hadoop-2.7.3
export HADOOP_MAPRED_HOME=/opt/hadoop-2.7.3
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=/opt/hadoop-2.7.3
export PATH=.:$LIVY_HOME/bin:$GRIFFIN_HOME/bin:$MAVEN_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin:$HIVE_HOME/bin:$HIVE_HOME/sbin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH
2 安裝griffin
下載:https://griffin.apache.org/docs/download.html
配置配置檔案
下載解壓後,配置配置檔案,如下:
配置 service/src/main/resources/application.properties
# griffin server啟動預設埠8080
server.port=8090
spring.application.name=griffin_service
spring.datasource.url=jdbc:mysql://bigdata101:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=root
spring.datasource.password=000000
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore
hive.metastore.uris=thrift://bigdata101:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
#Hive jdbc
hive.jdbc.className=org.apache.hive.jdbc.HiveDriver
hive.jdbc.url=jdbc:hive2://localhost:10000/
hive.need.kerberos=false
hive.keytab.user=xxx@xx.com
hive.keytab.path=/path/to/keytab/file
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://bigdata101:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=hdfs://bigdata101:9000
# elasticsearch
elasticsearch.host=bigdata101
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://bigdata101:8998/batches
livy.need.queue=false
livy.task.max.concurrent.count=20
livy.task.submit.interval.second=3
livy.task.appId.retry.count=3
livy.need.kerberos=false
livy.server.auth.kerberos.principal=livy/kerberos.principal
livy.server.auth.kerberos.keytab=/path/to/livy/keytab/file
# yarn url
yarn.uri=http://bigdata102:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
logging.file=logs/griffin-service.log
配置 service/src/main/resources/quartz.properties
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
配置 service/src/main/resources/sparkProperties.json
{
"file": "hdfs://bigdata101:9000/griffin/griffin-measure.jar",
"className": "org.apache.griffin.measure.Application",
"queue": "default",
"numExecutors": 2,
"executorCores": 1,
"driverMemory": "1g",
"executorMemory": "1g",
"conf": {
"spark.yarn.dist.files": "hdfs://bigdata101:9000/home/spark_conf/hive-site.xml"
},
"files": [
]
}
配置 service/src/main/resources/env/env_batch.json
也可以單獨建立配置,在執行任務的時候指向這個檔案,env_streaming.json
檔案同理。
{
"spark": {
"log.level": "INFO"
},
"sinks": [
{
"name": "console",
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"name": "hdfs",
"type": "HDFS",
"config": {
"path": "hdfs://bigdata101:9000/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"name": "elasticsearch",
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "hdfs://bigdata101:9000/griffin/accuracy",
"connection.timeout": "1m",
"retry": 10
}
}
],
"griffin.checkpoint": []
}
配置 service/src/main/resources/env/env_streaming.json
{
"spark": {
"log.level": "WARN",
"checkpoint.dir": "hdfs://bigdata101:9000/griffin/checkpoint/${JOB_NAME}",
"init.clear": true,
"batch.interval": "1m",
"process.interval": "5m",
"config": {
"spark.default.parallelism": 4,
"spark.task.maxFailures": 5,
"spark.streaming.kafkaMaxRatePerPartition": 1000,
"spark.streaming.concurrentJobs": 4,
"spark.yarn.maxAppAttempts": 5,
"spark.yarn.am.attemptFailuresValidityInterval": "1h",
"spark.yarn.max.executor.failures": 120,
"spark.yarn.executor.failuresValidityInterval": "1h",
"spark.hadoop.fs.hdfs.impl.disable.cache": true
}
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 100
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://bigdata101:9000/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "hdfs://bigdata101:9000/griffin/accuracy"
}
}
],
"griffin.checkpoint": [
{
"type": "zk",
"config": {
"hosts": "zk:2181",
"namespace": "griffin/infocache",
"lock.path": "lock",
"mode": "persist",
"init.clear": true,
"close.clear": false
}
}
]
}
service/pom.xml
<!-- <dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>${postgresql.version}</version>
</dependency> -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<!-- 注意:版本號刪除掉 -->
編譯
[root@bigdata101 griffin-0.5.0-src]# mvn clean
[root@bigdata101 griffin-0.5.0-src]# mvn -T2C install -DskipTests
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 3.069 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SUCCESS [02:05 min]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [01:02 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [01:39 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:08 min (Wall Clock)
[INFO] Finished at: 2021-12-26T12:04:02+08:00
[INFO] ------------------------------------------------------------------------
部署jar包
# 環境變數已在上面配置
[root@bigdata101 opt]# mkdir griffin-0.5.0
# 將 measure-0.5.0.jar 複製到 griffin-0.5.0 目錄下,並重新命名
[root@bigdata101 griffin-0.5.0-src]# cp measure/target/measure-0.5.0.jar $GRIFFIN_HOME/griffin-measure.jar
# 將 service-0.5.0.jar 複製到 griffin-0.5.0 目錄下,並重新命名
[root@bigdata101 griffin-0.5.0-src]# cp service/target/service-0.5.0.jar $GRIFFIN_HOME/griffin-service.jar
[root@bigdata101 griffin-0.5.0]# ls
griffin-measure.jar griffin-service.jar
[root@bigdata101 griffin-0.5.0]# hadoop fs -mkdir -p /griffin/persist
[root@bigdata101 griffin-0.5.0]# hadoop fs -mkdir -p /griffin/checkpoint
# 將 griffin-measure.jar 上傳到 hdfs
[root@bigdata101 griffin-0.5.0]# hadoop fs -put $GRIFFIN_HOME/griffin-measure.jar /griffin/
[root@bigdata101 griffin-0.5.0]# hadoop fs -ls /griffin
Found 3 items
drwxr-xr-x - root supergroup 0 2021-12-26 12:27 /griffin/checkpoint
-rw-r--r-- 3 root supergroup 46538384 2021-12-26 12:27 /griffin/griffin-measure.jar
drwxr-xr-x - root supergroup 0 2021-12-26 12:27 /griffin/persist
[root@bigdata101 griffin-0.5.0]# nohup java -jar $GRIFFIN_HOME/griffin-service.jar >$GRIFFIN_HOME/service.out 2>&1 &
[root@bigdata101 griffin-0.5.0]# jps
2098 DataNode
2996 Master
3748 RunJar
2646 -- process information unavailable
2374 NodeManager
5206 griffin-service.jar
5271 Jps
3064 Worker
1994 NameNode
2555 LivyServer
瀏覽器輸入 http://bigdata101:8090/
,檢視:
預設賬戶和密碼都是無,直接點登入即可。
介面操作示例點這裡
3 批處理測試
獲取測試資料:https://griffin.apache.org/data/batch
[root@bigdata101 griffin-0.5.0]# mkdir data/
wget https://griffin.apache.org/data/batch/create-table.hql
wget https://griffin.apache.org/data/batch/delta_src
wget https://griffin.apache.org/data/batch/delta_tgt
wget https://griffin.apache.org/data/batch/demo_basic
wget https://griffin.apache.org/data/batch/demo_src
wget https://griffin.apache.org/data/batch/demo_tgt
wget https://griffin.apache.org/data/batch/gen-hive-data.sh
wget https://griffin.apache.org/data/batch/gen_delta_src.sh
wget https://griffin.apache.org/data/batch/gen_demo_data.sh
wget https://griffin.apache.org/data/batch/insert-data.hql.template
[root@bigdata101 data]# ls
create-table.hql delta_tgt demo_src gen_delta_src.sh gen-hive-data.sh
delta_src demo_basic demo_tgt gen_demo_data.sh insert-data.hql.template
[root@bigdata101 data]# chmod 777 *
# 修改 create-table.hql 中的 LOCATION 路徑
[root@bigdata101 data]# vi create-table.hql
...
LOCATION
'hdfs://bigdata101:9000/griffin/data/batch/demo_src';
...
LOCATION
'hdfs://bigdata101:9000/griffin/data/batch/demo_tgt';
# 生成測試資料
[root@bigdata101 data]# ./gen-hive-data.sh
hive> select * from demo_src limit 5;
OK
0 1 1 20211226 13
0 2 2 20211226 13
0 3 3 20211226 13
0 4 4 20211226 13
0 5 5 20211226 13
hive> desc demo_src;
OK
id bigint
age int
desc string
dt string
hour string
# Partition Information
# col_name data_type comment
dt string
hour string
hive> select * from demo_tgt limit 5;
OK
0 1 1 20211226 13
0 2 2 20211226 13
0 3 3 20211226 13
0 4 4 20211226 13
0 5 5 20211226 13
hive> desc demo_tgt;
OK
id bigint
age int
desc string
dt string
hour string
# Partition Information
# col_name data_type comment
dt string
hour string
Time taken: 0.468 seconds, Fetched: 11 row(s)
準確度度量 Accuracy Measure
建立規則
[root@bigdata101 griffin-0.5.0]# mkdir measures
[root@bigdata101 measures]# vi dq.json
{
"name": "batch_accu",
"process.type": "batch",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "2.3",
"config": {
"database": "default",
"table.name": "demo_src"
}
}
]
}, {
"name": "tgt",
"connectors": [
{
"type": "hive",
"version": "2.3",
"config": {
"database": "default",
"table.name": "demo_tgt"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "src.id = tgt.id AND src.age = tgt.age AND src.desc = tgt.desc",
"details": {
"source": "src",
"target": "tgt",
"miss": "miss_count",
"total": "total_count",
"matched": "matched_count"
},
"out": [
{
"type": "metric",
"name": "accu"
},
{
"type": "record",
"name": "missRecords"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS"]
}
指定結果儲存路徑
[root@bigdata101 measures]# vi env.json
{
"spark": {
"log.level": "WARN"
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://bigdata101:9000/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "http://bigdata101:9200/griffin/accuracy",
"connection.timeout": "1m",
"retry": 10
}
}
],
"griffin.checkpoint": []
}
提交job
[root@bigdata101 griffin-0.5.0]# ls
data griffin-measure.jar griffin-service.jar measures service.out
[root@bigdata101 griffin-0.5.0]# ls measures/
dq.json env.json
[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq.json
......
data source timeRanges: src -> (1640586475986, 1640586475986], tgt -> (1640586475986, 1640586475986]
[1640586475986] batch_accu start: application_1640586390189_0001
batch_accu [1640586475986] metrics:
{"name":"batch_accu","tmst":1640586475986,"value":{"total_count":375000,"miss_count":1475,"matched_count":373525,"matchedFraction":0.9960666666666667},"applicationId":"application_1640586390189_0001"}
[1640586475986] 1640586626794: process using time: 150808 ms
[1640586475986] batch_accu finish
檢視 hdfs 上的 /griffin/persist/
目錄
[root@bigdata101 opt]# hadoop fs -ls /griffin/persist/batch_accu/1640586475986
Found 5 items
-rw-r--r-- 1 root supergroup 0 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_FINISH
-rw-r--r-- 1 root supergroup 138 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_LOG
-rw-r--r-- 1 root supergroup 200 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/_METRICS
-rw-r--r-- 1 root supergroup 30 2021-12-27 14:28 /griffin/persist/batch_accu/1640586475986/_START
-rw-r--r-- 1 root supergroup 128324 2021-12-27 14:30 /griffin/persist/batch_accu/1640586475986/missRecords
[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_START
application_1640586390189_0001
[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_LOG
================ log of Mon Dec 27 14:27:55 CST 2021 ================
--- Mon Dec 27 14:30:26 CST 2021 ---
process using time: 150808 ms
[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/_METRICS
{"name":"batch_accu","tmst":1640586475986,"value":{"total_count":375000,"miss_count":1475,"matched_count":373525,"matchedFraction":0.9960666666666667},"applicationId":"application_1640586390189_0001"}
[root@bigdata101 opt]# hadoop fs -cat /griffin/persist/batch_accu/1640586475986/missRecords
....
{"id":124,"age":1628,"desc":"1628","dt":"20211227","hour":"13","__tmst":1640586475986}
{"id":124,"age":1628,"desc":"1628","dt":"20211227","hour":"14","__tmst":1640586475986}
{"id":124,"age":1479,"desc":"1479","dt":"20211227","hour":"15","__tmst":1640586475986}
{"id":124,"age":1479,"desc":"1479","dt":"20211227","hour":"15","__tmst":1640586475986}
{"id":124,"age":1555,"desc":"1555","dt":"20211227","hour":"14","__tmst":1640586475986}
{"id":124,"age":1555,"desc":"1555","dt":"20211227","hour":"13","__tmst":1640586475986}
效能分析度量 Profiling Measure
檢視 demo_src 表的最大值、總個數、空值個數、最大長度。
src.id.count()
這裡是 "name": "src"
建立規則
[root@bigdata101 measures]# vi dq2.json
{
"name": "batch_Prof",
"process.type": "batch",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "2.3",
"config": {
"database": "default",
"table.name": "demo_src"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "profiling",
"out.dataframe.name": "prof",
"rule": "src.id.count() as id_count,src.age.max() as age_max,src.desc.length().max() as desc_length_max",
"out": [
{
"type": "metric",
"name": "prof"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS"]
}
使用測試1的 env.json,提交 job
[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq2.json
...
{"name":"batch_prof","tmst":1640589832672,"value":{"id_count":375000,"age_max":2000,"desc_length_max":4},"applicationId":"application_1640586390189_0004"}
Rule 規則編寫
在 rule 裡可以寫 sql,比如將 demo_src 表按 age 降序輸出。
注意查詢時,不是查詢表名而是查詢 name 名稱。
建立規則
[root@bigdata101 measures]# vi dq3.json
{
"name": "batch_prof",
"process.type": "batch",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "hive",
"version": "2.3",
"config": {
"database": "default",
"table.name": "demo_src"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "profiling",
"out.dataframe.name": "prof",
"rule": "select age from src order by age desc limit 5",
"out": [
{
"type": "metric",
"name": "prof"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS"]
}
使用測試1的 env.json,提交 job
[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq3.json
...
{"name":"batch_prof","tmst":1640590110604,"value":{"prof":[{"age":2000},{"age":2000},{"age":2000},{"age":1999},{"age":1999}]},"applicationId":"application_1640586390189_0005"}
4 流處理測試【有問題,待解決】
獲取測試資料:https://griffin.apache.org/data/streaming/
wget https://griffin.apache.org/data/streaming/gen-data.sh --no-check-certificate
wget https://griffin.apache.org/data/streaming/source.temp --no-check-certificate
wget https://griffin.apache.org/data/streaming/streaming-data.sh --no-check-certificate
wget https://griffin.apache.org/data/streaming/target.temp --no-check-certificate
[root@bigdata101 data]# chmod 777 *
# 先檢視下 streaming-data.sh 和 gen-data.sh 檔案,根據實際情況,做修改
# execute `./streaming-data.sh` to generate json string data file and produce them into kafka topics minutely.
建立規則
[root@bigdata101 measures]# vi dq4.json
{
"name": "streaming_accu",
"process.type": "streaming",
"data.sources": [
{
"name": "src",
"baseline": true,
"connectors": [
{
"type": "kafka",
"version": "2.3",
"config": {
"kafka.config": {
"bootstrap.servers": "bigdata101:9092,bigdata102:9092,bigdata103:9092",
"group.id": "griffin",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"topics": "source",
"key.type": "java.lang.String",
"value.type": "java.lang.String"
},
"pre.proc": [
{
"dsl.type": "df-opr",
"rule": "from_json"
}
]
}
],
"checkpoint": {
"type": "json",
"file.path": "hdfs://bigdata101:9000/griffin/streaming/dump/source",
"info.path": "source",
"ready.time.interval": "10s",
"ready.time.delay": "0",
"time.range": ["-5m", "0"],
"updatable": true
}
}, {
"name": "tgt",
"connectors": [
{
"type": "kafka",
"version": "2.3",
"config": {
"kafka.config": {
"bootstrap.servers": "bigdata101:9092,bigdata102:9092,bigdata103:9092",
"group.id": "griffin",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"topics": "target",
"key.type": "java.lang.String",
"value.type": "java.lang.String"
},
"pre.proc": [
{
"dsl.type": "df-opr",
"rule": "from_json"
}
]
}
],
"checkpoint": {
"type": "json",
"file.path": "hdfs://bigdata101:9000/griffin/streaming/dump/target",
"info.path": "target",
"ready.time.interval": "10s",
"ready.time.delay": "0",
"time.range": ["-1m", "0"]
}
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "src.id = tgt.id AND src.name = tgt.name AND src.color = tgt.color AND src.time = tgt.time",
"details": {
"source": "src",
"target": "tgt",
"miss": "miss_count",
"total": "total_count",
"matched": "matched_count"
},
"out":[
{
"type":"metric",
"name": "accu"
},
{
"type":"record",
"name": "missRecords"
}
]
}
]
},
"sinks": ["CONSOLE", "HDFS"]
}
env.json
[root@bigdata101 measures]# vi env.json
{
"spark": {
"log.level": "WARN",
"checkpoint.dir": "hdfs://bigdata101:9000/griffin/checkpoint",
"batch.interval": "20s",
"process.interval": "1m",
"init.clear": true,
"config": {
"spark.default.parallelism": 4,
"spark.task.maxFailures": 5,
"spark.streaming.kafkaMaxRatePerPartition": 1000,
"spark.streaming.concurrentJobs": 4,
"spark.yarn.maxAppAttempts": 5,
"spark.yarn.am.attemptFailuresValidityInterval": "1h",
"spark.yarn.max.executor.failures": 120,
"spark.yarn.executor.failuresValidityInterval": "1h",
"spark.hadoop.fs.hdfs.impl.disable.cache": true
}
},
"sinks": [
{
"type": "console"
},
{
"type": "hdfs",
"config": {
"path": "hdfs://bigdata101:9000/griffin/persist"
}
},
{
"type": "elasticsearch",
"config": {
"method": "post",
"api": "http://bigdata101:9200/griffin/accuracy"
}
}
],
"griffin.checkpoint": [
{
"type": "zk",
"config": {
"hosts": "bigdata101:2181,bigdata102:2182,bigdata103:2181",
"namespace": "griffin/infocache",
"lock.path": "lock",
"mode": "persist",
"init.clear": true,
"close.clear": false
}
}
]
}
啟動 kafka,執行 ./streaming-data.sh
生產資料。
[root@bigdata101 data]# ./streaming-data.sh
>>>>>>>>>>>>>>>>>>>>>>insert data at 2021-12-27_16:15:54
>>>>>>>>>>>>>>>>>>>>>>insert data at 2021-12-27_16:16:57
....
提交 job 【可能是記憶體不足的原因,待驗證】
[root@bigdata101 griffin-0.5.0]# spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default --driver-memory 1g --executor-memory 1g --num-executors 2 griffin-measure.jar measures/env.json measures/dq4.json
....
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /lock error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/lock
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /info.final error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final
21/12/27 21:38:41 ERROR offset.OffsetCheckpointInZK: delete /info error: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info
21/12/27 21:38:49 WARN streaming.Checkpoint: Checkpoint directory hdfs://bigdata101:9000/griffin/checkpoint does not exist
21/12/27 21:39:10 WARN offset.OffsetCheckpointInZK: read /info.final/last.proc.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/last.proc.time
21/12/27 21:39:10 WARN offset.OffsetCheckpointInZK: read /info.final/ready.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/ready.time
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info/source/old.cache.index warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info/source/old.cache.index
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info.final/last.proc.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/last.proc.time
21/12/27 21:39:14 WARN offset.OffsetCheckpointInZK: read /info.final/ready.time warn: KeeperErrorCode = NoNode for /griffin/infocache/streaming_accu/info.final/ready.time
data source timeRanges: src -> (-300001, -1], tgt -> (-60001, -1]
[1640612329475] streaming_accu start: application_1640606908996_0015
已殺死
5 問題
問題1
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.6.0 0.6.0:
[INFO]
[INFO] Apache Griffin 0.6.0 ............................... SUCCESS [ 6.164 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [06:54 min]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [02:10 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [02:44 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:01 min (Wall Clock)
[INFO] Finished at: 2021-12-26T11:55:52+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm install) on project ui: Failed to run task: 'npm install' failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
修改 ui\pom.xml
的如下內容:
原內容:
<!-- It will install nodejs and npm -->
<execution>
<id>install node and npm</id>
<goals>
<goal>install-node-and-npm</goal>
</goals>
<configuration>
<nodeVersion>${node.version}</nodeVersion>
<npmVersion>${npm.version}</npmVersion>
</configuration>
</execution>
修改後:
<!-- It will install nodejs and npm -->
<execution>
<id>install node and npm</id>
<goals>
<goal>install-node-and-npm</goal>
</goals>
<configuration>
<nodeVersion>v10.16.3</nodeVersion>
<npmVersion>6.11.3</npmVersion>
<nodeDownloadRoot>http://nodejs.org/dist/</nodeDownloadRoot>
<!-- <npmDownloadRoot>http://registry.npmjs.org/npm/</npmDownloadRoot> -->
<npmDownloadRoot>http://registry.npm.taobao.org/npm/</npmDownloadRoot>
</configuration>
</execution>
如果出現瞭如下問題,把 <nodeDownloadRoot>
和 <npmDownloadRoot>
兩行刪掉。
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:install-node-and-npm (install node and npm) on project ui: Could not download npm: Got error code 404 from the server. -> [Help 1]
問題2
[ERROR]
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,26): Cannot find name 'SVGElementTagNameMap'.
[ERROR] ERROR in /opt/soft/griffin-0.5.0/ui/angular/node_modules/@types/jquery/JQuery.d.ts (4137,89): Cannot find name 'SVGElementTagNameMap'.
[ERROR]
[ERROR] npm ERR! Linux 3.10.0-514.el7.x86_64
[ERROR] npm ERR! argv "/opt/soft/griffin-0.5.0/ui/.tmp/node/node" "/opt/soft/griffin-0.5.0/ui/.tmp/node/node_modules/npm/bin/npm-cli.js" "run" "build "
[ERROR] npm ERR! node v6.11.3
[ERROR] npm ERR! npm v3.10.10
[ERROR] npm ERR! code ELIFECYCLE
[ERROR] npm ERR! griffin@0.0.0 build: `ng build`
[ERROR] npm ERR! Exit status 1
[ERROR] npm ERR!
[ERROR] npm ERR! Failed at the griffin@0.0.0 build script 'ng build'.
[ERROR] npm ERR! Make sure you have the latest version of node.js and npm installed.
[ERROR] npm ERR! If you do, this is most likely a problem with the griffin package,
[ERROR] npm ERR! not with npm itself.
[ERROR] npm ERR! Tell the author that this fails on your system:
[ERROR] npm ERR! ng build
[ERROR] npm ERR! You can get information on how to open an issue for this project with:
[ERROR] npm ERR! npm bugs griffin
[ERROR] npm ERR! Or if that isn't available, you can get their info via:
[ERROR] npm ERR! npm owner ls griffin
[ERROR] npm ERR! There is likely additional logging output above.
[ERROR]
[ERROR] npm ERR! Please include the following file with any support request:
[ERROR] npm ERR! /opt/soft/griffin-0.5.0/ui/angular/npm-debug.log
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 16.497 s]
[INFO] Apache Griffin :: UI :: Default UI ................. FAILURE [24:12 min]
[INFO] Apache Griffin :: Web Service ...................... SKIPPED
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24:28 min
[INFO] Finished at: 2020-06-30T15:04:04+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.6:npm (npm build) on project ui: Failed to run task: 'npm run build' faile d. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
vi /opt/griffin-0.5.0-source/ui/angular/node_modules/@types/jquery/JQuery.d.ts
刪除4137這行
find<K extends keyof SVGElementTagNameMap>(selector_element: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;
問題3
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 9.715 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SKIPPED
[INFO] Apache Griffin :: Web Service ...................... FAILURE [02:55 min]
[INFO] Apache Griffin :: Measures ......................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:08 min (Wall Clock)
[INFO] Finished at: 2019-05-27T14:27:07+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project service: Could not resolve dependencies for project org.apache.griffin:service:jar:0.5.0: Could not find artifact io.confluent:kafka-schema-registry-client:jar:3.2.0 in nexus-aliyun (http://maven.aliyun.com/nexus/content/groups/public) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :service
下載 kafka-schema-registry-client,手動新增
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=3.2.0 -Dpackaging=jar -Dfile=kafka-schema-registry-client-3.2.0.jar
問題4【待解決】
griffin-0.6.0 編譯成功,但在執行 nohup java -jar $GRIFFIN_HOME/griffin-service.jar >$GRIFFIN_HOME/service.out 2>&1 &
時,依次出現如下錯誤:
/opt/griffin-0.6.0/griffin-service.jar中沒有主清單屬性
可以在 pom 中新增一個 SpringBoot 的構建的外掛,然後重新執行 mvn install 即可。(未測試,來自這裡)
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: javax/servlet/Filter
Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter
使用 0.5.0 版本沒有出現此問題。
問題5
在測試提交 spark job 時,出現如下錯誤:
NoSuchMethodError: org.apache.spark.SparkConf.setAll(Lscala/collection/Traversable;)Lorg/apache/spark/SparkConf
安裝的 spark 的版本與 griffin 內部指定的版本不一致,檢視 measure/pom.xml
檔案,更換對應的 spark 版本。
問題6
在測試提交 spark job 時,出現如下錯誤:
ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
把 hadoop-lzo-0.4.21-SNAPSHOT.jar
檔案複製到 spark 的 conf/
目錄中。
問題7
在測試流資料處理時,出現 java.io.CharConversionException: Invalid UTF-32 character 0x8000000(above 10ffff) at char #1, byte #7)
異常的原因就是沒有在 spark 中配置支援 lzo
參考:
總結了很多報錯問題:https://blog.csdn.net/weixin_43160819/article/details/118298681
https://blog.csdn.net/weixin_40004348/article/details/107191430
https://blog.csdn.net/github_39577257/article/details/90607081
https://blog.csdn.net/guanghuotainkong/article/details/112168871