spark direct kafka 將offset推到zookeeper
1、spark streaming direct方式讀取kafka效能要好很多,缺點是它不會去zookeeper更新offset,這將導致
基於zk的KafkaOffsetMonitor監控失效。由於我們流處理的資料量大並且健壯性要求高,我們需要
透過KafkaOffsetMonitor來實時監控資料消費情況,故自己實現offset推送到zk,如下:
2、spark程式碼:
-
object KSConvertStreaming{
-
-
val savaLocal = "/xxx/parquet/%s/year=%s/month=%s/day=%s"
-
-
def jsonConvert(jsonStrs: Iterator[String]): Iterator[(String, ArrayBuffer[String])] = {
-
StreamingUtils.init
-
val typeMap = scala.collection.mutable.Map[String, ArrayBuffer[String]]()
-
jsonStrs.foreach(
-
x => {
-
val res = StreamingUtils.mapToStr(x)
-
if (null != res) {
-
val msgType = res.get(0)
-
if (!typeMap.contains(msgType)) {
-
typeMap += (msgType -> new ArrayBuffer[String]())
-
}
-
typeMap(msgType) += res.get(1)
-
}
-
}
-
)
-
typeMap.iterator
-
}
-
-
def main(args: Array[String]): Unit = {
-
-
val Array(maxPartition, maxNumber,windownsS,groupName,maxRatePerPartition) = args
-
val topicName = "xxx"
-
val kafkaAddr = "xxx:9092,xxx:9092,xxx:9092"
-
val kafkaParams = Map[String, String](
-
"metadata.broker.list" -> kafkaAddr,
-
"group.id" -> groupName,
-
"auto.offset.reset" -> "largest"
-
)
-
val topics = Set(topicName)
-
-
println(s"maxPartition -------- $maxPartition")
-
println(s"maxNumber -------- $maxNumber")
-
println(s"windownsS -------- $windownsS")
-
println(s"groupName -------- $groupName")
-
-
val sparkConf = new SparkConf().setAppName("Streaming_Convert")
-
.set("spark.yarn.executor.memoryOverhead","1024")
- .set("spark.streaming.kafka.maxRatePerPartition",maxRatePerPartition) //此處為每秒每個partition的條數
-
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
-
.set("spark.reducer.maxSizeInFlight", "1m")
-
val sc = new SparkContext(sparkConf)
-
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
val ssc = new StreamingContext(sc, Seconds(windownsS.toInt)) //秒單位
-
-
val topicDirs = new ZKGroupTopicDirs(groupName,topicName)
-
-
val zkClient = new ZkClient("xxx:2181,xxx:2181,xxx:2181",Integer.MAX_VALUE,100000,ZKStringSerializer)
-
val children = zkClient.countChildren(topicDirs.consumerOffsetDir)
-
-
var kafkaStream: InputDStream[(String, String)] = null
-
var fromOffsets: Map[TopicAndPartition, Long] = Map()
-
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message())
-
if (children > 0) {
-
for (i <- 0 until children) {
-
val partitionOffset = zkClient.readData[String](s"${topicDirs.consumerOffsetDir}/$i")
-
val tp = TopicAndPartition(topicName, i)
-
fromOffsets += (tp -> partitionOffset.toLong)
-
}
-
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
-
}
-
else {
-
kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
-
}
-
-
var offsetRanges = Array[OffsetRange]()
-
-
kafkaStream.transform { rdd =>
-
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
-
rdd
-
}.map(_._2).foreachRDD {
-
rdd =>
-
val xRDD = rdd.flatMap(_.split("\n")).mapPartitions(x => jsonConvert(x))
-
xRDD.persist(StorageLevel.MEMORY_ONLY)
-
val typeCountMap = xRDD.map(x => (x._1, x._2.size)).collect()
-
val typeMap = scala.collection.mutable.Map[String, Long]()
-
typeCountMap.foreach(x => {
-
if (!typeMap.contains(x._1)) {
-
typeMap(x._1) = x._2
-
} else {
-
val tmpCount = typeMap(x._1)
-
typeMap(x._1) = x._2 + tmpCount
-
}
-
})
-
var totalCount: Long = 0
-
typeMap.foreach(x => {
-
println(s"${x._1}:${x._2}")
-
totalCount += x._2
-
})
-
println(s"total : $totalCount" )
-
val sortedMap = collection.mutable.LinkedHashMap(typeMap.toSeq.sortWith(_._2 > _._2): _*)
-
-
sortedMap.foreach {
-
x => {
-
val pointType = x._1
-
val count = x._2
-
println(s"save type( $pointType ) count( $count )")
-
val jsonRDD = xRDD.filter(x => x._1 == pointType).
-
flatMap(x => x._2)
-
var partitionNum = count / maxNumber.toLong
-
if (partitionNum == 0) {
-
partitionNum = 1
-
}
-
if (partitionNum > maxPartition.toLong) {
-
partitionNum = maxPartition.toLong
-
}
-
-
println(s"\trepartition ( $partitionNum )")
-
val arrType = pointType.split('-')
-
sqlContext.read.json(jsonRDD).
-
repartition(partitionNum.toInt).
-
write.mode(SaveMode.Append).
-
parquet(savaLocal.format(arrType(0),arrType(1),arrType(2),arrType(3)))
-
-
}
-
}
-
xRDD.unpersist()
-
-
for (o <- offsetRanges) {
- val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"
-
ZkUtils.updatePersistentPath(zkClient, zkPath, o.untilOffset.toString)
-
}
-
}
-
-
ssc.start()
-
ssc.awaitTermination()
-
}
-
-
- }
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29754888/viewspace-2125804/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 如何優雅的將Laravel日誌推到Kafka?LaravelKafka
- 理解Kafka offsetKafka
- 【Spark篇】---SparkStreaming+Kafka的兩種模式receiver模式和Direct模式SparkKafka模式
- 如何把 Laravel 的日誌推到Kafka?LaravelKafka
- kafka offset 過期處理策略Kafka
- zookeeper、dubbo、kafkaKafka
- Kafka——zookeeper的作用Kafka
- zookeeper:spark-project專案的zookeeper配置SparkProject
- 滴滴雲部署 ZooKeeper + KafkaKafka
- kafka和zookeeper安裝Kafka
- Zookeeper叢集 + Kafka叢集Kafka
- 如何將一個本地專案推到github上Github
- Spark Streaming的最佳化之路—從Receiver到Direct模式Spark模式
- Mac 使用 docker 搭建 kafka 叢集 + Zookeeper + kafka-managerMacDockerKafka
- Kafka 入門(三)--為什麼 Kafka 依賴 ZooKeeper?Kafka
- spark與kafaka整合workcount示例 spark-stream-kafkaSparkKafka
- 安裝Zookeeper和Kafka叢集Kafka
- kafka:spark-project專案的kafka和scala配置KafkaSparkProject
- Spring整合Kafka實現批量消費和手動提交offsetSpringKafka
- Debezium zookeeper kafka mysql資料處理KafkaMySql
- zookeeper叢集及kafka叢集搭建Kafka
- 基於 ZooKeeper 搭建 Spark 高可用叢集Spark
- Spark Streaming 之 Kafka 偏移量管理SparkKafka
- Spark 如何寫入HBase/Redis/MySQL/KafkaSparkRedisMySqlKafka
- CDH版Hadoop-zookeeper-hbase-spark安裝文件HadoopSpark
- spark streaming執行kafka資料來源SparkKafka
- direct3d---direct11 rendering pipeline3D
- 【重製版】全網最詳細ubuntu虛擬機器搭建hadoop+spark+zookeeper+hbase+kafka大資料環境Ubuntu虛擬機HadoopSparkKafka大資料
- zookeeper:spark-project專案的hive和mysql配置SparkProjectHiveMySql
- Clickhouse Engine kafka 將kafka資料同步clickhouseKafka
- Apache Kafka內部刪除了對ZooKeeper的依賴ApacheKafka
- Spark streaming消費Kafka的正確姿勢SparkKafka
- Spark Streaming讀取Kafka資料兩種方式SparkKafka
- hadoop,spark,Zookeeper,,, 這些名字都是怎麼來的呢?HadoopSpark
- Kafka社群KIP-500中文譯文(去除ZooKeeper)Kafka
- 在 mac 上建立 Python 的 Kafka 與 Spark 環境MacPythonKafkaSpark
- Spark Streaming使用Kafka保證資料零丟失SparkKafka
- 事件溯源概念深入人心:Kafka將拋棄ZooKeeper,替換為自我管理的後設資料仲裁事件Kafka
- 5minlab將在 Wholesome Direct 2024 上釋出新遊戲《Undusted》遊戲