怎樣利用Spark Streaming和Hadoop實現近實時的會話連線

mtunique發表於2015-04-17

這個 Spark Streaming 樣例是怎樣將近實時會話帶到到Hadoop中的一個很好的例子。

Spark Streaming 是Apache Spark 中最有趣的元件之一。利用Spark Streaming，你可以通過使用與處理批量載入資料相同的API來建立資料管道，並通過資料管道處理流式資料。此外，Spark Steaming的“micro-batching”方式提供相當好的彈性來應對某些原因造成的任務失敗。

在這篇文章中，我將通過演示網站事件的近實時會話使你熟悉一些常用的以及高階的Spark Streaming功能，然後在Apache HBase中載入事件的有關統計資料，最後選擇你喜歡的BI工具進行繪圖分析。 (Sessionization指的是捕獲單一訪問者的網站會話時間範圍內所有的點選流事件。)你可以在這裡找到了演示的程式碼。

像這樣的系統對於瞭解訪問者的行為（不管是人還是機器）是超級有用的。通過一些額外的工作，它也可以被設計用來包含windowing模式，以此來用非同步的方式發現可能的欺騙。

Spark Streaming 程式碼

我們例子中的main class是：

讓我們來看看這段程式碼段（忽略1-59行，它包含了imports 和其他無聊的東西）。

60到112行：設定Spark Streaming
這些程式碼是Spark Streaming最基本的開始，並可以選擇從HDFS或socket接收資料流。如果你是Spark Streaming新手，我已經新增了一些詳細的註釋幫助理解程式碼。（我不打算在這裡詳談，因為此時我們仍要討論程式碼。）

//This is just creating a Spark Config object.  I don't do much here but
    //add the app name.  There are tons of options to put into the Spark config,
    //but none are needed for this simple example.
    val sparkConf = new SparkConf().
      setAppName("SessionizeData " + args(0)).
      set("spark.cleaner.ttl", "120000")

    //These two lines will get us out SparkContext and our StreamingContext.
    //These objects have all the root functionality we need to get started.
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(10))

    //Here are are loading our HBase Configuration object.  This will have
    //all the information needed to connect to our HBase cluster.
    //There is nothing different here from when you normally interact with HBase.
    val conf = HBaseConfiguration.create();
    conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));
    conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));

    //This is a HBaseContext object.  This is a nice abstraction that will hide
    //any complex HBase stuff from us so we can focus on our business case
    //HBaseContext is from the SparkOnHBase project which can be found at
    // https://github.com/tmalaska/SparkOnHBase
    val hbaseContext = new HBaseContext(sc, conf);

    //This is create a reference to our root DStream.  DStreams are like RDDs but
    //with the context of being in micro batch world.  I set this to null now
    //because I later give the option of populating this data from HDFS or from
    //a socket.  There is no reason this could not also be populated by Kafka,
    //Flume, MQ system, or anything else.  I just focused on these because
    //there are the easiest to set up.
    var lines: DStream[String] = null

    //Options for data load.  Will be adding Kafka and Flume at some point
    if (args(0).equals("socket")) {
      val host = args(FIXED_ARGS);
      val port = args(FIXED_ARGS + 1);

      println("host:" + host)
      println("port:" + Integer.parseInt(port))

      //Simple example of how you set up a receiver from a Socket Stream
      lines = ssc.socketTextStream(host, port.toInt)
    } else if (args(0).equals("newFile")) {

      val directory = args(FIXED_ARGS)
      println("directory:" + directory)

      //Simple example of how you set up a receiver from a HDFS folder
      lines = ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t: Path) => true, true).map(_._2.toString)
    } else {
      throw new RuntimeException("bad input type")
    }

//This is just creating a Spark Config object. I don't do much here but

//add the app name. There are tons of options to put into the Spark config,

//but none are needed for this simple example.

val sparkConf = new SparkConf().

setAppName("SessionizeData " + args(0)).

set("spark.cleaner.ttl", "120000")

//These two lines will get us out SparkContext and our StreamingContext.

//These objects have all the root functionality we need to get started.

val sc = new SparkContext(sparkConf)

val ssc = new StreamingContext(sc, Seconds(10))

//Here are are loading our HBase Configuration object. This will have

//all the information needed to connect to our HBase cluster.

//There is nothing different here from when you normally interact with HBase.

val conf = HBaseConfiguration.create();

conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));

conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));

//This is a HBaseContext object. This is a nice abstraction that will hide

//any complex HBase stuff from us so we can focus on our business case

//HBaseContext is from the SparkOnHBase project which can be found at

// https://github.com/tmalaska/SparkOnHBase

val hbaseContext = new HBaseContext(sc, conf);

//This is create a reference to our root DStream. DStreams are like RDDs but

//with the context of being in micro batch world. I set this to null now

//because I later give the option of populating this data from HDFS or from

//a socket. There is no reason this could not also be populated by Kafka,

//Flume, MQ system, or anything else. I just focused on these because

//there are the easiest to set up.

var lines: DStream[String] = null

//Options for data load. Will be adding Kafka and Flume at some point

if (args(0).equals("socket")) {

val host = args(FIXED_ARGS);

val port = args(FIXED_ARGS + 1);

println("host:" + host)

println("port:" + Integer.parseInt(port))

//Simple example of how you set up a receiver from a Socket Stream

lines = ssc.socketTextStream(host, port.toInt)

} else if (args(0).equals("newFile")) {

val directory = args(FIXED_ARGS)

println("directory:" + directory)

//Simple example of how you set up a receiver from a HDFS folder

lines = ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t: Path) => true, true).map(_._2.toString)

} else {

throw new RuntimeException("bad input type")

}

114到124行: 字串解析

這裡是Spark Streaming開始的地方. 請看一下幾行：:

val ipKeyLines = lines.map[(String, (Long, Long, String))](eventRecord => {
      //Get the time and ip address out of the original event
      val time = dateFormat.parse(
        eventRecord.substring(eventRecord.indexOf('[') + 1, eventRecord.indexOf(']'))).
        getTime()
      val ipAddress = eventRecord.substring(0, eventRecord.indexOf(' '))

      //We are return the time twice because we will use the first at the start time
      //and the second as the end time
      (ipAddress, (time, time, eventRecord))
    })

val ipKeyLines = lines.map[(String, (Long, Long, String))](eventRecord => {

//Get the time and ip address out of the original event

val time = dateFormat.parse(

eventRecord.substring(eventRecord.indexOf('[') + 1, eventRecord.indexOf(']'))).

getTime()

val ipAddress = eventRecord.substring(0, eventRecord.indexOf(' '))

//We are return the time twice because we will use the first at the start time

//and the second as the end time

(ipAddress, (time, time, eventRecord))

})

上面第一行程式碼是對“lines” DStream物件做了一個map函式，並解析原始事件分離出IP地址，時間戳和事件體。對於那些新進入Spark Streaming的人而言，一個DSTREAM儲存著要處理的一批記錄。這些記錄由以前所定義的receiver物件填充，並且此map函式在這個micro-batch內產生另一個DSTREAM儲存變換後的記錄來進行額外的處理。

當看像上面的Spark Streaming示意圖時，有一些事情要注意：

每個micro-batch在你構造StreamingContext時設定的那一秒時會被銷燬
Receiver總是被下一個micro-batch中的即將到來的RDDS填充
之前micro batch中舊的RDDs將被清理丟棄

126到135行：產生Sessions

現在，我們有從網路日誌中獲得的IP地址和時間，是時候建立sessions了。下面的程式碼是通過micro-batch內的第一聚集事件建立session，然後在有狀態的DStream中減少這些會話的塊。

val latestSessionInfo = ipKeyLines.
      map[(String, (Long, Long, Long))](a => {
        //transform to (ipAddress, (time, time, counter))
        (a._1, (a._2._1, a._2._2, 1))
      }).
      reduceByKey((a, b) => {
        //transform to (ipAddress, (lowestStartTime, MaxFinishTime, sumOfCounter))
        (Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3)
      }).
      updateStateByKey(updateStatbyOfSessions)

val latestSessionInfo = ipKeyLines.

map[(String, (Long, Long, Long))](a => {

//transform to (ipAddress, (time, time, counter))

(a._1, (a._2._1, a._2._2, 1))

}).

reduceByKey((a, b) => {

//transform to (ipAddress, (lowestStartTime, MaxFinishTime, sumOfCounter))

(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3)

}).

updateStateByKey(updateStatbyOfSessions)

這裡有一個關於records如何在micro-batch中被reduce的例子：

會話範圍內的 micro-batch 內加入，我們可以用超酷的updateStateByKey功能，它可以利用先前活躍的micro-batch中的一個DStream做join/reduce-like操作。下圖詳細說明了，就DStreams而言，隨著時間變化這個處理過程是怎樣的。

現在，讓我們深入到定義在檔案的底部的updateStatbyOfSessions函式。該程式碼（注意詳細註釋）很神奇，能使會話流程以連續的微批處理模式發生。

/**
   * This function will be called for to union of keys in the Reduce DStream
   * with the active sessions from the last micro batch with the ipAddress
   * being the key
   *
   * To goal is that this produces a stateful RDD that has all the active
   * sessions.  So we add new sessions and remove sessions that have timed
   * out and extend sessions that are still going
   */
  def updateStatbyOfSessions(
      //(sessionStartTime, sessionFinishTime, countOfEvents)
      a: Seq[(Long, Long, Long)],
      //(sessionStartTime, sessionFinishTime, countOfEvents, isNewSession)
      b: Option[(Long, Long, Long, Boolean)]
    ): Option[(Long, Long, Long, Boolean)] = { 

    //This function will return a Optional value.
    //If we want to delete the value we can return a optional "None".
    //This value contains four parts
    //(startTime, endTime, countOfEvents, isNewSession)
    var result: Option[(Long, Long, Long, Boolean)] = null

    // These if statements are saying if we didn't get a new event for
    //this session's ip address for longer then the session
    //timeout + the batch time then it is safe to remove this key value
    //from the future Stateful DStream
    if (a.size == 0) {
      if (System.currentTimeMillis() - b.get._2 > SESSION_TIMEOUT + 11000) {
        result = None
      } else {
        if (b.get._4 == true) {
          result = b
        } else {
          result = Some((b.get._1, b.get._2, b.get._3, true))
        }
      }
    }

    //Now because we used the reduce function before this function we are
    //only ever going to get at most one event in the Sequence.
    a.foreach(c => {
      if (b.isEmpty) {
        //If there was no value in the Stateful DStream then just add it
        //new, with a true for being a new session
        result = Some((c._1, c._2, c._3, true))
      } else {
        if (c._1 - b.get._2 < SESSION_TIMEOUT) {
          //If the session from the stateful DStream has not timed out
          //then extend the session
          result = Some((
              Math.min(c._1, b.get._1), //newStartTime
              Math.max(c._2, b.get._2), //newFinishTime
              b.get._3 + c._3, //newSumOfEvents
              true //This is not a new session
            ))
        } else {
          //Otherwise remove the old session with a new one
          result = Some((
              c._1, //newStartTime
              c._2, //newFinishTime
              b.get._3, //newSumOfEvents
              true //new session
            ))
        }
      }
    })
    result
  }
}

/**

* This function will be called for to union of keys in the Reduce DStream

* with the active sessions from the last micro batch with the ipAddress

* being the key

* To goal is that this produces a stateful RDD that has all the active

* sessions. So we add new sessions and remove sessions that have timed

* out and extend sessions that are still going

def updateStatbyOfSessions(

//(sessionStartTime, sessionFinishTime, countOfEvents)

a: Seq[(Long, Long, Long)],

//(sessionStartTime, sessionFinishTime, countOfEvents, isNewSession)

b: Option[(Long, Long, Long, Boolean)]

): Option[(Long, Long, Long, Boolean)] = {

//This function will return a Optional value.

//If we want to delete the value we can return a optional "None".

//This value contains four parts

//(startTime, endTime, countOfEvents, isNewSession)

var result: Option[(Long, Long, Long, Boolean)] = null

// These if statements are saying if we didn't get a new event for

//this session's ip address for longer then the session

//timeout + the batch time then it is safe to remove this key value

//from the future Stateful DStream

if (a.size == 0) {

if (System.currentTimeMillis() - b.get._2 > SESSION_TIMEOUT + 11000) {

result = None

} else {

if (b.get._4 == true) {

result = b

} else {

result = Some((b.get._1, b.get._2, b.get._3, true))

}

//Now because we used the reduce function before this function we are

//only ever going to get at most one event in the Sequence.

a.foreach(c => {

if (b.isEmpty) {

//If there was no value in the Stateful DStream then just add it

//new, with a true for being a new session

result = Some((c._1, c._2, c._3, true))

} else {

if (c._1 - b.get._2 < SESSION_TIMEOUT) {

//If the session from the stateful DStream has not timed out

//then extend the session

result = Some((

Math.min(c._1, b.get._1), //newStartTime

Math.max(c._2, b.get._2), //newFinishTime

b.get._3 + c._3, //newSumOfEvents

true //This is not a new session

))

} else {

//Otherwise remove the old session with a new one

result = Some((

c._1, //newStartTime

c._2, //newFinishTime

b.get._3, //newSumOfEvents

true //new session

))

}

})

result

}

這段程式碼通過很多方式做了很多事，這是整個工作中最複雜的部分。總之，它會跟蹤活動的會話，所以你會知道你是繼續現有的會話還是啟動一個新的。

126到207行：計數和HBase

這部分做了大量的計數工作。在這裡有很多是重複的，所以讓我們只看一個count的例子，然後我們再一步步地把在相同記錄中產生的counts儲存在HBase中。

val onlyActiveSessions = latestSessionInfo.filter(t => System.currentTimeMillis() - t._2._2 < SESSION_TIMEOUT)
…
val newSessionCount = onlyActiveSessions.filter(t => {
        //is the session newer then that last micro batch
        //and is the boolean saying this is a new session true
        (System.currentTimeMillis() - t._2._2 > 11000 && t._2._4)
      }).
      count.
      map[HashMap[String, Long]](t => HashMap((NEW_SESSION_COUNTS, t)))

val onlyActiveSessions = latestSessionInfo.filter(t => System.currentTimeMillis() - t._2._2 < SESSION_TIMEOUT)

…

val newSessionCount = onlyActiveSessions.filter(t => {

//is the session newer then that last micro batch

//and is the boolean saying this is a new session true

(System.currentTimeMillis() - t._2._2 > 11000 && t._2._4)

}).

count.

map[HashMap[String, Long]](t => HashMap((NEW_SESSION_COUNTS, t)))

總之，上面的程式碼是過濾了除活動會話之外的其他所有會話，對他們進行計數，並把最終的計數記錄儲存到一個的HashMap例項中。它使用HashMap作為容器，所以在所有的count做完後，我們可以呼叫下面的reduce函式把他們都合併為一個單獨的記錄。（我敢肯定還有更好的方法來實現這一點，但這種方法也做的不錯。）

接下來，下面的程式碼處理所有的HashMap，並把他們所有的值儲存在在一個HashMap中。

val allCounts = newSessionCount.
      union(totalSessionCount).
      union(totals).
      union(totalEventsCount).
      union(deadSessionsCount).
      union(totalSessionEventCount).
      reduce((a, b) => b ++ a)

val allCounts = newSessionCount.

union(totalSessionCount).

union(totals).

union(totalEventsCount).

union(deadSessionsCount).

union(totalSessionEventCount).

reduce((a, b) => b ++ a)

Spark Streaming與HBase利用HBaseContext進行互動非常簡單。所有你需要做的就是用HashMap和函式將其轉換為一個put物件提供給DSTREAM。

hbaseContext.streamBulkPut[HashMap[String, Long]](
      allCounts, //The input RDD
      hTableName, //The name of the table we want to put too
      (t) => {
        //Here we are converting our input record into a put
        //The rowKey is C for Count and a backward counting time so the newest
        //count show up first in HBase's sorted order
        val put = new Put(Bytes.toBytes("C." + (Long.MaxValue - System.currentTimeMillis())))
        //We are iterating through the HashMap to make all the columns with their counts
        t.foreach(kv => put.add(Bytes.toBytes(hFamily), Bytes.toBytes(kv._1), Bytes.toBytes(kv._2.toString)))
        put
      },
      true)

hbaseContext.streamBulkPut[HashMap[String, Long]](

allCounts, //The input RDD

hTableName, //The name of the table we want to put too

(t) => {

//Here we are converting our input record into a put

//The rowKey is C for Count and a backward counting time so the newest

//count show up first in HBase's sorted order

val put = new Put(Bytes.toBytes("C." + (Long.MaxValue - System.currentTimeMillis())))

//We are iterating through the HashMap to make all the columns with their counts

t.foreach(kv => put.add(Bytes.toBytes(hFamily), Bytes.toBytes(kv._1), Bytes.toBytes(kv._2.toString)))

put

true)

現在，HBase的這些資訊可以用Apache Hive table打包，然後通過你喜歡的BI工具執行一個查詢來獲取像下面這樣的圖，每一次micro-batch該圖都將重新整理。

209到215行：寫入HDFS
最後的任務是把事件資料加入到活動的會話資訊中，然後把事件以會話的開始時間儲存到HDFS。

    //Persist to HDFS
    ipKeyLines.join(onlyActiveSessions).
      map(t => {
        //Session root start time | Event message
        dateFormat.format(new Date(t._2._2._1)) + "t" + t._2._1._3
      }).
      saveAsTextFiles(outputDir + "/session", "txt")

//Persist to HDFS

ipKeyLines.join(onlyActiveSessions).

map(t => {

//Session root start time | Event message

dateFormat.format(new Date(t._2._2._1)) + "t" + t._2._1._3

}).

saveAsTextFiles(outputDir + "/session", "txt")

結論

我希望你跳出這個例子，能感受到只用少量的程式碼就完成了大量的工作，因為它就是這樣的。想象一下你還可以用這種模式和Spark Streaming與HBase HDFS很容易互動的能力去做什麼東西。

Ted Malaska 是 Cloudera 的一個解決方案架構師，Apache Spark的一個貢獻者，O’Reilly的書 Hadoop Applications Architecture 的合著者。

Spark Streaming + Spark SQL 實現配置化ET
2021-09-09
SparkSQL
win7系統怎樣實現內外網同時連線
2016-09-27
Win7
談談Hadoop MapReduce和Spark MR實現
2020-07-27
HadoopSpark
怎麼樣在JBUILDERX中實現會話BEAN訪問實體BEAN？？
2005-03-08
UI會話Bean
Spark Streaming——Spark第一代實時計算引擎
2020-08-06
Spark
Kafka+Spark Streaming+Redis實時計算整合實踐
2015-11-25
KafkaSparkRedis
Spark 以及 spark streaming 核心原理及實踐
2019-01-05
Spark
Hadoop-Map/Reduce之單表連線的實現
2014-10-30
Hadoop
保持會話連線
2017-10-08
會話
oracle 會話，連線
2012-03-02
Oracle會話
實戰|使用Spark Streaming寫入Hudi
2020-04-18
Spark
採用spark和openfire實現即時通訊系統
2014-03-13
Spark
MySQL: kill 會話的實現原理
2019-01-30
MySql會話
怎樣實現填報表定時提交
2020-06-23
實時計算框架特點及對比：Flink、Spark Streaming、Storm
2023-01-18
框架SparkORM
hadoop基礎學習三十一（spark-streaming）
2020-11-09
HadoopSpark
Spark Streaming和Flink的區別
2020-09-30
Spark
session會話的底層實現方式
2011-12-05
Session會話
利用docker搭建spark hadoop workbench
2017-11-29
DockerSparkHadoop
Cognos會話快取設定(展現實時資料)
2009-10-29
會話快取
not in 用外連線實現
2012-02-01
Spark Streaming調優引數及最佳實踐深入剖析-Spark商業調優實戰
2018-11-18
Spark
Swoole MySQL 連線池的實現
2019-05-27
MySql
mysql的jdbc連線java實現
2024-06-13
MySqlJDBCJava
Spark Streaming高階特性在NDCG計算實踐
2018-03-21
Spark
Vue+WebSocket 實現頁面實時重新整理長連線
2019-05-27
VueWeb
怎樣用好本地連線
2017-11-15
使用Redis實現分散式會話
2021-09-15
Redis分散式會話
怎樣實現關閉connection時自動關閉Statement和ResultSet
2017-07-27
Spark之HiveSupport連線（spark-shell和IDEA）
2018-09-17
SparkHiveIdea
mysql和redis實時同步資料怎麼實現
2023-11-15
MySqlRedis
程式、會話、連線之間的差異
2011-07-29
會話
Hadoop--Map/Reduce實現多表連結
2014-07-07
Hadoop
WebSocket實現服務端推送訊息和聊天會話
2020-10-24
Web服務端會話
異地資料中心的資料實時同步，該怎樣智慧化實現？
2024-06-20
Spark Shuffle實現
2015-03-06
Spark
Redis連線超時排查實錄
2024-04-03
Redis
2、實現tomcat+mysql實現jps的連線
2017-11-12
TomcatMySql

怎樣利用Spark Streaming和Hadoop實現近實時的會話連線

Spark Streaming 程式碼

結論

相關文章