這個 Spark Streaming 樣例是怎樣將近實時會話帶到到Hadoop中的一個很好的例子。
Spark Streaming 是Apache Spark 中最有趣的元件之一。利用Spark Streaming,你可以通過使用與處理批量載入資料相同的API來建立資料管道,並通過資料管道處理流式資料。此外,Spark Steaming的“micro-batching”方式提供相當好的彈性來應對某些原因造成的任務失敗。
在這篇文章中,我將通過演示網站事件的近實時會話使你熟悉一些常用的以及高階的Spark Streaming功能,然後在Apache HBase中載入事件的有關統計資料,最後選擇你喜歡的BI工具進行繪圖分析。 (Sessionization指的是捕獲單一訪問者的網站會話時間範圍內所有的點選流事件。)你可以在這裡找到了演示的程式碼。
像這樣的系統對於瞭解訪問者的行為(不管是人還是機器)是超級有用的。通過一些額外的工作,它也可以被設計用來包含windowing模式,以此來用非同步的方式發現可能的欺騙。
Spark Streaming 程式碼
我們例子中的main class是:
讓我們來看看這段程式碼段(忽略1-59行,它包含了imports 和其他無聊的東西)。
60到112行:設定Spark Streaming
這些程式碼是Spark Streaming最基本的開始,並可以選擇從HDFS或socket接收資料流。如果你是Spark Streaming新手,我已經新增了一些詳細的註釋幫助理解程式碼。 (我不打算在這裡詳談,因為此時我們仍要討論程式碼。)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
//This is just creating a Spark Config object. I don't do much here but //add the app name. There are tons of options to put into the Spark config, //but none are needed for this simple example. val sparkConf = new SparkConf(). setAppName("SessionizeData " + args(0)). set("spark.cleaner.ttl", "120000") //These two lines will get us out SparkContext and our StreamingContext. //These objects have all the root functionality we need to get started. val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) //Here are are loading our HBase Configuration object. This will have //all the information needed to connect to our HBase cluster. //There is nothing different here from when you normally interact with HBase. val conf = HBaseConfiguration.create(); conf.addResource(new Path("/etc/hbase/conf/core-site.xml")); conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); //This is a HBaseContext object. This is a nice abstraction that will hide //any complex HBase stuff from us so we can focus on our business case //HBaseContext is from the SparkOnHBase project which can be found at // https://github.com/tmalaska/SparkOnHBase val hbaseContext = new HBaseContext(sc, conf); //This is create a reference to our root DStream. DStreams are like RDDs but //with the context of being in micro batch world. I set this to null now //because I later give the option of populating this data from HDFS or from //a socket. There is no reason this could not also be populated by Kafka, //Flume, MQ system, or anything else. I just focused on these because //there are the easiest to set up. var lines: DStream[String] = null //Options for data load. Will be adding Kafka and Flume at some point if (args(0).equals("socket")) { val host = args(FIXED_ARGS); val port = args(FIXED_ARGS + 1); println("host:" + host) println("port:" + Integer.parseInt(port)) //Simple example of how you set up a receiver from a Socket Stream lines = ssc.socketTextStream(host, port.toInt) } else if (args(0).equals("newFile")) { val directory = args(FIXED_ARGS) println("directory:" + directory) //Simple example of how you set up a receiver from a HDFS folder lines = ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t: Path) => true, true).map(_._2.toString) } else { throw new RuntimeException("bad input type") } |
114到124行: 字串解析
這裡是Spark Streaming開始的地方. 請看一下幾行::
1 2 3 4 5 6 7 8 9 10 11 |
val ipKeyLines = lines.map[(String, (Long, Long, String))](eventRecord => { //Get the time and ip address out of the original event val time = dateFormat.parse( eventRecord.substring(eventRecord.indexOf('[') + 1, eventRecord.indexOf(']'))). getTime() val ipAddress = eventRecord.substring(0, eventRecord.indexOf(' ')) //We are return the time twice because we will use the first at the start time //and the second as the end time (ipAddress, (time, time, eventRecord)) }) |
上面第一行程式碼是對“lines” DStream物件做了一個map函式,並解析原始事件分離出IP地址,時間戳和事件體。對於那些新進入Spark Streaming的人而言,一個DSTREAM儲存著要處理的一批記錄。這些記錄由以前所定義的receiver物件填充,並且此map函式在這個micro-batch內產生另一個DSTREAM儲存變換後的記錄來進行額外的處理。
當看像上面的Spark Streaming示意圖時,有一些事情要注意:
- 每個micro-batch在你構造StreamingContext時設定的那一秒時會被銷燬
- Receiver總是被下一個micro-batch中的即將到來的RDDS填充
- 之前micro batch中舊的RDDs將被清理丟棄
126到135行:產生Sessions
現在,我們有從網路日誌中獲得的IP地址和時間,是時候建立sessions了。下面的程式碼是通過micro-batch內的第一聚集事件建立session,然後在有狀態的DStream中減少這些會話的塊。
1 2 3 4 5 6 7 8 9 10 |
val latestSessionInfo = ipKeyLines. map[(String, (Long, Long, Long))](a => { //transform to (ipAddress, (time, time, counter)) (a._1, (a._2._1, a._2._2, 1)) }). reduceByKey((a, b) => { //transform to (ipAddress, (lowestStartTime, MaxFinishTime, sumOfCounter)) (Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3) }). updateStateByKey(updateStatbyOfSessions) |
這裡有一個關於records如何在micro-batch中被reduce的例子:
會話範圍內的 micro-batch 內加入,我們可以用超酷的updateStateByKey功能,它可以利用先前活躍的micro-batch中的一個DStream做join/reduce-like操作。下圖詳細說明了,就DStreams而言,隨著時間變化這個處理過程是怎樣的。
現在,讓我們深入到定義在檔案的底部的updateStatbyOfSessions函式。該程式碼(注意詳細註釋)很神奇,能使會話流程以連續的微批處理模式發生。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
/** * This function will be called for to union of keys in the Reduce DStream * with the active sessions from the last micro batch with the ipAddress * being the key * * To goal is that this produces a stateful RDD that has all the active * sessions. So we add new sessions and remove sessions that have timed * out and extend sessions that are still going */ def updateStatbyOfSessions( //(sessionStartTime, sessionFinishTime, countOfEvents) a: Seq[(Long, Long, Long)], //(sessionStartTime, sessionFinishTime, countOfEvents, isNewSession) b: Option[(Long, Long, Long, Boolean)] ): Option[(Long, Long, Long, Boolean)] = { //This function will return a Optional value. //If we want to delete the value we can return a optional "None". //This value contains four parts //(startTime, endTime, countOfEvents, isNewSession) var result: Option[(Long, Long, Long, Boolean)] = null // These if statements are saying if we didn't get a new event for //this session's ip address for longer then the session //timeout + the batch time then it is safe to remove this key value //from the future Stateful DStream if (a.size == 0) { if (System.currentTimeMillis() - b.get._2 > SESSION_TIMEOUT + 11000) { result = None } else { if (b.get._4 == true) { result = b } else { result = Some((b.get._1, b.get._2, b.get._3, true)) } } } //Now because we used the reduce function before this function we are //only ever going to get at most one event in the Sequence. a.foreach(c => { if (b.isEmpty) { //If there was no value in the Stateful DStream then just add it //new, with a true for being a new session result = Some((c._1, c._2, c._3, true)) } else { if (c._1 - b.get._2 < SESSION_TIMEOUT) { //If the session from the stateful DStream has not timed out //then extend the session result = Some(( Math.min(c._1, b.get._1), //newStartTime Math.max(c._2, b.get._2), //newFinishTime b.get._3 + c._3, //newSumOfEvents true //This is not a new session )) } else { //Otherwise remove the old session with a new one result = Some(( c._1, //newStartTime c._2, //newFinishTime b.get._3, //newSumOfEvents true //new session )) } } }) result } } |
這段程式碼通過很多方式做了很多事,這是整個工作中最複雜的部分。總之,它會跟蹤活動的會話,所以你會知道你是繼續現有的會話還是啟動一個新的。
126到207行:計數和HBase
這部分做了大量的計數工作。在這裡有很多是重複的,所以讓我們只看一個count的例子,然後我們再一步步地把在相同記錄中產生的counts儲存在HBase中。
1 2 3 4 5 6 7 8 9 |
val onlyActiveSessions = latestSessionInfo.filter(t => System.currentTimeMillis() - t._2._2 < SESSION_TIMEOUT) … val newSessionCount = onlyActiveSessions.filter(t => { //is the session newer then that last micro batch //and is the boolean saying this is a new session true (System.currentTimeMillis() - t._2._2 > 11000 && t._2._4) }). count. map[HashMap[String, Long]](t => HashMap((NEW_SESSION_COUNTS, t))) |
總之,上面的程式碼是過濾了除活動會話之外的其他所有會話,對他們進行計數,並把最終的計數記錄儲存到一個的HashMap例項中。它使用HashMap作為容器,所以在所有的count做完後,我們可以呼叫下面的reduce函式把他們都合併為一個單獨的記錄。 (我敢肯定還有更好的方法來實現這一點,但這種方法也做的不錯。)
接下來,下面的程式碼處理所有的HashMap,並把他們所有的值儲存在在一個HashMap中。
1 2 3 4 5 6 7 |
val allCounts = newSessionCount. union(totalSessionCount). union(totals). union(totalEventsCount). union(deadSessionsCount). union(totalSessionEventCount). reduce((a, b) => b ++ a) |
Spark Streaming與HBase利用HBaseContext進行互動非常簡單。所有你需要做的就是用HashMap和函式將其轉換為一個put物件提供給DSTREAM。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
hbaseContext.streamBulkPut[HashMap[String, Long]]( allCounts, //The input RDD hTableName, //The name of the table we want to put too (t) => { //Here we are converting our input record into a put //The rowKey is C for Count and a backward counting time so the newest //count show up first in HBase's sorted order val put = new Put(Bytes.toBytes("C." + (Long.MaxValue - System.currentTimeMillis()))) //We are iterating through the HashMap to make all the columns with their counts t.foreach(kv => put.add(Bytes.toBytes(hFamily), Bytes.toBytes(kv._1), Bytes.toBytes(kv._2.toString))) put }, true) |
現在,HBase的這些資訊可以用Apache Hive table打包,然後通過你喜歡的BI工具執行一個查詢來獲取像下面這樣的圖,每一次micro-batch該圖都將重新整理。
209到215行:寫入HDFS
最後的任務是把事件資料加入到活動的會話資訊中,然後把事件以會話的開始時間儲存到HDFS。
1 2 3 4 5 6 7 |
//Persist to HDFS ipKeyLines.join(onlyActiveSessions). map(t => { //Session root start time | Event message dateFormat.format(new Date(t._2._2._1)) + "t" + t._2._1._3 }). saveAsTextFiles(outputDir + "/session", "txt") |
結論
我希望你跳出這個例子,能感受到只用少量的程式碼就完成了大量的工作,因為它就是這樣的。想象一下你還可以用這種模式和Spark Streaming與HBase HDFS很容易互動的能力去做什麼東西。
Ted Malaska 是 Cloudera 的一個解決方案架構師,Apache Spark的一個貢獻者,O’Reilly的書 Hadoop Applications Architecture 的合著者。