跟Klout學如何用Iteratees 處理大資料

海興發表於2013-02-02

Klout以幾個社交網路為源，在有限的時間內將使用者相關的資料採集過來，並進行整合，以此計算出他的社會影響力，再將這些指標以視覺化的方式呈現給使用者。

enter image description here

為了能及時準確地呈現這些資料，Klout所面臨的技術壓力很大。這篇文章要介紹Klout在重新設計的資料採集管道中如何應用Play! Iteratees 達成目標，但並不會講解Iteratees的基本概念，因為這方面優秀的文章已經很多了，比如James Roper的和Josh Suereth的。本文的重點是Iteratees 在大規模資料採集情景中的實際應用，以及Klout選擇它解決這個問題的理由。在後續的文章中還會介紹Klout基於Akka的分散式訊息基礎設施，使得Klout可以將基於Iteratee 的資料採集分佈到叢集的機器上去，實現良好的擴充套件性。

三句話講清楚Iteratees

第一句，Iteratees 是用函式式的方法對producers 和 consumers資料流進行建模。Enumerator不斷產生一塊塊的資料，可以用（也可以不用）Enumeratee 對這些資料進行map/adapte，然後由Iteratee處理掉。各個階段可以拼接到一起工作，像個管道一樣。

Enumerator (produce data) → Enumeratee (map data) → Iteratee (consume data)

Play! Iteratee還有別的組合方式，比如Enumerator 互相交織（多個併發的enumerators），Enumeratee 連結，Iteratee 摺疊（將enumerated 資料分組到更大的資料塊中）。

原來的資料採集

Klout原來的資料採集框架是用Java寫的，建立在java.util.concurrent類庫之上。因此資料採集是用彼此孤立的節點完成的，它們用阻塞執行緒實現錯綜複雜的成功或重試語義，順序獲取使用者資料。隨著使用者群的擴大，資料集也迅速增長，系統低下的效率問題日益凸顯。

資料獲取和寫到磁碟上的方式跟Apache web伺服器的服務請求方式幾乎一樣；對於一個使用者資料採集的請求，只有一個執行緒負責程式碼路徑中所有可能的IO操作。這一IO操作大致由下面三個階段（對於每個使用者/網路/資料型別的組合）構成：

派發恰當的api呼叫，解析json響應
將資料寫到硬碟 / HDFS
更新指標

這三個階段必須順序執行，但同一階段是高度並行的，可能更重要的是可以非同步執行。比如說，在第一階段我們發起多個並行的API呼叫來構建某個使用者的活動，比如John Doe在Facebook上釋出了“早上我喜歡吃油條！”，並收到了20條評論和200個贊。這個由狀態訊息和全部20條評論和200個贊組成的活動可以用非同步和並行的方式構建。

基於Iteratee的新採集框架

意識到非阻塞併發實現的好處，Klout決定用絢爛的Play!+Scala+Akka技術組合重新做採集框架的架構。這個技術組合有很多優秀的特性集和類庫，Klout最感興趣的就是Iteratees的Typesafe實現。這一實現中有很多好東西，比如Iteratee.foreach和豐富的Enumeratee實現。Klout還大量使用了 Play! WebServices類庫，它對Ning 非同步http客戶端做了輕量的scala封裝，並且用Play Promises跟iteratee類庫做了非常漂亮的整合（跟Akka promises的完全整合要到Play 2.1 釋出）。

分頁Enumerator

社交網路中的資料可以通過呼叫api獲得，一般都會返回可分頁的json資料。為了處理這種資料，我們需要一個通用並抽象的分頁機制來處理我們獲取的每種資料，比如帖子，贊或評論。Klout利用了每頁資料中都有的next連結，用一個fromCallback Enumerator做了非常優雅的處理：

import play.api.libs.ws._
import play.api.libs.iteratee._

def pagingEnumerator(url:String):Enumerator[JsValue]={     
      var maybeNextUrl = Some(url) //Next url to fetch    
    Enumerator.fromCallback[JsValue] ( retriever = {
        val maybeResponsePromise = 
            maybeNextUrl map { nextUrl=>                                               
                WS.url(nextUrl).get.map { reponse =>
                    val json = response.json
                    maybeNextUrl = (json \ "next_url").asOpt[String] 
                    val code = response.status //Potential error handling here
                    json
                }                    
            }

        /* maybeResponsePromise will be an Option[Promise[JsValue]]. 
         * Need to 'flip' it, to make it a Promise[Option[JsValue]] to 
         * conform to the fromCallback constraints */
        maybeResponsePromise match {
            case Some(responsePromise) => responsePromise map Some.apply
            case None                  => PlayPromise pure None
        }
    })
}

上面的程式碼中沒有錯誤處理、重試和回退邏輯，但可以讓你形成很好的初始認識。對於給定的起始連結，enumerator 只要跟蹤一個可變狀態nextUrl，每次呼叫retriever函式時就更新它。用這個分頁enumerator 可以互動式地獲取資料，不會取得比所需更多的資料。比如可以把這個enumerator 應用到一個檔案寫入的Enumeratee 上，可以確保不會把硬碟壓垮。或者用‘take’ Enumeratee 限制ws呼叫的次數。然後把狀態更新iteratee附到這個處理鏈上，以確保資料庫不會被壓垮。如果你還不太明白，可以把Enumeratee當做一個介面卡，可以把由Enumerator產生的資料型別轉換成被Iteratee消費的資料型別。

Enumerator 之 Enumerators

分頁enumerator 跟蹤下一頁的url很班，但每頁中的json資料通常都是一個帖子列表，需要單獨處理。每個帖子通常都關聯一組喜歡跟評論，並有相應的獲取url，也需要進行分頁處理並加入到最初那個帖子的json資料中，從而構造一個完整的活動，可以做最終的處理。Klout想將每個活動都當做一個獨立的json文件，包括與其相關的喜歡和評論後設資料，同時能滿足Klout不會把他們的系統壓垮的需求，又能發起儘可能多的API併發呼叫。利用Iteratee類庫高度可拼接的屬性，可以在處理帖子流的同時獲取相關的喜歡和評論，並用Enumeratee.interleave 和 Iteratee.fold的組合並行構建每個活動：

type CommentsOrLikes = Either[Comments, Likes]

def buildActivity(post: Post): Promise[Activity] = {

    val likeUrl = LIKE_URL % (post.id, target.token)
    val commentsUrl = COMMENT_URL % (post.id, target.token)

    /*Construct paging enumerators, mapping each value to either Left or Right*/
    val comments = pagingEnumerator(commentsUrl).map(Left.apply) 
    val likes = pagingEnumerator(likeUrl).map(Right.apply) 

    //Enumerate likes and comments in parallel with the 'interleave' function
    val content:Enumerator[CommentsOrLikes] = likes interleave comments

    /*Initial value for fold*/
    val activity = Activity(Nil, post, Nil)

    /*Fold over each enumerated value, building the activity as we go*/
    val activityIterateePromise = 
        content |>> Iteratee.fold[CommentsOrLikes, Activity](activity) {        
            case Left(comments) => activity copy (comments = activity.comments ++ comments)
            case Right(likes)   => activity copy (likes = activity.likes ++ likes)        
        }

    /* Finally, activityIterateePromise will be a Promise[Iteratee[Activity]], 
     * which we need to turn into an Iteratee[Activity] and then run it to 
     * actually build our activity */
    Iteratee.flatten(activityIterateePromise).run    
}

現在就可以把buildActivity方法應用到每個帖子列表中的每個帖子上了：

val posts:Enumerator[List[Post]] = pagingEnumerator(postsUrl) map parseToPostList
/* parseToPostList does exactly that. Creates a list of Post objects from json*/

val activities:Enumerator[Enumerator[Activity]] = posts.map{ 
    postList =>    
        Enumerator.apply(postList:_*) map buildActivity
}

最後我們需要把Enumerators 中的Enumerator 展平來建立活動的Enumerator。可現在編寫，展平Enumerators 還不是Play! Iteratee類庫的標準操作，所以要自己寫：

/*
* Flatten an enumerator of enumerators of some type into an enumerator of some type
*/
def flatten[T](enumerator: Enumerator[Enumerator[T]]): Enumerator[T] = new Enumerator[T] {
    def step[A](it: Iteratee[T, A])(in: Input[Enumerator[T]]): Iteratee[Enumerator[T], Iteratee[T, A]] = {
        in match {
            case Input.EOF   => Done(Iteratee.flatten(it.feed(Input.EOF)), Input.EOF)
            case Input.Empty => Cont(step(it))
            case Input.El(e) => {
                val promise = e |>> Cont(removeEof(it))
                val next = Iteratee.flatten(promise.flatMap(_.run))
                next.pureFlatFold(
                    (v, l) => Done(next, in),
                    (_) => Cont(step(next)),
                    (msg, input) => Error(msg, in))
            }
        }
    }

    def apply[A](it: Iteratee[T, A]): PlayPromise[Iteratee[T, A]] = {
        it.fold(
            (v, l) => PlayPromise pure it,
            (_) => enumerator |>> Cont(step(it)) flatMap (_.run),
            (msg, input) => PlayPromise pure it
        )
    }
}

    /*Wrap the iteratee with an outer feeding iteratee, which does not feed EOF*/
def removeEof[A, T](inner: Iteratee[T, A])(el: Input[T]): Iteratee[T, Iteratee[T, A]] = {
    el match {
        case Input.Empty | Input.El(_) =>
            inner.pureFlatFold (
                (n, i) => Done(inner, Input.Empty),
                k => Cont(removeEof(k(el))),
                (m, i) => Error(m, i))
        case Input.EOF => Done(inner, Input.Empty)
    }
}

檔案寫入Enumeratee

武裝上這個互動式，可分頁和並行的活動Enumerator之後，我們需要把它掛到檔案寫入邏輯上。為了簡化問題，我們不再展開檔案寫入的內部邏輯，假定都是用下面這個函式完成的：

writeToFile(json: JsValue): Promise[Either[Failure, Success]]

從writeToFile的型別簽名來看，它以非同步方式執行，最終會返回Failure 或 Success 物件。我們可以用它構建一個Enumeratee，然後也掛到活動Enumerator之上（作為Iteratee 管道的一部分）：

/*Enumeratee to manage writing to the file writer. Mapping any errors to Left*/
type ErrorOrActivity = Either[Error,Activity]

def fileWriting: Enumeratee[Activity, ErrorOrActivity] = {        

    /* writeToFile returns a Promise, but the Enumeratee type constraint 
     * does not expect a Promise. flatMap will return an 
     * Enumeratee[Activity,ErrorOrActivity] given a function from Activity 
     * to Promise[ErrorOrActivity].
     */
    KloutEnumeratee.flatMap[ErrorOrActivity] { activity=>            
        writeToFile(activity.json).map{
            case e @ Failure(_)     => Left(e)
            case _                  => Right(activity) 
        }            
    }
}

flatMap也不是Iteratee標準類庫中的方法：

object KloutEnumeratee {
    def flatMapInput[From] = new {
        def apply[To](f: Input[From] => PlayPromise[Input[To]]) = 
            new Enumeratee.CheckDone[From, To] { //Checkdone is part of the Play Iteratee library
                def step[A](k: K[To, A]): K[From, Iteratee[To, A]] = {
                    case in @ (Input.El(_) | Input.Empty) =>
                        val promiseOfInput = f(in)
                        Iteratee.flatten(promiseOfInput map { input =>
                            new CheckDone[From, To] {
                                def continue[A](k: K[To, A]) = Cont(step(k))
                            } &> k(input)
                        })

                    case Input.EOF => Done(k(Input.EOF), Input.EOF)
                }

                def continue[A](k: K[To, A]) = Cont(step(k))
            }
    }

    def flatMap[E] = new {
        def apply[NE](f: E => Promise[NE]): Enumeratee[E, NE] = flatMapInput[E]{
            case Input.El(e) => f(e) map (Input.El(_))
            case Input.Empty => Promise pure Input.Empty
            case Input.EOF   => Promise pure Input.EOF
        }

    }
}

檔案寫入Enumeratee 只是把Activity對映到Either上，如果writeToFile 失敗，其中包含的就是Failure ，如果成功，就是需要進一步處理的Activity 。注意，儘管從概念上來看檔案寫入更像Iteratee任務，但因為我們不想“消耗”來自Enumerator的輸入，只是要把輸入做個對映以便後續處理，所以用Enumeratee結構更合適。現在3階段管道中的第2階段已經完成了。

狀態更新Iteratee

在處理每個Activity時，需要迴圈採集和報告狀態，遊標資訊，錯誤和其它後設資料。既然這是最後階段，就應該把它作為管道中的洗滌槽，即Iteratee。為了闡明問題，突出重點，下面這個是簡化版的Iteratee，但足以說明問題了：

/* Status updating and reporting iteratee*/
def updatingStatus:Iteratee[ErrorOrActivity,Unit] = Iteratee.foreach[ErrorOrActivity] {
    case Left(error)        => 
        reportError(error)
        statsd("collector.error")
    case Right(activity)    => 
        reportSuccess(activity) 
        statsd("collector.success")
}

拼到一起

最後一步是把這幾個傢伙聚到一起幹點有意義的事：

//The collect function below returns an Enumerator[Activity], given some target meta-data
val iterateePromise = collect(target) &> fileWriting |>> updatingCursor
iterateePromise.flatMap(_.run)

這個框架之美在於它的簡潔，更在於它的組合拼接能力。只要在實現一個恰當型別的Enumeratee 或 Iteratee，就可以在管道上加上新的階段，還能免費得到其它好處。

資料採集是Klout體驗的基礎，是整合、分析和跟蹤社交生活影響力的必要條件。正是因為有優秀的資料採集框架，Klout才能突出我們最有影響力的時刻。