"Principles of Reactive Programming" 之 <Persistent Actor State>學習筆記

devos發表於2014-12-12

這是《Pinciples of Reactive Programming》week6的最後一課。

為什麼需要把actor的狀態持久化?

如果actor沒有狀態,那麼在任何實時,這個actor的行為都是一致的。但是對於有狀態的actor,其行為跟當前狀態相關。所以當系統由於意外down掉以後,需要恢復系統的狀態,意味著需要恢復actor的狀態。

Actors representing a stateful resource

  • shall not lose important state due to (system) failure
  • must persist state as needed
  • must recover state at (re)start

怎麼記錄actor的狀態?

有兩種途徑來持久化actor的狀態

Two possibilities for persisting state:

  • in-place updates
  • persist changes in append-only fashion

 

第一種方法

The first is to have the actor mirror a persistent storage location and do in-place updates of both. So when the actor's state changes, the persistent location is also updated. This persistent location could be files, in the file system. Or it could also be a record in a relational database.

第一種方法就是直接把當前的actor的狀態持久化下來,成為actor狀態的映象。在更新actor的狀態時,同時更新這個映象。比如一個actor來處理kafka的訊息,它的狀態是它處理到了哪個topic的哪個partition的哪個offset.那麼它就可以把這些狀態記錄在zookeeper中,每次處理完訊息,就更新這些狀態。就像Storm的Kafka-spout所做的一樣,只不過那個不是actor。

第二種方法

The other way is to not persist the state itself and update it. But to persist the changes which are applied to the state. And this is doen in an append-only fashion, meaning that these change records will never be deleted. They will only be added to . The current state can then be derived by reapplying all changes from the beginning.

第二種方法就是不把當前的狀態直接儲存下來用以替代之前的狀態,而且記錄狀態的變化。就是記錄使狀態A變化到狀態B的動作。然後在恢復時,通過重複狀態的所有變化過程,就可以得到想要恢復的狀態。比如,如果一個actor是一個計數器,按照第一種方法,每次處理完訊息,就更新儲存在資料庫或檔案中的計數器的值,按照第二次方法,每次處理完訊息,就記錄說計數器的值加1。

這兩種方法各有優劣。

對於第一種:

There are obvious benefits to persisting the state and doing in-place updates.

The first is that recovery of the latest state can be done in constant time, because you just need to go to that one memory location and read it back.

The other advantage is that the data of all you needed for storage depends only on the number of records and not on the rate of changes.

第一種方法可以在固定時間內恢復actor的狀態,因為你只需要把狀態從儲存中讀出來就夠了。另一個好處是,它需要儲存的資料量只和記錄的個數有關,而與狀態變化的速度無關。對於第二種方法,隨著狀態的變化,需要儲存的資料量是持續增長的,而第一種則不一定會。

對於第二種方法:

But there are also benefits to persisting the changes.

For example, if you do that you can go back to any point in time and replay history, audit what happened in which order or restore a certain state. Say from last Thursday, because you need to either rerun what has happened or you need to discard all the changes which have been done since then.

During a replay, the code which handles the processing could also have, for example, been fixed, because it had a bug previously. And that means that errors which crept into the current state can be corrected retroactively. This is not possible if you only store the current state, because it will have the bug in it.

You all have seen the third advantage at work. For example, if you were shopping at a large shopping site one the Internet, which we all well know, if you look at the shopping cart and you put an item in. It is in the shopping cart. You might continue shopping, take it out, replace it by another one, and finally,onece you go to the checkout, the current contents of shopping cart is what you actually buy. If you only persist that, then the whole history is lost. But it might be very interesting to keep statistics. For example, this regrigerator has been replace in 50% of the cases by that other one, and people can then learn from other people's decisions. 0f course, these insights can also be used inside the company itself to organize their logistics processes. Storing all these events taks a lot of space, but space is comparactively cheap nowdays. And therefore, if profit can be made from analyzing these data, then it's well worth it.

The fourth advantage has to do with harware and how that works. If you write to an append-only stream, you can write a much higher bandwidth to IO, to network devices and aslo to hard disks.The reason is, that in-place updates need to at least appear to occur in exactly the order in which they were given, which limits the possibilities for optimization.

Finally, persisting immutable data has the advantages we have seen throughout the functional programming course. Anything which cannot possibly change can be freely shared and replicated. There is no need to synchronize acces, and whether you store an event stream to one, two, or three locatioons does not make a difference.

 總結起來,第二種方法有以下優點:

  • History can be replayed, audited or restored. 即,可以看到狀態的變化過程,而不是隻有結果,因此可以恢復到某個特定狀態,或者對變化的過程進行審查。
  • Some processing errors can be corrected retroactively. 一些處理過程中的錯誤可以被改正。
  • Additional insight can be gained on bussiness processes. 比起in-place updates, 儲存狀態變化的方式可以看到更深入的內容,某些資料只在變化的過程中存在,但是卻不反映在最終狀態中,這些資料在使用只儲存當前狀態的方法時就無法獲得
  • Writing an append-only stream optimize IO bandwith。 往一個只讀的輸出流寫入的速度更快。比如,可以把狀態寫入檔案,那麼每次更新狀態,就得更新這個檔案在特定位置的內容。如果記錄變化,就可以把這個變化簡單地append到用於記錄的檔案,這樣速度會更快。(注,我認為這個不是絕對的,取決於具體的儲存介質和儲存狀態的方式)
  • Changes are immutable and can freely be replicated.狀態的改變是不可變的,可以被隨意地複製,而不用擔心會衝突。(這個也是取決於具體情況)

當然這兩種方法也可以綜合起來,使用snapshot。就像HDFS的secondary namenode做的。它不僅記錄change log,而且週期性地用change log生成當一個namenode狀態的snapshot。這樣就可以使得狀態的恢復可以確定地在一個有限的時間內恢復。而且snapshot是不可變的,而且可以以append的方式記錄(snapshot是把當前狀態順序寫到磁碟(或其它儲存),而不用更新這個snapshot,因此是append-only的),因此高效。

如何持久化狀態的變化?

有兩種方式

Command-Sourcing:

Persist the command before processing it, persist acknowledgement when processed

 

這種做法是把發要給actor的訊息,也就是command, 在直正發給actor之前,先持久化。這樣恢復的時候,重放之前被持久化的command就可以了。但這樣存在一些問題,就是重放command時,相當於重新處理了一遍訊息,在恢復的過程中,actor對外界的影響相當於是恢復過程產生的副作用。比如,如果actor在處理訊息的過程中發給其它actor訊息,那麼在恢復過程中,他把之前發的訊息相當於又傳送了一遍,這就是一種副作用。對於這種在恢復過程中重複傳送的訊息,Akka有一種解決方案,就是使用channel(在2.3.4中,channel和persistentChannel被換成了AtLeastOnceDelivery)。channel會記下曾經傳送過的訊息,從而避免重複傳送。

 Event-Sourcing: 

Generate change requests("events) instead of modifying local state; persist and apply them.

 

這種方法就是不直接把訊息本身存下來,而是把訊息引起的狀態的變化儲存下來,這個變化就僅僅是狀態的變化。這樣在恢復actor的狀態時,就可以直接從log中取出狀態的變化進行恢復,因此不會有重複傳送訊息這樣的副作用。

When to Apply the Events?

在event-sourcing配圖所示的方案裡,events會先被髮給log,log通常是一個actor,log把events持久化後,會replay這個events給actor,然後actor才會應用這個events,此時actor的狀態才會改變。 

但這樣,也會存在一些問題,像下面這個例子。

下面的程式碼用來模擬一個部落格網站,這個網站限制每個使用者只能傳送有限數量的blog,在程式碼中,這個數量被設為1.

sealed trait Event
case class PostCreated(text: String) extends Event
case object QuotaReached extends Event

case class State(posts: Vector[String], disabled: Boolean) {
  def updated(e: Event): State = e match{
    case PostCreated(text) => copy(posts = posts :+ text)
    case QuotaReached => copy(disabled = true)
  }
}

class UserProcessor extends Actor{
  var state = State(Vector.empty[String], false)
  def receive = {
    case UserProcessor.NewPost(text) =>
      if(!state.disabled)
        emit(PostCreated(text), QuotaReached)
    case e: Event =>
      state = state.updated(e)
  }

  def emit(events: Event*) = ...//send to log
 }
object UserProcessor{
  case class NewPost(text: String)
}

 上面的程式碼中,event就代表狀態的變化,就是event-sourcing中的"event", UserProcessor會處理使用者提交blog的請求(NewPost),它會先判斷使用者發表blog的數量是否已達上限(if(!state.disabled)),如果沒有就把event發給log。當回來log返回的event後,UserProcessor會用event改變自已的狀態。

問題是,這個處理邏輯是有問題的。在於,UserProcessor收到NewPost後,並不會立即改變自身的狀態,而是等到event被從log返回之後,才會改變狀態。那麼在emit event之後,收到被log返回的event之前,如果使用者又傳送了NewPost,雖然blog的上限為1,這個NewPost還是會被接受,因為UserProcessor的狀態沒有改變。

 那麼我們可以在持久化event之前,應用event。所以,對於何時apply events,有兩種選擇: apply after persisting, apply before persisting.

 咋一看,第二種方法更好。但是讓我們從另一個方面考慮下。在上例中,State的update方法在收到PostCreated時,會把新的blog文字加入到一個Vector中,於是vector增加了一個元素,這種改變我們認為是實際狀態的改變。那麼我們來看,當actor發生故障時,上述兩種方法的不同。

在第一種方法中,如果一個blog已經被Posted,那麼這種狀態是一定可以被恢復的,因為引起狀態改變的event已經被持久化了。

在第二種方法中,在event被持久化之前,event已經被用於改變actor的狀態。所以,如果event被髮給log之後,log把它持久化之前,UserProcessor處於一種“可能會丟失”的狀態中。畢竟,如果blog在持久化event的過程中出了錯,那麼UserProcessor當前的狀態就不能從blog中恢復了。

看起來,我們必須在正確的行為(能判斷一個blog數量是否過多)和正確的persistent之前進行選擇。但是,在上面這個例子上,還有第三種做法。

我們可以在處理完一條訊息之後,不應用event,然後把event發給log。此時,actor處於等待狀態中,它把新來的command快取起來,先不進行處理,等待log對於第一條訊息的回覆,在log回覆它之後,應用log回覆的event進行狀態的改變,然後再處理被快取的command. 這樣的壞處在於中間的等待會降低效能,好處是可以維持一致性。

 Akka實際上內建了對這種形式地快取的支援,叫做Stash

The Stash Trait

 

class UserProcessor extends Actor with Stash{
  var state = State(Vector.empty[String], false)
  def receive = {
    case UserProcessor.NewPost(text) if !state.disabled=>
        emit(PostCreated(text), QuotaReached)
      context.become(waiting(2), discardOld=false)
  }

  def waiting(n: Int): Receive = {
    case e: Event =>
      state = state.updated(e)
      if(n == 1){context.unbecome(); unstashAll()}
      else context.become(waiting(1))
    case _ => stash()
  }
 }

 在actor繼承Stash這個trait以後,它可以使用stash()來快取當前的訊息,用unstash來恢復被恢存的訊息。被恢復的訊息不會被放在mailbox的最後,而是會放在前邊(prepend而不是append to mailbox),以此來保持訊息按照它們到達的順序排列。

在上邊的例子中,UserProcessor在收到NewPost後,會進行等待狀態,等待兩個event的到達,在此過程中其它訊息會進入stash,等log回覆的兩個event應用於狀態後,UserProcessor恢復到處理NewPost的狀態,同時unstashAll快取的訊息。

 

When to Perform External Effects?

 Peforming the effect and persisting that it was done cannot be atomic.

  • Perform it before persisting for at-least-once semantics.
  • Perform it after persisting for at-most-once semantics.

This choice needs to be made based on the underlying bussiness model.

 前邊說過,當actor與外部資源有互動時,恢復actor狀態的過程就會更加複雜。根本原因在於,無法把外部事件和actor系統中相關的log做成atomic的。比如,在前邊的例子中,假如發表blog要向銀行傳送請求,來收費,那麼傳送收費請求和記錄這個請求到log不是atomic的,也就是說這兩個事件可能只有一個成功。那麼,問題來了,是應該先在log裡記下已收費,然後再收費,還是先收費,再記下已收費?這個就取決於具體的業務模型了。

(注:但是實際上,從"把events記到log"到"應用events改變狀態"之間,也有可能失敗,所以,在actor down掉之後,想一定能恢復到之前的狀態是不可能的。只能依靠冪等+at-least-once這種語法來保證系統能不受失敗的影響)

 

Summary

  • Actors can persist incoming messages or generated events.
  • Events can be replicated and used to inform other components.
  • Recovery replays past commands or events; snapshots reduce this cost
  • Actors can defer handling certain messages by using the Stash trait.

 

相關文章