RAC一個例項慢會拖慢其他例項，RAC其實就是我為人人，人人為我

縱覽RAC，先看report summary，再看top event的等待是否和RAC相關，如果等待事件和RAC相關則繼續看RAC statistic（More RAC statistic）

RAC對硬體的忍受能力比單節點差很多。所以差硬體搭出來的RAC效能還不如單節點

網路方面：如果Interconnect的網路延遲 > IO子系統的延遲，那麼RAC本身就是效能瓶頸

儲存方面：如果IO響應時間慢，那麼log file sync慢，引起gc buffer busy（RAC 的Redo flush慢造成gc buffer busy release/acquire等待）

反應Private Network效能的2個維度：

a)可用頻寬: Estd Interconnect traffic (KB)

b)網路延遲: Avg message sent queue time on ksxp (ms)

RAC Cache Fusion(RAC緩衝融合 )

Cache Fusion就是透過網際網路絡在叢集內各節點的SGA之間進行塊傳遞，以避免首先將塊推送到磁碟，然後再重新讀入其他例項的快取中這樣一種低效的實現方式

RAC的cache等待事件主要是下圖中的幾個（gc就是global cache，11g的gc buffer busy在10g中就叫global cache buffer busy。產生的原因和單例項的 buffer busy waits 類似就是一個時間點節點a的例項向節點b請求block的等待。cr即consistent read）

Cluster Wait事件並不孤立，會引發enq: TX - row lock contention等待事件出現更高的頻率和更多單次等待耗時

Global Cache Load Profile

Global Cache blocks served就是send了

我們可以簡略的看下實際傳輸的頻寬=（received+send）*db_block_size=（Global Cache blocks received +Global Cache blocks served）*8K=901*8K=7.04MB/S=56.32Mb/S

實際傳輸的頻寬=Estd Interconnect traffic (KB)=7744.30KB

所以我們只要心跳線頻寬大於56.32Mb/S即可，目前一般都是千兆、萬兆網路卡，目前超五類線、超六類線都是千兆，超七類線是萬兆

Global Cache Efficiency Percentages (Target local+remote 100%)

我們一般希望Buffer access - local cache %:佔100%，其他兩個越少越好，因為Global cache的傳輸我們希望一般只在本地，因為網路再快也不如訪問本地記憶體快

Global Cache and Enqueue Services - Workload Characteristics

至關重要的2個指標Avg global cache cr block receive time (ms)、Avg global cache current block receive time (ms)，結合其他節點的AWR報告一起分析這2個指標(一個節點receive time值小不代表其他節點也小，可能其他節點接收很慢receive time值很大)，一般要求小於2ms。若在RAC例項之間這2個指標差異很大，一般說明interconnect問題出現於OS buffer層或者網路卡上。

平均每個cr、current塊從申請到收到的時間

比如都為10,則收到一個塊就要10毫秒，那收100個塊就是1秒，基本一個操作查詢1M差不多100個塊的大小的話就需要1秒。（一般磁碟物理讀速率是100M/S，物理讀一個塊大概0.1毫秒）

Avg global cache cr block receive time (ms)= 10 * gc cr block receive time / gc cr blocks received =10*458/4995=0.91

Avg global cache current block receive time (ms)= 10 * gc current block receive time / gc current blocks received=10*740/4005=1.84

上圖資料來自Instance Activity Stats，458單位是百分之秒，而0.9單位是毫秒，所以需要*10

等待事件Gc cr[multiblock] request和本地Avg global cache cr block receive time (ms)是同步的，等待事件出現了那肯定receive time也大，和異地Avg global cache cr block flush time (ms)、Avg global cache cr block build time (ms)、 Avg global cache cr block send time (ms)三個指標相關，可以檢視具體是哪個指標慢導致了receive time大或等待事件出現

等待事件Gc current[multiblock] request和本地Avg global cache current block receive time (ms)是同步，等待事件出現了那肯定receive time也大，和異地Avg global cache current block flush time (ms)、Avg global cache current block pin time (ms)、 Avg global cache current block send time (ms)三個指標相關，可以檢視具體是哪個指標慢導致了receive time大或等待事件出現

本地Time to process CR block request in the cache=異地(build time + flush time + send time)

本地Time to process current block request in the cache=異地(pin time + flush time + send time)

但是receive time值並不是上面三者簡單相加的值，我們姑且認為receive time是上面三者相加的一個結果。

pin time \build time \flush time \send time慢，那說明本地SGA和GRD慢，異地receive time慢

Flush time是在redo塊上的操作。flush 是Oracle為了保證Instance Recovery例項恢復機制，而要求每一個current block在本地節點local instance被修改後(modify/update) 必須要將該current block相關的redo 寫入到logfile 後（要求LGWR必須完成寫入後才能返回)，才能由LMS程式傳輸給其他節點使用。RAC 的Redo flush慢造成gc buffer busy release/acquire等待

上上圖CR、Current兩者的flush指標一般都要求<5ms，在Global CURRENT Served Stats中可以看到 current塊的flush時間分佈在哪些範圍內

Global Cache and Enqueue Services - Messaging Statistics

Avg message sent queue time->一條資訊進入佇列到傳送它的時間（在傳送佇列中的等待時間）

Avg message sent queue time on ksxp->對方一端收到該資訊並返回ACK的時間，這個指標很重要，直接反應了網路延遲，一般要求小於1ms

Avg message received queue time ->一條資訊進入佇列到接收到它的時間（在接收佇列中的等待時間）

% of direct sent messages->直接傳送資訊，在三者中佔比越大越好

% of indirect sent messages->間接傳送資訊,一般是排序或大的資訊

% of flow controlled messages->流控制最常見的原因是網路狀況不佳， % of flow controlled messages應當小於1%

Global Cache Transfer Stats\Global Cache Transfer Times (ms)

圖中busy對應的塊就是gc buffer busy的塊，可以看出gc buffer busy的塊佔總塊數的百分比和gc buffer busy塊的延遲時間

可以看出gc buffer busy的塊佔比分別cr為0.42%，current為3.39%

gc buffer busy的塊延遲時間，cr塊不大隻有2ms，但是current就比較大了平均50ms，想象一下，gc buffer busy收到一個current塊就要50ms，那收1萬個塊就要500秒，還好的是上圖中gc buffer busy的佔比不是特別高（也就是gc buffer busy塊的總量不大）

AWR_RAC的一些指標解讀

相關文章