[從原始碼學設計]螞蟻金服SOFARegistry之續約和驅逐
0x00 摘要
SOFARegistry 是螞蟻金服開源的一個生產級、高時效、高可用的服務註冊中心。
本系列文章重點在於分析設計和架構,即利用多篇文章,從多個角度反推總結 DataServer 或者 SOFARegistry 的實現機制和架構思路,讓大家藉以學習阿里如何設計。
本文為第十五篇,介紹續約和剔除。
0x01 業務範疇
續約和剔除是服務註冊與發現的重要功能,比如:
1.1 失效剔除
有些時候,我們的服務例項並不一定會正常下線,可能由於記憶體溢位,網路故障等原因使服務不能正常工作,而服務註冊中心未收到”服務下線“的請求。
為了從服務列表中將這些無法提供服務的例項剔除。Server在啟動的時候會建立一個定時任務,預設每隔一段時間(預設60s)將當前清單中,超時(預設為90s)沒有續約的服務剔除出去。
1.2 服務續約
在註冊完服務之後,服務提供者會維護一個心跳用來持續告訴 Server: "我還活著"。以防止 Server 的”剔除任務“將該服務例項從服務列表中排除出去。我們稱該操作為服務續約(Renew)。
0x02 DatumLeaseManager
在 Data Server 端,DatumLeaseManager 實現了 “失效剔除” 和 “服務續約 “功能。
2.1 定義
DatumLeaseManager 的主要變數如下:
-
connectIdRenewTimestampMap 裡面會維護每個服務最近一次傳送心跳的時間,Eureka 裡面也有類似的資料結構;
-
locksForConnectId :為了每次只有一個執行緒操作;lock for connectId: every connectId allows only one task to be created;
具體定義如下:
public class DatumLeaseManager implements AfterWorkingProcess {
/** record the latest heartbeat time for each connectId, format: connectId -> lastRenewTimestamp */
private final Map<String, Long> connectIdRenewTimestampMap = new ConcurrentHashMap<>();
/** lock for connectId , format: connectId -> true */
private ConcurrentHashMap<String, Boolean> locksForConnectId = new ConcurrentHashMap();
private volatile boolean serverWorking = false;
private volatile boolean renewEnable = true;
private AsyncHashedWheelTimer datumAsyncHashedWheelTimer;
@Autowired
private DataServerConfig dataServerConfig;
@Autowired
private DisconnectEventHandler disconnectEventHandler;
@Autowired
private DatumCache datumCache;
@Autowired
private DataNodeStatus dataNodeStatus;
private ScheduledThreadPoolExecutor executorForHeartbeatLess;
private ScheduledFuture<?> futureForHeartbeatLess;
}
2.2 續約
2.2.1 資料結構
在DatumLeaseManager之中,主要是有如下資料結構對續約起作用。
private ConcurrentHashMap<String, Boolean> locksForConnectId = new ConcurrentHashMap();
private AsyncHashedWheelTimer datumAsyncHashedWheelTimer;
2.2.2 呼叫
在如下模組會呼叫到 review,這些都是 AbstractServerHandler。
public class PublishDataHandler extends AbstractServerHandler<PublishDataRequest>
public class DatumSnapshotHandler extends AbstractServerHandler<DatumSnapshotRequest>
public class RenewDatumHandler extends AbstractServerHandler<RenewDatumRequest> implements AfterWorkingProcess
public class UnPublishDataHandler extends AbstractServerHandler<UnPublishDataRequest>
2.2.3 續約
DatumLeaseManager 這裡會記錄最新的時間戳,然後啟動scheduleEvictTask。
public void renew(String connectId) {
// record the renew timestamp
connectIdRenewTimestampMap.put(connectId, System.currentTimeMillis());
// try to trigger evict task
scheduleEvictTask(connectId, 0);
}
具體如下:
- 如果當前ConnectionId已經被鎖定,則返回;
- 啟動時間輪,加入一個定時操作,如果時間到,則:
- 釋放當前ConnectionId對應的lock;
- 獲取當前ConnectionId對應的上次續約時間,如果不存在,說明當前ConnectionId已經被移除,則返回;
- 如果當前狀態是不可續約狀態,則設定下次定時操作時間,因為If in a non-working state, cannot clean up because the renew request cannot be received at this time;
- 如果上次續約時間已經到期,則使用evict進行驅逐
- 如果沒到期,則會呼叫 scheduleEvictTask(connectId, nextDelaySec); 設定下次操作
具體程式碼如下:
/**
* trigger evict task: if connectId expired, create ClientDisconnectEvent to cleanup datums bind to the connectId
* PS: every connectId allows only one task to be created
*/
private void scheduleEvictTask(String connectId, long delaySec) {
delaySec = (delaySec <= 0) ? dataServerConfig.getDatumTimeToLiveSec() : delaySec;
// lock for connectId: every connectId allows only one task to be created
Boolean ifAbsent = locksForConnectId.putIfAbsent(connectId, true);
if (ifAbsent != null) {
return;
}
datumAsyncHashedWheelTimer.newTimeout(_timeout -> {
boolean continued = true;
long nextDelaySec = 0;
try {
// release lock
locksForConnectId.remove(connectId);
// get lastRenewTime of this connectId
Long lastRenewTime = connectIdRenewTimestampMap.get(connectId);
if (lastRenewTime == null) {
// connectId is already clientOff
return;
}
/*
* 1. lastRenewTime expires, then:
* - build ClientOffEvent and hand it to DataChangeEventCenter.
* - It will not be scheduled next time, so terminated.
* 2. lastRenewTime not expires, then:
* - trigger the next schedule
*/
boolean isExpired =
System.currentTimeMillis() - lastRenewTime > dataServerConfig.getDatumTimeToLiveSec() * 1000L;
if (!isRenewEnable()) {
nextDelaySec = dataServerConfig.getDatumTimeToLiveSec();
} else if (isExpired) {
int ownPubSize = getOwnPubSize(connectId);
if (ownPubSize > 0) {
evict(connectId);
}
connectIdRenewTimestampMap.remove(connectId, lastRenewTime);
continued = false;
} else {
nextDelaySec = dataServerConfig.getDatumTimeToLiveSec()
- (System.currentTimeMillis() - lastRenewTime) / 1000L;
nextDelaySec = nextDelaySec <= 0 ? 1 : nextDelaySec;
}
}
if (continued) {
scheduleEvictTask(connectId, nextDelaySec);
}
}, delaySec, TimeUnit.SECONDS);
}
2.2.4 圖示
具體如下圖所示
+------------------+ +-------------------------------------------+
|PublishDataHandler| | DatumLeaseManager |
+--------+---------+ | |
| | newTimeout |
| | +----------------------> |
doHandle | ^ + |
| | | | |
| renew | +-----------+--------------+ | |
| +--------------> | | AsyncHashedWheelTimer | | |
| | +-----+-----+--------------+ | |
| | | ^ | |
| | | | scheduleEvictTask | |
| | evict | + v |
| | | <----------------------+ |
| +-------------------------------------------+
| |
| |
| |
| |
v v
或者如下圖所示:
+------------------+ +-------------------+ +------------------------+
|PublishDataHandler| | DatumLeaseManager | | AsyncHashedWheelTimer |
+--------+---------+ +--------+----------+ +-----------+------------+
| | new |
doHandle +------------------------> |
| renew | |
+-------------------> | |
| | |
| | |
| scheduleEvictTask |
| | |
| | newTimeout |
| +----------> +------------------------> |
| | | |
| | | |
| | | |
| | | No +
| | | <---------------+ if (ownPubSize > 0)
| | | +
| | v |
| +--+ scheduleEvictTask | Yes
| + v
| | evict
| | |
v v v
2.3 驅逐
2.3.1 資料結構
在DatumLeaseManager之中,主要是有如下資料結構對續約起作用。
private ScheduledThreadPoolExecutor executorForHeartbeatLess;
private ScheduledFuture<?> futureForHeartbeatLess;
有兩個呼叫途徑,這樣在資料變化時,就會看看是否可以驅逐:
- 啟動時呼叫;
- 顯式呼叫;
2.3.2 顯式呼叫
LocalDataServerChangeEventHandler 類中,呼叫了datumLeaseManager.reset(),隨之呼叫了 evict。
@Override
public void doHandle(LocalDataServerChangeEvent localDataServerChangeEvent) {
isChanged.set(true);
// Better change to Listener pattern
localDataServerCleanHandler.reset();
datumLeaseManager.reset();
events.offer(localDataServerChangeEvent);
}
DatumLeaseManager的reset呼叫了scheduleEvictTaskForHeartbeatLess啟動了驅逐執行緒。
public synchronized void reset() {
if (futureForHeartbeatLess != null) {
futureForHeartbeatLess.cancel(false);
}
scheduleEvictTaskForHeartbeatLess();
}
2.3.3 啟動呼叫
啟動時候,會啟動驅逐執行緒。
@PostConstruct
public void init() {
......
executorForHeartbeatLess = new ScheduledThreadPoolExecutor(1, threadFactoryBuilder
.setNameFormat("Registry-DatumLeaseManager-ExecutorForHeartbeatLess").build());
scheduleEvictTaskForHeartbeatLess();
}
2.3.4 驅逐
具體驅逐是通過啟動了一個定時執行緒 EvictTaskForHeartbeatLess 來完成。
private void scheduleEvictTaskForHeartbeatLess() {
futureForHeartbeatLess = executorForHeartbeatLess.scheduleWithFixedDelay(
new EvictTaskForHeartbeatLess(), dataServerConfig.getDatumTimeToLiveSec(),
dataServerConfig.getDatumTimeToLiveSec(), TimeUnit.SECONDS);
}
當時間端到達之後,會從datumCache獲取目前所有connectionId,然後遍歷connectionID,看看上次時間戳是否到期,如果到期就驅逐。
/**
* evict own connectIds with heartbeat less
*/
private class EvictTaskForHeartbeatLess implements Runnable {
@Override
public void run() {
// If in a non-working state, cannot clean up because the renew request cannot be received at this time.
if (!isRenewEnable()) {
return;
}
Set<String> allConnectIds = datumCache.getAllConnectIds();
for (String connectId : allConnectIds) {
Long timestamp = connectIdRenewTimestampMap.get(connectId);
// no heartbeat
if (timestamp == null) {
int ownPubSize = getOwnPubSize(connectId);
if (ownPubSize > 0) {
evict(connectId);
}
}
}
}
}
這裡呼叫
private void evict(String connectId) {
disconnectEventHandler.receive(new ClientDisconnectEvent(connectId, System
.currentTimeMillis(), 0));
}
具體如下圖:
+--------------------------------------------------+
| DatumLeaseManager |
| |
| |
| EvictTaskForHeartbeatLess.run |
| |
| +----------------------------------------------+ |
| | | |
| | | | |
| | | | |
| | v | |
| | | |
| | allConnectIds = datumCache.getAllConnectIds()| |
| | | |
| | | | |
| | | for (allConnectIds) | |
| | v | |
| | | |
| | connectIdRenewTimestampMap | |
| | | |
| | | | |
| | | no heartbeat | |
| | v | |
| | | |
| | evict | |
| | | |
| +----------------------------------------------+ |
+--------------------------------------------------+
2.3.5 驅逐處理業務
2.3.5.1 轉發驅逐訊息
驅逐訊息需要轉發出來,就對應到 DisconnectEventHandler . receive 這裡,就是 EVENT_QUEUE.add(event);
public class DisconnectEventHandler implements InitializingBean, AfterWorkingProcess {
/**
* a DelayQueue that contains client disconnect events
*/
private final DelayQueue<DisconnectEvent> EVENT_QUEUE = new DelayQueue<>();
@Autowired
private SessionServerConnectionFactory sessionServerConnectionFactory;
@Autowired
private DataChangeEventCenter dataChangeEventCenter;
@Autowired
private DataNodeStatus dataNodeStatus;
private static final BlockingQueue<DisconnectEvent> noWorkQueue = new LinkedBlockingQueue<>();
public void receive(DisconnectEvent event) {
if (dataNodeStatus.getStatus() != LocalServerStatusEnum.WORKING) {
noWorkQueue.add(event);
return;
}
EVENT_QUEUE.add(event);
}
}
在 afterPropertiesSet 中會啟動一個 Thread,迴圈從 EVENT_QUEUE 之中取出訊息,然後處理,具體就是:
- 從 sessionServerConnectionFactory 之中移除對應的 Connection;
- 給 dataChangeEventCenter 發一個 ClientChangeEvent 通知;
具體如下:
@Override
public void afterPropertiesSet() {
Executor executor = ExecutorFactory
.newSingleThreadExecutor(DisconnectEventHandler.class.getSimpleName());
executor.execute(() -> {
while (true) {
try {
DisconnectEvent disconnectEvent = EVENT_QUEUE.take();
if (disconnectEvent.getType() == DisconnectTypeEnum.SESSION_SERVER) {
SessionServerDisconnectEvent event = (SessionServerDisconnectEvent) disconnectEvent;
String processId = event.getProcessId();
//check processId confirm remove,and not be registered again when delay time
String sessionServerHost = event.getSessionServerHost();
if (sessionServerConnectionFactory
.removeProcessIfMatch(processId,sessionServerHost)) {
Set<String> connectIds = sessionServerConnectionFactory
.removeConnectIds(processId);
if (connectIds != null && !connectIds.isEmpty()) {
for (String connectId : connectIds) {
unPub(connectId, event.getRegisterTimestamp());
}
}
}
} else {
ClientDisconnectEvent event = (ClientDisconnectEvent) disconnectEvent;
unPub(event.getConnectId(), event.getRegisterTimestamp());
}
}
}
});
}
/**
*
* @param connectId
* @param registerTimestamp
*/
private void unPub(String connectId, long registerTimestamp) {
dataChangeEventCenter.onChange(new ClientChangeEvent(connectId, dataServerConfig
.getLocalDataCenter(), registerTimestamp));
}
如下圖所示
+--------------------------------------------------+
| DatumLeaseManager |
| |
| |
| EvictTaskForHeartbeatLess.run |
| |
| +----------------------------------------------+ |
| | | |
| | | | |
| | | | |
| | v | |
| | | |
| | allConnectIds = datumCache.getAllConnectIds()| |
| | | |
| | | | |
| | | for (allConnectIds) | | +------------------------+
| | v | | | |
| | | | | DisconnectEventHandler |
| | connectIdRenewTimestampMap | | | |
| | | | | +-------------+ |
| | | | | | | noWorkQueue | |
| | | no heartbeat | | | +-------------+ |
| | v | | receive | |
| | | | | +--------------+ |
| | evict +---------------------------------> | EVENT_QUEUE | |
| | | | | +--------------+ |
| +----------------------------------------------+ | +------------------------+
+--------------------------------------------------+
2.3.5.1 DataChangeEventCenter 轉發
邏輯然後來到了 DataChangeEventCenter,這裡也是起到轉發作用。
public class DataChangeEventCenter {
/**
* queues of DataChangeEvent
*/
private DataChangeEventQueue[] dataChangeEventQueues;
/**
* receive changed publisher, then wrap it into the DataChangeEvent and put it into dataChangeEventQueue
*
* @param publisher
* @param dataCenter
*/
public void onChange(Publisher publisher, String dataCenter) {
int idx = hash(publisher.getDataInfoId());
Datum datum = new Datum(publisher, dataCenter);
if (publisher instanceof UnPublisher) {
datum.setContainsUnPub(true);
}
if (publisher.getPublishType() != PublishType.TEMPORARY) {
dataChangeEventQueues[idx].onChange(new DataChangeEvent(DataChangeTypeEnum.MERGE,
DataSourceTypeEnum.PUB, datum));
} else {
dataChangeEventQueues[idx].onChange(new DataChangeEvent(DataChangeTypeEnum.MERGE,
DataSourceTypeEnum.PUB_TEMP, datum));
}
}
}
2.3.5.2 DataChangeEventQueue 處理
具體業務是 DataChangeEventQueue 完成的,就是呼叫 addTempChangeData 與 handleDatum 處理對應資料,就是處理這些需要驅逐的資料。
當event被取出之後,會根據 DataChangeScopeEnum.DATUM 的不同,會做不同的處理。
- 如果是DataChangeScopeEnum.DATUM,則判斷dataChangeEvent.getSourceType();
- 如果是 DataSourceTypeEnum.PUB_TEMP,則addTempChangeData,就是往CHANGE_QUEUE新增ChangeData;
- 如果不是,則handleDatum;
- 如果是DataChangeScopeEnum.CLIENT,則handleClientOff((ClientChangeEvent) event);
- 如果是DataChangeScopeEnum.SNAPSHOT,則handleSnapshot((DatumSnapshotEvent) event);
具體參見前文 從原始碼學設計]螞蟻金服SOFARegistry之訊息匯流排非同步處理
+--------------------------------------------------+
| DatumLeaseManager |
| |
| |
| EvictTaskForHeartbeatLess.run |
| |
| +----------------------------------------------+ |
| | | |
| | | | |
| | | | |
| | v | |
| | | |
| | allConnectIds = datumCache.getAllConnectIds()| |
| | | |
| | | | |
| | | for (allConnectIds) | | +------------------------+
| | v | | | |
| | | | | DisconnectEventHandler |
| | connectIdRenewTimestampMap | | | |
| | | | | +-------------+ |
| | | | | | | noWorkQueue | |
| | | no heartbeat | | | +-------------+ |
| | v | | receive | |
| | | | | +--------------+ |
| | evict +---------------------------------> | EVENT_QUEUE | |
| | | | | +--------------+ |
| +----------------------------------------------+ | +------------------------+
+--------------------------------------------------+ |
|
+----------------------+ | onChange
| DataChangeEventQueue | v
| | +--------+------------------+
| | | DataChangeEventCenter |
| +------------+ | | |
| | eventQueue | | add DataChangeEvent | |
| +------------+ | | +-----------------------+ |
| | <-----------------------------+ | | dataChangeEventQueues | |
| addTempChangeData | | +-----------------------+ |
| | +---------------------------+
| handleDatum |
| |
+----------------------+
0xFF 參考
螞蟻金服服務註冊中心如何實現 DataServer 平滑擴縮容
螞蟻金服服務註冊中心 SOFARegistry 解析 | 服務發現優化之路
服務註冊中心 Session 儲存策略 | SOFARegistry 解析
海量資料下的註冊中心 - SOFARegistry 架構介紹
服務註冊中心資料分片和同步方案詳解 | SOFARegistry 解析
螞蟻金服開源通訊框架SOFABolt解析之超時控制機制及心跳機制
螞蟻金服服務註冊中心資料一致性方案分析 | SOFARegistry 解析