[原始碼解析] TensorFlow 分散式環境(2)---Master 靜態邏輯

在具體介紹 TensorFlow 分散式的各種 Strategy 之前，我們首先需要看看分散式的基礎：分散式環境。只有把基礎打紮實了，才能在以後的分析工作之中最大程度的掃清障礙，事半功倍。本文梳理下 Master 的靜態邏輯。

本系列其他文章是：

[翻譯] TensorFlow 分散式之論文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

[翻譯] TensorFlow 分散式之論文篇 "Implementation of Control Flow in TensorFlow"

1. 總述

Server 上執行了兩個 RPC 服務，分別是MasterService 和 WorkerService。如果 Client 接入到Server，那麼Server 就是 Master 角色，Client 訪問的就是 MasterService 服務（MasterService 同時負責協調和控制多個 WorkerService 的執行過程）。

Master 這個角色的具體實現是 Master Service。Master Service是一個GRPC service，用於與一系列遠端的分散式裝置進行互動來協調多個worker service。

Master Service 對應了 "//tensorflow/core/protobuf/master_service.proto"，其內部有 CreateSession，RunStep 等介面，所有的 TensorFlow Server 都實現了 Master Service。
客戶端可以與 Master Service 互動以執行分散式 TensorFlow 計算。
一個 Master Service 會跟蹤多個 "主會話（master sessions）"。每個 master sessions 封裝了一個計算圖及其相關狀態。
Master session 執行在 Master 之上，在會話建立後，master 返回一個控制程式碼給客戶端，該控制程式碼可用於關聯客戶端和主會話。
每個 Master session 通常對應一個 "客戶會話（client session）"。客戶端可以通過呼叫 CreateSession 向 master 傳送一個初始圖，通過呼叫 ExtendSession 向圖新增節點。
這裡需要說明下，Master 即是一個概念角色，比如 Master 節點，也有一個具體 Master 類。

2. 介面

2.1 介面規範

Client 通過 GrpcSession 呼叫 Master Service，既然是 RPC 服務，那麼 Client 和 MasterService 之間就需要有一個介面規範。這個規範定義在 master_service.proto 檔案中，其定義了各個介面的訊息體。

service MasterService {
  // Creates a session.
  rpc CreateSession(CreateSessionRequest) returns (CreateSessionResponse);

  // Extends a session.
  rpc ExtendSession(ExtendSessionRequest) returns (ExtendSessionResponse);

  // Prepares future partial run calls.
  rpc PartialRunSetup(PartialRunSetupRequest) returns (PartialRunSetupResponse);

  // Drives the graph computation.
  rpc RunStep(RunStepRequest) returns (RunStepResponse);

  // Closes a session.
  rpc CloseSession(CloseSessionRequest) returns (CloseSessionResponse);

  // List the devices usable by the master.
  rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);

  // Close and abandon all existing sessions.  Ongoing computations
  // will no longer affect fresh ones via the resources in containers listed in
  // the ResetRequest.  See ResetRequest for more details.
  rpc Reset(ResetRequest) returns (ResetResponse);

  // Registers a callable for execution with RunCallable.
  rpc MakeCallable(MakeCallableRequest) returns (MakeCallableResponse);

  // Executes a callable registered with MakeCallable.
  rpc RunCallable(RunCallableRequest) returns (RunCallableResponse);

  // Frees resources associated with a callable registered with MakeCallable.
  rpc ReleaseCallable(ReleaseCallableRequest) returns (ReleaseCallableResponse);
}

2.2 MasterInterface

Client 使用介面 MasterInterface 獲取遠端 MasterService 的服務。MasterInterface 是介面類，是 Client 與 TensorFlow Master service 進行通訊的抽象介面。這個介面既支援基於 RPC 的 master 實現，也支援不需要 RPC 往返的程式內部的 master 實現。MasterInterface 所有介面都是同步介面，這樣 Client 就像呼叫本地函式一樣呼叫遠端 MasterService 提供的服務。

MasterInterface有兩種實現，都是用來和 Master service 進行通訊，

LocalMaster 用於程式間的直接通訊，此時 Client 和 Master 在同一個程式。
GrpcRemoteMaster 則使用 Grpc 來和 Master service 進行通訊，此時 Client 和 Master 分別部署在兩個不同程式。
- 可以呼叫工廠方法 NewGrpcMaster 生成 GrpcRemoteMaster 例項。
- GrpcRemoteMaster 其實就實現了 gRPC 客戶端，它通過 Stub 訪問遠端 Master 上的 MasterService 服務，具體服務是 GrpcMasterService。
- 因為 MasterInterface 都是同步介面，所以 Client 就好像訪問本地函式一樣訪問 MasterService。

class MasterInterface {
 public:
  virtual ~MasterInterface() {}
  virtual Status CreateSession(CallOptions* call_options,
                               const CreateSessionRequest* request,
                               CreateSessionResponse* response) = 0;

  virtual Status ExtendSession(CallOptions* call_options,
                               const ExtendSessionRequest* request,
                               ExtendSessionResponse* response) = 0;

  virtual Status PartialRunSetup(CallOptions* call_options,
                                 const PartialRunSetupRequest* request,
                                 PartialRunSetupResponse* response) {
    return errors::Unimplemented("Partial run not implemented for this master");
  }

  virtual Status RunStep(CallOptions* call_options,
                         RunStepRequestWrapper* request,
                         MutableRunStepResponseWrapper* response) = 0;

  virtual Status RunStep(CallOptions* call_options,
                         const RunStepRequest* request,
                         RunStepResponse* response) {
    std::unique_ptr<RunStepRequestWrapper> wrapped_request(
        new ProtoRunStepRequest(request));
    std::unique_ptr<MutableRunStepResponseWrapper> wrapped_response(
        new NonOwnedProtoRunStepResponse(response));
    return RunStep(call_options, wrapped_request.get(), wrapped_response.get());
  }

  virtual MutableRunStepRequestWrapper* CreateRunStepRequest() {
    MutableProtoRunStepRequest* ret = new MutableProtoRunStepRequest;
    ret->request_.set_request_id(GetUniqueRequestId());
    return ret;
  }

  virtual MutableRunStepResponseWrapper* CreateRunStepResponse() {
    return new OwnedProtoRunStepResponse;
  }

  virtual Status CloseSession(CallOptions* call_options,
                              const CloseSessionRequest* request,
                              CloseSessionResponse* response) = 0;

  virtual Status ListDevices(CallOptions* call_options,
                             const ListDevicesRequest* request,
                             ListDevicesResponse* response) = 0;

  virtual Status Reset(CallOptions* call_options, const ResetRequest* request,
                       ResetResponse* response) = 0;

  virtual Status MakeCallable(CallOptions* call_options,
                              const MakeCallableRequest* request,
                              MakeCallableResponse* response) = 0;
  virtual Status RunCallable(CallOptions* call_options,
                             const RunCallableRequest* request,
                             RunCallableResponse* response) = 0;
  virtual Status ReleaseCallable(CallOptions* call_options,
                                 const ReleaseCallableRequest* request,
                                 ReleaseCallableResponse* response) = 0;

 protected:
  // NOTE: This should only be called by implementations of this
  // interface whose CreateRunStepResponse() method returns a
  // proto-based wrappers for the RunStepResponse message.
  RunStepResponse* get_proto_from_wrapper(
      MutableRunStepResponseWrapper* wrapper) {
    return wrapper->get_proto();
  }
};

具體使用如下，如果 Client 和 Master 在同一個程式，則直接使用 LocalMaster，否則使用 GrpcRemoteMaster 來利用 gRPC 訪問遠端 GrpcMasterService。圖上兩個矩形封裝的 Master 代表實際的 Master 類，此類實現了具體 Master 功能。

圖 1 Master 邏輯結構

2.3 呼叫

下面的虛擬碼說明了客戶端如何與 master 互動，這其實就是分散式模式之中，使用 GrpcRemoteMaster 來通過 gRPC 與遠端 MasterSerivce 服務互動的過程。

stub = NewStub("/job:mnist/replica:0/task:0")
{handle} = stub->CreateSession({graph_def})
  
do {
   stub->RunStep({handle, {feeds}, {fetches}})
   // The client can evaluate a predicate locally, based on the
   // result of fetches, to determine whether to terminate. For
   // example, it might fetch the loss and evaluate whether it is less
   // than some threshold.
} while (!should_stop({fetches}));

stub->CloseSession({handle})

3. LocalMaster

當 Client 呼叫時候，GrpcSession 使用 LocalMaster 獲取本地master，如果沒有得到，則才使用 GrpcRemoteMaster。此時 Client 和 master 沒有跨節點，LocalMaster 使客戶端和master之間能夠直接進行程式內通訊，這樣就可以給同程式內部的Client提供更高效的Master服務。

3.1 定義

LocalMaster 定義如下，主要成員變數就是 master_impl_。LocalMaster 其實就是一個殼而已，直接轉發給master_impl_。master_impl_ 是當 Client 和 master 沒有跨節點時候，本地直接呼叫的類。

class LocalMaster : public MasterInterface {
 private:
  Master* master_impl_;  // Not owned.
  const int64 default_timeout_in_ms_;

  // See LocalMaster::Lookup for the factory function that creates
  // objects of this type.
  LocalMaster(Master* master_impl, const int64 default_timeout_in_ms);

  TF_DISALLOW_COPY_AND_ASSIGN(LocalMaster);
};

3.2 註冊

LocalMaster 有一個靜態變數 local_master_registry_ 用來註冊。

typedef std::unordered_map<string, MasterInfo> LocalMasterRegistry;

LocalMasterRegistry* local_master_registry() {
  static LocalMasterRegistry* local_master_registry_ = new LocalMasterRegistry;
  return local_master_registry_;
}

在 GrpcServer 初始化時候，呼叫如下程式碼把 target="grpc://" 生成的 Master 註冊到本地 LocalMaster。

LocalMaster::Register(target(), master_impl_.get(), config.operation_timeout_in_ms());

就是把 master 註冊到這個static變數 local_master_registry_ 之中。

/* static */
void LocalMaster::Register(const string& target, Master* master,
                           int64 default_timeout_in_ms) {
  mutex_lock l(*get_local_master_registry_lock());
  local_master_registry()->insert(
      {target, MasterInfo(master, default_timeout_in_ms)});
}

3.3 查詢

當呼叫 GrpcSession::Create 方法時候，如果 Client 和 Master 在同一個程式，Lookup 在本地能夠找到註冊的 Master，則會生成一個 LocalMaster 返回，同時 LocalMaster 的 master_impl_ 就配置成找到的 Master。如果找不到，就返回空，則 GrpcSession::Create 方法會建立一個 GrpcRemoterMaster，這樣就同遠端 Master 進行互動。

/* static */
std::unique_ptr<LocalMaster> LocalMaster::Lookup(const string& target) {
  std::unique_ptr<LocalMaster> ret;
  mutex_lock l(*get_local_master_registry_lock());
  auto iter = local_master_registry()->find(target);
  if (iter != local_master_registry()->end()) {
    ret.reset(new LocalMaster(iter->second.master,
                              iter->second.default_timeout_in_ms));
  }
  return ret;
}

以下是同一個程式，Lookup 可以找到的情況，生成 LocalMaster 進行本地操作。

圖 2 同程式 master 操作

我們看看不同程式的情況。此時程式 1 之中的 LocalMaster 沒有指向任何 Master，因為本地沒有啟動 Server，所以 GrpcSession::Create 方法第一步 Lookup 呼叫失敗，返回 Null，GrpcSession::Create 方法執行第二步驟，建立 GrpcRemoteMaster，進行遠端互動。程式 2 之中，LocalMaster 因為沒有客戶端呼叫 GrpcSession::Create 方法，所以也沒有指向任何 Master。

圖 3 跨程式 master 操作

3.4 功能

LocalMaster 呼叫到其內部成員變數 master_impl_ 來完成業務功能。

Status LocalMaster::CreateSession(CallOptions* call_options,
                                  const CreateSessionRequest* request,
                                  CreateSessionResponse* response) {
  Notification n;
  Status ret;
  master_impl_->CreateSession(request, response, [&n, &ret](const Status& s) {
    ret.Update(s);
    n.Notify();
  });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

Status LocalMaster::ExtendSession(CallOptions* call_options,
                                  const ExtendSessionRequest* request,
                                  ExtendSessionResponse* response) {
  Notification n;
  Status ret;
  master_impl_->ExtendSession(request, response, [&n, &ret](const Status& s) {
    ret.Update(s);
    n.Notify();
  });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

Status LocalMaster::RunStep(CallOptions* call_options,
                            RunStepRequestWrapper* request,
                            MutableRunStepResponseWrapper* response) {
  Notification n;
  Status ret;
  master_impl_->RunStep(call_options, request, response,
                        [&n, &ret](const Status& s) {
                          ret.Update(s);
                          n.Notify();
                        });
  TF_RETURN_IF_ERROR(
      WaitForNotification(call_options, default_timeout_in_ms_, &n));
  return ret;
}

4. GrpcRemoteMaster

GrpcRemoteMaster 是 gRPC 客戶端的一種實現，其終通過 Stub 呼叫遠端 Master 上的 GrpcMasterService 服務，這樣呼叫行為就猶如本地函式呼叫一樣。遠端 GrpcMasterService 實現了 MasterService 服務定義的所有介面，是 MasterService 服務的真正實體。當建立 GrpcRemoteMaster 例項時候，需要通過 target 來指定 Master 服務的地址和埠，並且建立對應的 RPC 通道。GrpcSession 和 GrpcRemoteMaster 從嚴格意義上講都是 Client 實現的一部分。

4.1 定義

GrpcRemoteMaster 具體定義如下，主要是使用了MasterServiceStub。

// GrpcRemoteMaster is an implementation of the MasterInterface
// that uses gRPC to talk to the Master service.
class GrpcRemoteMaster : public MasterInterface {
  using MasterServiceStub = grpc::MasterService::Stub;

 public:
  explicit GrpcRemoteMaster(const SharedGrpcChannelPtr& client_channel)
      : stub_(grpc::MasterService::NewStub(client_channel)) {}

  ~GrpcRemoteMaster() override {}

  std::unique_ptr<MasterServiceStub> stub_;
};

4.2 功能

GrpcRemoteMaster 的功能很簡單，就是通過 gRPC 的一個 stub 呼叫遠端 Master 服務的相應介面。

4.2.1 CreateSession

我們使用 CreateSession 為例看看，是使用 CallWithRetry 完成功能。

Status CreateSession(CallOptions* call_options,
                     const CreateSessionRequest* request,
                     CreateSessionResponse* response) override {
  return CallWithRetry(call_options, request, response,
                       &MasterServiceStub::CreateSession);
}

CallWithRetry 程式碼如下，其又是呼叫 s = FromGrpcStatus((stub_.get()->*pfunc)(&ctx, *request, response)) 獲取 Stub 來完成功能。

template <typename Request, typename Response>
Status CallWithRetry(CallOptions* call_options, const Request* request,
                     Response* response,
                     ::grpc::Status (MasterServiceStub::*pfunc)(
                         ::grpc::ClientContext*, const Request&, Response*),
                     string trace_string = {}) {
  absl::Duration timeout = absl::Milliseconds(call_options->GetTimeout());
  absl::Time expired_time = absl::FromUnixMicros(Env::Default()->NowMicros());
  if (timeout > absl::ZeroDuration()) {
    expired_time += timeout;
  }
  Status s;
  for (int num_retries = 0;; ++num_retries) {
    ::grpc::ClientContext ctx;
    std::unique_ptr<profiler::TraceMe> trace;
    if (!trace_string.empty()) {
      trace.reset(NewTraceRpc(trace_string, &ctx));
    }
    ctx.set_fail_fast(false);
    if (timeout > absl::ZeroDuration()) {
      // We do not modify the timeout here to match legacy behavior. However,
      // this could violate the contract of tensorflow::Session. If we retry
      // an RPC just before the deadline is exceeded, we will still set the
      // timeout to the original value. This leads to the overall timeout
      // being double what was expected.
      ctx.set_deadline(absl::ToChronoTime(absl::Now() + timeout));
    }
    s = FromGrpcStatus((stub_.get()->*pfunc)(&ctx, *request, response));
    if (!errors::IsUnavailable(s)) {
      return s;
    }
    // TODO(b/117162170): we may want to make this configurable.
    constexpr int kMaxRetries = 10;
    if (num_retries >= kMaxRetries) {
      return s;
    }
    absl::Time now = absl::FromUnixMicros(Env::Default()->NowMicros());
    const absl::Time deadline_with_backoff =
        now + absl::Microseconds(ComputeBackoffMicroseconds(num_retries));
    // Wait for a short period of time before retrying the RPC.  If our
    // backoff would put us past the RPC deadline, we truncate it to ensure
    // our RPC starts before the deadline.
    const auto backoff_until = (timeout <= absl::ZeroDuration() ||
                                expired_time > deadline_with_backoff)
                                   ? deadline_with_backoff
                                   : expired_time;
    Env::Default()->SleepForMicroseconds(
        absl::ToInt64Microseconds(backoff_until - now));
    now = absl::FromUnixMicros(Env::Default()->NowMicros());
    if (now > expired_time && timeout > absl::ZeroDuration()) {
      // If timeout_in_ms is set, exit the retry loop on timeout.
      return errors::DeadlineExceeded(ctx.debug_error_string());
    }
  }
}

4.2.2 Master Service Stub

接下來我們看看 Stub，這是依據 "//tensorflow/core/protobuf/master_service.proto" 來使用 grpc 實現的。

class Stub final : public StubInterface {
 public:
  Stub(const std::shared_ptr< ::grpc::ChannelInterface>& channel);
  ::grpc::Status CreateSession(::grpc::ClientContext* context,
                               const CreateSessionRequest& request,
                               CreateSessionResponse* response) override;
  ::grpc::Status ExtendSession(::grpc::ClientContext* context,
                               const ExtendSessionRequest& request,
                               ExtendSessionResponse* response) override;
  ::grpc::Status PartialRunSetup(::grpc::ClientContext* context,
                                 const PartialRunSetupRequest& request,
                                 PartialRunSetupResponse* response) override;
  ::grpc::Status RunStep(::grpc::ClientContext* context,
                         const RunStepRequest& request,
                         RunStepResponse* response) override;
  ::grpc::Status CloseSession(::grpc::ClientContext* context,
                              const CloseSessionRequest& request,
                              CloseSessionResponse* response) override;
  ::grpc::Status ListDevices(::grpc::ClientContext* context,
                             const ListDevicesRequest& request,
                             ListDevicesResponse* response) override;
  ::grpc::Status Reset(::grpc::ClientContext* context,
                       const ResetRequest& request,
                       ResetResponse* response) override;
  ::grpc::Status MakeCallable(::grpc::ClientContext* context,
                              const MakeCallableRequest& request,
                              MakeCallableResponse* response) override;
  ::grpc::Status RunCallable(::grpc::ClientContext* context,
                             const RunCallableRequest& request,
                             RunCallableResponse* response) override;
  ::grpc::Status ReleaseCallable(::grpc::ClientContext* context,
                                 const ReleaseCallableRequest& request,
                                 ReleaseCallableResponse* response) override;

 private:
  std::shared_ptr< ::grpc::ChannelInterface> channel_;
  const ::grpc::internal::RpcMethod rpcmethod_CreateSession_;
  const ::grpc::internal::RpcMethod rpcmethod_ExtendSession_;
  const ::grpc::internal::RpcMethod rpcmethod_PartialRunSetup_;
  const ::grpc::internal::RpcMethod rpcmethod_RunStep_;
  const ::grpc::internal::RpcMethod rpcmethod_CloseSession_;
  const ::grpc::internal::RpcMethod rpcmethod_ListDevices_;
  const ::grpc::internal::RpcMethod rpcmethod_Reset_;
  const ::grpc::internal::RpcMethod rpcmethod_MakeCallable_;
  const ::grpc::internal::RpcMethod rpcmethod_RunCallable_;
  const ::grpc::internal::RpcMethod rpcmethod_ReleaseCallable_;
};

具體遠端的對應方法是：

static const char* grpcMasterService_method_names[] = {
    "/tensorflow.MasterService/CreateSession",
    "/tensorflow.MasterService/ExtendSession",
    "/tensorflow.MasterService/PartialRunSetup",
    "/tensorflow.MasterService/RunStep",
    "/tensorflow.MasterService/CloseSession",
    "/tensorflow.MasterService/ListDevices",
    "/tensorflow.MasterService/Reset",
    "/tensorflow.MasterService/MakeCallable",
    "/tensorflow.MasterService/RunCallable",
    "/tensorflow.MasterService/ReleaseCallable",
};

std::unique_ptr<MasterService::Stub> MasterService::NewStub(
    const std::shared_ptr< ::grpc::ChannelInterface>& channel,
    const ::grpc::StubOptions& options) {
  std::unique_ptr<MasterService::Stub> stub(new MasterService::Stub(channel));
  return stub;
}

Stub 內部呼叫 grpc 完成傳送功能。

::grpc::Status MasterService::Stub::CreateSession(
    ::grpc::ClientContext* context, const CreateSessionRequest& request,
    CreateSessionResponse* response) {
  return ::grpc::internal::BlockingUnaryCall(
      channel_.get(), rpcmethod_CreateSession_, context, request, response);
}

所以，如果是 GrpcRemoteMaster，則呼叫流程應該是：GrpcRemoteMaster 接收到 grpc session 的請求，轉交給 grpc master service，這期間經歷了 GrpcSession -> GrpcRemoteMaster -> GrpcMasterService -> Master -> MasterSession 一系列流程。

4.3 建立

當建立 GrpcSession 時候，create 方法之中會先查詢有沒有 Master。如果找到了就直接返回 LocalMaster，這部分我們前面介紹過。如果 Lookup 找不到。所以會呼叫 NewGrpcMaster 生成一個 GrpcRemoteMaster。

/* static */
Status GrpcSession::Create(const SessionOptions& options,
                           std::unique_ptr<GrpcSession>* out_session) {
  std::unique_ptr<GrpcSession> session(new GrpcSession(options));
  std::unique_ptr<MasterInterface> master;
  // For testing, we enable the client to disable the use of the local
  // master registry, so that the RPC stack is exercised.
  if (!options.config.rpc_options().use_rpc_for_inprocess_master()) {
    master = LocalMaster::Lookup(options.target); 
  }
  if (!master) {
    SharedGrpcChannelPtr master_channel;
    TF_RETURN_IF_ERROR(
        NewHostPortGrpcChannel(options.target.substr(kSchemePrefixLength),
                               &options.config.rpc_options(), &master_channel));
    // 建立 GrpcRemoteMaster，與遠端 Master 互動
    master.reset(NewGrpcMaster(master_channel));
  } else {
    session->is_local_ = true;
  }
  session->SetRemoteMaster(std::move(master));
  *out_session = std::move(session);
  return Status::OK();
}

NewGrpcMaster 方法具體如下：

MasterInterface* NewGrpcMaster(const SharedGrpcChannelPtr& channel) {
  return new GrpcRemoteMaster(channel);
}

5. GrpcMasterService

GrpcMasterService 實現了 RPC 對應的 MasterService。GrpcMasterService 會：

預先了解有哪些本地裝置可以給客戶使用，也會發現遠端裝置並且跟蹤其統計資料。
維護/管理實時計算圖會話（MasterSession），這些會話將呼叫本地或者遠端裝置來對收到的計算圖進行計算。
會話功能是：對收到的計算圖進行分析，剪枝，把節點放到可用裝置上，通過呼叫 RunGraph 在工作者上進行圖計算。

5.1 建立

GrpcServer 之中，master_service_ 是 GrpcMasterService 型別的變數。

  // 建立 Master 以及對應的 GrpcMasterService
  master_impl_ = CreateMaster(&master_env_);
  master_service_ = NewGrpcMasterService(master_impl_.get(), config, &builder);

GrpcServer 使用 master_thread_ 執行緒來執行 GrpcMasterService 的 HandleRPCsLoop方法。

master_thread_.reset(
    env_->StartThread(ThreadOptions(), "TF_master_service",
                      [this] { master_service_->HandleRPCsLoop(); }));

5.2 定義

GrpcMasterService 定義如下，master_impl_ 是 Server 傳入的 master 指標，是一個 Master 類的例項：

class GrpcMasterService : public AsyncServiceInterface {
  Master* master_impl_ = nullptr;  // Not owned.
  std::unique_ptr<::grpc::ServerCompletionQueue> cq_;
  grpc::MasterService::AsyncService master_service_;

  mutex mu_;
  bool is_shutdown_ TF_GUARDED_BY(mu_);
  const ConfigProto default_session_config_;
  ::grpc::Alarm* shutdown_alarm_ = nullptr;

  template <class RequestMessage, class ResponseMessage>
  using MasterCall = Call<GrpcMasterService, grpc::MasterService::AsyncService,
                          RequestMessage, ResponseMessage>;
}

GrpcMasterService 初始化時候，會得到 grpc 的訊息佇列 cq_。

GrpcMasterService(Master* master, const ConfigProto& default_session_config,
                  ::grpc::ServerBuilder* builder)
    : master_impl_(master),
      is_shutdown_(false),
      default_session_config_(default_session_config) {
  builder->RegisterService(&master_service_);
  cq_ = builder->AddCompletionQueue();
}

5.3 主迴圈

前面提到了，master_thread_ 執行緒來執行 GrpcMasterService 的 HandleRPCsLoop 方法。HandleRPCsLoop 會呼叫 GrpcMasterService 內部函式來進行處理RPC訊息。主迴圈 HandleRPCsLoop 程式碼如下：

void HandleRPCsLoop() override {
  ENQUEUE_REQUEST(CreateSession, true);
  ENQUEUE_REQUEST(ExtendSession, false);
  for (int i = 0; i < 100; ++i) {
    ENQUEUE_REQUEST(PartialRunSetup, false);
    ENQUEUE_REQUEST(RunStep, true);
  }
  ENQUEUE_REQUEST(CloseSession, false);
  ENQUEUE_REQUEST(ListDevices, false);
  ENQUEUE_REQUEST(Reset, false);
  ENQUEUE_REQUEST(MakeCallable, false);
  for (int i = 0; i < 100; ++i) {
    ENQUEUE_REQUEST(RunCallable, true);
  }
  ENQUEUE_REQUEST(ReleaseCallable, false);

  void* tag;
  bool ok;
  while (cq_->Next(&tag, &ok)) {
    UntypedCall<GrpcMasterService>::Tag* callback_tag =
        static_cast<UntypedCall<GrpcMasterService>::Tag*>(tag);
    if (callback_tag) {
      callback_tag->OnCompleted(this, ok);
    } else {
      // NOTE(mrry): A null callback_tag indicates that this is
      // the shutdown alarm.
      cq_->Shutdown();
    }
  }
}

上面程式碼之中有一些最佳實踐，具體就是圍繞 ENQUEUE_REQUEST 做了一些處理：

this->cq_ 是 grpc 佇列。
ENQUEUE_REQUEST 巨集會為給定的 RPC 方法名稱建立一個新請求（比如 ENQUEUE_REQUEST(GetStatus, false) 就會生成一個 GetStatus 請求），這些請求將在 this->cq_ 之上進行排隊。
預先把一定數量的要處理的任務放入 cq_，如果任務被任務響應 handler 呼叫，則 handler 會呼叫ENQUEUE_REQUEST() 往佇列之中補充一個同樣的呼叫，這樣可以確保完成佇列 cq_ 有足夠的任務來處理傳入的請求，這樣處理將不會阻塞，整體處理速度會提高。
程式碼最後的 while 迴圈將讀取 gRPC 佇列中的內容，就是 gRPC 呼叫之後的收尾工作。

#define ENQUEUE_REQUEST(method, supports_cancel)                              \
  do {                                                                        \
    mutex_lock l(mu_);                                                        \
    if (!is_shutdown_) {                                                      \
      Call<GrpcMasterService, grpc::MasterService::AsyncService,              \
           method##Request, method##Response>::                               \
          EnqueueRequest(&master_service_, cq_.get(),                         \
                         &grpc::MasterService::AsyncService::Request##method, \
                         &GrpcMasterService::method##Handler,                 \
                         (supports_cancel));                                  \
    }                                                                         \
  } while (0)

5.4 訊息處理

在具體訊息響應之中，會呼叫 master_impl_ 進行處理，當 Master 處理完成之後，處理函式將回撥一個 lambda 表示式，向 Client 返回的響應訊息。可以看到，程式碼在最後會使用 ENQUEUE_REQUEST 再插入一個同樣型別的請求，比如下面最後會返回給 Client 一個 CreateSessionResponse。

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

5.5 功能

GrpcMasterService 提供的 API 如下：

static const char* grpcMasterService_method_names[] = {
    "/tensorflow.MasterService/CreateSession",
    "/tensorflow.MasterService/ExtendSession",
    "/tensorflow.MasterService/PartialRunSetup",
    "/tensorflow.MasterService/RunStep",
    "/tensorflow.MasterService/CloseSession",
    "/tensorflow.MasterService/ListDevices",
    "/tensorflow.MasterService/Reset",
    "/tensorflow.MasterService/MakeCallable",
    "/tensorflow.MasterService/RunCallable",
    "/tensorflow.MasterService/ReleaseCallable",
};

我們舉出三個具體功能分析一下：

5.5.1 CreateSession

CreateSessionRequest 訊息之中會帶有 Client 設定的計算圖和配置資訊。Master 接收到請求之後，為這個 Client 建立一個 MasterSession 例項，並建立一個唯一地標識該 MasterSession 例項的 session_handle。這是通過 Master 類成員變數 std::unordered_map<string, MasterSession*> sessions_ 來完成的，session_handle 就是 string 型別。

Master 返回訊息 CreateSessionResponse 給 Client。CreateSessionResponse 訊息中攜帶：

session_handle。Client 的 GrpcSession 據此和 Master 端的 MasterSession 建立關聯，後續互動之中，Client 在訊息內均會攜帶此 session_handle，隨後，Client 與 Master 的所有互動中，在請求訊息中通過攜帶 session_handle，Master 通過它在 std::unordered_map<string, MasterSession*> sessions_ 會找到相對應的 MasterSession 例項。
初始 graph_version。用於後續發起 ExtendSession 操作，往原始的計算圖中追加新的節點。

圖 4 CreateSession

具體響應程式碼如下：

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

5.5.2 ExtendSession

當建立 Session 之後，Client 可以通過 ExtendSession 告訴 Master 我需要擴充原有計算圖的規模 (只能追加子圖，不能修改或刪除)。

在請求訊息 ExtendSessionRequest 中有：

session_handle ：用來查詢哪一個 MasterSession 例項；
graph_def ：需要加到計算圖上的節點；
current_graph_version ：需要擴充的計算圖版本號；

在在響應訊息 ExtendSessionResponse 中返回 new_graph_version，其用於下一此 ExtendSession 操作。

圖 5 ExtendSession

具體程式碼如下：

// RPC handler for extending a session.
void ExtendSessionHandler(
    MasterCall<ExtendSessionRequest, ExtendSessionResponse>* call) {
  master_impl_->ExtendSession(&call->request, &call->response,
                              [call](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                              });
  ENQUEUE_REQUEST(ExtendSession, false);
}

5.5.3 RunStep

客戶端會迭代執行 RunStep，請求訊息 RunStepRequest 的變數較多，比如：

session_handle ：用來查詢哪一個 MasterSession 例項；
feed ：輸入的 NamedTensor 列表；
fetch ：待輸出 Tensor 的名稱列表；
target ：執行節點列表；

響應訊息 RunStepResponse 主要攜帶：

tensor ：輸出的 Tensor 列表；

圖 6 RunStep

訊息定義具體如下：

message RunStepRequest {
  // REQUIRED: session_handle must be returned by a CreateSession call
  // to the same master service.
  string session_handle = 1;

  // Tensors to be fed in the step. Each feed is a named tensor.
  repeated NamedTensorProto feed = 2;

  // Fetches. A list of tensor names. The caller expects a tensor to
  // be returned for each fetch[i] (see RunStepResponse.tensor). The
  // order of specified fetches does not change the execution order.
  repeated string fetch = 3;

  // Target Nodes. A list of node names. The named nodes will be run
  // to but their outputs will not be fetched.
  repeated string target = 4;

  // Options for the run call.
  RunOptions options = 5;

  // Partial run handle (optional). If specified, this will be a partial run
  // execution, run up to the specified fetches.
  string partial_run_handle = 6;

  // If true then some errors, e.g., execution errors that have long
  // error messages, may return an OK RunStepResponse with the actual
  // error saved in the status_code/status_error_message fields of the
  // response body. This is a workaround since the RPC subsystem may
  // truncate long metadata messages.
  bool store_errors_in_response_body = 7;

  // Unique identifier for this request. Every RunStepRequest must
  // have a unique request_id, and retried RunStepRequest must have
  // the same request_id. If request_id is zero, retry detection is disabled.
  int64 request_id = 8;
}

message RunStepResponse {
  // NOTE: The order of the returned tensors may or may not match
  // the fetch order specified in RunStepRequest.
  repeated NamedTensorProto tensor = 1;

  // Returned metadata if requested in the options.
  RunMetadata metadata = 2;

  // If store_errors_in_response_body is true in the request, then
  // optionally the server may return an OK status for the RPC and
  // fill the true status into the fields below, to allow for messages
  // that are too long to fit in metadata.
  error.Code status_code = 3;
  string status_error_message = 4;
}

具體程式碼如下：

// RPC handler for running one step in a session.
void RunStepHandler(MasterCall<RunStepRequest, RunStepResponse>* call) {
  auto* trace = TraceRpc("RunStep/Server", call->client_metadata());
  CallOptions* call_opts = new CallOptions;
  if (call->request.options().timeout_in_ms() > 0) {
    call_opts->SetTimeout(call->request.options().timeout_in_ms());
  } else {
    call_opts->SetTimeout(default_session_config_.operation_timeout_in_ms());
  }
  RunStepRequestWrapper* wrapped_request =
      new ProtoRunStepRequest(&call->request);
  MutableRunStepResponseWrapper* wrapped_response =
      new NonOwnedProtoRunStepResponse(&call->response);
  call->SetCancelCallback([call_opts]() { call_opts->StartCancel(); });
  master_impl_->RunStep(
      call_opts, wrapped_request, wrapped_response,
      [call, call_opts, wrapped_request, trace](const Status& status) {
        call->ClearCancelCallback();
        delete call_opts;
        delete wrapped_request;
        delete trace;
        if (call->request.store_errors_in_response_body() && !status.ok()) {
          call->response.set_status_code(status.code());
          call->response.set_status_error_message(status.error_message());
          call->SendResponse(ToGrpcStatus(Status::OK()));
        } else {
          call->SendResponse(ToGrpcStatus(status));
        }
      });
  ENQUEUE_REQUEST(RunStep, true);
}

6. 業務實現 Master 類

6.1 建立

前面提到了，GrpcServer 之中建立的是 Master 類的例項。

std::unique_ptr<Master> GrpcServer::CreateMaster(MasterEnv* master_env) {
  return std::unique_ptr<Master>(new Master(master_env, 0.0));
}

這樣，在收到 Client 的訊息後，在具體訊息響應之中，GrpcMasterService 的執行緒會呼叫 master_impl_ 進行處理，就是把業務邏輯委託給 Master 類來實現。所以我們接下來就看看 Master 如何處理。

// RPC handler for creating a session.
void CreateSessionHandler(
    MasterCall<CreateSessionRequest, CreateSessionResponse>* call) {
  CreateSessionRequest* rewritten_req = new CreateSessionRequest;
  rewritten_req->mutable_config()->MergeFrom(default_session_config_);
  rewritten_req->MergeFrom(call->request);
  master_impl_->CreateSession(rewritten_req, &call->response,
                              [call, rewritten_req](const Status& status) {
                                call->SendResponse(ToGrpcStatus(status));
                                delete rewritten_req;
                              });
  ENQUEUE_REQUEST(CreateSession, true);
}

6.2 定義

Master 其實不是 MasterInterface 的派生類，其定義在tensorflow/core/distributed_runtime/master.cc。可以從成員變數 sessions_ 上看出來，主要就是管理 MasterSession。

class Master {

 private:
  typedef Master ME;

  // Not owned.
  MasterEnv* env_ = nullptr;

  // Owned.
  mutex mu_;

  // shutdown_ is set to true by the dtor.
  condition_variable shutdown_cv_;
  bool shutdown_ TF_GUARDED_BY(mu_) = false;
  Thread* gc_thread_;

  // Maps session handles to sessions.
  std::unordered_map<string, MasterSession*> sessions_ TF_GUARDED_BY(mu_);

  // Moving average of step times.
  MovingAverage last_1000_steps_ TF_GUARDED_BY(mu_);

  // Cumulative number of steps executed.
  int64 step_count_ TF_GUARDED_BY(mu_);

  // If a session is not active for this many seconds, it will be
  // closed automatically.
  const double session_gc_seconds_;

  // Used to track ids for incoming requests so we can detect duplicates.
  RecentRequestIds recent_request_ids_;
};

6.3 功能

我們回憶一下之前提到的。

分散式執行的核心是如何操作計算圖，但是計算功能被拆分為 Client，Master 和 Worker 三個角色。

Client 負責構造計算圖，Worker 負責執行具體計算，但是 Worker 怎麼知道應該計算什麼？TensorFlow 在兩者之間插入了一個 Master 角色來負責協調，排程。

雖然 Master 不是 MasterInterface 的派生類，但時其實現了 MasterService 的具體業務。Master 具體負責：

Master 預先知道本地有哪些裝置可以作為客戶使用的裝置，也會發現遠端裝置，並跟蹤這些遠端裝置的統計資料。
一個 Master 包含多個 "主會話（master sessions）"。每個 master sessions 封裝了一個計算圖及其相關狀態。
主會話將:
- 精簡優化計算圖，比如剪枝/分割/插入傳送和接受運算元。
- 協調/排程資源。比如哪個計算應該在哪個裝置執行，具體就是按照 graph -> Partition -> Device 這個策略把子圖劃分到硬體裝置之上。
- 把分割之後的各個子圖傳送給各個 worker，具體每一個子圖對應一個 MasterSession。並最終通過在工作者上啟動 RunGraph 來驅動圖的計算。
Master 維護實時圖計算會話的狀態。

至此，Master 的靜態結構我們已經介紹完畢，具體 Master 功能我們將在後文 Session 部分進行具體介紹。

最後，強烈推薦兩個大神：

[TensorFlow Internals] (https://github.com/horance-liu/tensorflow-internals)，雖然其分析的不是最新程式碼，但是建議對 TF 內部實現機制有興趣的朋友都去閱讀一下，絕對大有收穫。
https://home.cnblogs.com/u/deep-learning-stacks/ 西門宇少，不僅僅是 TensorFlow，其公共號還有更多其他領域，業界前沿。