Dive into TensorFlow系列（3）- 揭開Tensor的神秘面紗

京東雲發表於2022-11-17

原文網址 : https://zhuanlan.kanxue.com/article-19841.htm

TensorFlow計算圖是由op和tensor組成，那麼tensor一般都用來代表什麼呢？顯然，像模型的輸入資料、網路權重、輸入資料經op處理後的輸出結果都需要用張量或特殊張量進行表達。既然tensor在TensorFlow體系架構中如此重要，因此本文將帶領大家由淺入深地學習tensor的三個話題：使用者眼中的tensor、TensorFlow系統中的tensor、tensor高階用法DLPack（跨框架程式設計，如：TensorFlow+PyTorch）。

注：本文基於TensorFlow v1.15.5進行編寫。

一、小白眼中的Tensor

1.1 Tensor HelloWorld

定義兩個張量，然後對其求加法，相關程式碼如下：

# segment 1a = tf.constant(3.0, dtype=tf.float32)b = tf.constant(4.0) # also tf.float32 implicitlytotal = a + bprint(a)print(b)print(total)### 三個print的輸出如下："""
Tensor("Const:0", shape=(), dtype=float32)
Tensor("Const_1:0", shape=(), dtype=float32)
Tensor("add:0", shape=(), dtype=float32)
"""# 說明：此時的Tenosr尚不能產生真正的結果。以上程式碼建立了計算圖，Tensor只是代表op執行的結果(但此時op未執行)。

如果想看到最終total的計算結果，則應該建立Session物件並執行計算圖，具體程式碼如下（在segment1基礎上增加程式碼）：

with tf.Session() as sess:
    result = sess.run(total)
    print(result, type(result), type(total))
# 輸出結果= 7.0 <class 'numpy.float32'> <class 'tensorflow.python.framework.ops.Tensor'>

由此可見，Tensor代表尚未執行的結果表示，建立Session物件並執行計算圖可得total結果7.0，並且結果的資料型別已變為numpy。最後說明一下，本小節程式碼輸出的Tensor是指tf.Tensor，對應的程式碼實現是tensorflow.python.framework.ops.Tensor。

1.2 張量屬性及特殊張量

從使用者視角看tf.Tensor主要有三個屬性：name、dtype、shape。除此之外，還有三個屬性比較重要（不常用或者不直接可見）：op、graph、device。其中op屬性記錄產生此Tensor的操作名稱，graph屬性記錄包含此Tensor的計算圖，device屬性記錄產生此Tensor的裝置名稱。

在TensorFlow體系中有四種特殊的張量（此處暫不嚴格區分Tensor與產生此Tensor的op），具體如下：

•tf.Variable: 定義內容可變的張量，一般用來定義模型權重。

•tf.constant: 一般來說，張量內容不可變，此API可用來定義常規張量。

•tf.placeholder: 佔位符張量，用於描述靜態圖輸入規格。靜態圖採用先編譯後執行的方式，因此在定義計算圖時要知道輸入規格。

•tf.SparseTensor: 為稀疏資料定製的張量結構。

1.3 Tensor與op的關係

我們多次提到，Tensor可以作為op的輸入，經op一系列處理後產生新的Tensor作為輸出。為了深入理解這一點，我們回頭重新審視segment1中的程式碼片段（請大家注意Tensor的命名）：

# segment 1a = tf.constant(3.0, dtype=tf.float32)b = tf.constant(4.0) # also tf.float32 implicitlytotal = a + bprint(a)print(b)print(total)### 三個print的輸出如下："""
Tensor("Const:0", shape=(), dtype=float32)
Tensor("Const_1:0", shape=(), dtype=float32)
Tensor("add:0", shape=(), dtype=float32)
"""# 說明：此時的Tenosr尚不能產生真正的結果。以上程式碼建立了計算圖，Tensor只是代表op執行的結果(但此時op未執行)。

針對上述程式碼，我們先來看看哪些是Tensor，哪些是op，然後基於此分別描述每一個操作的執行過程。為回答第一個問題，我們先看一段TensorFlow官方註釋：

"""
`tf.constant` creates a `Const` node in the computation graph with the
 
exact value at graph construction time.
"""

由此可見，segment1的程式碼中有兩種op，分別為Const和add，前者出現了兩次，而後者1次。基於此，我們得知segment1依次向計算圖中新增了三個op，與此同時也可以回答第二個問題，即每個操作的過程。具體如下：

### 三個print的輸出如下(a,b,total)："""
Tensor("Const:0", shape=(), dtype=float32)
Tensor("Const_1:0", shape=(), dtype=float32)
Tensor("add:0", shape=(), dtype=float32)
"""# 向計算圖新增第一個op(Const),輸入是一個標量,輸出是Tensor a,其名稱由兩部分組成,即op名稱:a在op輸出的索引位置.# 向計算圖新增第二個op(Const_1,因為op名稱要唯一),輸入標量,輸出Tensor b,其命名規則同上.# 向計算圖新增第三個op(add),輸入是Tensor a和b,輸出Tensor total,其命名規則同上.

二、一探tensor究竟

2.1 前後端Tensor對映

在TensorFlow的白皮書[7]中提到C API是連線前端使用者程式碼和後端執行引擎的橋樑，為深入理解這個概念，建議讀者參照TensorFlow官網從頭編譯原始碼。TensorFlow v1.15.5基於Bazel進行編譯，前端python與後端C++透過SWIG進行互動。實際上在系統編譯之前會先啟動SWIG程式碼生成過程，透過解析tensorflow.i自動生成兩個wrapper檔案：pywrap_tensorflow_internal.py和pywrap_tensorflow_internal.cc，前者對接前端python呼叫，後者對接後端C API呼叫。大家安裝tensorflow官方二進位制包後，只能看到py檔案而沒有cc檔案。如果自己編譯TensorFlow原始碼，可在專案根目錄下的bazel-bin中找到相應的py和cc檔案，如下圖所示：

上圖紅框中的so檔案是由cc檔案編譯得到，黃框中的py模組首次被匯入時，會自動載入so動態連結庫。而在so對應的cc檔案中，靜態註冊了一個函式對映表，實現python函式到C函式的對映。此對映表結構大致如下：

static PyMethodDef SwigMethods[] = {
          { (char *)"SWIG_PyInstanceMethod_New", (PyCFunction)SWIG_PyInstanceMethod_New, METH_O, NULL},
          { (char *)"TF_OK_swigconstant", TF_OK_swigconstant, METH_VARARGS, NULL},
          { (char *)"TF_CANCELLED_swigconstant", TF_CANCELLED_swigconstant, METH_VARARGS, NULL},
          { (char *)"TF_UNKNOWN_swigconstant", TF_UNKNOWN_swigconstant, METH_VARARGS, NULL},
          { (char *)"TF_INVALID_ARGUMENT_swigconstant", TF_INVALID_ARGUMENT_swigconstant, METH_VARARGS, NULL},
          // 此處省略許多程式碼};

如果沒有親身實踐，上面這些文字讀起來多少有些吃力。為便於大家理解，我們把上述文字用如下簡圖進行總結：

有些好奇寶寶可能會說：上面講的太宏觀，好像懂了，又好像沒懂。沒關係，接下來我們以靜態圖的執行介面session.run()為例，結合TensorFlow原始碼詳細梳理一下前後端的對映過程，具體過程見下圖：

由上圖我們可清晰看到C API層把前後端給隔離開了，當然C API層包括pywrap_tensorflow_internal.h/cc、tf_session_helper.h/cc、c_api.h/cc。至此session.run()從前端對映到後端的流程講完了，那接下來回答前端tensor如何對映至後端Tensor，請看如下程式碼：

// tf_session_helper.cc    line351void TF_SessionRun_wrapper_helper(TF_Session* session, const char* handle,
                                  const TF_Buffer* run_options,
                                  const std::vector<TF_Output>& inputs,
                                  const std::vector<PyObject*>& input_ndarrays,
                                  const std::vector<TF_Output>& outputs,
                                  const std::vector<TF_Operation*>& targets,
                                  TF_Buffer* run_metadata,
                                  TF_Status* out_status,
                                  std::vector<PyObject*>* py_outputs) {
  DCHECK_EQ(inputs.size(), input_ndarrays.size());
  DCHECK(py_outputs != nullptr);
  DCHECK(py_outputs->empty());
  Status s;

  // Convert input ndarray PyObjects to TF_Tensors. We maintain a continuous
  // array of TF_Tensor*s as well as scoped containers to make sure they're
  // cleaned up properly.
  // 省略了很多程式碼，可以看到此處把前端類ndarray的物件轉化成了TF_Tensors。}// c_api.cc  line2274void TF_SessionRun(TF_Session* session, const TF_Buffer* run_options,
                   const TF_Output* inputs, TF_Tensor* const* input_values,
                   int ninputs, const TF_Output* outputs,
                   TF_Tensor** output_values, int noutputs,
                   const TF_Operation* const* target_opers, int ntargets,
                   TF_Buffer* run_metadata, TF_Status* status) {
  // TODO(josh11b,mrry): Change Session to be able to use a Graph*
  // directly, instead of requiring us to serialize to a GraphDef and
  // call Session::Extend().
  if (session->extend_before_run &&
      !ExtendSessionGraphHelper(session, status)) {
    return;
  }

  TF_Run_Setup(noutputs, output_values, status);

  // Convert from TF_Output and TF_Tensor to a string and Tensor.
  // 看這裡，此外TensorFlow把TF_Tensor轉化成c++ Tensor
  std::vector<std::pair<string, Tensor>> input_pairs(ninputs);
  if (!TF_Run_Inputs(input_values, &input_pairs, status)) return;
  for (int i = 0; i < ninputs; ++i) {
    input_pairs[i].first = OutputName(inputs[i]);
  }

  // Convert from TF_Output to string names.
  std::vector<string> output_names(noutputs);
  for (int i = 0; i < noutputs; ++i) {
    output_names[i] = OutputName(outputs[i]);
  }}

2.2 C++ Tensor類

檢視參考文獻5，我們找到了C++ Tensor類的定義，其重要片段（seg1）如下：

class Tensor{
  public:
    // Tensor序列化/反序列化相關,在2.3節詳細介紹
    bool FromProto(const TensorProto& other) TF_MUST_USE_RESULT;
    void AsProtoField(TensorProto* proto) const;
    void AsProtoTensorContent(TensorProto* proto) const;
    
    // Tensor實際為底層資料的一種檢視,可用vec或matrix進行展示
    template <typename T>
    typename TTypes<T>::Vec vec() {
      return tensor<T, 1>();
    }

    template <typename T>
    typename TTypes<T>::Matrix matrix() {
      return tensor<T, 2>();
    }

    template <typename T, size_t NDIMS>
    typename TTypes<T, NDIMS>::Tensor tensor();
  
  private:
    TensorShape shape_;    // 維護Tensor的形狀和資料型別
    TensorBuffer buf_;     // 底層資料的指標}

我們先來分析下兩個私有成員。首先看一下TensorBuffer類，它是一個繼承引用計數類的虛擬類，不包含任何實現。透過檢視參考文獻6，我們得知BufferBase繼承TensorBuffer類，且維護了一個記憶體分配器指標。而Buffer類繼承BufferBase類，且維護了指向實際資料的指標data_和元素數量elem_。上述類的繼承關係如下圖所示（為便於理解圖中給出成員定義，而非標準的UML圖）：

接下來我們分析TensorShape類。它也有自己的類繼承體系，其核心邏輯定義在父類TensorShapeRep中，相關的類繼承體系如下圖：

為深入理解TensorShape的作用，以下結合TensorShapeRep的部分程式碼（seg2）進行分析：

class TensorShapeRep{
  private:
    // 如下buf共計16位元組表示TensorShape，其中前12位元組用來儲存形狀（Rep16、Rep32、Rep64）
    // 第13位元組作用不清楚，第14、15、16位元組分別表示資料型別編號、張量的維度數目、張量維度的表示型別
    union {
      uint8 buf[16];

      Rep64* unused_aligner;   // Force data to be aligned enough for a pointer.
    } u_;
  
  public:
    // 理論上可定義任意維的張量，但1維、2維、3維張量最常見。所以給出如下三種維度表示方法（12位元組）
    struct Rep16 {
      uint16 dims_[6];    // 最多可表示6維的張量，每一維的長度不超過2^16-1
    };
    struct Rep32 {
      uint32 dims_[3];    // 最多可表示3維的張量，每一維的長度不超過2^32-1
    };
    struct Rep64 {
      gtl::InlinedVector<int64, 4>* dims_;  // 支援任意維度的張量
    };}

本小節最後，我們再來看一下Tensor類定義中的vector()和matrix()。檢視兩個方法的實現，發現呼叫了共同的方法tensor()，而tensor()的返回型別為TTypes<T，NDIMS>::Tensor，而TTypes正是銜接TF Tensor與Eigen庫的關鍵。請看如下程式碼（seg3）：

// tensorflow1.15.5\tensorflow\core\framework\tensor.hclass Tensor{
  public:
    // Returns the shape of the tensor.
    const TensorShape& shape() const { return shape_; }
  
    template <typename T>
    typename TTypes<T>::Vec vec() {
      return tensor<T, 1>();
    }
    

    template <typename T>
    typename TTypes<T>::Matrix matrix() {
      return tensor<T, 2>();
    }
    

    template <typename T, size_t NDIMS>
    typename TTypes<T, NDIMS>::Tensor tensor();}// tensorflow1.15.5\tensorflow\core\framework\tensor_types.htemplate <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex>struct TTypes {
  // Rank-<NDIMS> tensor of scalar type T.
  typedef Eigen::TensorMap<Eigen::Tensor<T, NDIMS, Eigen::RowMajor, IndexType>,Eigen::Aligned> Tensor;
  // 省略了許多程式碼}// tensorflow1.15.5\tensorflow\core\framework\tensor.h// TF Tensor的shape()返回TensorShape。base()返回指向實際資料的指標。template <typename T, size_t NDIMS>typename TTypes<T, NDIMS>::Tensor Tensor::tensor() {
  CheckTypeAndIsAligned(DataTypeToEnum<T>::v());
  return typename TTypes<T, NDIMS>::Tensor(base<T>(),
                                           shape().AsEigenDSizes<NDIMS>());}

由上述程式碼可見，呼叫tensor()是把TF Tensor轉化成了TTypes<T,NDIMS>::Tensor，而後者本質上是Eigen::TensorMap。至此，我們搞清楚了TF Tensor與Eigen庫的關係，可以認為TF C++ Tensor是對Eigen::TensorMap的一種封裝。因為Eigen::TensorMap建構函式的引數來自於TF Tensor中儲存的資訊（base()和shape()對應的資訊）。

2.3 C++ Tensor序列化

在TensorFlow的分散式訓練環境中涉及大量的跨機通訊，通訊的內容就是序列化後的張量（透過send/recv op對協同工作）。本小節我們將一起學習Tensor的序列化機制，以及Tensor與序列化物件的互程式設計。TensorFlow中Tensor對應的序列化物件叫TensorProto，它是由對應的proto檔案生成。具體程式碼如下（seg4）：

// tensorflow1.15.5\tensorflow\core\framework\tensor.proto
syntax = "proto3";

message TensorProto {
  DataType dtype = 1;

  TensorShapeProto tensor_shape = 2;

  int32 version_number = 3;

  bytes tensor_content = 4;

  repeated int32 half_val = 13 [packed = true];

  // DT_FLOAT.
  repeated float float_val = 5 [packed = true];

  // DT_DOUBLE.
  repeated double double_val = 6 [packed = true];

  // DT_INT32, DT_INT16, DT_INT8, DT_UINT8.
  repeated int32 int_val = 7 [packed = true];

  // DT_STRING
  repeated bytes string_val = 8;

  // DT_COMPLEX64. scomplex_val(2*i) and scomplex_val(2*i+1) are real
  // and imaginary parts of i-th single precision complex.
  repeated float scomplex_val = 9 [packed = true];

  // DT_INT64
  repeated int64 int64_val = 10 [packed = true];

  // DT_BOOL
  repeated bool bool_val = 11 [packed = true];

  // DT_COMPLEX128. dcomplex_val(2*i) and dcomplex_val(2*i+1) are real
  // and imaginary parts of i-th double precision complex.
  repeated double dcomplex_val = 12 [packed = true];

  // DT_RESOURCE
  repeated ResourceHandleProto resource_handle_val = 14;

  // DT_VARIANT
  repeated VariantTensorDataProto variant_val = 15;

  // DT_UINT32
  repeated uint32 uint32_val = 16 [packed = true];

  // DT_UINT64
  repeated uint64 uint64_val = 17 [packed = true];
};

大家可用protoc編譯器來編譯tensor.proto檔案，結果生成tensor.pb.h和tensor.pb.cc兩個檔案，他們分別宣告瞭TensorProto類定義、TensorProto成員方法的實現。我們可以粗略地將TensorProto看作Tensor的二進位制物件，基於此它們相互之間的轉換程式碼如下所示（seg5）：

// Tensor的序列化過程auto tensor_proto = new TensorProto();// Fills in `proto` with `*this` tensor's content.// `AsProtoField()` fills in the repeated field for `proto.dtype()`, // while `AsProtoTensorContent()` encodes the content in `proto.tensor_content()` in a compact form.tensor->AsProtoField(tensor_proto);tensor->AsProtoTensorContent(tensor_proto);
  // Tensor的反序列化過程Tensor tensor;tensor.FromProto(tensor_proto);

三、跨框架程式設計-通用記憶體張量DLPack

3.1 什麼是DLPack

DLPack是一種開放的記憶體張量結構，用於在AI框架之間共享張量。多框架整合解決AI問題，能充分發揮各框架優勢（一些運算在某框架中支援更好），並最終取得整體最佳效能。但這裡有一個關鍵問題要解決：如何將記憶體中的張量從一個框架傳遞到另一個框架，而不發生任何資料複製？幸運的是，陳天奇團隊給出了DLPack這個答案。

DLPack的設計理念是儘可能的輕量化，它不考慮記憶體分配、裝置API，僅僅關注張量資料結構。它可以執行在多個硬體平臺上，目前支援的框架有：NumPy、CuPy、PyTorch、Tensorflow、MXNet、TVM、mpi4py。DLPack的開發者不打算實現Tensor和Ops，而是將其用作跨框架重用張量和操作的公共橋樑。深入理解DLPack，要掌握兩大模組：C API與Python API。DLPack C API體系結構如下：

上圖中深藍色的結構體均定義在[13]中。DLTensor代表普通C Tensor物件，但不負責記憶體管理。DLManagedTensor也是一個C Tensor物件，負責DLTensor的記憶體管理，它被設計用來幫助其他框架借用此DLTensor。接下來，我們將目光轉向DLPack的Python API。

DLPack Python介面是Python array的標準API。用DLPack Python介面進行資料交換的介面有兩個：

•from_dlpack(x)：輸入一個包含__dlpack__方法的陣列物件，用這個方法構建一個包含x資料域的新陣列物件。

•__dlpack__(self,stream=None) and __dlpack_device__()：在from_dlpack(x)內部呼叫x的這兩個方法，分別用於獲取x的資料域以及定位x陣列物件在哪個裝置上。

從語義層面理解y=from_dlpack(x)的話，生成x的庫叫生產者，包含from_dlpack()的庫叫做消費者。其中生產者提供了訪問x資料域的途徑，通常來說生產者和消費者之間關於相應的資料是零複製的，也即y可視為x的檢視。如果深入from_dlpack(x)內部，則x.__dlpack__方法生成包含DLManagedTensor的PyCapsule物件（或稱capsule），這個物件只能被消費一次。生產者必須將PyCapsule物件名稱設為"dltensor"，以方便按名稱檢索；同時也要設定DLManagedTensor的deleter方法給PyCapsule_Destructor，這個設定是當名為"dltensor"的capsule物件不再需要時使用。消費者把DLManagedTensor的所有權從capsule物件轉移至自己，這是透過把capsule物件改名為"used_dltensor"以確保PyCapsule_Destructor不會被呼叫來實現的。但當capsule物件把DLManagedTensor所有權轉移至消費者物件時，消費者物件的destructor方法仍然可以呼叫DLManagedTensor的deleter方法。

3.2 TensorFlow中的dlpack

筆者發現TensorFlow對DLPack的支援是從v2.2.0開始的，更早的版本沒有dlpack相應的庫。TensorFlow的dlpack介面與3.1遵守相同的語義描述，相應的API測試語句如下：

import tensorflow as tf

x = tf.constant(5)x                     // <tf.Tensor: shape=(), dtype=int32, numpy=5>r =tf.experimental.dlpack.to_dlpack(x)print(r,type(r))      // <capsule object "dltensor" at 0x7f55a0431c30> <class 'PyCapsule'>x_other = tf.experimental.dlpack.from_dlpack(r)x_other               // <tf.Tensor: shape=(), dtype=int32, numpy=5>

3.3 TVM與DLPack的關係

如果你想開發一款跨AI框架的深度學習編譯器，DLPack就是一種可行的方案（TVM就是這條技術路線）。比如，我們在TVM中宣告並編譯一個矩陣乘法運算元，然後基於DLPack表示構建一個包裝器，該包裝器能讓此矩陣乘法運算元支援PyTorch Tensor。對MxNet可以採用類似的操作。DLPack提供在AI框架和TVM之間共享的中間包裝器的原理如下圖所示：

上述原理可以參考如下程式碼舉例：

// 前提說明:在PyTorch中計算矩陣乘法import torch
x = torch.rand(56,56)y = torch.rand(56,56)z = x.mm(y)// 第一步，定義並構建一個TVM矩陣乘法運算元
n = tvm.convert(56)X = tvm.placeholder((n,n), name='X')Y = tvm.placeholder((n,n), name='Y')k = tvm.reduce_axis((0, n), name='k')Z = tvm.compute((n,n), lambda i,j : tvm.sum(X[i,k]*Y[k,j], axis=k))s = tvm.create_schedule(Z.op)fmm = tvm.build(s, [X, Y, Z], target_host='llvm', name='fmm')
    // 第二步，對TVM函式進行包裝以支援PyTorch Tensor,並驗證結果from tvm.contrib.dlpack import to_pytorch_func# fmm is the previously built TVM function (Python function)# fmm is the wrapped TVM function (Python function)fmm_pytorch = to_pytorch_func(fmm)z2 = torch.empty(56,56)fmm_pytorch(x, y, z2)np.testing.assert_allclose(z.numpy(), z2.numpy())
    // 第三步，參照第二步對MxNet進行類似包裝處理import mxnetfrom tvm.contrib.mxnet import to_mxnet_func
ctx = mxnet.cpu(0)x = mxnet.nd.uniform(shape=(56,56), ctx=ctx)y = mxnet.nd.uniform(shape=(56,56), ctx=ctx)z = mxnet.nd.empty(shape=(56,56), ctx=ctx)f = tvm.build(s, [X, Y, Z], target_host='llvm', name='f')f_mxnet = to_mxnet_func(f)f_mxnet(x, y, z)np.testing.assert_allclose(z.asnumpy(), x.asnumpy().dot(y.asnumpy()))
    // 第四步，to_pytorch_func()的詳細定義// TVM提供了dlpack tensor和TVM NDArray互轉的函式.TVM函式在最底層呼叫的是TVM NDArray.// 此包裝器的大致流程是: AI Tensor -> dlpack tensor -> TVM NDArray -> call TVM functiondef convert_func(tvm_func, tensor_type, to_dlpack_func):
    assert callable(tvm_func)

    def _wrapper(*args):
        args = tuple(ndarray.from_dlpack(to_dlpack_func(arg))\            if isinstance(arg, tensor_type) else arg for arg in args)
        return tvm_func(*args)

    return _wrapperdef to_pytorch_func(tvm_func):
    import torch    import torch.utils.dlpack    return convert_func(tvm_func, torch.Tensor, torch.utils.dlpack.to_dlpack)