bert訓練過程3

ywm-pku發表於2019-01-04

輸出引數

INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (8, 128)
INFO:tensorflow:  name = input_mask, shape = (8, 128)
INFO:tensorflow:  name = masked_lm_ids, shape = (8, 20)
INFO:tensorflow:  name = masked_lm_positions, shape = (8, 20)
INFO:tensorflow:  name = masked_lm_weights, shape = (8, 20)
INFO:tensorflow:  name = next_sentence_labels, shape = (8, 1)
INFO:tensorflow:  name = segment_ids, shape = (8, 128)

INFO:tensorflow:**** Trainable Variables ****

INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 768)
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768)
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768)
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_0/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_0/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_0/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_1/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_1/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_1/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_1/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_2/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_2/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_2/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_3/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_3/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_3/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_3/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_4/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_4/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_4/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_4/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_5/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_5/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_5/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_5/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_6/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_6/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_6/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_6/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_7/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_7/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_7/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_7/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_8/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_8/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_8/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_9/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_9/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_9/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_9/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_10/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_10/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_10/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_10/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072)
INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,)
INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768)
INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = bert/pooler/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = bert/pooler/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = cls/predictions/transform/dense/kernel:0, shape = (768, 768)
INFO:tensorflow:  name = cls/predictions/transform/dense/bias:0, shape = (768,)
INFO:tensorflow:  name = cls/predictions/transform/LayerNorm/beta:0, shape = (768,)
INFO:tensorflow:  name = cls/predictions/transform/LayerNorm/gamma:0, shape = (768,)
INFO:tensorflow:  name = cls/predictions/output_bias:0, shape = (30522,)
INFO:tensorflow:  name = cls/seq_relationship/output_weights:0, shape = (2, 768)
INFO:tensorflow:  name = cls/seq_relationship/output_bias:0, shape = (2,)

規範化資料集
Estimator要求模型的輸入為特定格式(from_tensor_slices),所以要對資料進行類封裝

"""Creates an `input_fn` closure to be passed to TPUEstimator."""
  def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]  #32
    #tf.FixedLenFeature 返回的是一個定長的tensor
    name_to_features = {
        "input_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "input_mask":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "segment_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "masked_lm_positions":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_ids":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_weights":
            tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
        "next_sentence_labels":
            tf.FixedLenFeature([1], tf.int64),
    }

    # For training, we want a lot of parallel reading and shuffling.
    # For eval, we want no shuffling and parallel reading doesn't matter.
    if is_training:
#它的作用是切分傳入Tensor的第一個維度,生成相應的dataset。
 #dataset = tf.data.Dataset.from_tensor_slices(np.random.uniform(size=(5, 2)))
#傳入的數值是一個矩陣,它的形狀為(5, 2),tf.data.Dataset.from_tensor_slices就會切分它形狀上的第一個維度,最後生成的dataset中
#一個含有5個元素,每個元素的形狀是(2, ),即每個元素是矩陣的一行。
    '''
   對於更復雜的情形,比如元素是一個python中的元組或者字典:在影像識別中一個元素可以是{”image”:image_tensor,”label”:label_tensor}的形式。
  dataset = tf.data.Dataset.from_tensor_slices ( { “a”:np.array([1.0,2.0,3.0,4.0,5.0]), “b”:np.random.uniform(size=(5,2) ) } )
  這時,函式會分別切分”a”中的數值以及”b”中的數值,最後總dataset中的一個元素就是類似於{ “a”:1.0, “b”:[0.9,0.1] }的形式。tf.data.Dataset.from_tensor_slices真正作用是切分傳入Tensor的第一個維度,生成相應的dataset,即第一維表明資料集中資料的數量,之後切分batch等操作都以第一維為基礎。http://www.cnblogs.com/hellcat/p/8569651.html
  repeat的功能就是將整個序列重複多次,主要用來處理機器學習中的epoch,假設原先的資料是一個epoch,使用repeat(2)就可以將之變成2個epoch:
     '''
      d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
      d = d.repeat()
      d = d.shuffle(buffer_size=len(input_files))
      # `cycle_length` is the number of parallel files that get read.
      cycle_length = min(num_cpu_threads, len(input_files))
      # `sloppy` mode means that the interleaving is not exact. This adds
      # even more randomness to the training pipeline.
      d = d.apply(
          tf.contrib.data.parallel_interleave(
              tf.data.TFRecordDataset,
              sloppy=is_training,
              cycle_length=cycle_length))
      d = d.shuffle(buffer_size=100)
    else:
      d = tf.data.TFRecordDataset(input_files)
      # Since we evaluate for a fixed number of steps we don't want to encounter
      # out-of-range exceptions.
      d = d.repeat()
    # We must `drop_remainder` on training because the TPU requires fixed
    # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
    # and we *don't* want to drop the remainder, otherwise we wont cover
    # every sample.
    d = d.apply(
        tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),  #map_func:將tensor的巢狀結構對映到另一個tensor巢狀結構的函式。
            batch_size=batch_size,
            num_parallel_batches=num_cpu_threads,
            drop_remainder=True))
    return d
  return input_fn

tf.contrib.data.map_and_batch(
map_func,
batch_size,
num_parallel_batches=None,
drop_remainder=False,
num_parallel_calls=None
)
定義於:tensorflow/contrib/data/python/ops/batching.py。
複合實現map和batch。
map_func橫跨dataset的batch_size個連續元素,然後將它們組合成一個batch。在功能上,它相當於map 後面跟著batch。但是,通過將兩個轉換融合在一起,實現可以更有效。在API中展示此轉換是暫時的。一旦自動輸入管道的優化實現了,map和batch的融合會自動發生,這個API將被棄用。
引數:
map_func:將tensor的巢狀結構對映到另一個tensor巢狀結構的函式。
batch_size:tf.int64,標量tf.Tensor,表示要在此資料集合並的單個batch中的連續元素數。
num_parallel_batches:(可選)tf.int64,標量tf.Tensor,表示要並行建立的batch數。一方面,較高的值可以幫助減輕落後者的影響。另一方面,如果CPU空閒,較高的值可能會增加競爭。
drop_remainder:(可選)tf.bool,標量tf.Tensor,表示是否應丟棄最後一個batch,以防其大小小於所需值; 預設行為是不刪除較小的batch。
num_parallel_calls:(可選)tf.int32,標量tf.Tensor,表示要並行處理的元素數。如果未指定,則將並行處理batch_size * num_parallel_batches個元素。
返回:
一個Dataset轉換函式,它可以傳遞給 tf.data.Dataset.apply。

def _decode_record(record, name_to_features):
  """Decodes a record to a TensorFlow example."""
  example = tf.parse_single_example(record, name_to_features)
  # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
  # So cast all int64 to int32.
  for name in list(example.keys()):
    t = example[name]
    if t.dtype == tf.int64:
      t = tf.to_int32(t)
    example[name] = t
  #print(example)
  return example

#print(example):

{'masked_lm_weights': <tf.Tensor 'ParseSingleExample/ParseSingleExample:4' shape=(20,) dtype=float32>, 'segment_ids': <tf.Tensor 'ToInt32:0' shape=(128,) dtype=int32>, 'masked_lm_positions': <tf.Tensor 'ToInt32_1:0' shape=(20,) dtype=int32>, 'masked_lm_ids': <tf.Tensor 'ToInt32_2:0' shape=(20,) dtype=int32>, 'next_sentence_labels': <tf.Tensor 'ToInt32_3:0' shape=(1,) dtype=int32>, 'input_ids': <tf.Tensor 'ToInt32_4:0' shape=(128,) dtype=int32>, 'input_mask': <tf.Tensor 'ToInt32_5:0' shape=(128,) dtype=int32>}

『TensorFlow』資料讀取類_data.Dataset

#輸入BERT模型的最後一層encoder,輸出遮蔽詞預測任務的loss和概率矩陣。
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
    #input_tensor=model.get_sequence_output()
    #output_weights=model.get_embedding_table()
    #positions=masked_lm_positions
    #label_ids=masked_lm_ids
    #label_weights=masked_lm_weights
    # 這裡的input_tensor是模型中傳回的最後一層結果 [batch_size,seq_length,hidden_size]。
    # #output_weights是詞向量表 [vocab_size,embedding_size]
  """Get loss and log probs for the masked LM."""   #獲取positions位置的所有encoder(即要預測的那些位置的encoder)
  input_tensor = gather_indexes(input_tensor, positions)  #[batch_size*max_pred_pre_seq,hidden_size]
  #print("input_tensor",input_tensor) #shape=(640, 768) #遮蔽的20個位置

  with tf.variable_scope("cls/predictions"):
    # We apply one more non-linear transformation before the output layer.
    # This matrix is not used after pre-training.
    with tf.variable_scope("transform"):
      input_tensor = tf.layers.dense(  # #傳入一個全連線層 輸出shape [batch_size*max_pred_pre_seq,hidden_size]
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      input_tensor = modeling.layer_norm(input_tensor)

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
#output_weights是embedding層 output_weights進行轉置
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)  ##[batch_size*max_pred_pre_seq,vocab_size]
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    label_ids = tf.reshape(label_ids, [-1])
    label_weights = tf.reshape(label_weights, [-1])

    one_hot_labels = tf.one_hot(
        label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
    #print(one_hot_labels) #bert-master/run_pretraining.py:284
    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator
  return (loss, per_example_loss, log_probs)

輸入是:model.get_sequence_output()–模型中傳回的最後一層結果 [batch_size,seq_length,hidden_size]=[32,128,768]
標籤:label_ids
output_weights是embedding層
同標籤的計算:per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) #此處類似與進行與標籤的計算one_hot_labels------shape=(640, 30522)
在做mask時為什麼是80%的mask,10%的正確詞,10%的錯誤詞??????????????????
為什麼不能全部換成mask?10%的錯誤詞會由影響嗎?
谷歌終於開源BERT程式碼:3 億引數量,機器之心全面解讀
預訓練過程BERT 最核心的就是預訓練過程,這也是該論文的亮點所在。簡單而言,模型會從資料集抽取兩句話,其中 B 句有 50% 的概率是 A 句的下一句,然後將這兩句話轉化前面所示的輸入表徵。現在我們隨機遮掩(Mask 掉)輸入序列中 15% 的詞,並要求 Transformer 預測這些被遮掩的詞,以及 B 句是 A 句下一句的概率這兩個任務。
對於二分類任務,在抽取一個序列(A+B)中,B 有 50% 的概率是 A 的下一句。如果是的話就會生成標註「IsNext」,不是的話就會生成標註「NotNext」,這些標註可以作為二元分類任務判斷模型預測的憑證。
對於 Mask 預測任務,首先整個序列會隨機 Mask 掉 15% 的詞,這裡的 Mask 不只是簡單地用「[MASK]」符號代替某些詞,因為這會引起預訓練與微調兩階段不是太匹配。所以谷歌在確定需要 Mask 掉的詞後,80% 的情況下會直接替代為「[MASK]」,10% 的情況會替代為其它任意的詞,最後 10% 的情況會保留原詞。
原句:my dog is hairy
80%:my dog is [MASK]
10%:my dog is apple
10%:my dog is hairy
注意最後 10% 保留原句是為了將表徵偏向真實觀察值,而另外 10% 用其它詞替代原詞並不會影響模型對語言的理解能力,因為它只佔所有詞的 1.5%(0.1 × 0.15)。此外,作者在論文中還表示因為每次只能預測 15% 的詞,因此模型收斂比較慢。
下一句預測
輸入BERT模型CLS的encoder,輸出下一句預測任務的loss和概率矩陣,輸入為model.get_pooled_output()
標籤為:0代表是下一句,1代表是隨機語句
提出問題:transformer的輸入端encoder和輸出端decoder資料??
程式碼


def gather_indexes(sequence_tensor, positions):
  """Gathers the vectors at the specific positions over a minibatch."""
  sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
  batch_size = sequence_shape[0] #32
  seq_length = sequence_shape[1] #128
  width = sequence_shape[2] #768
#tf.range(start, limit, delta)  # [3, 6, 9, 12, 15]
  flat_offsets = tf.reshape(
      tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])  #偏置b
  #print(tf.Session().run(flat_offsets))
  #print(seq_length) #128
  #print('flat_offsets',flat_offsets) #flat_offsets Tensor("Reshape:0", shape=(32, 1), dtype=int32)
  flat_positions = tf.reshape(positions + flat_offsets, [-1]) #在最後一列也就是第20列加上128
  #print((positions + flat_offsets)) #Tensor("add_1:0", shape=(32, 20), dtype=int32)
  #print(positions) #Tensor("IteratorGetNext:3", shape=(32, 20), dtype=int32)
  #print('flat_positions',flat_positions) #flat_positions Tensor("Reshape_1:0", shape=(640,), dtype=int32)
  flat_sequence_tensor = tf.reshape(sequence_tensor,
                                    [batch_size * seq_length, width])  #[32*128,768]
  #print(sequence_tensor) #shape=(32, 128, 768)
  #print(width) #hidden 768
  #print('flat_sequence_tensor',flat_sequence_tensor) #flat_sequence_tensor Tensor("Reshape_2:0", shape=(4096, 768), dtype=float32)
 #tf.gather根據索引從引數軸上收集切片。
#索引必須是任何維度的整數張量(通常為 0-D 或 1-D)。生成輸出張量該張量的形狀為:params.shape[:axis] + indices.shape + params.shape[axis + 1:]
  output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
  #print('output_tensor',output_tensor) #output_tensor Tensor("GatherV2:0", shape=(640, 768), dtype=float32)
  #本質上是完成了將對應的遮蔽的20個位置的訓練後的向量取出(32*20,768)
  return output_tensor

flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])

[[   0]
 [ 128]
 [ 256]
 [ 384]
 [ 512]
 [ 640]
 [ 768]
 [ 896]
 [1024]
 [1152]
 [1280]
 [1408]
 [1536]
 [1664]
 [1792]
 [1920]
 [2048]
 [2176]
 [2304]
 [2432]
 [2560]
 [2688]
 [2816]
 [2944]
 [3072]
 [3200]
 [3328]
 [3456]
 [3584]
 [3712]
 [3840]
 [3968]]

position = masked_lm_positions--------20
shape = [32,20]
[0+[1,…20]]
[128+[1,…20]]
[1282+[1,…20]]
.
.
[128
31,[1,…20]]
sequence_tensor=model.get_sequence_output() shape=[32128,768]
對照sequence_tensor可以看出第一行的20個被遮蔽的元素,對應[32
128,768]中第一行,第二行對應[32*128,768]的第二行

import tensorflow as tf 
temp = tf.range(0,10)*10 + tf.constant(1,shape=[10]) 
temp2 = tf.gather(temp,[1,5,9]) 
with tf.Session() as sess: 
print sess.run(temp) 
print sess.run(temp2)

[ 1 11 21 31 41 51 61 71 81 91]
[11 51 91]
get_masked_lm_output()函式的執行過程是,輸入的是transformer最後的輸出,在這個輸出中將對應的遮蔽的20個位置的向量取出一共是(3220,768)形成輸入intensor,然後將這個intensor和tranformer中的embeddin層相乘。
最後形成print(log_probs) #shape=(640, 30522)=(32
20,30522),也就是說每一個字對應著一個30522的向量,也就是字典的大小。最後和label(640,)做比較,並計算loss值。
label_weights的作用?

INFO:tensorflow:masked_lm_ids: 1011 1011 2171 2003 6442 1010 6697 1998 2015 8835 1010 2909 25636 4308 1011 1997 2015 1011 13610 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0

對於在20個遮蔽單詞中不夠20個的就補0,補0的位置同時label_weights對應為0,可以在計算時比較節省時間,並且可以緩解預測過程中預測數目,可能會對準確度的提高有一定的幫助,只是個人理解。
1)tokens:代表的是具體的詞彙
2)input_ids:將詞彙轉換成對應的字典中的序列號
3)input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-----128的長度。不夠128的話補0,其他有詞彙的地方為1.
4)segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0。--------前部分的0是句子A部分,可以看成問題部分,中間的1是部分,可以看成是句子B部分或答案,最後的0是不夠128補充的0.masked_lm_positions:句子中遮蔽的詞彙的位置
5)masked_lm_ids:遮蔽的詞彙對應的字典序
6)masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0。遮蔽的詞彙不夠20的部分權重為0.
7)next_sentence_labels: 說明句子是否是一個正確的句子對。

相關文章