關於某個復現XLNet的廣告文案

雷姆是我的發表於2021-01-01

在某心培訓中,最常見的一個廣告就是所謂復現XLNet的。原意是,在面試一個小時中,如果你不能手打XLNet,那麼你連基本功都達不到。所以換句話說,倒貼錢都沒公司要你。

這個廣告造成極壞的影響。姑且不說,後面推薦課程課程一點幫助都沒有,其實就是簡單的優化理論。

我們先說一下XLNet復現為啥不可能。我們先看XLNet原始碼。先看這個檔案。自己看看多長,我記得列印出來是四十頁。四十頁一個小時老師能打完我都不相信,更不用說復現。

再說一個更關鍵,大部分頂會論文不會把所有細節都放出來,比如說這段程式碼

flags.DEFINE_string("master", default=None,
      help="master")
flags.DEFINE_string("tpu", default=None,
      help="The Cloud TPU to use for training. This should be either the name "
      "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 url.")
flags.DEFINE_string("gcp_project", default=None,
      help="Project name for the Cloud TPU-enabled project. If not specified, "
      "we will attempt to automatically detect the GCE project from metadata.")
flags.DEFINE_string("tpu_zone",default=None,
      help="GCE zone where the Cloud TPU is located in. If not specified, we "
      "will attempt to automatically detect the GCE project from metadata.")
flags.DEFINE_bool("use_tpu", default=True,
      help="Use TPUs rather than plain CPUs.")
flags.DEFINE_integer("num_hosts", default=1,
      help="number of TPU hosts")
flags.DEFINE_integer("num_core_per_host", default=8,
      help="number of cores per host")
flags.DEFINE_bool("track_mean", default=False,
      help="Whether to track mean loss.")

# Experiment (data/checkpoint/directory) config
flags.DEFINE_integer("num_passes", default=1,
      help="Number of passed used for training.")
flags.DEFINE_string("record_info_dir", default=None,
      help="Path to local directory containing `record_info-lm.json`.")
flags.DEFINE_string("model_dir", default=None,
      help="Estimator model_dir.")
flags.DEFINE_string("init_checkpoint", default=None,
      help="Checkpoint path for initializing the model.")

# Optimization config
flags.DEFINE_float("learning_rate", default=1e-4,
      help="Maximum learning rate.")
flags.DEFINE_float("clip", default=1.0,
      help="Gradient clipping value.")
# lr decay
flags.DEFINE_float("min_lr_ratio", default=0.001,
      help="Minimum ratio learning rate.")
flags.DEFINE_integer("warmup_steps", default=0,
      help="Number of steps for linear lr warmup.")
flags.DEFINE_float("adam_epsilon", default=1e-8,
      help="Adam epsilon.")
flags.DEFINE_string("decay_method", default="poly",
      help="Poly or cos.")
flags.DEFINE_float("weight_decay", default=0.0,
      help="Weight decay rate.")

# Training config
flags.DEFINE_integer("train_batch_size", default=16,
      help="Size of the train batch across all hosts.")
flags.DEFINE_integer("train_steps", default=100000,
      help="Total number of training steps.")
flags.DEFINE_integer("iterations", default=1000,
      help="Number of iterations per repeat loop.")
flags.DEFINE_integer("save_steps", default=None,
      help="Number of steps for model checkpointing. "
      "None for not saving checkpoints")
flags.DEFINE_integer("max_save", default=100000,
      help="Maximum number of checkpoints to save.")

# Data config
flags.DEFINE_integer("seq_len", default=0,
      help="Sequence length for pretraining.")
flags.DEFINE_integer("reuse_len", default=0,
      help="How many tokens to be reused in the next batch. "
      "Could be half of `seq_len`.")
flags.DEFINE_bool("uncased", False,
      help="Use uncased inputs or not.")
flags.DEFINE_integer("perm_size", 0,
      help="Window size of permutation.")
flags.DEFINE_bool("bi_data", default=True,
      help="Use bidirectional data streams, i.e., forward & backward.")
flags.DEFINE_integer("mask_alpha", default=6,
      help="How many tokens to form a group.")
flags.DEFINE_integer("mask_beta", default=1,
      help="How many tokens to mask within each group.")
flags.DEFINE_integer("num_predict", default=None,
      help="Number of tokens to predict in partial prediction.")
flags.DEFINE_integer("n_token", 32000, help="Vocab size")

# Model config
flags.DEFINE_integer("mem_len", default=0,
      help="Number of steps to cache")
flags.DEFINE_bool("same_length", default=False,
      help="Same length attention")
flags.DEFINE_integer("clamp_len", default=-1,
      help="Clamp length")

flags.DEFINE_integer("n_layer", default=6,
      help="Number of layers.")
flags.DEFINE_integer("d_model", default=32,
      help="Dimension of the model.")
flags.DEFINE_integer("d_embed", default=32,
      help="Dimension of the embeddings.")
flags.DEFINE_integer("n_head", default=4,
      help="Number of attention heads.")
flags.DEFINE_integer("d_head", default=8,
      help="Dimension of each attention head.")
flags.DEFINE_integer("d_inner", default=32,
      help="Dimension of inner hidden size in positionwise feed-forward.")
flags.DEFINE_float("dropout", default=0.0,
      help="Dropout rate.")
flags.DEFINE_float("dropatt", default=0.0,
      help="Attention dropout rate.")
flags.DEFINE_bool("untie_r", default=False,
      help="Untie r_w_bias and r_r_bias")
flags.DEFINE_string("summary_type", default="last",
      help="Method used to summarize a sequence into a compact vector.")
flags.DEFINE_string("ff_activation", default="relu",
      help="Activation type used in position-wise feed-forward.")
flags.DEFINE_bool("use_bfloat16", False,
      help="Whether to use bfloat16.")

這些是預訓練的核心,有些很容易理解,有些在文中沒太提到,但是我自己在訓練XlNet(是的,我自己用TPU POD訓練過)用過,這些引數極其重要。如果不知道這些引數怎麼調,訓練出來也沒用。

最後一句,國內能訓練起XLNet的公司舉手唄。

之所以打假,就是不想讓這種製造焦慮。我對於貪心學院課程沒上過。所以不做評價。但是這個廣告文案,過於噁心和低階,請不要在弄了。

相關文章