【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數

胡得偉發表於2020-11-20

Abstract

  • The standard RNN: doesn’t work for global sentence representation;

  • The work of this paper: RNN-based variational autoencoder generative model;

  • The model: incorporates distributed latent representations of entire sentences; good at modeling holistic properties;

  • the model can generate coherent novel sentences that interpolate between known sentences;

Introduction

  • RNN works well in unsupervised generative modeling for natural language sentences; like in machine translation and image captioning;
  • Shortcoming of RNN: cannot capture the global information like topic or high level syntactic properties.
  • What this paper did: extension of the RNN which capture global features in a continuous latent variable;
  • Inspiration of this paper from: using the architecture of a variational autoencoder and takes advantage of recent advances in variational inference (Kingma and Welling, 2015; Rezende et al., 2014); in these paper, they introduce generative models with latent variables;
  • The contributions of the paper:
    • a VAE architecture for text and the problems while training.
    • The model yields similar performance to existing RNN if the global variable is not explicitly needed;
    • Task of imputing missing words: this paper introduced a novel evaluation strategy using adversarial classifier; sidestepping the issue of intractable likelihood computations by drawing inspiration from work on non-parametric two-sample tests and adversarial training;
    • In this setting, the global latent variable allows the model to work well;
    • Introduce several qualitative techniques evaluating ability to learn high level features of sentences;
    • The model can produce diverse, coherent sentences through purely deterministic decoding and that they can interpolate smoothly between sentences.

Background

Unsupervised sentence encoding

  • Standard RNN doesn’t learn a vector representation of the full sentence;

  • In order to incorporate a continuous latent sentence representation, we need a method to map between sentences and distributed representations, which can be trained in an unsupervised setting;

  • Sequence autoencoders include: encoder function ϕ e n c \phi_{enc} ϕenc and a probabilistic decoder p ( x ∣ z ⃗ = ϕ e n c ( x ) ) p(x|\vec z=\phi_{enc}(x)) p(xz =ϕenc(x)); where z ⃗ \vec z z is the learned code x x x is a given example; the decoder need to maximize the likelihood of the an example x x x conditioned on z ⃗ \vec z z ; both encoder and decoder are RNNs and examples are token sequences;

  • But standard autoencoders are not effective at extracting for global semantic features; it can not learn a smooth, interpretable features for sentence encoding; the model do not incorporate a prior over z ⃗ \vec z z ;

  • Skip-thought models: same structure as autoencoder, but generate text conditioned on a neighboring sentence from the target text; instead the target sentence itself;

  • Paragraph vector models: non-recurrent sentence representation models

The variational autoencoder

  • Based on a regularized version of the standard autoencoder

A VAE for sentences

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-y8iHmVhS-1605794628063)(https://i.loli.net/2020/11/17/C1HLey89agIG4hl.png)]

  • the hidden code: with the Gaussian prior acting as a regularizer, and incorporates no useful information (as noise);

  • decoder: serve as a special RNN conditioned on the hidden code;

  • Explore several variations on the architecture:

    • concatenating the sampled z ⃗ \vec z z to the decoder input at every time step

    • using a soft-plus parametrization for the variance

    • using deep feedforward networks between the encoder and latent variable and the decoder and latent variable

    • Result: little difference in model’s performance;

  • Some work about VAE on discrete sequences has been done:

    • VRAE(Variational Recurrent Autoencoder) for modeling music;
  • Some work on including continuous latent variables in RNN-style models:

    • Modeling speech, handwriting and music;
    • these models include separate latent variables per timestep; unsuitable for our task;
  • Similar work to this paper:

    • VAE-based document-level language model that models texts as bags of words, rather than sequences;

Optimization challenges

The goal of this paper is to learn global latent representations of sentence content.

Q: How can we estimate the global features learned?

A: By looking at the variational lower bound objective.

  • The objective:
【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
  • It encourages the model to keep its posterior distributions close to a prior p ( z ⃗ ) p(\vec z) p(z ), generally a standard Gaussian ( µ = 0 ⃗ , σ = 1 ⃗ ) (µ=\vec0, σ =\vec1) (µ=0 ,σ=1 );

  • It’s a valid lower bound on the true log likelihood of the data, making the VAE a generative model.

  • KL divergence of the posterior from the prior: where P P P represents the data, the observation, or a probability distribution we can measure; distribution Q Q Q represents a model which is an approximation of P P P. If P P P and Q Q Q are closer to each other, the value is closer to zero;

    【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
  • The data likelihood under the posterior (expressed as cross entropy) : measure the different between predicted probability distribution and the ground truth probability distribution;

    − y log ⁡ ( y ^ ) + ( 1 − y ) log ⁡ ( 1 − y ^ ) -y\log(\hat y)+(1-y)\log(1-\hat y) ylog(y^)+(1y)log(1y^)

  • This objective function forces the model to be able to decode plausible sentences from every point in the latent space that has a reasonable probability under the prior.

  • A good model will have a non-zero KL divergence term and a relatively small cross entropy;

Fail in learning

Straightforward implementations would fail, most training runs would yield models that set q ( z ⃗ ∣ x ) q(\vec z|x) q(z x) equal to the prior p ( z ⃗ ) p(\vec z) p(z ) (prior sampling), bringing the KL divergence to zero. In other words, the posterior and prior has the same distribution and they can be arbitrary distributions over the output sentences; and the likelihood is close to zero. Once this has happened, the decoder ignores the encoder, q ( z ⃗ ∣ x ) q(\vec z|x) q(z x) is not related to x x x.

KL cost annealing

【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數

Give a changeable weight to KL term. Initialize it with zero, as the training progresses, we gradually increase it to one.

  • Why?

    The reason why we fail learning is the effect of the KL term is too strong. To begin with, we deduce its effect and let the model learn to condense more information into z ⃗ \vec z z ; when the weight is one, it can make the posterior and prior distribution similar.

Word dropout and historyless decoding

  • Randomly replace some conditioned-on words with ; thus it can weaken the effect of conditioned-on words and make the model rely more on the latent variable z ⃗ \vec z z to make good predictions.

  • This technique is parameterized by a keep rate k ∈ [ 0 , 1 ] k\in [0,1] k[0,1].

The detail for the prior and posterior distribution

Results: Language modeling

  • Penn Treebank

  • to discover whether the inclusion of a global latent variable is helpful for this standard task.

  • restrict our VAE hyperparameter search to those models which encode a non-trivial amount in the latent variable

  • measured by KL divergence

Results:

  • In the standard setting, the VAE performs slightly worse than the RNN baseline; reconstruction cost is better in VAE, but it has a KL divergence cost of 2.

  • With out word dropout and KL annealing, results in models with equivalent performance to baseline RNN. KL=0.

  • Inputless decoder: test the ability of the latent variable to encode the full content of sentence; this time VAE is better. (Sentence generating process is fully differentiable. This property is required by adversarial training methods.)

  • Using tricks above, still cannot train models for which the kl divergence term of the cost function dominates the reconstruction term;

  • It suggests that it’s easier to learn to factor the data distribution using simple local statistics; (global latent variable is not the most important); the encoder will only learn to encode information in z ⃗ \vec z z when it cannot be described by the local information.

Results: Imputing missing words

A technique for imputation and a novel evaluation strategy inspired by adversarial training.

  • Why novel evaluation strategy

    precise quantitative evaluation requires intractable likelihood computations;

  • Disadvantages of standard RNN

    • the sequential nature of likelihood computation and decoding makes it unsuitable for performing inference over unknown words given some known words;

    • requires O ( V ) O(V) O(V) runs of full RNN inference per steps of Gibbs sampling or iterated conditional modes

    • many steps of sampling need to propagate information between unknown variables and the known downstream variables

  • advantages of VAE:

    • more easily propagate information between all variables, by virtue of having a global latent variable and a tractable recognition model.

Training

  • Books Corpus introduced in Kiros et al. (2015)

  • 80m sentences

  • fixed word dropout rate of 75%

  • 512 hidden units

Inference method

  • Beam search with size 15 for RNN

  • Approximate iterated conditional modes with 3 steps of a beam size 5 search for VAE

  • Why?

    • Allow to compare the same amount of computation for both models;
  • breaking decoding for the VAE into several sequential steps is necessary to propagate information among the variables;

  • For approximate iterated conditional modes:

    • First initialize the unknown words to the “unk” token;
    • alternate assigning the latent variable to its mode from the recognition model, and performing constrained beam search to assign the unknown words;
  • Both of our generative models are trained to decode sentences from right-to-left, which shortens the dependencies involved in learning for the VAE;

  • impute the final 20% of each sentence.

Now we can demonstrate the advantages of the global latent variable in the regime where the RNN suffers the most from its inductive bias.

Adversarial evaluation

Drawing inspiration from adversarial training methods for generative models as well as non-parametric two-sample tests

  • How to evaluate the imputed sentence completions:
    • examining their distinguishability from the true sentence endings.
  • non-differentiability of the discrete rnn decoder prevents us from easily applying the adversarial criterion at train time
    • use adversarial error
  • Train two classifiers:
    • a bag-of-unigrams logistic regression classifier;
    • LSTM logistic regression classifier that reads the input sentence and produces a binary prediction after seeing the final “eos” token.
  • define the adversarial error as the gap between the ideal accuracy of the discriminator (50%, i.e. indistinguishable samples), and the actual accuracy attained

Results

  • When producing the final token of the sentence, RNN cannot choose anything outside of the top 15 tokens given by the RNN initial unconditional distribution, since it has not generate anything to condition on.

  • VAE substantially outperforms the baseline since it is able to efficiently propagate information bidirectionally through the latent variable.

Analyzing variational models

  • sample from the Gaussian prior, but use a but use a greedy deterministic decoder for p ( x ∣ z ⃗ ) p(x|\vec z) p(xz ), the RNN conditioned on z ⃗ \vec z z ;

  • allow us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector z ⃗ \vec z z as opposed to the decoder;

  • these results shows large amounts of variation in generated language can be achieved by following this procedure.

Analyzing the impact of word dropout

【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
  • train and test set performance are very similar;

  • drop out words with the specified keep rate at training time, but supply all words as inputs at test time except in the 0% setting.

  • don’t retune the hyperparameters for each keep rate;

  • as we lower the keep rate for word dropout, the amount of information stored in the latent variable increases, and the overall likelihood of the model degrades somewhat.

  • a model with no latent variable would degrade in performance significantly more in the presence of heavy word dropout.

  • Evaluate samples: simply sampling from RNN would not tell us about how much of the

    data is being explained;

    • instead, we sample from the Gaussian prior, but use a greedy deterministic decoder for x x x; taking each token x t = a r g m a x x t p ( x t ∣ x o , . . . . , x t − 1 , z ⃗ ) x_t=argmax_{x_t}p(x_t|x_o,....,x_{t-1},\vec z) xt=argmaxxtp(xtxo,....,xt1,z )
    • allow us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector z ⃗ \vec z z as opposed to by local language model dependencies
    【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
    • greedy decoding applied to a Gaussian sample does not produce diverse sentences;
    • increase the amount of word dropout and force z ⃗ \vec z z to encode more information, we see the sentences become more varied;
    • but: begin to repeat words and show other signs of ungrammaticality;
    • in the fully dropped-out decoder, the model is able to capture higher-order statistics not present in the unigram distribution.
  • examine the effect of using lower-probability samples from the latent Gaussian space

    【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
    • find lower-probability samples by applying an approximately volume-preserving transformation to the Gaussian samples that stretches some eigenspaces by up to a factor of 4;
    • use a random linear transformation, with matrix elements drawn from a uniform distribution from [ − c , c ] [−c,c] [c,c];
    • results: sentences are far less typical, but for the most part are grammatical and
      maintain a clear topic, indicating latent variable is capturing a rich variety of global features even for rare sentences;

Sampling from the posterior

  • exam the sentences decoded from the posterior vectors p ( z ∣ x ) p(z|x) p(zx)

  • see what the model considers to be similar sentences by examining the posterior samples;

  • capture information about the number of tokens are parts of speech for each token, as well as the topic information;

  • the longer the sentences, the lower the fidelity.

Homotopies

【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
  • the volume-filling and smooth nature of the code space allows us to examine the homotopy between sentences;

  • a homotopy : a homotopy between two codes z ⃗ 1 \vec z_1 z 1 and z ⃗ 2 \vec z_2 z 2 is the set of points on the line between them, inclusive, z ⃗ ( t ) = z ⃗ 1 ∗ ( 1 − t ) + z ⃗ 2 ∗ t \vec z(t)=\vec z_1*(1-t)+\vec z_2*t z (t)=z 1(1t)+z 2t for t ∈ [ 0 , 1 ] t\in[0,1] t[0,1].

  • we can infer what neighborhoods in code space look like;

  • the traditional RNN does not have a way to do so;

  • VAE learns representations that are smooth and “fill up” the space;

  • In table 8, we can see the codes mostly contains:

    • the number of words;
    • the parts of speech of tokens;
    • all intermediate sentences are grammatical;
    • some topic information also remains consistent in neighborhoods;
    • sentences have opposite sentiment may have the similar embeddings, this phenomenon can also be seen in word embeddings like “bad” and “good”, due to the similar distribution characteristics;
  • greedy decoder: Given a state vector we can recursively decode a sequence in a greedy manner by generating each output successively, where each prediction is conditioned on the previous output.

  • deterministic decoding: Deterministic decoding solely relies on latent codes as the only way to produce diverse objects.

相關文章