【文獻解讀】Generating Sentences from a Continuous Space,VAE產生連續空間變數
Abstract
-
The standard RNN: doesn’t work for global sentence representation;
-
The work of this paper: RNN-based variational autoencoder generative model;
-
The model: incorporates distributed latent representations of entire sentences; good at modeling holistic properties;
-
the model can generate coherent novel sentences that interpolate between known sentences;
Introduction
- RNN works well in unsupervised generative modeling for natural language sentences; like in machine translation and image captioning;
- Shortcoming of RNN: cannot capture the global information like topic or high level syntactic properties.
- What this paper did: extension of the RNN which capture global features in a continuous latent variable;
- Inspiration of this paper from: using the architecture of a variational autoencoder and takes advantage of recent advances in variational inference (Kingma and Welling, 2015; Rezende et al., 2014); in these paper, they introduce generative models with latent variables;
- The contributions of the paper:
- a VAE architecture for text and the problems while training.
- The model yields similar performance to existing RNN if the global variable is not explicitly needed;
- Task of imputing missing words: this paper introduced a novel evaluation strategy using adversarial classifier; sidestepping the issue of intractable likelihood computations by drawing inspiration from work on non-parametric two-sample tests and adversarial training;
- In this setting, the global latent variable allows the model to work well;
- Introduce several qualitative techniques evaluating ability to learn high level features of sentences;
- The model can produce diverse, coherent sentences through purely deterministic decoding and that they can interpolate smoothly between sentences.
Background
Unsupervised sentence encoding
-
Standard RNN doesn’t learn a vector representation of the full sentence;
-
In order to incorporate a continuous latent sentence representation, we need a method to map between sentences and distributed representations, which can be trained in an unsupervised setting;
-
Sequence autoencoders include: encoder function ϕ e n c \phi_{enc} ϕenc and a probabilistic decoder p ( x ∣ z ⃗ = ϕ e n c ( x ) ) p(x|\vec z=\phi_{enc}(x)) p(x∣z=ϕenc(x)); where z ⃗ \vec z z is the learned code x x x is a given example; the decoder need to maximize the likelihood of the an example x x x conditioned on z ⃗ \vec z z; both encoder and decoder are RNNs and examples are token sequences;
-
But standard autoencoders are not effective at extracting for global semantic features; it can not learn a smooth, interpretable features for sentence encoding; the model do not incorporate a prior over z ⃗ \vec z z;
-
Skip-thought models: same structure as autoencoder, but generate text conditioned on a neighboring sentence from the target text; instead the target sentence itself;
-
Paragraph vector models: non-recurrent sentence representation models
The variational autoencoder
- Based on a regularized version of the standard autoencoder
A VAE for sentences
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片儲存下來直接上傳(img-y8iHmVhS-1605794628063)(https://i.loli.net/2020/11/17/C1HLey89agIG4hl.png)]
-
the hidden code: with the Gaussian prior acting as a regularizer, and incorporates no useful information (as noise);
-
decoder: serve as a special RNN conditioned on the hidden code;
-
Explore several variations on the architecture:
-
concatenating the sampled z ⃗ \vec z z to the decoder input at every time step
-
using a soft-plus parametrization for the variance
-
using deep feedforward networks between the encoder and latent variable and the decoder and latent variable
-
Result: little difference in model’s performance;
-
-
Some work about VAE on discrete sequences has been done:
- VRAE(Variational Recurrent Autoencoder) for modeling music;
-
Some work on including continuous latent variables in RNN-style models:
- Modeling speech, handwriting and music;
- these models include separate latent variables per timestep; unsuitable for our task;
-
Similar work to this paper:
- VAE-based document-level language model that models texts as bags of words, rather than sequences;
Optimization challenges
The goal of this paper is to learn global latent representations of sentence content.
Q: How can we estimate the global features learned?
A: By looking at the variational lower bound objective.
- The objective:
-
It encourages the model to keep its posterior distributions close to a prior p ( z ⃗ ) p(\vec z) p(z), generally a standard Gaussian ( µ = 0 ⃗ , σ = 1 ⃗ ) (µ=\vec0, σ =\vec1) (µ=0,σ=1);
-
It’s a valid lower bound on the true log likelihood of the data, making the VAE a generative model.
-
KL divergence of the posterior from the prior: where P P P represents the data, the observation, or a probability distribution we can measure; distribution Q Q Q represents a model which is an approximation of P P P. If P P P and Q Q Q are closer to each other, the value is closer to zero;
-
The data likelihood under the posterior (expressed as cross entropy) : measure the different between predicted probability distribution and the ground truth probability distribution;
− y log ( y ^ ) + ( 1 − y ) log ( 1 − y ^ ) -y\log(\hat y)+(1-y)\log(1-\hat y) −ylog(y^)+(1−y)log(1−y^)
-
This objective function forces the model to be able to decode plausible sentences from every point in the latent space that has a reasonable probability under the prior.
-
A good model will have a non-zero KL divergence term and a relatively small cross entropy;
Fail in learning
Straightforward implementations would fail, most training runs would yield models that set q ( z ⃗ ∣ x ) q(\vec z|x) q(z∣x) equal to the prior p ( z ⃗ ) p(\vec z) p(z) (prior sampling), bringing the KL divergence to zero. In other words, the posterior and prior has the same distribution and they can be arbitrary distributions over the output sentences; and the likelihood is close to zero. Once this has happened, the decoder ignores the encoder, q ( z ⃗ ∣ x ) q(\vec z|x) q(z∣x) is not related to x x x.
KL cost annealing
Give a changeable weight to KL term. Initialize it with zero, as the training progresses, we gradually increase it to one.
-
Why?
The reason why we fail learning is the effect of the KL term is too strong. To begin with, we deduce its effect and let the model learn to condense more information into z ⃗ \vec z z; when the weight is one, it can make the posterior and prior distribution similar.
Word dropout and historyless decoding
-
Randomly replace some conditioned-on words with ; thus it can weaken the effect of conditioned-on words and make the model rely more on the latent variable z ⃗ \vec z z to make good predictions.
-
This technique is parameterized by a keep rate k ∈ [ 0 , 1 ] k\in [0,1] k∈[0,1].
The detail for the prior and posterior distribution
Results: Language modeling
-
Penn Treebank
-
to discover whether the inclusion of a global latent variable is helpful for this standard task.
-
restrict our VAE hyperparameter search to those models which encode a non-trivial amount in the latent variable
-
measured by KL divergence
Results:
-
In the standard setting, the VAE performs slightly worse than the RNN baseline; reconstruction cost is better in VAE, but it has a KL divergence cost of 2.
-
With out word dropout and KL annealing, results in models with equivalent performance to baseline RNN. KL=0.
-
Inputless decoder: test the ability of the latent variable to encode the full content of sentence; this time VAE is better. (Sentence generating process is fully differentiable. This property is required by adversarial training methods.)
-
Using tricks above, still cannot train models for which the kl divergence term of the cost function dominates the reconstruction term;
-
It suggests that it’s easier to learn to factor the data distribution using simple local statistics; (global latent variable is not the most important); the encoder will only learn to encode information in z ⃗ \vec z z when it cannot be described by the local information.
Results: Imputing missing words
A technique for imputation and a novel evaluation strategy inspired by adversarial training.
-
Why novel evaluation strategy
precise quantitative evaluation requires intractable likelihood computations;
-
Disadvantages of standard RNN
-
the sequential nature of likelihood computation and decoding makes it unsuitable for performing inference over unknown words given some known words;
-
requires O ( V ) O(V) O(V) runs of full RNN inference per steps of Gibbs sampling or iterated conditional modes
-
many steps of sampling need to propagate information between unknown variables and the known downstream variables
-
-
advantages of VAE:
- more easily propagate information between all variables, by virtue of having a global latent variable and a tractable recognition model.
Training
-
Books Corpus introduced in Kiros et al. (2015)
-
80m sentences
-
fixed word dropout rate of 75%
-
512 hidden units
Inference method
-
Beam search with size 15 for RNN
-
Approximate iterated conditional modes with 3 steps of a beam size 5 search for VAE
-
Why?
- Allow to compare the same amount of computation for both models;
-
breaking decoding for the VAE into several sequential steps is necessary to propagate information among the variables;
-
For approximate iterated conditional modes:
- First initialize the unknown words to the “unk” token;
- alternate assigning the latent variable to its mode from the recognition model, and performing constrained beam search to assign the unknown words;
-
Both of our generative models are trained to decode sentences from right-to-left, which shortens the dependencies involved in learning for the VAE;
-
impute the final 20% of each sentence.
Now we can demonstrate the advantages of the global latent variable in the regime where the RNN suffers the most from its inductive bias.
Adversarial evaluation
Drawing inspiration from adversarial training methods for generative models as well as non-parametric two-sample tests
- How to evaluate the imputed sentence completions:
- examining their distinguishability from the true sentence endings.
- non-differentiability of the discrete rnn decoder prevents us from easily applying the adversarial criterion at train time
- use adversarial error
- Train two classifiers:
- a bag-of-unigrams logistic regression classifier;
- LSTM logistic regression classifier that reads the input sentence and produces a binary prediction after seeing the final “eos” token.
- define the adversarial error as the gap between the ideal accuracy of the discriminator (50%, i.e. indistinguishable samples), and the actual accuracy attained
Results
-
When producing the final token of the sentence, RNN cannot choose anything outside of the top 15 tokens given by the RNN initial unconditional distribution, since it has not generate anything to condition on.
-
VAE substantially outperforms the baseline since it is able to efficiently propagate information bidirectionally through the latent variable.
Analyzing variational models
-
sample from the Gaussian prior, but use a but use a greedy deterministic decoder for p ( x ∣ z ⃗ ) p(x|\vec z) p(x∣z), the RNN conditioned on z ⃗ \vec z z;
-
allow us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector z ⃗ \vec z z as opposed to the decoder;
-
these results shows large amounts of variation in generated language can be achieved by following this procedure.
Analyzing the impact of word dropout
-
train and test set performance are very similar;
-
drop out words with the specified keep rate at training time, but supply all words as inputs at test time except in the 0% setting.
-
don’t retune the hyperparameters for each keep rate;
-
as we lower the keep rate for word dropout, the amount of information stored in the latent variable increases, and the overall likelihood of the model degrades somewhat.
-
a model with no latent variable would degrade in performance significantly more in the presence of heavy word dropout.
-
Evaluate samples: simply sampling from RNN would not tell us about how much of the
data is being explained;
- instead, we sample from the Gaussian prior, but use a greedy deterministic decoder for x x x; taking each token x t = a r g m a x x t p ( x t ∣ x o , . . . . , x t − 1 , z ⃗ ) x_t=argmax_{x_t}p(x_t|x_o,....,x_{t-1},\vec z) xt=argmaxxtp(xt∣xo,....,xt−1,z)
- allow us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector z ⃗ \vec z z as opposed to by local language model dependencies
- greedy decoding applied to a Gaussian sample does not produce diverse sentences;
- increase the amount of word dropout and force z ⃗ \vec z z to encode more information, we see the sentences become more varied;
- but: begin to repeat words and show other signs of ungrammaticality;
- in the fully dropped-out decoder, the model is able to capture higher-order statistics not present in the unigram distribution.
-
examine the effect of using lower-probability samples from the latent Gaussian space
- find lower-probability samples by applying an approximately volume-preserving transformation to the Gaussian samples that stretches some eigenspaces by up to a factor of 4;
- use a random linear transformation, with matrix elements drawn from a uniform distribution from [ − c , c ] [−c,c] [−c,c];
- results: sentences are far less typical, but for the most part are grammatical and
maintain a clear topic, indicating latent variable is capturing a rich variety of global features even for rare sentences;
Sampling from the posterior
-
exam the sentences decoded from the posterior vectors p ( z ∣ x ) p(z|x) p(z∣x)
-
see what the model considers to be similar sentences by examining the posterior samples;
-
capture information about the number of tokens are parts of speech for each token, as well as the topic information;
-
the longer the sentences, the lower the fidelity.
Homotopies
-
the volume-filling and smooth nature of the code space allows us to examine the homotopy between sentences;
-
a homotopy : a homotopy between two codes z ⃗ 1 \vec z_1 z1 and z ⃗ 2 \vec z_2 z2 is the set of points on the line between them, inclusive, z ⃗ ( t ) = z ⃗ 1 ∗ ( 1 − t ) + z ⃗ 2 ∗ t \vec z(t)=\vec z_1*(1-t)+\vec z_2*t z(t)=z1∗(1−t)+z2∗t for t ∈ [ 0 , 1 ] t\in[0,1] t∈[0,1].
-
we can infer what neighborhoods in code space look like;
-
the traditional RNN does not have a way to do so;
-
VAE learns representations that are smooth and “fill up” the space;
-
In table 8, we can see the codes mostly contains:
- the number of words;
- the parts of speech of tokens;
- all intermediate sentences are grammatical;
- some topic information also remains consistent in neighborhoods;
- sentences have opposite sentiment may have the similar embeddings, this phenomenon can also be seen in word embeddings like “bad” and “good”, due to the similar distribution characteristics;
-
greedy decoder: Given a state vector we can recursively decode a sequence in a greedy manner by generating each output successively, where each prediction is conditioned on the previous output.
-
deterministic decoding: Deterministic decoding solely relies on latent codes as the only way to produce diverse objects.
相關文章
- 43. 連續空間的只讀性
- [LeetCode] 884. Uncommon Words from Two SentencesLeetCode
- 閱讀文獻
- 沒有磁碟空間 No space left on devicedev
- hive生成連續的時間和連續的數Hive
- 連續時間傅立葉變換
- 環境變數和地址空間變數
- Generating Diverse and Natural 3D Human Motions from Text3D
- 一個連續動作空間的SAC的例子
- Space Capital:地理空間情報手冊報告API
- 快速瞭解 變分自編碼器 VAE
- dbms_space(分析段增長和空間的需求)
- 只讀表空間
- 變分自編碼器(五):VAE + BN = 更好的VAE
- [論文閱讀] 顏色遷移-Correlated Color Space
- 科研基本功——高效文獻檢索與文獻閱讀保姆級教程
- Space Capital:2024年Q2空間投資報告API
- 開發人員的生產力管理框架:SPACE框架
- cannot reclaim 52428800 bytes disk space from 4070572032 limitAIMIT
- 生產事故後續
- 論文解讀《Measuring and Relieving the Over-smoothing Problem for Graph NeuralNetworks from the Topological View》View
- 作用域(scope), 定義空間(declaration space) 和 生存期(lifetime)
- ASM磁碟空間假裝耗盡,ORA-15041: diskgroup space exhaustedASM
- 強化深度學習task06連續動作空間和DDPG深度學習
- 變分自編碼器(七):球面上的VAE(vMF-VAE)
- VAE變分自編碼器
- python之深入講解變數與名稱空間及資料引數與容器引數區別Python變數
- A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation 論文解讀(SIGMOD 2021)Nifi
- 簡單聊聊 ABAP 變數消耗的記憶體空間這個話題的試讀版變數記憶體
- 區塊鏈資料管理必讀文獻區塊鏈
- 【文獻閱讀】ES as a Scalable Alternative to RL(OpenAI 17)OpenAI
- 【CV】三維空間的旋轉問題(Rotation in 3D space)3D
- 《2021網路空間測繪年報》解讀|物聯網資產與風險篇
- 國產處理器龍芯地址空間詳解
- 【論文閱讀】CVPR2022: Learning from all vehicles
- 斷路器生產車間精益變革總結
- 連續登入及其變種
- 連續自然數求和