Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

ForHHeart發表於2024-05-03

1 Introduction

1.1 Instance discrimination (樣本判別)

Instance discrimination 制定了一種劃分正樣本和負樣本的規則

有一個資料集,裡面有N張圖片,隨機選擇一張圖片 \(x_1\),經過不同的Data Transformation得到正樣本對,其他的圖片 \(x_2, x_3, ..., x_n\) 叫做負樣本,經過Encoder(also called Feature Extractor)得到對應的feature

Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning
  • \(x_i\): 表示第 \(i\) 張圖片
  • \(T_j\): 表示Data Transformation(Data Augmentation)
  • \(x_1^{(1)}\): anchor錨點, 基準點
  • \(x_1^{(2)}\): 基於anchor的positive sample
  • \(x_2, x_3, ..., x_n\): 基於anchor的negative sample

1.2 Momentum

動量在數學上可以理解為是一種指數移動平均(Exponential Moving Average)

\(m\)為動量係數,目的是為了 \(Y_t\) 不完全依賴於當前時刻的輸入 \(X_t\)

$Y_t = mY_{t-1} + (1-m)X_t$
  • \(Y_t\): 當前時刻的輸出
  • \(Y_{t-1}\): 上一時刻的輸出
  • \(X_t\): 當前時刻的輸入

\(m\)越大,對當前時刻的輸入 \(X_t\) 的依賴越小

1.3 Momentum Contrast (MoCo)

Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

2 Related Work

Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning

2.1 Loss Function(目標損失函式)

2.2 Pretext tasks(代理任務)

3 Method

3.1 InfoNCE Loss

CrossEntropyLoss

In \(softmax\) formula, the probabilty of the true sample(class) through model computes:

$\hat{y}_+ = softmax(z_+) = \dfrac{\exp(z_+)}{\sum_{i=0}^K \exp(z_i)}$

In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:

$\begin{align*} CrossEntropyLoss& = -\sum y_i\log(\hat{y}) \\ & = -\log \hat{y}_+\\ & = -\log softmax(z_+)\\ & = -\log\dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)} \end{align*}$
  • \(K\) 為 num_labels (樣本中的類別數)

InfoNCE Loss

那麼為什麼不能使用CrossEntropyLoss作為Contrast Learning的Loss Function呢?因為在Contrast Learning中,\(K\)的值很大(例如ImageNet裡樣本集有128萬張圖片,也就是128萬類),softmax無法處理如此多的類別,而且exponential operation維度很高,計算複雜度很高。

\(\quad\) Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query \(q\) and a set of encoded samples {\(k_0, k_1, k_2, ...\)} that are the keys of a dictionary. Assume that there is a single key (denoted as \(k_+\)) in the dictionary that \(q\) matches.
\(\quad\)A contrastive loss is a function whose value is low when q is similar to its positive key \(k_+\) and dissimilar to all other keys (considered negative keys for \(q\)). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.

$\mathcal{L_q} = -\log \dfrac{\exp(q*k_+/\tau)}{\sum_{i=0}^K \exp(q*k_i/\tau)}$
  • \(q\)\(feature_{archor}\)
  • \(k_i(i=0,...,K)\) 為 1個\(feature_{positive}\) + \(K\)\(feature_{negative}\)
  • \(K\) 為負取樣後負樣本的數量
  • \(\tau\) means temperature, which is a hyper-parameter
  • The sum is over one positive and \(K\) negative samples. Intuitively, this loss is the log loss of a (\(K+1\))-way softmax-based classifier that tries to classify \(q\) as \(k_+\).

In general, the query representationis \(q = f_q(x^{query})\) where \(f_q\) is an encoder network and \(x^{query}\) is a query sample (likewise, \(k = f_k(x^{key})\)). Their instantiations depend on the specific pretext task. The input \(x^{query}\) and \(x^{key}\) can be images, patches, or context consisting a set of patches. The networks \(f_q\) and \(f_k\) can be identical, partially shared, or different.

相關文章