1 Introduction
1.1 Instance discrimination (樣本判別)
Instance discrimination 制定了一種劃分正樣本和負樣本的規則
有一個資料集,裡面有N張圖片,隨機選擇一張圖片 \(x_1\),經過不同的Data Transformation得到正樣本對,其他的圖片 \(x_2, x_3, ..., x_n\) 叫做負樣本,經過Encoder(also called Feature Extractor)得到對應的feature
- \(x_i\): 表示第 \(i\) 張圖片
- \(T_j\): 表示Data Transformation(Data Augmentation)
- \(x_1^{(1)}\): anchor錨點, 基準點
- \(x_1^{(2)}\): 基於anchor的positive sample
- \(x_2, x_3, ..., x_n\): 基於anchor的negative sample
1.2 Momentum
動量在數學上可以理解為是一種指數移動平均(Exponential Moving Average)
\(m\)為動量係數,目的是為了 \(Y_t\) 不完全依賴於當前時刻的輸入 \(X_t\)
- \(Y_t\): 當前時刻的輸出
- \(Y_{t-1}\): 上一時刻的輸出
- \(X_t\): 當前時刻的輸入
\(m\)越大,對當前時刻的輸入 \(X_t\) 的依賴越小
1.3 Momentum Contrast (MoCo)
2 Related Work
the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning
2.1 Loss Function(目標損失函式)
2.2 Pretext tasks(代理任務)
3 Method
3.1 InfoNCE Loss
CrossEntropyLoss
In \(softmax\) formula, the probabilty of the true sample(class) through model computes:
In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:
- \(K\) 為 num_labels (樣本中的類別數)
InfoNCE Loss
那麼為什麼不能使用CrossEntropyLoss作為Contrast Learning的Loss Function呢?因為在Contrast Learning中,\(K\)的值很大(例如ImageNet裡樣本集有128萬張圖片,也就是128萬類),softmax無法處理如此多的類別,而且exponential operation維度很高,計算複雜度很高。
\(\quad\) Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query \(q\) and a set of encoded samples {\(k_0, k_1, k_2, ...\)} that are the keys of a dictionary. Assume that there is a single key (denoted as \(k_+\)) in the dictionary that \(q\) matches.
\(\quad\)A contrastive loss is a function whose value is low when q is similar to its positive key \(k_+\) and dissimilar to all other keys (considered negative keys for \(q\)). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.
- \(q\) 為 \(feature_{archor}\)
- \(k_i(i=0,...,K)\) 為 1個\(feature_{positive}\) + \(K\)個\(feature_{negative}\)
- \(K\) 為負取樣後負樣本的數量
- \(\tau\) means temperature, which is a hyper-parameter
- The sum is over one positive and \(K\) negative samples. Intuitively, this loss is the log loss of a (\(K+1\))-way softmax-based classifier that tries to classify \(q\) as \(k_+\).
In general, the query representationis \(q = f_q(x^{query})\) where \(f_q\) is an encoder network and \(x^{query}\) is a query sample (likewise, \(k = f_k(x^{key})\)). Their instantiations depend on the specific pretext task. The input \(x^{query}\) and \(x^{key}\) can be images, patches, or context consisting a set of patches. The networks \(f_q\) and \(f_k\) can be identical, partially shared, or different.