在這裡插入圖片描述

arXiv-2019
code：https://github.com/yinghdb/EmbedMask
在 FCOS 基礎上的改進

文章目錄

1 Background and Motivation

隨著深度學習的蓬勃發展，CNN 在計算機視覺中的應用已經從 image-level 擴充套件到 pixel-level。eg，例項分割就是對目標檢測的一種擴充套件，detected objects from instance-level to pixel-level.

當前基於 CNN 做例項分割的方法可以分為兩類

Proposal-based methods：先用目標檢測方法檢測目標框，再在框內區域進行 pix-level 的分類。代表性的方法有 Mask R-CNN （參考【Mask RCNN】《Mask R-CNN》）， one-stage 普遍沒有 two-stage 的目標檢測方法的方法猛，two-stage 中 RoI pooling 操作又會存在如下兩個缺點
- results in the loss of features and the distortion to the aspect ratios
- complex to adjust too many parameters
segmentation based methods： segment first then do clustering，這類方法都沒有 re-pooling 操作（RoI pooling），難點在 cluster 的過程中，很難去 determine the number of clusters or the positions of the cluster centers

作者融合兩類分割方法的優點（It preserves strong detection capabilities as the proposal-based methods, and meanwhile keeps the details of images as the segmentation-based methods），提出了 EmbedMask 例項分割方法，用 embedding 的方式來 simplifies the clustering procedure in the segmentation-based methods and avoid the repooling procedure in Mask R-CNN

2 Related Work

Proposal-based methods：detection and segmentation
- Two-stage Methods，eg Mask RCNN
- One-stage Methods，eg YOLACT、TensorMask
Segmentation-based Methods（bottom-up methods）：first segmenting and then clustering（eg 讓屬於同一例項的畫素在 embedding 空間中儘量靠在一起）

3 Advantages / Contributions

propose a framework that unites the proposal-based and segmentation-based methods，通過 pixel-embedding 和 proposal embedding
one-stage 例項分割方法，但是 higher quality（挑了些圖） and higher speed than two-stage 的 Mask RCNN（但 AP 沒別人高喲）

4 Method

pixel embeddings, proposal embeddings, and proposal margins to extract the instance masks

在這裡插入圖片描述
$d = 32$

location $x_j$ 處 $proposal_j$ 的所有引數為 $\{class_j, box_j,center_j, q_j,\sigma_j\}$

其中 $q_j$ 是 proposal embedding，which is regarded as the cluster center

$\sigma_j$ 是 proposal margin

pixel embedding 和 proposal embedding 的相似度來生成每個候選區域中的 mask，proposal margin $\delta$ 相當於一個相似度的閾值，來決定最終的 mask

4.1 Embedding Definition

在這裡插入圖片描述
作者提出了下面兩種新的 embedding 方式

proposal embedding, which is a good representation of entire instance
pixel embedding, which learns the relation between each pixel with corresponding instance

在 embedding 空間中，proposal embedding 相當於聚類中心，然後同一個 instance 的 pixel embedding 會在這個聚類中心附近

相比於其他方法，作者的這種 embedding 方式就避免了找 cluster center 的位置和數量的問題了

常規的思路是

在這裡插入圖片描述

$p_i$ 是 pixel embeddings
$q_i$ 是 proposal embedding
$\delta$ 是 proposal margin
$Q_k$ 是 instance proposal $S_k$ （GT mask）內，正樣本區域 $q_i$ 的平均值，也即聚類中心

訓練的時候 $S_k$ 是 GT mask， $Q_k$ 是所有 positive proposal embedding 的平均值，優化目標是讓同一 instance 的 pixel embedding 與 proposal embedding 儘可能的近（pull），與背景畫素儘可能的遠（push）

在公式（1）的基礎上，採用 hinge loss，就可以訓練了

在這裡插入圖片描述

$K$ ： GT instance 的數量
$B_k$ ：represents the set of pixel embeddings that need to be supervised for the instance $S_k$ ，GT mask 對應的 bbox 的區域
$N_k$ ：the number of pixel embeddings in $B_k$
$\mathbb{I}(i \in S_k)$ indicator function
$S_k$ ：GT instance 的 mask
$Q _k$ 是所有 positive proposal embedding 的平均值，positive 區域是預測的 bbox 與 $S_k$ 對應的 bbox IoU 大於 0.5 的區域內
$x]_+$ ：表示 max(0,x)
$\delta_a$ 和 $\delta_b$ 是 two margins designed for push and pull strategy

第一項是 pull 到 margin $\delta_a$ 內，第二項是 push 到 margin $\delta_b$ 外

畫個圖這個關係就很明瞭，橫座標是 p-q，縱座標分別是兩項 loss

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2,5,100)
y1 = np.array([max(0,(i-1))**2 for i in x])
y2 = np.array([max(0,(1-j))**2 for j in x])

plt.plot(x,y1)
plt.plot(x,y2)
plt.legend(['y1','y2'],loc="upper center")

# gca = get current axis
ax = plt.gca() # x,y

# spines = 上下左右四條黑線
ax.spines['right'].set_color('none') # 讓右邊的黑線消失
ax.spines['top'].set_color('none')  # 讓上邊的黑線消失

ax.xaxis.set_ticks_position('bottom') # 把下面的黑線設定為x軸
ax.yaxis.set_ticks_position('left')   #  把左邊的黑線設定為y軸

ax.spines['bottom'].set_position(('data',0)) # 移動x軸到指定位置，本例子為0
ax.spines['left'].set_position(('data',0))   # 移動y軸到指定位置，本例子為0

plt.show()

在這裡插入圖片描述
第一項 x 超過了設定的閾值，損失就會越來越大，第二項 x 小於閾值，損失就會越來越大

作者發現設定固定的 margin （difficult to find the optimal values），

在這裡插入圖片描述

因此採用 learning 的方式，來學習一個 margin

3.2 Learnable Margin

採用高斯公式來判斷畫素是否屬於例項，取代 3.1 小節的公式（1）

在這裡插入圖片描述

map the distance between the pixel embedding $p_i$ of the pixel $x_i$ and the proposal embedding $Q_k$ of the instance $S_k$ into a value ranged in [0， 1)

$\Sigma_k$ 就是 positive 區域的 $\sigma_j$ 的均值，類比 $q_j$ 和 $Q_k$ 的關係，positive 是預測的 bbox 與 $S_k$ 對應的 bbox IoU 大於 0.5 的區域內
$\phi(x_i,S_k)$ 表示畫素 $x_i$ 屬於 GT mask $S_k$ 的概率

整體的 loss 就沒有用公式（3）中的 hinge loss 了，而採用瞭如下形式

在這裡插入圖片描述

$L (\cdot)$ 是 binary classification loss function
$\phi(x_i,S_k)$ 表示畫素 $x_i$ 屬於 GT mask $S_k$ 的概率
$\mathbb{G}(x_i,S_k)$ represents the ground truth label for pixel $x_i$ to judge whether it is in the mask of the proposal $S_k$ , which is a binary value

相當於需要網路學 $\Sigma$ ，而不是用固定的 $\sigma$ ，實際中學習的是 $\frac{1}{2\sigma^2}$ （會用指數函式保證預測出來的都是正值）

3.3 Smooth Loss

$Q_k$ 和 $\Sigma_k$ 的計算方式如下

在這裡插入圖片描述

$M_k$ 是正樣本畫素的集合——當前畫素預測出的 bbox 與 GT bbox 的 IoU > 0.5

注意，訓練的時候 $S_k$ 是 GT，會被用到如下兩個地方

來算 $Q_k$ 和 $\Sigma_k$ 的正樣本區域時候（在 GT mask 區域內選 IoU > 0.5 的位置）
算二值損失公式（4）時

測試的時候，我們是不知道 $S_k$ 的，無法計算 $Q_k$ 和 $\Sigma_k$ ，所以作者在測試的時候把公式（3）中的 $Q_k$ 和 $\Sigma_k$ 替換為了當前位置的 $q_j$ 和 $\sigma_j$

這樣，訓練和測試的 $Q_k$ 和 $\Sigma_k$ 就不一樣（訓練的時候是區域 embedding 的平均值，測試的時候是當前位置的 embedding），作者用如下損失來緩解這種情況

在這裡插入圖片描述

讓每個位置的 embedding 儘量和他們的聚類中心差距較小

3.4 Training

計算 loss 的時候，feature map 和 embedding 都 resize 到原圖長寬的 1/4
在這裡插入圖片描述

其中

在這裡插入圖片描述

$\lambda_1 = 0.5$ ， $\lambda_1 = 0.1$

1）Training Samples for Box and Classification

FCOS

${box_j, class_j, center_j\}$ ，正樣本被定義為，locate on the center region of the ground-truth bounding box，且在 GT mask 區域內

2）Training Samples for Proposal Embedding and Margin

正樣本被定義為，當前畫素預測出的 bbox 與 GT bbox 的 IoU > 0.5（且在 GT mask 內）

3）Training Samples for Pixel Embedding

正樣本被定義為，落在 GT bbox 中的 pixel，實驗中發現， expand bbox，增加 training sample（負樣本）效果會更好

3.5 Inference

根據 NMS 後的 bbox（預設都是正樣本了），用當前位置的 $q_j$ 和 $\sigma_j$ 代替 $Q$ 和 $\Sigma$ ，然後代入下面公式計算 $x_i$ 屬於 $S_k$ 的概率

在這裡插入圖片描述

5 Experiments

5.1 Datasets

MS COCO
- trainval35k split (115K images) for training,
- minival split (5K images) for ablation study
- test-dev (20K images) for reporting the main results

5.2 Main Results

1）Quantitative Results

在這裡插入圖片描述
一階段中最好

2）Qualitative Results
在這裡插入圖片描述
左圖 mask rcnn，右圖 embedmask，can provide more detailed masks than Mask R-CNN with sharper edges（沒有 re-pooling 帶來的 detail missing）

5.3 Ablation Study

1）Fixed vs. Learnable Margin

在這裡插入圖片描述

$\delta_a = 0.5$ ， $\delta_b = 0.8$ ， $\delta = 1.5$

學出來的更好

2）The Choice of Cluster Centers

用 $p_j$ 作為聚類中心，不用 $Q_k$ 作為聚類中心，結果如下

在這裡插入圖片描述

3）Sampling Strategy

也即正樣本取樣策略

${box_j, class_j, center_j\}$ 是否落在中心區域

$\{p_j,\sigma_j\}$ IoU>0.5
在這裡插入圖片描述
在 mask 內，且 IoU >0.5 合起來，效果會更好

4）Training Samples for Pixel Embedding
在這裡插入圖片描述
正樣本為 GT mask 內的畫素，bbox 擴大 1.2 倍，增加 training sample（負樣本）效果會更好

5）Embedding Dimension
在這裡插入圖片描述

6 Conclusion（own）

proposal margin，實際中學習的是 $\frac{1}{2\sigma^2}$ ，會用指數函式保證預測出來的都是正值
小寫 $q$ 和大寫 $Q$ ，小寫 $\sigma$ 和大寫 $\Sigma$ 的區別是，小寫代表每個位置的 embedding，大寫表示正樣本區域（bbox IoU>0.5）的 embedding 均值

【EmbedMask】《EmbedMask：Embedding Coupling for One-stage Instance Segmentation》

文章目錄

1 Background and Motivation

2 Related Work

3 Advantages / Contributions

4 Method

4.1 Embedding Definition

3.2 Learnable Margin

3.3 Smooth Loss

3.4 Training

3.5 Inference

5 Experiments

5.1 Datasets

5.2 Main Results

5.3 Ablation Study

6 Conclusion（own）

相關文章