MyDLNote-360Camera：全景影像深度估計，結合 equirectangular 和 cubemap 兩種對映（2020CVPR BiFuse）

Phoenixtree_Zhao發表於2020-11-28

原文網址 : https://blog.csdn.net/u014546828/article/details/110225653

BiFuse: Monocular 360◦ Depth Estimation via Bi-Projection Fusion

Fig. 1. Our BiFuse network estimates the 360◦ depth from a monocular image using both equirectangular and cubemap projections. A bi-projection fusion component is proposed to leverage both projections inspired by both peripheral and foveal vision of the human eye. Given the estimated 360◦ depth, a complete 3D point cloud surrounding the camera can be generated to serve downstream applications.

BiFuse: Monocular 360◦ Depth Estimation via Bi-Projection Fusion

Abstract

Depth estimation from a monocular 360◦ image is an emerging problem that gains popularity due to the availability of consumer-level 360◦ cameras and the complete surrounding sensing capability.

While the standard of 360◦ imaging is under rapid development, we propose to predict the depth map of a monocular 360◦ image by mimicking both peripheral and foveal vision of the human eye.

To this end, we adopt a two-branch neural network leveraging two common projections: equirectangular and cubemap projections.

In particular, equirectangular projection incorporates a complete field-of-view but introduces distortion, whereas cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Thus we propose a bi-projection fusion scheme along with learnable masks to balance the feature map from the two projections. Moreover, for the cubemap projection, we propose a spherical padding procedure which mitigates discontinuity at the boundary of each face.

We apply our method to four panorama datasets and show favorable results against the existing state-of-the-art methods.

第一句，背景意義：

第二句，motivation：通過模仿人眼周邊和中央凹視覺的特點，本文提出對單眼 360◦ 影像深度預測的方法。

第三局，本文策略：利用兩個常見的投影 : 等矩形投影和立方投影，來實現深度預測。

第四-六句，具體方法：等矩形投影包含了一個完整的視場但是引入了畸變，而立方投影避免了畸變但是在立方體的邊界引入了不連續。因此，本文提出了一種雙投影融合方案，並結合可學習 masks 來平衡來自兩個投影的feature map。對於立方投影，提出了一個球面填充程式，以減輕在每個面邊界的不連續。（演算法貢獻：雙投影融合和球面填充程式）。

第七句，實驗結果。

Introduction

Inferring 3D structure from 2D images has been widely studied due to numerous practical applications. For instance, it is crucial for autonomous systems like self-driving cars and indoor robots to sense the 3D environment since they need to navigate safely in 3D. Among several techniques for 3D reconstruction, significant improvement has been achieved in monocular depth estimation due to the advance of deep learning and availability of large-scale 3D training data. For example, FCRN [16] achieves monocular depth estimation by their proposed up-projection module. However, most of the existing methods are designed for a camera with normal field-of-view (FoV). As 360◦ camera becomes more and more popular in recent years, the ability to infer the 3D structure of a camera’s complete surrounding has motivated the study of monocular 360◦ depth estimation.

背景介紹：細讀，這段背景寫地還挺好的。從 3D 的實際應用需求引出故事，介紹了深度估計是 3D 構建的重要方法，最後點題，monocular 360◦ depth estimation 十分重要。思路很清晰。

In this paper, we propose an end-to-end trainable neural network leveraging two common projections – equirectangular and cubemap projection – as inputs to predict the depth map of a monocular 360◦ image. Our main motivation is to combine the capability from both peripheral and foveal vision like the human eye (see Fig. 1 for the illustration). Note that, equirectangular projection provides a wide field-of-view mimicking a peripheral vision, whereas cubemap projection provides a smaller but non-distorted fieldof-view mimicking the foveal vision.

On the one hand, equirectangular projection allows all surrounding information to be observed from a single 2D image but introduces distortion. On the other hand, cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Considering both projections would have the complementary property to each other, where we refer to our method as BiFuse.

本文利用 equirectangular 和 cubemap 兩種投影作為輸入來預測單眼 360 影像的深度圖。主要動機是像人眼一樣結合外圍和中央凹視覺的能力 (如圖1所示)。equirectangular 投影提供了一個全景視場來模擬周邊視覺，而 cubemap 投影提供了沒有扭曲的場域檢視來模擬中央凹視覺。（全文的核心思想）

一方面，equirectangular 投影允許從一個單一的二維影像觀察到所有的周圍資訊，但引入了失真。另一方面，cubemap 投影避免了失真，但在立方體的邊界上引入了不連續。考慮到這兩個投影將具有相互補充的性質，本文提出了 BiFuse。

However, the FoV of the foveal vision could be too small, which degrades the effectiveness of our fusion scheme (Fig. 2). To tackle this issue, cube padding (CP) methods [26, 4] have been proposed to expand field-ofview from neighboring faces on the cube. Nevertheless, using cube padding may result in geometric inconsistency at the boundary that introduces non-negligible distortion effect. Therefore, we propose spherical padding (SP) which pads the boundary by considering the spherical geometry and reduces the boundary inconsistency. Finally, instead of naively combining features of both branches (e.g., [31]), we propose a bi-projection fusion procedure with learnable masks to balance the information shared between two projections. The source code and pretrained models are available to the public.

Fig.2. Field-of-view (FoV) comparison.

Equirectangular projection has the largest FoV compared to each face on the cubemap projection with (solid-line) or without (dash-line) the proposed spherical padding.

錦上添花，本文又提出了 spherical padding 來緩解 cube padding 導致 cubemap 影像不連續問題。

提出的 equirectangular 和 cubemap 的特徵融合方法，通過學習 mask，對兩種特徵進行選擇性融合。

We apply our method to four panorama datasets: Matterport3D [3], PanoSUNCG [26], 360D [38] and Stanford2D3D [1]. Our experimental results show that the proposed method performs favorably against the current stateof-the-art (SOTA) methods. In addition, we present extensive ablation study for each of the proposed modules, including the spherical padding and fusion schemes.

資料集和實驗結果。資料集比較重要。我之前的部落格特別介紹了 Matterport3D [3], PanoSUNCG [26] 和 Stanford2D3D [1] 的下載方法，可以參考連結：全景（360 度相機）影像資料集 3D60 Dataset 下載步驟（詳細）。

[3] Matterport3D: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV), 2017

[26] Self-supervised learning of depth and camera motion from 360◦ videos. In Asian Conference on Computer Vision (ACCV), 2018

[38] Omnidepth: Dense depth estimation for indoors spherical panoramas. In European Conference on Computer Vision (ECCV), 2018

[1] Joint 2d-3d-semantic data for indoor scene understanding. CoRR, 2017

E2C and C2E

For a cubemap representation with sides of equal length w, we denote its six faces as fi , i ∈ {B, D, F, L, R, U}, corresponding to the ones on the back, down, front, left, right and up, respectively. Each face can be treated as the image plane of an independent camera with focal length w/2 , in which all these cameras share the same center of projection (i.e., the center of the cube) but with different poses. When we set the origin of the world coordinate system to the center of the cube, the extrinsic matrix of each camera coordinate system can be simply defined by a rotation matrix and zero translation. Given a pixel pi on the image plane fi with its coordinate (x, y, z) on the corresponding camera system, where 0 ≤ x, y ≤ w-1 and z = w/2 , we can transform it into the equirectangular representation by a simple mapping:

where $\theta _{f_i}$ and $\varphi _{f_i}$ are longitude and latitude in equirectangular projection; and $q_{x_i}$ , $q_{y_i}$ , $q_{z_ i }$ are the x, y, z components of respectively. As this mapping is reversible, we are able to easily perform both equirectangular-to-cube and cube-toequirectangular transformations, which are denoted as E2C and C2E, respectively. A more detailed illustration is shown in the supplementary material.

這裡只給出瞭如何從一個平面影像到 equirectangular 影像的變換公式，並沒有給出其逆變換公式。說是在補充材料裡面，但無論是在 CVF Openacess 還是 project 網頁，都沒有給出這個補充材料。

Spherical Padding

Due to the distortion in the equirectangular projection, directly learning a typical convolutional neural network to perform monocular depth estimation on equirectangular images would lead to unstable training process and unsatisfying prediction [4]. In contrast, the cubemap representation suffers less from distortion but instead produces large errors since the discontinuity across the boundaries of each face [4, 26]. In order to resolve this issue for cubemap projection, Cheng et al. [4] propose the cube padding (CP) approach to utilize the connectivity between faces on the cube for image padding. However, solely padding the feature map of a face by using the features from its neighboring faces does not follow the characteristic of perspective projection. Therefore, here we propose the spherical padding (SP) method, which pads the feature according to spherical projection. As such, we can connect each face with the geometric relationship. A comparison between the cube padding [4] and our proposed spherical padding is illustrated in Fig. 3.

Figure 3. Spherical padding v.s. cube padding. Cube padding directly pads the feature of the connected faces. In addition to obvious inconsistency at the boundary, the values of four corners are undefined. In [4], the values are only chosen by the closest side. In our proposed spherical padding, the padding area is calculated with spherical projection. As a result, both the missing corner and inconsistency at the boundary can be addressed.

這段給出了為什麼提出 SP，以及 SP 的直觀方法。為了使得 cubemap 每個面的邊緣連續，CP 直接擴充套件了鄰近面的一部分，但沒有考慮幾何連續，即四個角上的內容是空的，不連續的。SP 就是將這四個角也補齊了。那這是怎麼做到的呢，這麼做的好處是什麼呢？看下文繼續介紹。

The most straightforward way to apply spherical padding for cubemap is to first transform all the faces into a unified equirectangular image by C2E. Then, we extend the original FoV σ = 90◦ to σ ′ , and map it back to the cubemap by E2C. As a result, we can pad them on each face completely without missing parts (i.e., undefined areas in cube padding of Fig. 3) and with consistent geometry. Specifi- cally, given a cubemap with side length w and Fov σ = 90◦ , the C2E transformation is identical to the inverse calculation of (1). When we apply spherical padding with padding size γ, which is determined by the padding size in the convolution layer (e.g.,γ=1 for a 3×3convolution layer), we update the side length of a cube face to w ′ = w + 2γ, and the corresponding FoV becomes σ ′ = 2 arctan (w/2+γ)/(w/2) after padding, as illustrated in Fig. 5. Hence, for mapping from equirectangular image back to the padded cubemap, we should use both w ′ and σ ′ to derive the correct E2C transformation for spherical padding.

Figure 5. The cubemap with length w and padding size γ. We keep the focal length the same (0.5w) and calculate a new FoV σ ′ .

Efficient Transformation. We have described the overall concept of our spherical padding. However, the above procedure consists of both C2E and E2C transformations, which could require heavy computational cost. Therefore, we simplify this procedure by deriving a direct mapping function between two cube faces. Given two cube faces fi and fj , we first denote the geometric transformation between their camera coordinate systems as a rotation matrix $R_{f_i\rightarrow f_j}$ . Then the mapping from a pixel pi in fi to fj can be established upon the typical projection model of pinhole cameras:

where (x, y) represents the 2D location of pi after being mapped onto the image plane of fj . Since this mapping only needs to be computed once for all the pixels on the padding region, the computational cost of applying spherical padding is comparable with cube padding, without any E2C or C2E transformation included.

首先，文章介紹了一種最直接的 SP 方法，即將 cubemap 圖片轉換成 equirectangular 圖片，然後從 equirectangular 圖片轉 cubemap 圖片時， σ 適當擴大，如圖 5。這個 σ 擴大多少，取決於卷積核的大小，作者也給出了這個角度的計算公式。這樣做的問題是需要做 C2E and E2C 兩個變換，不經濟。因此作者給出了新的變換方法，不需要操作 C2E and E2C。這裡不詳細解釋這個方法（暫時不摳細節，用的時候再研究）。

BiFuse Network

Overview

Overall, our model consists of two encoder-decoder branches which take the equirectangular image and cubemap as input, respectively, where we denote the equirectangular branch as Be and the cubemap one as Bc. As mentioned in Sec. 1, each branch has its benefit but also suffers from some limitations. To jointly learn a better model while sharing both advantages, we utilize a bi-projection fusion block that bridges the information across two branches, which will be described in the following. To generate the final prediction, we first convert the prediction of cubemap to the equirectangular view and adopt a convolution module to combine both predictions.

Figure 4. The proposed BiFuse Network. Our network consists of two branches Be and Bc. The input of Be is an RGB equirectangular image, while Bc takes the corresponding cubemap as input. We replace the first convolution layer in Be with a Pre-Block [38, 23]. For the decoder, we adopt up-projection [16] modules. For each convolution and up-projection layer in Bc, we apply our spherical padding to connect feature maps of six faces. Most importantly, between feature maps from Be and Bc, we use the proposed bi-projection fusion module to share information between two feature representations. Finally, we add a Conv module [24] to unify two depth predictions from Be and Bc.

整體地看，本文的 BiFuse Network 包括兩個分支，一個用 equirectangular 影像估計深度，一個用 cubemap 影像。二者各有優缺點，為了將各自優點影響另外一方，兩個分支的每個編解碼模組之間用 Bi-Projection Fusion 模組進行資訊傳遞。最後，將 cubemap 分支估計的深度影像轉換到 equirectangular 域，然後與 equirectangular 分支估計的深度圖級聯和卷積，輸出最終結果。

因此，Bi-Projection Fusion 是該網路的關鍵。

Bi-Projection Fusion

如何融合兩個不同域的特徵呢？最簡單的當然就是 concatenate 操作。但這樣做，並不是最優的，或者說網路效率沒有達到最優，還有改進的餘地。本文提出的融合模組非常巧妙，思路值得借鑑。

To encourage the information shared across two branches, we empirically find that directly combining feature maps [31] from Be and Bc would result in unstable gradients and training procedure, and thus it is keen to develop a fusion scheme to balance two branches. Inspired by the recent works in multi-tasking [5, 36], we focus on balancing the feature map from two different representations. To achieve this goal, we propose a bi-projection fusion module H: given feature maps he and hc from Be and Bc in each layer respectively, we estimate the corresponding feature maps h′e = He(he) and h′c = Hc(C2E(hc)), where He and Hc indicate a convolution layer.

To produce feature maps that benefit both branches, we first concatenate h′e and h′c , and then pass it to a convolution layer with the sigmoid activation to estimate a mask M to balance the fusion procedure. Finally, we generate feature maps h¯e and h¯c as the input to the next layer as:

Note that we use C2E and E2C operations in the fusion procedure to ensure that features and the mask M are in the same projection space.

首先，當考慮融合兩個域的特徵時，需要將二者統一到一個統一的域，要麼選擇一箇中間域，要麼選擇二值之一的域，因為只有域統一了，融合才是高效的。本文，將 cubemap 域都轉換成 equirectangular 域。

然後，統一到一個域的兩個特徵就可以名正言順地進行 concatenate 。

再然後，級聯的特徵經過卷積層和 sigmoid，學習到一個 mask，M。很顯然，這是一個 attention 模型。

最後，根據 attention 公式（3），得到每個域內新的、增強的特徵。注意的是，學習到的新特徵分量是 equirectangular 域的，在 cubemap 域中，這個新特徵需要變換到 cubemap 域。

[5] Segflow: Joint learning for video object segmentation and optical flow. In IEEE International Conference on Computer Vision (ICCV), 2017 [github]

[36] Joint task-recursive learning for semantic segmentation and depth estimation. In European Conference on Computer Vision (ECCV), 2018

Loss Function

We adopt the reverse Huber loss [16] as the objective function for optimizing predictions from both Be and Bc:

The overall objective function is then written as:

where De and Dc are the predictions produced by Be and Bc respectively; DGT is the ground truth depth in the equirectangular representation; and P indicates all pixels where there is a valid depth value in the ground truth map. We note that the C2E operation is required on converting Dc into the equirectangular form before computing the loss.