2024-08-29-SEA-RAFT-中英對照

cold_moon發表於2024-09-20

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

SEA-RAFT:簡單、高效、準確的光流RAFT演算法

Yihan Wang, Lahav Lipson, and Jia Deng

王一涵,Lahav Lipson,和Jia Deng

Department of Computer Science, Princeton University {yw7685, llipson, jiadeng}@princeton.edu

普林斯頓大學電腦科學系 {yw7685, llipson, jiadeng}@princeton.edu

Abstract. We introduce SEA-RAFT, a more simple, efficient, and accurate RAFT for optical flow. Compared with RAFT, SEA-RAFT is trained with a new loss (mixture of Laplace). It directly regresses an initial flow for faster convergence in iterative refinements and introduces rigid-motion pre-training to improve generalization. SEA-RAFT achieves state-of-the-art accuracy on the Spring benchmark with a 3.69 endpoint-error (EPE) and a 0.36 1-pixel outlier rate (1px), representing \({22.9}\%\) and \({17.8}\%\) error reduction from best published results. In addition, SEA-RAFT obtains the best cross-dataset generalization on KITTI and Spring. With its high efficiency,SEA-RAFT operates at least \({2.3} \times\) faster than existing methods while maintaining competitive performance. The code is publicly available at https://github.com/princeton-vl/SEA-RAFT.

摘要。我們介紹了SEA-RAFT,一種更簡單、高效和準確的光流RAFT演算法。與RAFT相比,SEA-RAFT採用了一種新的損失函式(拉普拉斯混合)進行訓練。它直接回歸初始流,以加快迭代細化中的收斂速度,並引入剛體運動預訓練以提高泛化能力。SEA-RAFT在Spring基準測試中達到了最先進的準確性,端點誤差(EPE)為3.69,1畫素異常率(1px)為0.36,代表了\({22.9}\%\)\({17.8}\%\)的誤差減少。此外,SEA-RAFT在KITTI和Spring資料集上獲得了最佳的跨資料集泛化效能。憑藉其高效率,SEA-RAFT比現有方法至少快\({2.3} \times\),同時保持了競爭效能。程式碼已公開發布在https://github.com/princeton-vl/SEA-RAFT。

Introduction

引言

Optical flow is a fundamental task in low-level vision and aims to estimate per-pixel 2D motion between video frames. It is useful for various downstream tasks Sincluding action recognition \(\left\lbrack {{39},{49},{67}}\right\rbrack\) ,video in-painting \(\left\lbrack {{10},{22},{60}}\right\rbrack\) ,frame interpolation \(\left\lbrack {{15},{27},{61}}\right\rbrack ,3\mathrm{D}\) reconstruction and synthesis \(\left\lbrack {{33},{69}}\right\rbrack\) .

光流是低階視覺中的一個基本任務,旨在估計影片幀之間的每畫素2D運動。它在包括動作識別\(\left\lbrack {{39},{49},{67}}\right\rbrack\)、影片修復\(\left\lbrack {{10},{22},{60}}\right\rbrack\)、幀插值\(\left\lbrack {{15},{27},{61}}\right\rbrack ,3\mathrm{D}\)重建和合成\(\left\lbrack {{33},{69}}\right\rbrack\)等多種下游任務中非常有用。

Although traditionally formulated as an optimization problem \(\left\lbrack {5,{13},{62}}\right\rbrack\) ,almost all recent methods are based on deep learning \(\lbrack 6,8,{11},{14},{24},{29},{42} - {45},{48},{50}\) , 54-57,63,66,68]. In particular, many state-of-the-art methods [14,29,43,44,50,66] have adopted architectures based on RAFT [50], which uses a recurrent network to iteratively refine a flow field.

儘管傳統上被表述為一個最佳化問題\(\left\lbrack {5,{13},{62}}\right\rbrack\),但幾乎所有最近的方法都基於深度學習\(\lbrack 6,8,{11},{14},{24},{29},{42} - {45},{48},{50}\),54-57,63,66,68。特別是,許多最先進的方法[14,29,43,44,50,66]採用了基於RAFT[50]的架構,該架構使用迴圈網路來迭代細化流場。

In this paper, we introduce SEA-RAFT, a new variant of RAFT that is more efficient and accurate. When compared against all existing approaches, SEA-RAFT has the best accuracy-efficiency Pareto frontier (Fig. 1):

在本文中,我們介紹了 SEA-RAFT,這是一種 RAFT 的新變體,效率更高且更準確。與所有現有方法相比,SEA-RAFT 具有最佳的準確性-效率帕累託前沿(圖 1):

  • Accuracy: On Spring [35], SEA-RAFT achieves a new state of the art, outperforming the next best by a large margin: \({18}\%\) error reduction on \(1\mathrm{{px}}\) -outlier rate (3.686 vs. 4.482) and \({24}\%\) error reduction on endpoint-error (0.363 vs. 0.471). On Sintel [3] and KITTI [36], it outperforms all other methods that have similar computational costs.

  • 準確性:在 Spring [35] 上,SEA-RAFT 達到了新的技術水平,大幅超越了次優方法:\({18}\%\)\(1\mathrm{{px}}\) 異常率上的誤差減少(3.686 對比 4.482)和在端點誤差上的 \({24}\%\) 誤差減少(0.363 對比 0.471)。在 Sintel [3] 和 KITTI [36] 上,它優於所有具有相似計算成本的其他方法。

  • Efficiency: On each benchmark tested,SEA-RAFT runs at least \({2.3} \times\) faster than existing methods that have comparable accuracy. Our smallest model,

  • 效率:在每個測試基準上,SEA-RAFT 的執行速度至少比具有可比準確性的現有方法快 \({2.3} \times\)。我們最小的模型,

2024-08-29-SEA-RAFT-中英對照

Fig. 1: Zero-shot performance of SEA-RAFT and existing methods on the Spring [35] training split. Latency is measured on an RTX3090 with a batch size of 1 and input resolution \({540} \times {960}\) . SEA-RAFT has an accuracy close to the best one achieved by MS-RAFT+ [19] but is \({11} \times\) smaller and \({24} \times\) faster.

圖 1:SEA-RAFT 和現有方法在 Spring [35] 訓練集上的零樣本效能。延遲是在具有批次大小為 1 和輸入解析度 \({540} \times {960}\) 的 RTX3090 上測量的。SEA-RAFT 的準確性接近 MS-RAFT+ [19] 所達到的最佳水平,但體積更小 \({11} \times\) 且速度更快 \({24} \times\)

which still outperforms all other methods on Spring, can run at 21fps when processing 1080p images on an RTX3090, \(3 \times\) faster than the original RAFT.

該模型在 Spring 上仍然優於所有其他方法,在 RTX3090 上處理 1080p 影像時可以達到 21fps,比原始 RAFT 快 \(3 \times\)

We achieve this by introducing a combination of improvements over the original RAFT:

我們透過在原始 RAFT 基礎上引入一系列改進來實現這一點:

  • Mixture of Laplace Loss: Instead of the standard \({L}_{1}\) loss,we train the network to predict parameters of a mixture of Laplace distributions to maximize the log-likelihood of the ground truth flow. As we will demonstrate, this new loss reduces overfitting to ambiguous cases and improves generalization.

  • 混合拉普拉斯損失:我們不是使用標準的 \({L}_{1}\) 損失,而是訓練網路預測混合拉普拉斯分佈的引數,以最大化地面真實流的對數似然。正如我們將展示的,這種新損失減少了過度擬合到模糊情況並提高了泛化能力。

  • Directly Regressed Initial Flow: Instead of initializing the flow field to zero before iterative refinement, we directly predict the initial flow by reusing the existing context encoder and feeding it the stacked input frames. This simple change introduces minimal overhead but is surprisingly effective in reducing the number of iterations and improving efficiency.

  • 直接回歸初始流:在迭代細化之前,我們不是將流場初始化為零,而是透過重用現有的上下文編碼器並將其輸入堆疊的幀來直接預測初始流。這一簡單的改變引入的額外開銷最小,但令人驚訝地有效,可以減少迭代次數並提高效率。

  • Rigid-Flow Pre-Training: We find that pre-training on TartanAir [52], which can significantly improve generalization, despite the limited diversity of flow, which is induced purely by camera motion in a static scene.

  • 剛性流預訓練:我們發現,儘管流的變化僅由靜態場景中的相機運動引起,多樣性有限,但在TartanAir [52]上進行預訓練可以顯著提高泛化能力。

These improvements are novel in the context of RAFT-style methods for optical flow. Moreover, they are orthogonal to the improvements proposed in existing RAFT-style methods, which focus on replacing certain blocks with newer designs, such as replacing convolutional blocks with transformers.

這些改進在RAFT風格的光流方法背景下是新穎的。此外,它們與現有RAFT風格方法中提出的改進是正交的,後者主要關注用更新的設計替換某些模組,例如用變換器替換卷積塊。

Besides the main improvements above, SEA-RAFT also incorporates architectural changes that greatly simplify the original RAFT. In particular, we find that certain custom designs of the original RAFT are unnecessary and can be replaced with standard off-the-shelf modules. For example, the original feature encoder and context encoder were custom-designed and must use different normalization layers for stable training; we replaced each with a standard ResNet. In addition, we replace the original convolutional GRU with a simple RNN consisting entirely of ConvNext blocks. Such simplifications make it easy for SEA-RAFT to incorporate new neural building blocks and scale to larger datasets.

除了上述主要改進之外,SEA-RAFT還結合了架構上的變化,這些變化極大地簡化了原始的RAFT。特別是,我們發現原始RAFT的某些定製設計是不必要的,可以用標準的現成模組替換。例如,原始的特徵編碼器和上下文編碼器是定製設計的,必須使用不同的歸一化層以確保穩定訓練;我們用標準的ResNet替換了它們。此外,我們用完全由ConvNext塊組成的簡單RNN替換了原始的卷積GRU。這些簡化使得SEA-RAFT易於整合新的神經構建塊,並擴充套件到更大的資料集。

We perform extensive experiments to evaluate SEA-RAFT on standard benchmarks including Spring, Sintel, and KITTI. We also validate the effectiveness of our improvements through ablation studies.

我們在包括Spring、Sintel和KITTI在內的標準基準上進行了廣泛的實驗,以評估SEA-RAFT。我們還透過消融研究驗證了我們改進的有效性。

2 相關工作

Estimating Optical Flow Classical approaches treated optical flow as an optimization problem that maximizes visual similarity between corresponding pixels, with strong regularization. \(\left\lbrack {5,{13},{62}}\right\rbrack\) . Current methods \(\lbrack 6,9,{14},{16} - {19},{24},{30},{31}\) , 42-45,48,50,54-57,65,66,68] are mostly based on deep learning. FlowNets [9,17] regarded optical flow as a dense regression problem and used stacked convolution blocks for prediction. DCNet [58] and PWC-Net [45] introduced 4D cost-volume to explicitly model pixel correspondence. RAFT [50] further combined multi-scale 4D cost-volume with recurrent iterative refinements, achieving large improvements and spawning many follow-ups \(\left\lbrack {{14},{19},{30},{31},{43},{44},{48},{66},{68}}\right\rbrack\) .

估計光流 傳統方法將光流視為一個最佳化問題,該問題最大化對應畫素之間的視覺相似性,並具有強正則化。\(\left\lbrack {5,{13},{62}}\right\rbrack\)。當前方法\(\lbrack 6,9,{14},{16} - {19},{24},{30},{31}\),42-45,48,50,54-57,65,66,68]主要基於深度學習。FlowNets[9,17]將光流視為密集迴歸問題,並使用堆疊的卷積塊進行預測。DCNet[58]和PWC-Net[45]引入了4D成本體積以顯式建模畫素對應關係。RAFT[50]進一步結合多尺度4D成本體積與迴圈迭代細化,實現了大幅改進並催生了許多後續工作\(\left\lbrack {{14},{19},{30},{31},{43},{44},{48},{66},{68}}\right\rbrack\)

Our method is a new variant of RAFT [50] with several improvements including a new loss function, direct regression of initial flow, rigid-flow pre-training, and architectural simplifications. All of these improvements are new compared to existing RAFT variants. In particular, our direct regression of initial flow is new compared to existing efficient RAFT variants \(\left\lbrack {6,{11},{37}}\right\rbrack\) ,which mainly focus on efficient implementations of RAFT modules. This direct regression is a simple change with minimal overhead, but substantially reduces the number of RAFT iterations needed.

我們的方法是RAFT[50]的一個新變體,包含多項改進,包括新的損失函式、初始流直接回歸、剛性流預訓練和架構簡化。與現有RAFT變體相比,所有這些改進都是新的。特別是,我們的初始流直接回歸與現有高效RAFT變體\(\left\lbrack {6,{11},{37}}\right\rbrack\)相比是新的,後者主要關注RAFT模組的高效實現。這種直接回歸是一個簡單的改變,開銷最小,但顯著減少了所需的RAFT迭代次數。

Data for Optical Flow FlyingChairs and FlyingThings3D [9,34] are commonly used datasets for optical flow. They provide a large amount of synthetic data but have limited realism. Sintel [3], VIPER [41], Infinigen [40], and Spring [35] are more realistic, using open-source 3D animations, games or procedurally generated scenes. Besides synthetic data, Middlebury, KITTI, and HD1K [1,12,23,36] provide annotations for real-world image pairs. These datasets are limited in both quantity and diversity due to the difficulty of accurately annotating optical flow in the real world. To leverage more data,several methods \(\left\lbrack {8,{42},{54},{55}}\right\rbrack\) pre-train their models on different tasks. MatchFlow [8] pre-trains on geometric image matching (GIM) using MegaDepth [26]. Croco-Flow [54, 55], DDVM [42], and Flowformer++ [43] pre-train on unlabeled data. We pre-train SEA-RAFT on rigid flow using TartanAir [52]. Though TartanAir [52] has been used in other methods such as DDVM [42] and CroCo-Flow [54,55], our adoption of rigid-flow pre-training is new in the context of RAFT-style methods.

光流資料集 FlyingChairs 和 FlyingThings3D [9,34] 是常用的資料集。它們提供了大量的合成資料,但真實性有限。Sintel [3]、VIPER [41]、Infinigen [40] 和 Spring [35] 使用開源的 3D 動畫、遊戲或程式生成的場景,更加真實。除了合成資料外,Middlebury、KITTI 和 HD1K [1,12,23,36] 為真實世界的影像對提供註釋。由於在現實世界中準確標註光流的難度,這些資料集在數量和多樣性上都有限。為了利用更多資料,幾種方法 \(\left\lbrack {8,{42},{54},{55}}\right\rbrack\) 在不同任務上預訓練其模型。MatchFlow [8] 使用 MegaDepth [26] 在幾何影像匹配(GIM)上預訓練。Croco-Flow [54, 55]、DDVM [42] 和 Flowformer++ [43] 在未標註資料上預訓練。我們在 TartanAir [52] 上使用剛性流預訓練 SEA-RAFT。儘管 TartanAir [52] 已被其他方法如 DDVM [42] 和 CroCo-Flow [54,55] 使用,但我們在 RAFT 風格方法的背景下采用剛性流預訓練是新穎的。

Predicting Probability Distributions Predicting probability distributions is a common practice in computer vision \(\left\lbrack {2,4,{25},{32},{47},{51},{53},{64}}\right\rbrack\) . In tasks closely related to optical flow such as keypoint matching \(\left\lbrack {4,{47},{51},{64}}\right\rbrack\) ,the variance of the probability distribution reflects uncertainty of predictions and therefore is useful for many applications. For example, LoFTR [47] filters out uncertain matching pairs. Aspanformer [4] adjusts the look-up radius based on uncertainty.

預測機率分佈 預測機率分佈是計算機視覺中的一種常見做法 \(\left\lbrack {2,4,{25},{32},{47},{51},{53},{64}}\right\rbrack\)。在諸如關鍵點匹配 \(\left\lbrack {4,{47},{51},{64}}\right\rbrack\) 等與光流密切相關的任務中,機率分佈的方差反映了預測的不確定性,因此對許多應用都很有用。例如,LoFTR [47] 過濾掉不確定的匹配對。Aspanformer [4] 根據不確定性調整查詢半徑。

2024-08-29-SEA-RAFT-中英對照

Fig. 2: Compared with RAFT [50], SEA-RAFT introduces (1) rigid-flow pre-training, (2) mixture of Laplace loss, and (3) direct regression of initial flow.

圖 2:與 RAFT [50] 相比,SEA-RAFT 引入了(1)剛性流預訓練,(2)拉普拉斯混合損失,以及(3)初始流的直接回歸。

To handle the ambiguity caused by heavy occlusion, SEA-RAFT predicts a mixture of Laplace (MoL) distribution. Although MoL has been used in keypoint matching methods such as PDC-Net + [51], our use of MoL is new in the context of RAFT-style methods. In addition, our formulation is different in that we require one mixture component to have a constant variance, making it equivalent to the \({L}_{1}\) loss that aligns better with the optical flow evaluation metrics. This difference is crucial for achieving competitive performance in optical flow, where every pixel needs accurate correspondence, unlike keypoint matching, where a subset of reliable matches suffices.

為了處理由嚴重遮擋引起的模糊性,SEA-RAFT 預測了一個拉普拉斯混合(MoL)分佈。儘管 MoL 已在關鍵點匹配方法中使用,如 PDC-Net + [51],但我們在 RAFT 風格方法中使用 MoL 是新穎的。此外,我們的公式不同之處在於我們要求一個混合分量具有恆定方差,使其等同於與光流評估指標更好地對齊的 \({L}_{1}\) 損失。這一差異對於實現光流中的競爭效能至關重要,其中每個畫素都需要精確對應,而不像關鍵點匹配,其中一組可靠匹配就足夠了。

3 Method

3 方法

In this section, we first describe the iterative refinement in RAFT and then introduce the improvements that lead to SEA-RAFT.

在本節中,我們首先描述 RAFT 中的迭代細化,然後介紹導致 SEA-RAFT 的改進。

3.1 Iterative refinement

3.1 迭代細化

Given two adjacent RGB frames, RAFT predicts a field of pixel-wise 2D vectors through iterative refinement that consists of two parts: (1) feature and context encoders, which transform images into lower-resolution dense features, and (2) an RNN unit, which iteratively refines the predictions.

給定兩個相鄰的 RGB 幀,RAFT 透過迭代細化預測畫素級的 2D 向量場,該細化包括兩部分:(1)特徵和上下文編碼器,將影像轉換為低解析度的密集特徵,以及(2)一個 RNN 單元,迭代地細化預測。

Given two images \({I}_{1},{I}_{2} \in {\mathbb{R}}^{H \times W \times 3}\) ,the feature encoder \(F\) takes \({I}_{1},{I}_{2}\) as inputs separately and outputs a lower-resolution feature \(F\left( {I}_{1}\right) ,F\left( {I}_{2}\right) \in {\mathbb{R}}^{h \times w \times D}\) . The context encoder \(C\) takes source image \({I}_{1}\) as input and outputs a context feature \(C\left( {I}_{1}\right) \in {\mathbb{R}}^{h \times w \times D}\) . A multi-scale \(4\mathrm{D}\) correlation volume \(\left\{ {V}_{k}\right\}\) is then built with the features from feature encoder \(F\) :

給定兩幅影像 \({I}_{1},{I}_{2} \in {\mathbb{R}}^{H \times W \times 3}\),特徵編碼器 \(F\) 分別以 \({I}_{1},{I}_{2}\) 作為輸入並輸出一個低解析度特徵 \(F\left( {I}_{1}\right) ,F\left( {I}_{2}\right) \in {\mathbb{R}}^{h \times w \times D}\)。上下文編碼器 \(C\) 以源影像 \({I}_{1}\) 作為輸入並輸出一個上下文特徵 \(C\left( {I}_{1}\right) \in {\mathbb{R}}^{h \times w \times D}\)。然後,使用來自特徵編碼器 \(F\) 的特徵構建一個多尺度的 \(4\mathrm{D}\) 相關體積 \(\left\{ {V}_{k}\right\}\)

\[{V}_{k} = F\left( {I}_{1}\right) \circ \operatorname{AvgPool}{\left( F\left( {I}_{2}\right) ,{2}^{k}\right) }^{\top } \in {\mathbb{R}}^{h \times w \times \frac{h}{{2}^{k}} \times \frac{w}{{2}^{k}}}, \]

where \(\circ\) represents the correlation operator,which computes similarities (as dot products of feature vectors) between all pairs of pixels across two feature maps.

其中 \(\circ\) 表示相關運算子,該運算子計算兩個特徵圖之間所有畫素對之間的相似性(作為特徵向量的點積)。

Several works \(\left\lbrack {{18},{19}}\right\rbrack\) have explored the optimal choices of the number of levels in the cost volume \(\left( k\right)\) and the feature resolution \(\left( {h,w}\right)\) . In SEA-RAFT, we simply follow the original setting in RAFT [50]: \(\left( {h,w}\right) = \frac{1}{8}\left( {H,W}\right) ,k = 4\) .

多項工作 \(\left\lbrack {{18},{19}}\right\rbrack\) 已經探討了成本體積 \(\left( k\right)\) 中層數和特徵解析度 \(\left( {h,w}\right)\) 的最佳選擇。在 SEA-RAFT 中,我們簡單地遵循 RAFT [50] 中的原始設定:\(\left( {h,w}\right) = \frac{1}{8}\left( {H,W}\right) ,k = 4\)

RAFT iteratively refines a flow prediction \(\mu\) . Initially, \(\mu\) is set to be all zeros. Each refinement step uses the current flow prediction \(\mu\) to fetch a \({D}_{M}\) -dim motion feature \(M\) from the multi-scale correlation volume \(\left\{ {V}_{k}\right\}\) with a look-up radius \(r\) :

RAFT 迭代地細化流預測 \(\mu\)。最初,\(\mu\) 被設定為全零。每個細化步驟使用當前的流預測 \(\mu\) 從多尺度相關體積 \(\left\{ {V}_{k}\right\}\) 中獲取一個 \({D}_{M}\) 維的運動特徵 \(M\),查詢半徑為 \(r\)

\[M = \operatorname{MotionEncoder}\left( {\operatorname{LookUp}\left( {\left\{ {V}_{k}\right\} ,\mu ,r}\right) }\right) \in {\mathbb{R}}^{h \times w \times {D}_{M}}, \]

where the Lookup operator returns a motion feature vector for each pixel in \({I}_{1}\) , consisting of similarities between the pixel in \({I}_{1}\) and its current correspondence’s neighboring pixels in \({I}_{2}\) within the radius \(r\) . The motion feature vector is further transformed by a motion encoder.

其中查詢運算子為 \({I}_{1}\) 中的每個畫素返回一個運動特徵向量,該向量由 \({I}_{1}\) 中的畫素與其在 \({I}_{2}\) 中當前對應畫素的鄰近畫素在半徑 \(r\) 內的相似性組成。運動特徵向量進一步透過運動編碼器進行變換。

Existing works \(\left\lbrack {4,{11},{21}}\right\rbrack\) have explored dynamic radius and look-up when obtaining the motion features from \(\left\{ {V}_{k}\right\}\) . For simplicity of design,SEA-RAFT follows the original RAFT and sets the look-up radius \(r = 4\) to a fixed constant. The motion feature \(M\) is fed into the RNN cell along with hidden state \(h\) and context feature \(C\left( {I}_{1}\right)\) . From the new hidden state \({h}^{\prime }\) ,the residual flow \({\Delta \mu }\) is regressed by a 2-layer FlowHead:

現有工作 \(\left\lbrack {4,{11},{21}}\right\rbrack\) 已經探討了在從 \(\left\{ {V}_{k}\right\}\) 獲取運動特徵時使用動態半徑和查詢。為了設計的簡單性,SEA-RAFT 遵循原始的 RAFT 並將查詢半徑 \(r = 4\) 設定為一個固定的常數。運動特徵 \(M\) 與隱藏狀態 \(h\) 和上下文特徵 \(C\left( {I}_{1}\right)\) 一起輸入到 RNN 單元中。從新的隱藏狀態 \({h}^{\prime }\) 中,殘差流 \({\Delta \mu }\) 透過一個兩層的 FlowHead 進行迴歸:

\[{h}^{\prime } = \operatorname{RNN}\left( {h,M,C\left( {I}_{1}\right) }\right) \]

\[{\mathit{Δ}\mu } = \mathtt{{FlowHead}}\left( {h}^{\prime }\right) \]

Methods using RAFT-Style iterative refinement [14,50] usually need many iterations: 12 in training and as many as 32 in inference. As a result, RNN-based iterative refinement is a significant bottleneck in latency. Though there have been attempts \(\left\lbrack {6,{11}}\right\rbrack\) to reduce the number of iterations,the performance drastically drops with fewer iterations. In contrast, SEA-RAFT only needs 4 iterations in training and up to 12 iterations in inference to achieve competitive performance.

使用 RAFT 風格迭代細化的方法 [14,50] 通常需要多次迭代:訓練中需要 12 次,推理中多達 32 次。因此,基於 RNN 的迭代細化是延遲的一個顯著瓶頸。儘管有嘗試 \(\left\lbrack {6,{11}}\right\rbrack\) 減少迭代次數,但效能會隨著迭代次數減少而急劇下降。相比之下,SEA-RAFT 在訓練中僅需要 4 次迭代,在推理中最多需要 12 次迭代即可達到競爭性效能。

3.2 Mixture-of-Laplace Loss

3.2 拉普拉斯混合損失

Most prior works are supervised using an endpoint-error loss on all pixels. However, optical flow training data often contains ambiguous, unpredictable samples, which can dominate this loss empirically.

大多數先前的工作使用所有畫素的端點誤差損失進行監督。然而,光流訓練資料通常包含模糊、不可預測的樣本,這些樣本在經驗上可以主導這種損失。

Ambiguous Cases Ambiguous cases of optical flow can arise with heavy occlusion Fig. 3. While in many cases the motion of occluded pixels can be predicted, sometimes the ambiguity can be too large to predict a single outcome. We examined 10 samples with the highest endpoint-error in the training and validation sets of FlyingChairs [9] and found that ambiguous cases dominate the error.

模糊情況 光流的模糊情況可能因嚴重遮擋而出現,如圖3所示。雖然在許多情況下被遮擋畫素的運動可以預測,但有時模糊性可能太大以至於無法預測單一結果。我們檢查了FlyingChairs [9]訓練和驗證集中端點誤差最高的10個樣本,發現模糊情況主導了誤差。

Review of Probabilistic Regression Prior works for image-matching have proposed probabilistic losses to enable their model to express aleatoric or epistemic uncertainty \(\left\lbrack {4,{47},{51},{51},{53},{55},{64}}\right\rbrack\) . These approaches regress the parameters

機率迴歸的回顧 先前用於影像匹配的工作提出了機率損失,以使模型能夠表達偶然或認知不確定性 \(\left\lbrack {4,{47},{51},{51},{53},{55},{64}}\right\rbrack\)。這些方法迴歸機率模型的引數

2024-08-29-SEA-RAFT-中英對照

Fig. 3: Ambiguous cases can occur frequently in training data where flow is unpredictable due to occlusion. Such cases can dominate the \({L}_{1}\) loss (shown as an error map) used by current methods \(\left\lbrack {{50},{56}}\right\rbrack\) . Our new training loss allows the model to account for such uncertainty.

圖3:模糊情況在訓練資料中可能經常發生,由於遮擋導致流不可預測。此類情況可以主導當前方法 \(\left\lbrack {{50},{56}}\right\rbrack\) 使用的 \({L}_{1}\) 損失(顯示為誤差圖)。我們的新訓練損失允許模型考慮這種不確定性。

of the probabilistic model and maximize the log-likelihood of the ground truth during training.

在訓練過程中,迴歸機率模型的引數並最大化地面真實值的對數似然。

Given an image pair \(\left\{ {{I}_{1},{I}_{2}}\right\}\) and the flow ground truth \({\mu }_{gt}\) ,the training loss is

給定一對影像 \(\left\{ {{I}_{1},{I}_{2}}\right\}\) 和流的真實值 \({\mu }_{gt}\),訓練損失為

\[{\mathcal{L}}_{\text{prob }} = - \log {p}_{\theta }\left( {\mu = {\mu }_{gt} \mid {I}_{1},{I}_{2}}\right) \]

where the probability density function \({p}_{\theta }\) is parameterized by the network. Prior work has formulated \({p}_{\theta }\) as a Gaussian or a Laplace distribution with a predicted mean and variance. For example, we can formulate a naive version of probabilistic regression by assuming: (1) \({p}_{\theta }\) is Laplace with mean \(\mu \in {\mathbb{R}}^{H \times W \times 2}\) and scale \(b \in {\mathbb{R}}^{H \times W \times 1}\) predicted by the network,(2) the flow distribution is pixel-wisely independent, and (3) the x-direction flow and the y-direction flow are independent but share the same scale parameter \(b\) :

其中機率密度函式 \({p}_{\theta }\) 由網路引數化。先前的工作將 \({p}_{\theta }\) 形式化為具有預測均值和方差的高斯分佈或拉普拉斯分佈。例如,我們可以透過假設:(1) \({p}_{\theta }\) 是具有網路預測的均值 \(\mu \in {\mathbb{R}}^{H \times W \times 2}\) 和尺度 \(b \in {\mathbb{R}}^{H \times W \times 1}\) 的拉普拉斯分佈,(2) 流分佈是畫素獨立的,以及 (3) x方向流和y方向流獨立但共享相同的尺度引數 \(b\),來形式化機率迴歸的簡單版本:

\[{\mathcal{L}}_{Lap} = \frac{1}{HW}\mathop{\sum }\limits_{u}\mathop{\sum }\limits_{v}\left( {\log {2b}\left( {u,v}\right) + \frac{{\begin{Vmatrix}{\mu }_{gt}\left( u,v\right) - \mu \left( u,v\right) \end{Vmatrix}}_{1}}{{2b}\left( {u,v}\right) }}\right) \tag{1} \]

where \(u,v\) are indices to the pixels. The Laplace loss can be regarded as an extended version of \({L}_{1}\) loss with an extra penalty term \(b\) . During inference, \(\mu\) represents the flow prediction,and the scale factor \(b\) provides an estimation of uncertainty. However, we find this naive probabilistic regression does not work well on optical flow, which has also been pointed out by prior work [64].

其中 \(u,v\) 是畫素的索引。拉普拉斯損失可以視為 \({L}_{1}\) 損失的擴充套件版本,帶有一個額外的懲罰項 \(b\)。在推理過程中,\(\mu\) 表示流預測,而尺度因子 \(b\) 提供了不確定性的估計。然而,我們發現這種樸素機率迴歸在光流上效果不佳,這一點也已被先前的工作 [64] 指出。

Mixture of Laplace One reason that naive probabilistic regression performs poorly is numerical instability as the loss contains a log term. To address this issue,we regress \(b\left( {u,v}\right)\) directly in log-space. This approach makes training more stable compared to previous approaches which clamp \(b\) to \(\lbrack \epsilon ,\infty )\) ,where \(\epsilon\) is a small positive number.

拉普拉斯混合分佈 樸素機率迴歸表現不佳的一個原因是數值不穩定性,因為損失包含一個對數項。為了解決這個問題,我們在對數空間中直接回歸 \(b\left( {u,v}\right)\)。與之前將 \(b\) 限制在 \(\lbrack \epsilon ,\infty )\) 的方法相比,這種方法使得訓練更加穩定,其中 \(\epsilon\) 是一個小的正數。

Another reason that naive probabilistic regression performs poorly is that it deviates from the standard endpoint-error metric, which only cares about the

樸素機率迴歸表現不佳的另一個原因是它偏離了標準的端點誤差度量,該度量只關心

2024-08-29-SEA-RAFT-中英對照

Fig. 4: Visualization on Spring [35] test set.

圖 4:Spring [35] 測試集的視覺化。

\({L}_{1}\) difference,but not the uncertainty estimation. Thus,we propose to use a mixture of two Laplace distributions: one for ordinary cases, and the other for ambiguous cases,with mixing coefficient \(\alpha \in \left\lbrack {0,1}\right\rbrack\) :

\({L}_{1}\) 差異,但不關心不確定性估計。因此,我們提出使用兩個拉普拉斯分佈的混合:一個用於普通情況,另一個用於模糊情況,混合係數為 \(\alpha \in \left\lbrack {0,1}\right\rbrack\)

\[\operatorname{MixLap}\left( {x;\alpha ,{\beta }_{1},{\beta }_{2},\mu }\right) = \alpha \cdot \frac{{e}^{-\frac{\left| x - \mu \right| }{{e}^{{\beta }_{1}}}}}{2{e}^{{\beta }_{1}}} + \left( {1 - \alpha }\right) \cdot \frac{{e}^{-\frac{\left| x - \mu \right| }{{e}^{{\beta }_{2}}}}}{2{e}^{{\beta }_{2}}} \]

Intuitively, at each pixel, we want the first component of the mixture to be aligned with the endpoint-error metric, and the second component to account for ambiguous cases. To explicitly enforce this,we fix \({\beta }_{1} = 0\) ,such that the network is encouraged to optimize for the L1 loss when possible. This leads to the following Mixture-of-Laplace (MoL) loss:

直觀上,在每個畫素上,我們希望混合的第一個分量與端點誤差度量對齊,第二個分量用於模糊情況。為了明確強化這一點,我們固定 \({\beta }_{1} = 0\),使得網路在可能的情況下鼓勵最佳化 L1 損失。這導致了以下混合拉普拉斯(MoL)損失:

\[{\mathcal{L}}_{MoL} = - \frac{1}{2HW}\mathop{\sum }\limits_{u}\mathop{\sum }\limits_{v}\mathop{\sum }\limits_{{d \in \{ x,y\} }}\log \left\lbrack {\operatorname{MixLap}\left( {{\mu }_{gt}{\left( u,v\right) }_{d};\alpha \left( {u,v}\right) ,0,{\beta }_{2}\left( {u,v}\right) ,\mu {\left( u,v\right) }_{d}}\right) }\right\rbrack \tag{2} \]

where \(d\) indexes the axe of the flow vector (the \(x\) direction or \(y\) direction).

其中 \(d\) 索引了流向量的軸(\(x\) 方向或 \(y\) 方向)。

The free parameters \(\alpha ,{\beta }_{2},\mu\) of \({\mathcal{L}}_{MoL}\) are predicted by the network. Intuitively,a higher \(\alpha\) means the flow prediction of this pixel is more "ordinary" instead of "ambiguous". Mathematically,a higher \(\alpha\) makes \({\mathcal{L}}_{moL}\) behave like an \({L}_{1}\) loss. In Sec. 4.3,we then show that this property leads to better accuracy.

網路預測了 \({\mathcal{L}}_{MoL}\) 的自由引數 \(\alpha ,{\beta }_{2},\mu\)。直觀上,較高的 \(\alpha\) 意味著該畫素的流量預測更“普通”而非“模糊”。從數學上講,較高的 \(\alpha\) 使得 \({\mathcal{L}}_{moL}\) 表現得像一個 \({L}_{1}\) 損失。在第 4.3 節中,我們隨後展示了這一特性導致更高的準確性。

Note that though the mixture model has been used in keypoint matching [4, \({47},{51}\rbrack\) ,its application to optical flow requires a different formulation because the goal is substantially different. In keypoint matching, the goal is to identify a subset of reliable matches for downstream applications such as camera pose estimation. Predicting uncertainty serves to filter out unreliable matches, and there is no explicit penalty for predicting few correspondences. As a result, it is not essential for them to align a mixing component to \({L}_{1}\) loss. In optical flow, we are evaluated on the flow prediction for every pixel.

請注意,儘管混合模型已用於關鍵點匹配 [4, \({47},{51}\rbrack\),但其應用於光流需要不同的表述,因為目標本質上不同。在關鍵點匹配中,目標是識別一組可靠的匹配,用於下游應用,如相機姿態估計。預測不確定性用於過濾不可靠的匹配,而對於預測較少的對應關係沒有明確的懲罰。因此,將混合元件對齊到 \({L}_{1}\) 損失並非必要。在光流中,我們針對每個畫素的流量預測進行評估。

Implementation Details We set an upper bound for \(\beta\) to 10 in the loss to make the training more stable. We also re-predict \(\alpha\) and \(\beta\) every update iteration. We can similarly define the probabilistic sequence loss as:

實現細節 我們在損失中為 \(\beta\) 設定了一個上限 10,以使訓練更加穩定。我們還每更新迭代重新預測 \(\alpha\)\(\beta\)。我們可以類似地定義機率序列損失為:

\[{\mathcal{L}}_{all} = \mathop{\sum }\limits_{{i = 1}}^{N}{\gamma }^{N - i}{\mathcal{L}}_{MoL}^{i} \tag{3} \]

2024-08-29-SEA-RAFT-中英對照

Fig. 5: Visualization on Sintel [3], KITTI [36], and Middlebury [1].

圖 5:在 Sintel [3]、KITTI [36] 和 Middlebury [1] 上的視覺化。

where \({\mathcal{L}}_{\operatorname{mix}}^{i}\) denotes the probabilistic loss in iteration \(i,N\) denotes the number of iterations,and \(\gamma < 1\) exponentially downweights the early iterations. We empirically observe that our method significantly reduces the number of update iterations needed in inference. In fact, \(N = 4\) is sufficient for SEA-RAFT to take first place on the Spring [35] benchmark. We provide detailed ablations in Tab. 4.

其中 \({\mathcal{L}}_{\operatorname{mix}}^{i}\) 表示迭代中的機率損失,\(i,N\) 表示迭代次數,而 \(\gamma < 1\) 指數級地降低早期迭代的權重。我們實證觀察到,我們的方法顯著減少了推理中所需的更新迭代次數。事實上,\(N = 4\) 足以使 SEA-RAFT 在 Spring [35] 基準測試中名列第一。我們在表 4 中提供了詳細的消融研究。

3.3 Direct Regression of Initial Flow

3.3 初始流直接回歸

RAFT-style iterative refinements \(\left\lbrack {8,{14},{31},{37},{48},{66}}\right\rbrack\) typically zero-initialize the flow field. However, zero-initialization may deviate substantially from the ground truth, thus needing many iterations. In SEA-RAFT, we borrow an idea from the FlowNet family of methods \(\left\lbrack {9,{17}}\right\rbrack\) to predict an initial estimate of optical flow from the context encoder, given both frames as input. We also predict an associated MoL (see Sec. 3.2).

RAFT 風格的迭代改進 \(\left\lbrack {8,{14},{31},{37},{48},{66}}\right\rbrack\) 通常將流場初始化為零。然而,零初始化可能與真實情況相差甚遠,因此需要多次迭代。在 SEA-RAFT 中,我們從 FlowNet 系列方法 \(\left\lbrack {9,{17}}\right\rbrack\) 中借鑑了一個想法,即從上下文編碼器中預測光流的初始估計值,給定兩個幀作為輸入。我們還預測了相關的 MoL(見第 3.2 節)。

This simple modification also significantly improves the convergence speed of the iterative refinement framework, allowing one to use fewer iterations during inference. Detailed ablations are shown in Tab. 4.

這一簡單的修改也顯著提高了迭代改進框架的收斂速度,使得在推理過程中可以使用更少的迭代次數。詳細的消融實驗結果見表 4。

3.4 Large-Scale Rigid-Flow Pre-Training

3.4 大規模剛性流預訓練

Most prior works train on a small number of datasets with limited size, diversity and realism \(\left\lbrack {9,{34}}\right\rbrack\) . To improve generationalization,we pre-train SEA-RAFT on TartanAir [52], which provides optical flow annotations between a pair of (nonrectified) stereo cameras. This type of motion field is a special case of optical flow due to viewpoint change in a rigid static scene. Despite its limited motion diversity, it enables SEA-RAFT to train on data with higher realism and scene diversity, leading to better generalization.

大多數先前的工作在數量有限、多樣性和真實性有限的資料集 \(\left\lbrack {9,{34}}\right\rbrack\) 上進行訓練。為了提高泛化能力,我們在 TartanAir [52] 上預訓練 SEA-RAFT,該資料集提供了非校正立體相機對之間的光流標註。這種運動場是光流的一個特例,由於剛性靜態場景中的視角變化。儘管其運動多樣性有限,但它使 SEA-RAFT 能夠在更具真實性和場景多樣性的資料上進行訓練,從而提高泛化能力。

3.5 Simplifications

3.5 簡化

We also provide a few architecture changes that greatly simplify the original RAFT [50]. First, we adopt truncated, ImageNet [7] pre-trained ResNets for the backbones. We also substitute the ConvGRU in RAFT with two ConvNeXt [28] blocks, which we show provides better efficiency and training stability. The detailed ablations of these changes are shown in Tab. 4.

我們還提供了一些架構上的改變,這些改變極大地簡化了原始的 RAFT [50]。首先,我們採用截斷的、在 ImageNet [7] 上預訓練的 ResNets 作為主幹網路。我們還用兩個 ConvNeXt [28] 塊替換了 RAFT 中的 ConvGRU,我們表明這提供了更好的效率和訓練穩定性。這些改變的詳細消融實驗結果見表 4。

4 Experiments

4 實驗

We evaluate SEA-RAFT on Spring [35], KITTI [12], and Sintel [3]. Following previous works, we also incorporate FlyingChairs [9], FlyingThings [34], and HD1K [23] into our training pipeline. To verify the effectiveness of TartanAir [52] rigid-flow pre-training, we provide the performance gain from it in different settings.

我們在Spring [35]、KITTI [12]和Sintel [3]上評估SEA-RAFT。按照先前的工作,我們還將在訓練流程中加入FlyingChairs [9]、FlyingThings [34]和HD1K [23]。為了驗證TartanAir [52]剛性流預訓練的有效性,我們在不同設定下提供了其帶來的效能提升。

Model Details SEA-RAFT is implemented in PyTorch [38]. There are three different types of SEA-RAFT and we denote them as SEA-RAFT(S/M/L). The only differences among them are the backbone choices and the number of iterations in inference. Specifically, SEA-RAFT(S) uses the first 6 layers of ResNet-18 as the feature/context encoder, and SEA-RAFT(M) uses the first 13 layers of ResNet-34. The pre-trained weights we use are downloaded from torchvision. SEA-RAFT(S) and SEA-RAFT(M) use the same architecture for the recurrent units and keep the number of iterations \(N = 4\) in both training and inference. SEA-RAFT(L) can be regarded as an extension based on SEA-RAFT(M): they share the same weights,but SEA-RAFT(L) uses \(N = {12}\) iterations in inference. Following RAFT [50],we stop the gradient for \(\mu\) when computing \({\mu }^{\prime } = \mu + {\Delta \mu }\) and only propagate the gradient for residual flow \({\Delta \mu }\) .

模型細節 SEA-RAFT在PyTorch [38]中實現。有三種不同型別的SEA-RAFT,我們分別表示為SEA-RAFT(S/M/L)。它們之間的唯一區別在於骨幹選擇和推理中的迭代次數。具體來說,SEA-RAFT(S)使用ResNet-18的前6層作為特徵/上下文編碼器,而SEA-RAFT(M)使用ResNet-34的前13層。我們使用的預訓練權重從torchvision下載。SEA-RAFT(S)和SEA-RAFT(M)在迴圈單元中使用相同的架構,並在訓練和推理中保持迭代次數\(N = 4\)。SEA-RAFT(L)可以視為基於SEA-RAFT(M)的擴充套件:它們共享相同的權重,但SEA-RAFT(L)在推理中使用\(N = {12}\)次迭代。按照RAFT [50],我們在計算\({\mu }^{\prime } = \mu + {\Delta \mu }\)時停止\(\mu\)的梯度,並且僅傳播殘餘流\({\Delta \mu }\)的梯度。

Training Details As mentioned in Sec. 3.4, We pre-train SEA-RAFT on Tar-tanAir [52] for 300k steps with a batch size of 32,input resolution \({480} \times {640}\) and learning rate \(4 \times {10}^{-4}\) . Similar to RAFT [50],MaskFlowNet [65] and PWC-Net + [45], we then train our models on FlyingChairs [9] for 100k steps with a batch size of 16,input resolution \({368} \times {496}\) ,learning rate \({2.5} \times {10}^{-4}\) and FlyingTh-ings3D [34] for 120k steps with a batch size of 32,input resolution \({432} \times {960}\) , learning rate \(4 \times {10}^{-4}\) (denoted as "C+T" following previous works). For the submissions on Sintel [3] benchmark, we fine-tune the model from "C+T" on a mixture of Sintel [3], FlyingThings3D clean pass [34], KITTI [12] and HD1K [23] for \({300}\mathrm{k}\) steps with a batch size of 32,input resolution \({432} \times {960}\) and learning rate \(4 \times {10}^{-4}\) (denoted as "C+T+S+K+H" following previous works). Different from previous methods, we reduce the percentage of Sintel [3] in the mixture dataset, which is usually more than \({70}\%\) in previous papers. Details will be mentioned in the supplementary material. For KITTI [12] submissions, we fine-tune our models from "C+T+S+K+H" on the KITTI training set for extra 10k steps with a batch size of 16,input resolution \({432} \times {960}\) and learning rate \({10}^{-4}\) . For Spring [35] submissions, we fine-tune our models from "C+T+S+K+H" on the

訓練細節 如第3.4節所述,我們在TitanAir [52]上對SEA-RAFT進行30萬步的預訓練,批次大小為32,輸入解析度\({480} \times {640}\)和學習率\(4 \times {10}^{-4}\)。與RAFT [50]、MaskFlowNet [65]和PWC-Net + [45]類似,我們隨後在FlyingChairs [9]上對我們的模型進行10萬步的訓練,批次大小為16,輸入解析度\({368} \times {496}\),學習率\({2.5} \times {10}^{-4}\),並在FlyingThings3D [34]上進行12萬步的訓練,批次大小為32,輸入解析度\({432} \times {960}\),學習率\(4 \times {10}^{-4}\)(按照先前的工作表示為“C+T”)。對於Sintel [3]基準的提交,我們從“C+T”在Sintel [3]、FlyingThings3D clean pass [34]、KITTI [12]和HD1K [23]的混合資料集上進行微調,步數為\({300}\mathrm{k}\),批次大小為32,輸入解析度\({432} \times {960}\)和學習率\(4 \times {10}^{-4}\)(按照先前的工作表示為“C+T+S+K+H”)。與先前的方法不同,我們減少了混合資料集中Sintel [3]的比例,這在先前的論文中通常超過\({70}\%\)。細節將在補充材料中提及。對於KITTI [12]的提交,我們從“C+T+S+K+H”在KITTI訓練集上進行額外的1萬步微調,批次大小為16,輸入解析度\({432} \times {960}\)和學習率\({10}^{-4}\)。對於Spring [35]的提交,我們從“C+T+S+K+H”在

Extra DataMethodSpring (train)Spring (test)
Fine-tune$1 \leq p \leq x \downarrow$$\mathrm{{EPE}} \downarrow$$1\mathrm{{px}} \downarrow$$\mathrm{{EPE}} \downarrow$$\mathrm{{Fl}} \downarrow$WAUC↑
PWC-Net [45]$x$$-$-82.27*2.288*4.889*45.670*
FlowNet2 [17]$x$1$-$6.710*1.040*2.823*90.907*
RAFT [50]$x$4.7880.4486.790*${1.476}^{ * }$3.198*90.920*
GMA [20]$x$4.7630.4437.074*0.914*${3.079}^{ * }$90.722*
RPKNet [37]$x$4.4720.4164.8090.6571.75692.638
DIP [68]$x$4.2730.463---$-$
SKFlow [48]$x$4.5210.408$-$$-$--
GMFlow [56]$x$29.490.93010.355*0.945*2.952*82.337*
GMFlow+ [57]$x$4.2920.433-$-$$-$$-$
Flowformer [14]$x$4.5080.4706.510*0.723*2.384*91.679*
CRAFT [44]$x$4.8030.448$-$--$-$
SEA-RAFT(S)$x$4.0770.415---$-$
SEA-RAFT(M)$x$4.0600.406----
MegaDepth [26]MatchFlow(G) [8]X4.5040.407$-$$-$-$-$
YouTube-VOS [59]Flowformer++ [43]$x$4.4820.447$-$$-$$-$$-$
VIPER [41]MS-RAFT+ [19]$x$3.5770.397${5.724}^{ * }$0.643*2.189*92.888*
TartanAir [52]SEA-RAFT(S)X4.1610.410$-$$-$-$-$
TartanAir [52]SEA-RAFT(M)$x$3.8880.406-$-$-$-$
CroCo-PretrainCroCoFlow [55]$\checkmark$-$-$4.5650.4981.50893.660
CroCo-PretrainWin-Win [24]$\checkmark$-$-$5.3710.4751.62192.270
TartanAir [52]SEA-RAFT(S)$\checkmark$$-$$-$3.9040.3771.38994.182
TartanAir [52]SEA-RAFT(M)$\checkmark$$-$$-$3.6860.3631.34794.534

Table 1: SEA-RAFT outperforms existing methods on Spring [35] in different settings. * denotes the results submitted by Spring [35] team. By default, all methods have undergone " \(\mathrm{C} + \mathrm{T} + \mathrm{S} + \mathrm{K} + {\mathrm{H}}^{\prime \prime }\) training. We list the data used by each method beyond default in the "Extra Data" column. On Spring(test), even our smallest model SEA-RAFT(S) surpasses existing methods by a significant margin. Without fine-tuning on Spring(train), SEA-RAFT outperforms all other methods that do not use extra data.

表1:在不同設定下,SEA-RAFT在Spring [35]上優於現有方法。*表示Spring [35]團隊提交的結果。預設情況下,所有方法都經過了“\(\mathrm{C} + \mathrm{T} + \mathrm{S} + \mathrm{K} + {\mathrm{H}}^{\prime \prime }\)訓練”。我們在“額外資料”列中列出了每種方法使用的超出預設的資料。在Spring(測試)上,即使是我們最小的模型SEA-RAFT(S)也以顯著優勢超過了現有方法。在沒有對Spring(訓練)進行微調的情況下,SEA-RAFT優於所有不使用額外資料的其他方法。

Spring training set for extra 120k steps with a batch size of 32, input resolution \({540} \times {960}\) and learning rate \(4 \times {10}^{-4}\) .

在輸入解析度\({540} \times {960}\)和學習率\(4 \times {10}^{-4}\)下,使用批次大小為32的Spring訓練集額外進行120k步的訓練。

Metrics We adopt the widely used metrics in this study: endpoint-error (EPE), 1-pixel outlier rate (1px), Fl-score and WAUC error. Definitions can be found in \(\left\lbrack {{12},{35},{41}}\right\rbrack\) .

我們在這項研究中採用了廣泛使用的指標:終點誤差(EPE)、1畫素離群率(1px)、Fl-score和WAUC誤差。定義可以在\(\left\lbrack {{12},{35},{41}}\right\rbrack\)中找到。

4.1 Results on Spring

4.1 在Spring上的結果

Zero-Shot Evaluation We compare several representative existing methods with SEA-RAFT using the checkpoints and configurations for Sintel [3] submission on the Spring [35] training split. For fair comparisons, we remove the test-time optimizations such as tiling in this setting, which will significantly slow down the inference speed. All experiments follow the same downsample-upsample protocol: We first downsample the 1080p images by 2×,do inference,and then bi-linearly upsample the flow field back to 1080p, which ensures the input resolution in inference is similar to their training resolution in "C+T+S+K+H".

零樣本評估 我們使用Sintel [3]提交的檢查點和配置,在Spring [35]訓練分割上將幾個具有代表性的現有方法與SEA-RAFT進行比較。為了公平比較,我們在這個設定中移除了測試時最佳化,如平鋪,這將顯著降低推理速度。所有實驗都遵循相同的下采樣-上取樣協議:我們首先將1080p影像下采樣2倍,進行推理,然後將流場雙線性上取樣回1080p,確保推理中的輸入解析度與其在“C+T+S+K+H”中的訓練解析度相似。

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

SEA-RAFT:簡單、高效、準確的光流RAFT方法

Extra DataMethodSintelKITTI
Clean $\downarrow$Final $\downarrow$Fl-epe $\downarrow$Fl-all
PWC-Net [45]2.553.9310.433.7
RAFT [50]1.432.715.0417.4
GMA [20]1.302.744.6917.1
SKFlow [48]1.222.464.2715.5
FlowFormer [14]1.012.404.09†14.7 ${}^{ \dagger }$
DIP [68]1.302.824.2913.7
EMD-L [6]0.882.554.1213.5
CRAFT [44]1.272.794.8817.5
RPKNet [37]1.122.45-13.0
GMFlowNet [66]1.142.714.2415.4
SEA-RAFT(M)1.214.044.2914.2
SEA-RAFT(L)1.194.113.6212.9
TartanGMFlow [56]1.082.4811.2*28.7*
GMFlow [56]$-$$-$8.70 (-22%)*24.4 ${\left( -{15}\% \right) }^{ * }$
Tartan $\mathrm{K} + \mathrm{H}$ Tartan+K+HSEA-RAFT(S)1.274.324.6115.8
SEA-RAFT(S)1.27${3.74}\left( {\text{-}{13}\% }\right)$4.4315.1
SEA-RAFT(S)1.32${2.95}\left( {-{32}\% }\right)$--
SEA-RAFT(S)1.30${2.79}\left( {-{35}\% }\right)$--

Table 2: SEA-RAFT achieves the best zero-shot performance on KITTI(train). By default,all methods are trained with "C+T". We list the extra data in the first column. \({}^{ \dagger }\) denotes the method uses tiling in inference. * denotes the GMFlow [56] ablation with 200k training steps.

表2:SEA-RAFT在KITTI(訓練集)上實現了最佳的零樣本效能。預設情況下,所有方法都使用“C+T”進行訓練。我們在第一列列出了額外資料。\({}^{ \dagger }\)表示該方法在推理中使用了平鋪技術。*表示使用20萬訓練步數的GMFlow [56]消融實驗。

As shown in Tab. 1, SEA-RAFT achieves the best results among representative existing methods without using extra data, which demonstrates the superiority of our mixture loss and architecture design. When allowed to use extra data, SEA-RAFT falls slightly behind MS-RAFT+ [19] but is \({24} \times\) faster and \({11} \times\) smaller as mentioned in Fig. 1.

如表1所示,SEA-RAFT在不使用額外資料的情況下,在現有代表性方法中取得了最佳結果,這證明了我們的混合損失和架構設計的優越性。當允許使用額外資料時,SEA-RAFT略遜於MS-RAFT+ [19],但正如圖1所述,速度更快且模型更小。

Fine-Tuning Test SEA-RAFT ranks 1st on the public test benchmark: SEA-RAFT(M) outperforms all other methods by at least 22.9% on average EPE(endpoint-error) and 17.8% on 1px (1-pixel outlier rate), and SEA-RAFT(S) outperforms other methods by at least \({20.0}\%\) on EPE and \({12.8}\%\) on \(1\mathrm{{px}}\) . Besides the strong performance,our method is notably fast. SEA-RAFT(S) is at least \({2.3} \times\) faster than existing methods which can achieve similar performance. As we still follow the downsample-upsample protocol without using any test-time optimizations in submissions, the inference latency directly reflects our speed in handling 1080p images, which means over 20fps on a single RTX3090.

微調測試 SEA-RAFT在公共測試基準上排名第一:SEA-RAFT(M)在平均端點誤差(EPE)和1畫素異常率(1px)上分別比所有其他方法至少高出22.9%和17.8%,而SEA-RAFT(S)在EPE和\(1\mathrm{{px}}\)上分別比其他方法至少高出\({20.0}\%\)\({12.8}\%\)。除了強大的效能外,我們的方法還非常快速。SEA-RAFT(S)比現有能夠達到類似效能的方法至少快\({2.3} \times\)。由於我們仍然遵循下采樣-上取樣協議,且在提交中未使用任何測試時最佳化,因此推理延遲直接反映了我們在處理1080p影像時的速度,這意味著在單個RTX3090上超過20fps。

4.2 Results on Sintel and KITTI

4.2 在Sintel和KITTI上的結果

Zero-Shot Evaluation Following previous works, we evaluate the zero-shot Ever performance of SEA-RAFT given training schedule "C+T" on Sintel(train) [3] and KITTI(train) [36]. The results are provided in Tab. 2. On KITTI(train),

零樣本評估 遵循先前的工作,我們在Sintel(訓練集)[3]和KITTI(訓練集)[36]上,使用訓練計劃“C+T”評估SEA-RAFT的零樣本Ever效能。結果在表2中提供。在KITTI(訓練集)上,

Extra DataMethodSintelKITTIInference Cost
$\mathrm{{Clean}} \downarrow$Final $\downarrow$Fl-all $\downarrow$Fl-bg $\downarrow$Fl-fg↓#MACsLatency
PWC-Net+ [46]3.454.607.727.697.88101.3G23.82ms
RAFT [50]1.61 *2.86*5.104.746.87938.2G${140.7}\mathrm{{ms}}$
GMA [20]1.39*2.47*5.15--${1352}\mathrm{G}$183.3ms
DIP [68]1.44*2.83*4.213.865.963068G498.9ms
GMFlowNet [66]1.392.654.794.396.841094G244.3ms
GMF low [56]1.742.909.329.677.57602.6G${138.5}\mathrm{{ms}}$
CRAFT [44]1.45*2.42*4.794.585.85${2274}\mathrm{G}$483.4ms
FlowFormer [141.202.124.68†4.3716.1811715G${335.6}\mathrm{{ms}}$
SKF low [48]1.28*2.23*4.854.556.39${1453}\mathrm{G}$331.9ms
GMFlow+ [57]1.032.374.494.275.601177G${249.6}\mathrm{{ms}}$
EMD-L [6]1.322.514.494.166.15${1755}\mathrm{G}$OOM
RPKNet [37]1.312.654.644.634.69137.0G183.3ms
VIPER [41]$\overline{\mathrm{{CCMR}} + }$ [18]1.072.103.863.396.2112653GOOM
MegaDepth [26]MatchFlow(G) [8]1.16*2.37*4.6314.331${6.11}^{ \dagger }$${1669}\mathrm{G}$290.6ms
YouTube-VOS [59]Flowformer++ [43]1.071.944.521--${1713}\mathrm{G}$373.4ms
CroCo-PretrainCroCoFlow [55]${1.09}^{ \dagger }$2.44†3.6413.181${5.94}^{ \dagger }$${57343}{\mathrm{G}}^{ \dagger }$${6422}\mathrm{m{s}^{ \dagger }}$
DDVM-PretrainDDVM [42]1.75 ${}^{ \dagger }$2.4813.2612.90 ${}^{ \dagger }$5.051$-$$-$
TartanAir [52]SEA-RAFT(M)1.442.864.644.475.49486.9G70.96ms
TartanAir [52]SEA-RAFT(L)1.312.604.304.085.37655.1G108.0ms

Table 3: Compared with other methods that achieve competitive performance,SEA \(\frac{}{\text{RAFT is at least 1.8} \times \text{faster on Sintel(test) [3] and 4.6} \times \text{faster on KITTI(test) [36]. All}}\) methods have "C+T+S+K+H" training by default and we list the extra data each method uses in the first column. We measure latency on an RTX3090 with a batch size of 1 and input resolution \({540} \times {960}\) . * denotes the method uses warm-start [50] strategy. \({}^{ \dagger }\) denotes that the corresponding methods use tiling-based test-time optimizations.

表3:與其他實現競爭效能的方法相比,SEA \(\frac{}{\text{RAFT is at least 1.8} \times \text{faster on Sintel(test) [3] and 4.6} \times \text{faster on KITTI(test) [36]. All}}\) 方法預設採用“C+T+S+K+H”訓練,我們在第一列列出了每種方法使用的額外資料。我們在RTX3090上以批次大小為1和輸入解析度 \({540} \times {960}\) 測量延遲。* 表示該方法使用了熱啟動 [50] 策略。\({}^{ \dagger }\) 表示相應的方法使用了基於平鋪的測試時最佳化。

SEA-RAFT outperforms all prior works by a large margin, improving Fl-epe from 4.09 to 3.62 and F1-all from 13.7 to 12.9. On Sintel(train), SEA-RAFT achieves competitive results on the clean pass but, for reasons unclear to us, underperforms existing methods on the final pass. Note that although this "C+T" zero-shot setting is standard, it is of limited relevance to real-world applications, which do not need to restrict the training data to only C+T. Indeed, we show that by adding a small amount of high-quality real-world data (KITTI + HD1K, about 1.2k image pairs compared with 80k image pairs in FlyingThings3D [34]), the performance gap on the Sintel(train) final pass can be remarkably reduced.

SEA-RAFT 在所有先前的工作中表現出色,將 Fl-epe 從 4.09 提高到 3.62,F1-all 從 13.7 提高到 12.9。在 Sintel(訓練)上,SEA-RAFT 在乾淨通道上取得了有競爭力的結果,但由於某些我們不清楚的原因,在最終通道上表現不如現有方法。請注意,儘管這種“C+T”零樣本設定是標準的,但它與實際應用的相關性有限,因為實際應用不需要將訓練資料限制為僅 C+T。事實上,我們表明,透過新增少量高質量的現實世界資料(KITTI + HD1K,約 1.2k 影像對,而 FlyingThings3D [34] 中有 80k 影像對),可以在 Sintel(訓練)最終通道上顯著縮小效能差距。

Fine-Tuning Test Results are shown in Tab. 3. Compared with RAFT [50], SEA-RAFT achieves \({19.9}\%\) improvements on the Sintel clean pass, \({4.2}\%\) improvements on the Sintel final pass,and \({15.7}\%\) improvements on KITTI F1-all score. SEA-RAFT is also competitive among all existing methods in terms of performance-speed trade-off: It is the only method that can achieve results better than RAFT [50] with latency around 70ms. On Sintel(test), methods with similar performance are at least \({1.8} \times\) slower than us. On KITTI(test),methods with similar performance are at least \({4.6} \times\) slower than us.

微調測試結果如表 3 所示。與 RAFT [50] 相比,SEA-RAFT 在 Sintel 乾淨通道上實現了 \({19.9}\%\) 的改進,在 Sintel 最終通道上實現了 \({4.2}\%\) 的改進,在 KITTI F1-all 評分上實現了 \({15.7}\%\) 的改進。在效能-速度權衡方面,SEA-RAFT 在所有現有方法中也具有競爭力:它是唯一一種能夠在約 70ms 延遲下實現比 RAFT [50] 更好結果的方法。在 Sintel(測試集)上,效能相似的方法至少比我們慢 \({1.8} \times\)。在 KITTI(測試集)上,效能相似的方法至少比我們慢 \({4.6} \times\)

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

SEA-RAFT:簡單、高效、準確的光流估計方法

ExperimentInit.Pre-TrainingRNNLoss Design#MACsEPE
Img [7]Tar [52]GRU#blocksTypeParams
SEA-RAFT (w/o Tar.)$\checkmark$$\checkmark$$-$-2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.187
SEA-RAFT (w/ Tar.)$\checkmark$$\checkmark$$\checkmark$-2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.179
$\mathrm{w}/\mathrm{o}\operatorname{Img}$ .$\checkmark$$-$--2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.194
w/o Direct Reg.$x$$\checkmark$--2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$${277.3}\mathrm{G}$0.201
RAFT GRU$\checkmark$$\checkmark$-$\checkmark$-Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$297.9G0.189
More ConvNeXt Blocks$\checkmark$$\checkmark$$-$-4Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$314.7G0.189
Naive Laplace$\checkmark$$\checkmark$-$-$2Naive Single Laplace Naive Mixture-of-Laplace$\beta \in \left\lbrack {-{10},{10}}\right\rbrack$ ${\beta }_{1},{\beta }_{2} \in \left\lbrack {-{10},{10}}\right\rbrack$284.7G0.217 0.248
L1$\checkmark$$\checkmark$--2${L}_{1}$$x$284.7G0.206
Gaussian$\checkmark$$\checkmark$$-$-2Mixture-of-Gaussian${\sigma }_{1} = 1,{\sigma }_{2} = {e}^{{\beta }_{2}},{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.210

Table 4: We ablate pretraining, direct regression, RNN design, and loss designs on

表 4:我們在

Spring [35] subval. The effect of changes can be identified through comparisons with the first row. See Sec. 4.3 for details.

Spring [35] 子驗證集上對預訓練、直接回歸、RNN 設計和損失設計進行了消融實驗。透過與第一行的比較可以識別出變化的影響。詳見第 4.3 節。

2024-08-29-SEA-RAFT-中英對照

Fig. 6: More iterations produce lower variance in the Mixture of Laplace, indicating that the model becomes more confident after each iteration.

圖 6:更多迭代次數會降低拉普拉斯混合分佈的方差,表明模型在每次迭代後變得更加自信。

4.3 Ablations and Analysis

4.3 消融實驗與分析

Ablation experiments are conducted on the Spring [35] dataset based on SEA-RAFT(S). We separate a subval set (sequence 0045 and 0047) from the original training set, train our model on the remaining training data and evaluate the performance on subval. The model is trained with a batch size of 32 , input resolution \({540} \times {960}\) ,and tested following "downsample-upsample" protocol mentioned in Sec. 4.1. We describe the details of ablation studies in the following and show the results in Tab. 4:

基於 SEA-RAFT(S) 在 Spring [35] 資料集上進行了消融實驗。我們從原始訓練集中分離出一個子驗證集(序列 0045 和 0047),使用剩餘的訓練資料訓練我們的模型,並在子驗證集上評估效能。模型以 32 的批次大小、輸入解析度 \({540} \times {960}\) 進行訓練,並按照第 4.1 節提到的“下采樣-上取樣”協議進行測試。我們在下文中詳細描述了消融研究的細節,並在表 4 中展示了結果:

Pretraining We test the performance of TartanAir [52] rigid-flow pre-training on different datasets(see Tabs. 1, 2 and 4 for details). Without TartanAir, SEA-RAFT already provide strong performance, and the rigid-flow pre-training makes it better. We also show that ImageNet pre-trained weights are effective.

預訓練 我們測試了TartanAir [52] 剛性流預訓練在不同資料集上的表現(詳情見表1、2和4)。沒有TartanAir,SEA-RAFT已經提供了強大的效能,而剛性流預訓練使其更上一層樓。我們還展示了ImageNet預訓練權重是有效的。

RNN Design Our new RNN designs can reduce the computation without performance loss compared with the GRU used in RAFT [50]. We also show that on Spring subval, 4 ConvNeXt blocks do not work better than 2 ConvNeXt blocks.

RNN設計 我們的新RNN設計可以在不損失效能的情況下減少計算量,相比於RAFT [50]中使用的GRU。我們還展示了在Spring子集上,4個ConvNeXt塊並不比2個ConvNeXt塊效果更好。

2024-08-29-SEA-RAFT-中英對照

Fig. 7: Iterative refinements are not hardware-friendly: The latency almost linearly increases with the number of iterations.

圖7:迭代細化對硬體不友好:延遲幾乎線性地隨著迭代次數增加。

Method#ItersLatency (ms)
TotalIter.
RAFT [50]24 (K)11190.3 (82%)
32 (S)141120 (86%)
SEA-RAFT4 (S)47.518.5 $\left( {{39}\% }\right)$
4 (M)70.918.5(26%)
12 (L)10855.5(51%)

Table 5: Compared with RAFT, SEA-RAFT significantly reduces the cost of iterative refinements, which allows larger backbones while still being faster. We use \(\mathrm{K}\) and \(\mathrm{S}\) to denote RAFT submissions on KITTI and Sintel respectively.

表5:與RAFT相比,SEA-RAFT顯著降低了迭代細化的成本,這使得在保持更快速度的同時可以使用更大的主幹網路。我們用\(\mathrm{K}\)\(\mathrm{S}\)分別表示在KITTI和Sintel上的RAFT提交。

Loss Design We see that naive Laplace regression does worse than the original \({L}_{1}\) loss. We also see that it is important to set \({\beta }_{1}\) to 0 in the MoL loss,which aligns the MoL loss to \({L}_{1}\) for ordinary cases. Besides,we find that the mixture of Gaussian loss does not work well for optical flow, even though it has been found to be useful for image matching [4].

損失設計 我們發現樸素的拉普拉斯迴歸不如原始的\({L}_{1}\)損失。我們還發現,在MoL損失中將\({\beta }_{1}\)設定為0很重要,這使得MoL損失在普通情況下與\({L}_{1}\)對齊。此外,我們發現高斯混合損失對於光流效果不佳,儘管它已被發現對影像匹配有用[4]。

Direct Regression of Initial Flow We see that the regressed flow initialization significantly improves accuracy without introducing much overhead.

初始流直接回歸 我們發現迴歸的流初始化顯著提高了準確性,而沒有引入太多開銷。

Inference Time Breakdown In Fig. 7, we show how the computational cost increases when we add more refinements. The cost bottleneck for SEA-RAFT is no longer iterative refinements ( Tab. 5), which allows us to use larger backbones given the same computational cost constraint as RAFT [50].

推理時間分解 在圖7中,我們展示了當我們增加更多細化時計算成本的增加情況。SEA-RAFT的成本瓶頸不再是迭代細化(見表5),這使得在相同的計算成本約束下可以使用更大的主幹網路,如同RAFT [50]。

5 Conclusion

5 結論

We have introduced SEA-RAFT, a simpler, more efficient and accurate variant of RAFT. It achieves high accuracy across a diverse range of datasets, strong cross-dataset generalization, and state-of-the-art accuracy-speed trade-offs, making it useful for real-world high-resolution optical flow.

我們介紹了 SEA-RAFT,這是 RAFT 的一個更簡單、更高效且更準確的變體。它在多種資料集上實現了高精度,具有強大的跨資料集泛化能力,並且在精度和速度之間達到了最先進的平衡,使其適用於現實世界的高解析度光流計算。

Acknowledgements

致謝

This work was partially supported by the National Science Foundation.

這項工作部分得到了國家科學基金會的支援。

References

參考文獻

  1. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. International journal of computer vision 92, 1-31 (2011) 3, 8

  2. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: International conference on machine learning. pp. 1613-1622. PMLR (2015) 3

  3. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611- 625. Springer (2012) 1, 3, 8, 9, 10, 11, 12

  4. Chen, H., Luo, Z., Zhou, L., Tian, Y., Zhen, M., Fang, T., Mckinnon, D., Tsin, Y., Quan, L.: Aspanformer: Detector-free image matching with adaptive span transformer. In: European Conference on Computer Vision. pp. 20-36. Springer (2022) \(3,4,5,7,{14}\)

  5. Chen, Q., Koltun, V.: Full flow: Optical flow estimation by global optimization over regular grids. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4706-4714 (2016) 1,3

  6. Deng, C., Luo, A., Huang, H., Ma, S., Liu, J., Liu, S.: Explicit motion disentangling for efficient optical flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9521-9530 (2023) 1, 3, 5, 11, 12

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248-255. Ieee (2009) 9, 13

  8. Dong, Q., Cao, C., Fu, Y.: Rethinking optical flow from geometric matching consistent perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1337-1347 (2023) 1, 3, 8, 10, 12

  9. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2758-2766 (2015) 3, 5, 8, 9

  10. Gao, C., Saraf, A., Huang, J.B., Kopf, J.: Flow-edge guided video completion. In: European Conference on Computer Vision. pp. 713-729. Springer (2020) 1

  11. Garrepalli, R., Jeong, J., Ravindran, R.C., Lin, J.M., Porikli, F.: Dift: Dynamic iterative field transforms for memory efficient optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2219- \({2228}\left( {2023}\right) 1,3,5\)

  12. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11), 1231-1237 (2013) \(3,9,{10}\)

  13. Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial intelligence 17(1-3), 185-203 (1981) 1,3

  14. Huang, Z., Shi, X., Zhang, C., Wang, Q., Cheung, K.C., Qin, H., Dai, J., Li, H.: Flowformer: A transformer architecture for optical flow. In: European Conference on Computer Vision. pp. 668-685. Springer (2022) 1, 3, 5, 8, 10, 11, 12

  15. Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294 (2020) 1

  16. Hui, T.W., Tang, X., Loy, C.C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8981-8989 (2018) 3

  17. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2462-2470 (2017) 3, 8, 10

  18. Jahedi, A., Luz, M., Rivinius, M., Bruhn, A.: Ccmr: High resolution optical flow estimation via coarse-to-fine context-guided motion reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6899- 6908 (2024) 3, 5, 12

  19. Jahedi, A., Luz, M., Rivinius, M., Mehl, L., Bruhn, A.: Ms-raft+: High resolution multi-scale raft. International Journal of Computer Vision pp. 1-22 (2023) 2, 3, 5, 10,11

  20. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. arXiv preprint arXiv:2104.02409 (2021) 10,11,12

  21. Jung, H., Hui, Z., Luo, L., Yang, H., Liu, F., Yoo, S., Ranjan, R., Demandolx, D.: Anyflow: Arbitrary scale optical flow with implicit neural representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5455-5465 (2023) 5

  22. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5792-5801 (2019) 1

  23. Kondermann, D., Nair, R., Honauer, K., Krispin, K., Andrulis, J., Brock, A., Gusse-feld, B., Rahimimoghaddam, M., Hofmann, S., Brenner, C., et al.: The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 19-28 (2016) 3, 9

  24. Leroy, V., Revaud, J., Lucas, T., Weinzaepfel, P.: Win-win: Training high-resolution vision transformers from two windows. arXiv preprint arXiv:2310.00632 (2023) 1, 3, 10

  25. Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11025-11034 (2021) 3

  26. Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2041-2050 (2018) 3, 10, 12

  27. Liu, X., Liu, H., Lin, Y.: Video frame interpolation via optical flow estimation with image inpainting. International Journal of Intelligent Systems 35(12), 2087-2102 (2020) 1

  28. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976-11986 (2022) 9

  29. Lu, Y., Wang, Q., Ma, S., Geng, T., Chen, Y.V., Chen, H., Liu, D.: Transflow: Transformer as flow learner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18063-18073 (2023) 1

  30. Luo, A., Yang, F., Li, X., Liu, S.: Learning optical flow with kernel patch attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8906-8915 (2022) 3

  31. Luo, A., Yang, F., Li, X., Nie, L., Lin, C., Fan, H., Liu, S.: Gaflow: Incorporating gaussian attention into optical flow. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9642-9651 (2023) 3,8

  32. Luo, A., Yang, F., Luo, K., Li, X., Fan, H., Liu, S.: Learning optical flow with adaptive graph reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2022) 3

  33. Ma, Z., Teed, Z., Deng, J.: Multiview stereo with cascaded epipolar raft. In: European Conference on Computer Vision. pp. 734-750. Springer (2022) 1

  34. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040-4048 (2016) 3, 8, 9, 12

  35. Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981-4991 (2023) 1, 2, 3, 7, 8, 9, 10, 13

  36. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3061-3070 (2015) 1, 3, 8, 11, 12

  37. Morimitsu, H., Zhu, X., Ji, X., Yin, X.C.: Recurrent partial kernel network for efficient optical flow estimation (2024) 3, 8, 10, 11, 12

  38. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems \({32}\left( {2019}\right) 9\)

  39. Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9945-9953 (2019) 1

  40. Raistrick, A., Lipson, L., Ma, Z., Mei, L., Wang, M., Zuo, Y., Kayan, K., Wen, H., Han, B., Wang, Y., et al.: Infinite photorealistic worlds using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12630-12641 (2023) 3

  41. Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2213-2222 (2017) 3, 10, 12

  42. Saxena, S., Herrmann, C., Hur, J., Kar, A., Norouzi, M., Sun, D., Fleet, D.J.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. Advances in Neural Information Processing Systems 36 (2024) 1, 3, 12

  43. Shi, X., Huang, Z., Li, D., Zhang, M., Cheung, K.C., See, S., Qin, H., Dai, J., Li, H.: Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1599-1610 (2023) 1,3,10,12

  44. Sui, X., Li, S., Geng, X., Wu, Y., Xu, X., Liu, Y., Goh, R., Zhu, H.: Craft: Cross-attentional flow transformer for robust optical flow. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 17602- 17611 (2022) 1, 3, 10, 11, 12

  45. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8934-8943 (2018) 1, 3, 9, 10, 11

  46. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE transactions on pattern analysis and machine intelligence \(\mathbf{{42}}\left( 6\right) ,{1408} - {1423}\left( {2019}\right) {12}\)

  47. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8922-8931 (2021) 3, 4, 5, 7

  48. Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: Skflow: Learning optical flow with super kernels. Advances in Neural Information Processing Systems 35, 11313- 11326 (2022) 1, 3, 8, 10, 11, 12

  49. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1390-1399 (2018) 1

  50. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402-419. Springer (2020) 1, 3, 4, 5, \(6,9,{10},{11},{12},{13},{14}\)

  51. Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: Pdc-net+: Enhanced probabilistic dense correspondence network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) 3, 4, 5, 7

  52. Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4909-4916. IEEE (2020) 2, 3, 8, 9, 10, 12, 13

  53. Wannenwetsch, A.S., Keuper, M., Roth, S.: Probflow: Joint optical flow and uncertainty estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 1173-1182 (2017) 3, 5

  54. Weinzaepfel, P., Leroy, V., Lucas, T., Brégier, R., Cabon, Y., Arora, V., Antsfeld, L., Chidlovskii, B., Csurka, G., Revaud, J.: Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Processing Systems 35, 3502-3516 (2022) 1, 3

  55. Weinzaepfel, P., Lucas, T., Leroy, V., Cabon, Y., Arora, V., Brégier, R., Csurka, G., Antsfeld, L., Chidlovskii, B., Revaud, J.: Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17969-17980 (2023) 1, 3, 5, 10, 12

  56. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: Gmflow: Learning optical flow via global matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8121-8130 (2022) 1,3,6,10,11,12

  57. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) 1, 3, 10, 12

  58. Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1289-1297 (2017) 3

  59. Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.: Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018) 10, 12

  60. Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3723-3732 (2019) 1

  61. Xu, X., Siyao, L., Sun, W., Yin, Q., Yang, M.H.: Quadratic video interpolation. Advances in Neural Information Processing Systems 32 (2019) 1

  62. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29. pp. 214-223. Springer (2007) 1, 3

  63. Zhai, M., Xiang, X., Lv, N., Ali, S.M., El Saddik, A.: Skflow: Optical flow estimation using selective kernel networks. Ieee Access 7, 98854-98865 (2019) 1

  64. Zhang, S., Sun, X., Chen, H., Li, B., Shen, C.: Rgm: A robust generalist matching model. arXiv preprint arXiv:2310.11755 (2023) 3, 5, 6

  65. Zhao, S., Sheng, Y., Dong, Y., Chang, E.I., Xu, Y., et al.: Maskflownet: Asymmetric feature matching with learnable occlusion mask. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6278-6287 (2020) 3,9

  66. Zhao, S., Zhao, L., Zhang, Z., Zhau, E., Metaxas, D.: Global matching with overlapping attention for optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17592-17601 (2022) \(1,3,8,{11},{12}\)

  67. Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.U.: Improved two-stream model for human action recognition. EURASIP Journal on Image and Video Processing \(\operatorname{2020}\left( 1\right) ,1 - 9\left( {2020}\right) 1\)

  68. Zheng, Z., Nie, N., Ling, Z., Xiong, P., Liu, J., Wang, H., Li, J.: Dip: Deep inverse patchmatch for high-resolution optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8925-8934 (2022) \(1,3,{10},{11},{12}\)

  69. Zuo, Y., Deng, J.: View synthesis with sculpted neural points. arXiv preprint arXiv:2205.05869 (2022) 1

相關文章