【論文閱讀】ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

Kin_Zhang發表於2022-04-05

原文網址 : https://www.cnblogs.com/kin-zhang/p/16104211.html

ICLRORMNifi

ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

Type: ICLR
Year: 2022
組織: waymo

參考與前言
- openreivew
https://openreview.net/forum?id=Wm3EA5OlHsG
- pdf
Scene Transformer: A unified architecture for predicting multiple agent trajectories

1. Motivation

主要受語言模型方法 language modeling approach 啟發而來

問題場景

任務：多agent的軌跡預測問題

難點：因為agent本身行為的多樣性（diverse），加之對彼此軌跡的影響（influence）

之前工作主要聚焦在根據過去動作預測單獨 agent的未來軌跡，然後根據各自的預測來進行規劃；但是呢 independent predictions 並不利於表示未來狀態下不同agent之間的互動問題，從而引申規劃時也是sub-optimal的軌跡

marginal prediction：未來時刻不同agent預測的軌跡可能會有衝突部分，即兩者相交
joint prediction：在同一未來時刻，不同agent的預測軌跡不會衝突， respect each others’ prediction

【論文閱讀】ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

Contribution

formulate a model 去同時(jointly)預測所有的agent行為，producing consistent future 來解釋agent之間的行為

以下為原文，這個貢獻的格式和jjh說的TRO格式好像，名詞方法為主語

A novel, scene-centric approach that allows us to gracefully switch training the model to produce either marginal (independent) and joint agent predictions in a single feed-forward pass.

僅在單個feed-forward中進行marginal和joint prediction之間的切換
A permutation equivariant Transformer-based architecture factored over agents, time, and road graph elements that exploits the inherent symmetries of the problem.

使用與 transformer 相同(等價)的permutation 來將agents, time和road graph都考慮在系統內
A masked sequence modeling approach that enables us to condition on hypothetical agent futures at inference time, enabling conditional motion prediction or goal conditioned prediction.

masked sequence modeling 能使我們將未來考慮在內，時間意義上

問題區：

摘要的方法沒看懂，三個一個都沒看懂.... TBD閱讀到後面在回答這個問題吧

Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy
- 介紹中引入scene-centric說的是為了scaling to large numbers of agents，但是在貢獻中卻說的是切換？emmm 是數量大了就切換？小了就joint？
評估時為什麼是marginal and joint motion predictions，後者可以理解，前者的marginal是什麼預測？單獨agent的預測與真值對比嘛？

後面介紹部分解釋了，見前面解釋
為什麼要切換為什麼[要切換](https://www.notion.so/ICLR-2022-Scene-Transformer-A-unified-architecture-for-predicting-future-trajectories-of-multiple--37a700f54efb4e4f87eee9f682c6a0d5)？直接整體進行joint prediction不是更好嗎？

方法處說明了是不同的任務之間都可以用這一個網路進行，主要任務是：motion prediction、conditional motion prediction、goal-conditioned prediction
transformer？attention 機制？考慮時形式以vector形式嗎？

方法中有具體介紹，靜止的road graph用feature vector形式，動態的比如紅綠燈是one feature vector per object形式
沒看懂最後一條貢獻，因為在第二條裡已經說明了使用transform類似機制將time考慮進內，mask squence 建模原因有重複？
- 是直接對未來的agent進行假設嘛？可能是前情提要知識缺的有點多，可能得套娃比較多
mask的原因其實是切換... The approach is flexible, enabling us to simultaneously train a single model for MP, CMP, GCP.
如果有榜的話不是第一，也可以稱自己為state-of-art嘛？畢竟這篇在waymo online 排行榜中，排名挺後的

2. Method

相關工作主要是圍繞，此處僅做簡單總結，主要是前情提要知識補充可能能解答上面的問題

motion prediction框架：說明成功的模型大多都會考慮agent motion history和道路結構（包括lane, stop line, 紅綠燈等等）；

相關方式：
- 直接將輸入渲染為多通道的鳥瞰圖 top-down image，然後使用卷積，但是receptive field並不利於capturing spatially-distant intersection
- entity-centric approach：可以將agent的歷史狀態使用sequence modeling方式例如 RNN，進行編碼，其中將道路結構中 pose 資訊和 semantic type 都編碼（比如以piecewise-linear segments）進入系統；使用如下方法將資訊進行聚合：employ pooling, soft-attention, graph neural networks
scene-centric 和 agent-centric representation：主要是討論 representation encoding所用的框架
- 以scene-level 作為座標系，rasterized top-down image，雖然能有效的表示world狀態在common的座標系下，但是喪失了一些潛在的pose資訊
- 以agent-coordinate 為座標系，但是隨著agent數量上升同時互動的數量也會二次方上升。
後續說明 waymo的另一篇工作LaneGCN就是以agent為中心但是實在global frame下做的。同時也不需要將場景表示成為影像的形式
Representing multi-agent futures：主要是如何表示多agent的未來狀態，常用的有直接對每個agent的軌跡使用權重

問題區：

第二點提到的representation不就是第一點裡面的相關方式嘛？感覺這篇文章好多地方有耦合方法和方法之間的原因很像，為何不直接總結成一個？

一個是representation，一個是以什麼為中心進行

2.1 輸入與輸出

輸入

a feature for every agent at every time step

在模型中是一個3d tensor，A 個 agents，每個裡面有D個特徵維度，在時間T steps，同時在每層layers中我們都想保持住這樣的size：\([A,T,D]\)

注意在decoder中有多的一個維度：F potential futures

輸出

an output for every agent at every time step

2.2 框架

整體模型名稱：scene transformer，一共有三個階段：

將agents和road graph embed到一個高維空間
employ attention-based network 去 encode agents和road graph之間的互動
使用attention-based network 去 decode multiple future

mask

對於多工的切換主要用mask來實現，如下圖所示，在做MP的時候時間維度上有mask被遮擋，但是如果是CMP則自身的motion提供未來時間內motion，GCP的話就是提供最遠時間T的AV motion

A. Scene-Centric Representation

此點主要是以什麼為中心進行場景周圍資訊的獲取，正如前面相關工作中提到的，此處以場景為中心也就是使用 an agent of interest’s position 作為原點，對所有的road graph和agents進行編碼；以agent為中心的話，就是對每個agent分別進行以其為原點的計算

此步中細節步驟為：

為每個agent生成 time step內的feature，if time step is visible
使用 PointNet 為static road graph和其餘的元素 learning one feature vecctor per polyline，其中交通標誌 sign為長度為1的polylines
為dynamics road graph 比如在空間上是靜止的在時間上是變換的紅綠燈，生成為 one feature vector per object

所有的以上類別都具有xyz位置資訊，以其選定好的agent作為居中，對剩餘類別進行居中旋轉等處理，再使用sinusoidal position embeddings

B. Encoding Ttansformer

和基本的attention並無太大區別，query, key, value為需要學習的線性層，每個都乘一下輸入 x，比如：\(Q=W_qx\)，如上圖的encoder和decoder框圖，其中decoder最後接了兩層MLP然後 predict 7 outputs，其中前六個對應的是：三個是在給定時間下的agent的三維與the agent of interest之間的絕對座標，and 三個是不確定性遵循Laplace 分佈的引數。後一個是heading

為了尋求更高效的self-attention，僅在時間層上使模型獨立於agent進行平滑軌跡的學習，同樣的僅在agent層上使模型獨立於time進行interaction的之間的學習，類似於解耦，如上圖decoder部分下面，交替進行兩次

與road graph之間是cross attention

C. Predicting Probabilities for each Futures

預測的是概率分數，不論是joint裡的每個未來的情況打分還是marginal model裡對軌跡的打分。所以我們需要一個feature representation去總結 scene和each agent.

根據agent和time下對agent feature tensor進行分別求和，然後加到additional artificial agent and time，所以internal representation就會變成 \([A+1,T+1,D]\)

然後作為decoder的輸入，經過兩層 MLP+softmax 得到等價的probabilities for each features

D. Joint and Marginal Loss Formulation

首先對於所有的agent都有一個displacement loss and time step to build a loss tensor of shape \([F]\)，但是我們僅將最接近於真值的進行back-propagate反向傳播；對於marginal的預測呢則是每個agent都是單獨的對待，也就是得到了displacement loss是 \([F,A]\)，但是並不aggregate across agents而是為每個agent選取最小的loss然後反向

問題區：

encode和decode都是一個attention-based network... 那

有框圖解釋了兩者的設計方式
這裡的預計motion 是根據規劃得到的嗎？規劃是deterministic的嗎？還是直接針對的是資料集

應該是資料集，所以可以直接獲取未來資料集內的motion進行此任務
an agent of interest’s position 是感興趣的agent的位置吧... 為啥寫的這麼繞.. select an interest agent’s position不好嗎...
- 選擇指標是？
  
  腳註和open reivew中也有審稿人問了 hhh，腳註說明了對於waymo是自身車輛，對於Argoverse是需要預測的車輛
這裡的所有是指？這裡的[所有是指？](https://www.notion.so/ICLR-2022-Scene-Transformer-A-unified-architecture-for-predicting-future-trajectories-of-multiple--37a700f54efb4e4f87eee9f682c6a0d5)所有？整張地圖的道路結構？還是選取了以選擇定的agent 畫了框？

3. 實驗

指標為預測中場景的minADE, minFDE, miss rate和mAP，基本上都是用來測量 how close the top k trajectories are to ground truth observation，也就是預測的軌跡離真值有多近

L2: A simple and common distance-based metric is to measure the L2 norm between a given trajectory and the ground truth
minADE: reports the L2 norm of the trajectory with the minimal distance
minFDE: reports the L2 norm of the trajectory with the smallest distance only evaluated at the final location of the trajectory.

本文所有的是MR, mAP，對於joint future則是scene-level下的minSADE, minSFDE, SMR

miss rate (MR) and mean average precision (mAP) to capture how well a model predicts all of the future trajectories of agents probabilistically

主要就貼一下實驗表格等

場景分析圖：

指明不同的目標點，預測也會隨之變換，響應前文提出的switch task GCP

4. Conclusion

碎碎念

正如CJ哥所言：waymo必然不開源；但是吧每個論文的附錄都特別仔細到讓我這種小白菜覺得哇 emm 似乎可以復現呢，但是這篇可能沒細看附錄的原因有好幾個地方還是有點存疑的，hhhh。所以主要重點看看他們的框架是怎麼搭的更為重要，waymo三篇基本都是自己設計的網路不走resnet或者regnet 有預訓練的引數。更多細節要是感興趣的話建議讀一下原文的附錄部分，網路引數等都介紹的較為詳細

這一篇雖不及MP3驚豔，但似乎奠定了應該用vector的形式去做預測類似於CJ哥在multipath++筆記中提到，vectornet有一統的趨勢。其實pointnet之類的在17年的就提出了進來以pointnet → vectornet → 再到現在的一系列基本都是attention下的各種玩法

open review值得一看還是這種開放審稿的有意思啊，因為有審稿人對GCP的結果說明產生了問題，類似於建議作者在CARLA做就是以目標點的condition prediction其實已經很像planning了，基本就是加一下控制器，然後作者謝謝提醒，我知道（內心OS:但是我不做hhhh）

另外貼一下我在前面說的 online leaderboard 下確實排名不高，不過按提交時間的話就另說了

贈人點贊手有餘香 ?；正向回饋才能更好開放記錄 hhh

【論文閱讀筆記】Transformer——《Attention Is All You Need》
2024-11-08
筆記ORM
『論文精讀』Vision Transformer(VIT)論文解讀
2024-04-25
ORM
【論文閱讀】CVPR2021: MP3: A Unified Model to Map, Perceive, Predict and Plan
2022-03-27
Nifi
【論文閱讀】Informer Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
2024-03-12
ORMAST
論文閱讀：SiameseFC
2018-04-09
論文閱讀20241117
2024-11-22
GeoChat論文閱讀
2024-10-29
【論文閱讀】CVPR2022: Learning from all vehicles
2022-03-23
多模態學習之論文閱讀：《Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis 》
2024-08-09
AST
transformer model architecture
2024-07-21
ORM
2019-02-26 論文閱讀：Learning a Single Convolutional Super-Resolution Network for Multiple Degradations...
2019-02-26
阿里DMR論文閱讀
2024-04-29
阿里
[論文閱讀] Hector Mapping
2020-12-16
APP
並行多工學習論文閱讀（五）：論文閱讀總結
2021-11-12
並行
XGBoost論文閱讀及其原理
2018-05-13
Q-REG論文閱讀
2023-10-04
MapReduce 論文閱讀筆記
2020-06-24
筆記
「DNN for YouTube Recommendations」- 論文閱讀
2020-02-19
DNN
G-FRNet論文閱讀
2020-10-11
AutoEmbedding論文閱讀筆記
2023-03-29
筆記
論文閱讀——Deformable Convolutional Networks
2020-12-25
ORM
【2020論文閱讀】11月
2020-11-27
論文泛讀《PICCOLO : Exposing Complex Backdoors in NLP Transformer Models》
2024-12-04
ORM
論文閱讀狀態壓縮
2019-02-05
ICLR 2021投稿中值得一讀的NLP相關論文
2020-11-10
ICLR
多模態學習之論文閱讀：《PREDICTING AXILLARY LYMPH NODE METASTASIS IN EARLY BREAST CANCER USING DEEP LEARNING ON PRIMARY TUMOR BIOPSY SLIDES》
2024-08-09
ASTIDE
論文閱讀2-思維鏈
2024-03-14
CornerNet-Lite論文閱讀筆記
2020-10-31
筆記
Visual Instruction Tuning論文閱讀筆記
2024-06-07
Struct筆記
論文閱讀：《Learning by abstraction: The neural state machine》
2022-04-10
Mac
閱讀論文：《Compositional Attention Networks for Machine Reasoning》
2022-04-10
Mac
論文閱讀 Inductive Representation Learning on Temporal Graphs
2022-07-11
A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation 論文解讀（SIGMOD 2021）
2022-03-07
Nifi
AeroSpike踩坑手記1：Architecture of a Real Time Operational DBMS論文導讀
2019-01-20
ROS
深度學習論文閱讀路線圖
2018-08-06
深度學習
論文閱讀-Causality Inspired Representation Learning for Domain Generalization
2024-04-09
AI
ACL2020論文閱讀筆記：BART
2020-09-26
筆記
Reading Face, Read Health論文閱讀筆記
2020-10-31
筆記

【論文閱讀】ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

1. Motivation

問題場景

Contribution

2. Method

2.1 輸入與輸出

2.2 框架

A. Scene-Centric Representation

B. Encoding Ttansformer

C. Predicting Probabilities for each Futures

D. Joint and Marginal Loss Formulation

3. 實驗

4. Conclusion

碎碎念

相關文章