ShareGPT4V作者團隊又一力作！百萬高質量影片-字幕資料助力社群提升多模態大模型影片理解及生成能力

机器之心發表於2024-06-20

原文網址 : https://www.jiqizhixin.com/articles/2024-06-20-4

AIxiv專欄是機器之心釋出學術、技術內容的欄目。過去數年，機器之心AIxiv專欄接收報導了2000多篇內容，覆蓋全球各大高校與企業的頂級實驗室，有效促進了學術交流與傳播。如果您有優秀的工作想要分享，歡迎投稿或者聯絡報導。投稿郵箱：liyazhou@jiqizhixin.com；zhaoyunfeng@jiqizhixin.com

繼Sora官宣之後，多模態大模型在影片生成方面的應用簡直就像井噴一樣湧現出來，LUMA、Gen-3 Alpha等影片生成模型展現了極佳質量的藝術風格和影片場景的細節雕刻能力，文生影片、圖生影片的新前沿不斷被擴充套件令大家驚喜不已，抱有期待。

最近，來自中國科學技術大學、北京大學、上海 AI Lab等團隊的研究人員釋出了引人矚目的 ShareGPT4Video系列，旨在提升影片理解和生成能力。

論文連結: https://arxiv.org/abs/2406.04325v1
專案連結：https://sharegpt4video.github.io/
資料集連結：https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video
程式碼連結: https://github.com/ShareGPT4Omni/ShareGPT4Video
Demo連結: https://huggingface.co/spaces/Lin-Chen/ShareCaptioner-Video

在過去半年中，影像-語言多模態領域在ShareGPT4V的高質量影像-字幕資料集的推出後逐漸意識到詳細、準確的影像-字幕資料對於對齊影像與語言模態的重要性。ShareGPT4V資料集推出至今已在HuggingFace平臺的VQA dataset track上獲得了歷史第二高的點贊數。

建立在高質量的ShareGPT4V資料集上，影像理解和影像生成社群也都取得一些突破性的進展，例如InternVL-Chat-V1.5與PixArt-Σ等工作。

受ShareGPT4V資料集在圖文多模態領域的成功所鼓舞，原作者團隊把目光再次投向影片多模態領域。影片多模態領域中閉源商業模型一直處於斷層領先的地位，一方面，OpenAI和谷歌近期接連的兩場釋出會，把AI影片推理捲到了新高度。另一方面，OpenAI的Sora文生影片模型則把文生影片帶到了一個全新的高度。

研究者們認為閉源模型對於影片理解和影片生成領域的巨大領先同樣離不開詳細高質量的影片-字幕資料。因此，該研究團隊再次致力於為影片獲取大量詳細而精確的字幕，提升大型影片語言模型的影片理解能力和文生影片模型的影片生成能力。

目前，該研究在HuggingFace的6月7日Daily Papers中位居榜首，並且在程式碼公佈後迅速獲得500+ Star，得到了國內外的一致關注。

研究者們認為用現有的閉源模型生成高質量影片描述的挑戰有三個方面:

清晰地理解幀間的時序變化。
詳細準確地描述幀內內容。
對任意長度影片的可擴充套件性。

為此，研究者們精心設計了一種差分滑窗影片描述（Differential Sliding-Window Captioning, DiffSW）策略，該策略可以穩定且高效地為任意解析度，寬高比和長度的影片生成高質量描述。

^{圖 1：差分滑動視窗影片描述生成}

具體而言，研究者們每次送入GPT4V的輸入是當前關鍵幀，上一關鍵幀以及上一關鍵幀對應的差分描述，旨在讓GPT4V根據觀察兩幀之間的時間與空間變化總結出當前幀相對於上一幀的重要空間、時序變化，即當前幀與上一幀對應的差分描述。最終，所有差分描述會連同時間戳一起送入GPT4中從而總結出最終的關於整個影片的高質量字幕。

該研究團隊展示了幾個示例：

Caption 1:

The video segment documented a significant event in Kochi, Kerala, where 2 buildings razed in Kochi. The broadcast began with a split-screen presentation: on one side, thick clouds of dust were seen billowing into the sky, marking the onset of the demolition process, while on the other side, reporter Gopikrishnan provided live coverage, indicated by "BREAKING NEWS" captions and a consistent timestamp of "11:10 AM." The news ticker at the bottom of the screen simultaneously ran other global events, maintaining a flow of information. As the video progresses, the split-screen footage of the razed house turns into a close-up. A notable change in the headline to "KOCHI FLATS RAZED" signaled the demolition's culmination. A brief interlude offered a visual contradiction by showcasing the flats presumably before their demolition, providing a stark before and after comparison. As the video progressed, the left building's collapse initiated a dramatic alteration in the skyline, marked by significant dust plumes. Subsequently, another building was shown partially collapsing amid debris, fully obscured by dust in seconds, with surrounding greenery remaining untouched. This transitioned into a graphic interlude featuring the "India Today" logo, briefly pausing the live footage. Resuming to the aftermath, split imagery displayed the rubble and ongoing smoke. Then, the imagery continued to juxtapose the scenes of destruction against intact high-rise buildings nearby. The narrative was augmented by the revelation that the Supreme Court directed the demolition within a broader national news context. Throughout, the report maintained a real-time approach, threading continuity and urgency across the unfolding event's documentation. ShareGPT4V作者團隊又一力作！百萬高質量影片-字幕資料助力社群提升多模態大模型影片理解及生成能力

Caption 2:

The video begins with an individual seated on a gray couch in a cozy domestic setting, about to unbox a product from a red CCM-branded box placed on a white table in front of them. Initially, the person is seen drinking from a blue can, indicating a casual atmosphere. Soon after, the individual shifts attention from the can to the red box, signifying the start of the unboxing process. The red box, initially closed, gradually becomes the focal point as the person prepares to open it, conveying a build-up of anticipation. As the video progresses, the box is flipped over and then opened, revealing its content still hidden under white tissue paper adorned with prints, adding to the suspense. The individual’s engagement with the box evolves, from initially preparing to open it, to actively delving into its contents. A momentary pause in activity is captured before the anticipation culminates with the individual lifting an object from the box. This object, identifiable by a yellow label, is then examined closely by the person, indicating a thorough inspection or perusal of the product or its packaging. Throughout the video, the surrounding environment remains consistent and undisturbed, with household items like a potted plant and a wall clock maintaining the setting's homely ambiance. The camera’s perspective remains fixed, focusing on the unfolding unboxing event without any movement, thus allowing the viewer to observe the narrative closely. Another partially open brown box is visible beside the main red box, though its role or contents are not elaborated upon. The video encapsulates the anticipation, action, and reveal inherent to unboxing experiences in a home setting. ShareGPT4V作者團隊又一力作！百萬高質量影片-字幕資料助力社群提升多模態大模型影片理解及生成能力

透過這一方法，研究者們推出了大型“影片-文字描述”資料集--ShareGPT4Video資料集，其中包括4萬條（共291小時）由GPT-4V標註的影片資料。這些資料涵蓋了廣泛的類別，生成的描述包含豐富的世界知識，物件屬性，攝像機運動，以及詳細和精確的事件時間描述。

^{圖 2 ：（a）資料集涵蓋廣泛的內容，包括野生動物、烹飪、體育、風景、第一人稱人類活動、自動駕駛場景等。(c) 字幕的字數主要在 200 到 400 之間，提供了豐富的時間資訊，可以很好地完成影片理解和生成任務。}

在ShareGPT4Video資料集的基礎上，為了進一步擴大資料集規模以及便於開源社群在自有資料上的使用，研究者們進一步設計開發了ShareCaptioner-Video，一個能夠有效地為任意影片生成高質量描述的多功能多模態大模型。

^{圖 3：ShareCaptioner-Video 是一款四合一的特殊影片描述模型，具有以下功能：滑動視窗生成影片描述、快速生成影片描述、影片片段對應描述整合，提示詞生成詳細描述}

具體而言，滑窗影片描述功能可以擔任GPT4V收集標註資料中的全部角色，並且透過滑窗的方式來產生差分描述並彙總出最終的字幕。快速影片描述功能則是把所有關鍵幀沿豎直方向拼成一張長圖一次性產生最終的字幕，在略微犧牲效能的情況下大幅提升標註速度。影片片段總結功能則可以在對完整影片進行一次滑窗描述後，對其中任意的影片片段直接總結出字幕而不需要再次進行滑窗描述過程。

在得到了優異的影片描述模型後，研究者們用它進一步標註了480萬條，總時長3000小時的豐富的影片資料。這些影片具有較高的美學評分以及較少的轉場效果，旨在為影片生成任務服務。

^{表1：由 ShareCaptioner-Video 標註的480萬條影片資料的構成}

實驗

在影片理解方面，研究者們首先透過簡單的等量替換實驗，驗證了ShareGPT4Video資料集在幾種當前LVLM架構上的有效性。研究者們把VideoChatGPT資料集中100K影片訓練資料中的與詳細caption相關的28K資料等量替換成ShareGPT4Video資料集中的子集。從下表可以看到，透過簡單的資料替換，僅僅是字幕資料質量上的提升便可以一致地為不同架構、不同規模的影片理解多模態大模型帶來顯著的效能增益。

^{表 2：ShareGPT4Video資料集在各模型架構上均能產生效能增益}

之後，研究者們自主收集了153K的影片VQA資料，並結合ShareGPT4Video資料集中與影片理解相關的28K高質量字幕資料，提出了新的LVLM ShareGPT4Video-8B。僅需8卡以及5個小時的訓練開銷，即可在多項Benchmark上取得優異的結果。

^{表 3 ：TempCompass上效能對比}

^{表 4 ：VideoBench上效能對比}

^{表 5：MVBench上效能對比}

即使是在最近新出現的幾個影片理解基準上，ShareGPT4Video-8B也可以在7B引數規模上一致地展現出具有競爭力的效能。

^{表 6 ：LongVideoBench上效能對比}

^{表 7 ：Video-MME基準效能對比}

^{表 8：MMBench-Video基準效能對比}

在影片生成方面，研究者們基於Open-Sora-Plan專案簡單直接地驗證了詳細的字幕資料對於文生影片模型的幫助。下圖中，第一行的結果是使用了短字幕資料訓練出的文生影片模型得到的，第二行的結果是使用了ShareCaptioner-Video標註的高質量字幕資料訓練出的文生影片模型得到的。可以看到，使用詳細的字幕資料可以讓文生影片模型具備優異的鏡頭移動控制以及語義內容控制能力。

ShareGPT4V作者團隊又一力作！百萬高質量影片-字幕資料助力社群提升多模態大模型影片理解及生成能力

^{參考連結：}

^{【1】Open-Sora-Plan：https://github.com/PKU-YuanGroup/Open-Sora-Plan}

^{【2】LongVideoBench：https://longvideobench.github.io/}

^{【3】Video-MME：https://video-mme.github.io/}

^{【4】MMBench-Video：https://github.com/open-compass/MMBench-Video}

ShareGPT4V作者團隊又一力作！百萬高質量影片-字幕資料助力社群提升多模態大模型影片理解及生成能力

相關文章