多模態大模型

lightsong發表於2024-09-01

A Survey on Multimodal Large Language Models

https://arxiv.org/pdf/2306.13549

多模態大預言模型,其是基於LLM,同時具有了接收、推理、輸出多模態資訊的能力。

In light of this complementarity, LLM and LVM run
towards each other, leading to the new field of Multimodal
Large Language Model (MLLM). Formally, it refers to the
LLM-based model with the ability to receive, reason, and
output with multimodal information

三大模型基於影像的推理能力

https://hiresynth.ai/blog/googleio_puzzle_multimodal_eval.html#introduction-the-models

OpenAI GPT-4V

The multimodal LLM craze started with the release of GPT-4V in September and the enticing caption:
"ChatGPT can now see, hear, and speak"

Google Gemini Ultra

Next Google Gemini Ultra was released in December, along with the following press release:
"[Gemini] was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across and combine different types of information including text, code, audio, image and video."

Anthropic Claude3 Opus

Finally, Anthropic Claude3 Opus has just been released in February, with the following caption: "The Claude 3 models have sophisticated vision capabilities on par with other leading models. They can process a wide range of visual formats, including photos, charts, graphs and technical diagrams."

Along with the release of Claude3, we were provided a handy chart comparing the multimodal capabilities of the three models:

Multimodal Evaluation Chart

相關文章