A Survey on Multimodal Large Language Models
https://arxiv.org/pdf/2306.13549
多模態大預言模型,其是基於LLM,同時具有了接收、推理、輸出多模態資訊的能力。
In light of this complementarity, LLM and LVM run
towards each other, leading to the new field of Multimodal
Large Language Model (MLLM). Formally, it refers to the
LLM-based model with the ability to receive, reason, and
output with multimodal information
三大模型基於影像的推理能力
https://hiresynth.ai/blog/googleio_puzzle_multimodal_eval.html#introduction-the-models
OpenAI GPT-4V
The multimodal LLM craze started with the release of GPT-4V in September and the enticing caption:
"ChatGPT can now see, hear, and speak"Google Gemini Ultra
Next Google Gemini Ultra was released in December, along with the following press release:
"[Gemini] was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across and combine different types of information including text, code, audio, image and video."Anthropic Claude3 Opus
Finally, Anthropic Claude3 Opus has just been released in February, with the following caption: "The Claude 3 models have sophisticated vision capabilities on par with other leading models. They can process a wide range of visual formats, including photos, charts, graphs and technical diagrams."
Along with the release of Claude3, we were provided a handy chart comparing the multimodal capabilities of the three models: