開源視覺大模型

lightsong發表於2024-09-13

MiniCPM-V 2.6

https://github.com/OpenBMB/MiniCPM-V

MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text outputs. Since February 2024, we have released 5 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:

  • MiniCPM-V 2.6: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses GPT-4V in single image, multi-image and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.

MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:

  • 🔥 Leading Performance. MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.

  • 🖼️ Multi Image Understanding and In-context Learning. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.

  • 🎬 Video Understanding. MiniCPM-V 2.6 can also accept video inputs, performing conversation and providing dense captions for spatial-temporal information. It outperforms GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.

  • 💪 Strong OCR Capability and Others. MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.

  • 🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on end-side devices such as iPad.

  • 💫 Easy Usage. MiniCPM-V 2.6 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with Gradio, and (6) online web demo.

https://www.toutiao.com/article/7399980838420480564/?app=news_article&timestamp=1725840551&use_new_style=1&req_id=202409090809113385223578718DD18BF5&group_id=7399980838420480564&share_token=7F883988-0F83-4351-83EF-A68D1B96AC8C&tt_from=weixin&utm_source=weixin&utm_medium=toutiao_ios&utm_campaign=client_share&wxshare_count=1&source=m_redirect&wid=1725844796084

在多圖評測平臺 Mantis-Eval 榜單上,MiniCPM-V 2.6 多圖聯合理解能力實現開源模型 SOTA ,且超越 GPT-4V。在影片評測平臺 Video-MME 榜單上,MiniCPM-V 2.6 的影片理解能力達到端側 SOTA,超越 GPT-4V。

開源視覺大模型

OpenCompass | Mantis-Eval | Video-MME 榜單成績

  • 首次在端側實現單圖、多圖、影片理解等多模態核心能力全面超越 GPT-4V,單圖理解越級比肩 Gemini 1.5 Pro 和 GPT-4o mini。
  • 引入實時影片理解、多圖聯合、ICL 視覺學習、OCR 等,開啟端側模型真實世界觀察與學習。
  • 小鋼炮 2.6 取得了兩倍於 GPT-4o 的單 token 編碼畫素密度。視覺 token 相比上一代下降 30% ,比同類模型低 75%。
  • 量化後端側記憶體僅佔 6 GB;端側推理速度高達 18 tokens/s,相比上代模型快 33%。並且釋出即支援 llama.cpp、ollama、vllm 推理;且支援多種語言。

CogVLM2 & CogVLM2-Video

https://github.com/THUDM/CogVLM2?tab=readme-ov-file

We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

  1. Significant improvements in many benchmarks such as TextVQA, DocVQA.
  2. Support 8K content length.
  3. Support image resolution up to 1344 * 1344.
  4. Provide an open source model version that supports both Chinese and English.

You can see the details of the CogVLM2 family of open source models in the table below:

Model Namecogvlm2-llama3-chat-19Bcogvlm2-llama3-chinese-chat-19Bcogvlm2-video-llama3-chatcogvlm2-video-llama3-base
Base Model Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct
Language English Chinese, English English English
Task Image Understanding, Multi-turn Dialogue Model Image Understanding, Multi-turn Dialogue Model Video Understanding, Single-turn Dialogue Model Video Understanding, Base Model, No Dialogue
Model Link 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 🤗 Huggingface 🤖 ModelScope
Experience Link 📙 Official Page 📙 Official Page 🤖 ModelScope 📙 Official Page 🤖 ModelScope /
Int4 Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model / /
Text Length 8K 8K 2K 2K
Image Resolution 1344 * 1344 1344 * 1344 224 * 224 (Video, take the first 24 frames) 224 * 224 (Video, take the average 24 frames)

相關文章