MiniCPM-V 2.6
https://github.com/OpenBMB/MiniCPM-V
MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text outputs. Since February 2024, we have released 5 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:
MiniCPM-V 2.6: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses GPT-4V in single image, multi-image and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
🔥 Leading Performance. MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.
🖼️ Multi Image Understanding and In-context Learning. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
🎬 Video Understanding. MiniCPM-V 2.6 can also accept video inputs, performing conversation and providing dense captions for spatial-temporal information. It outperforms GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.
💪 Strong OCR Capability and Others. MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.
🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on end-side devices such as iPad.
💫 Easy Usage. MiniCPM-V 2.6 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with Gradio, and (6) online web demo.
https://www.toutiao.com/article/7399980838420480564/?app=news_article×tamp=1725840551&use_new_style=1&req_id=202409090809113385223578718DD18BF5&group_id=7399980838420480564&share_token=7F883988-0F83-4351-83EF-A68D1B96AC8C&tt_from=weixin&utm_source=weixin&utm_medium=toutiao_ios&utm_campaign=client_share&wxshare_count=1&source=m_redirect&wid=1725844796084
在多圖評測平臺 Mantis-Eval 榜單上,MiniCPM-V 2.6 多圖聯合理解能力實現開源模型 SOTA ,且超越 GPT-4V。在影片評測平臺 Video-MME 榜單上,MiniCPM-V 2.6 的影片理解能力達到端側 SOTA,超越 GPT-4V。
OpenCompass | Mantis-Eval | Video-MME 榜單成績
- 首次在端側實現單圖、多圖、影片理解等多模態核心能力全面超越 GPT-4V,單圖理解越級比肩 Gemini 1.5 Pro 和 GPT-4o mini。
- 引入實時影片理解、多圖聯合、ICL 視覺學習、OCR 等,開啟端側模型真實世界觀察與學習。
- 小鋼炮 2.6 取得了兩倍於 GPT-4o 的單 token 編碼畫素密度。視覺 token 相比上一代下降 30% ,比同類模型低 75%。
- 量化後端側記憶體僅佔 6 GB;端側推理速度高達 18 tokens/s,相比上代模型快 33%。並且釋出即支援 llama.cpp、ollama、vllm 推理;且支援多種語言。
CogVLM2 & CogVLM2-Video
https://github.com/THUDM/CogVLM2?tab=readme-ov-file
We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:
- Significant improvements in many benchmarks such as
TextVQA
,DocVQA
.- Support 8K content length.
- Support image resolution up to 1344 * 1344.
- Provide an open source model version that supports both Chinese and English.
You can see the details of the CogVLM2 family of open source models in the table below:
Model Name cogvlm2-llama3-chat-19B cogvlm2-llama3-chinese-chat-19B cogvlm2-video-llama3-chat cogvlm2-video-llama3-base Base Model Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct Meta-Llama-3-8B-Instruct Language English Chinese, English English English Task Image Understanding, Multi-turn Dialogue Model Image Understanding, Multi-turn Dialogue Model Video Understanding, Single-turn Dialogue Model Video Understanding, Base Model, No Dialogue Model Link 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 🤗 Huggingface 🤖 ModelScope Experience Link 📙 Official Page 📙 Official Page 🤖 ModelScope 📙 Official Page 🤖 ModelScope / Int4 Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model 🤗 Huggingface 🤖 ModelScope 💫 Wise Model / / Text Length 8K 8K 2K 2K Image Resolution 1344 * 1344 1344 * 1344 224 * 224 (Video, take the first 24 frames) 224 * 224 (Video, take the average 24 frames)