MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Chaoya Jiang, Hongrui Jia, Haiyang Xu, Wei Ye, Mengfan Dong +4 more

2/3/2026

Abstract

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.

View on arXiv View PDF

Code Implementations(10)

FoundationVision/Groma68%

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

58245Apr 21, 20241 years agoApache-2.0

foundation-modelsgroundinglarge-language-modelsllamallama2+4 more

AIDC-AI/Ovis66%

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

1,42685Shell, PythonJun 13, 20246 months agoApache-2.0

chatbotllama3multimodalmultimodal-large-language-modelsmultimodality+3 more

VARGPT-family/VARGPT65%

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

34317Jan 21, 20251 years agoApache-2.0

mllmunified-model

OpenAdaptAI/OpenAdapt65%

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

1,478216Apr 12, 20233 months agoMIT

agentsai-agentsai-agents-frameworkanthropiccomputer-use+15 more

microsoft/LoRA62%

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

13,195881Jun 18, 20211 years agoMIT

adaptationdebertadeep-learninggpt-2gpt-3+5 more

Atomic-man007/Awesome_Multimodel_LLM62%

Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.

35822Jun 11, 20231 years ago

chatgptdatasetgptllmmllm+3 more

apple/ml-fastvlm62%

This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025

7,164537May 1, 202511 months agoNOASSERTION

dzhng/deep-research61%

An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the simplest implementation of a deep research agent - e.g. an agent that can refine its research direction overtime and deep dive into a topic.

18,3671,895Feb 4, 20257 months agoMIT

agentaigpto3-miniresearch

mohanchandrass/Lip-Reading-VSR--with-a-Multi-Branch-Transformer59%

This repository presents a Visual Speech Recognition (VSR) system-commonly known as lip reading that transcribes spoken language purely from visual input, without using audio. The work focuses on a hybrid CNN + Multi-Branch Transformer architecture designed for sequence-level lip reading in real-world conditions.

10Dec 23, 20253 weeks ago

Yuvarajsel/Multimodal_Jewelry_Retrieval59%

An intelligent search app that connects text and visual queries using OpenAI’s CLIP. Users can search jewelry with natural language (“gold ring with emerald”) or upload images to find similar items. Built with Fast API and Qdrant for fast vector search, plus a Next.js frontend. Includes OCR and hybrid filtering features.

00Feb 12, 20262 months agoMIT

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Abstract

Code Implementations(10)

Discussion