Skip to content

OpenGVLab/VKnowU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

๐Ÿ“Š VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

๐Ÿ“– ArXiv   โ”‚   ๐Ÿ“Š VKnowU   โ”‚   ๐Ÿ“€ VKnowQA   โ”‚   ๐Ÿค— Video-Know+   โ”‚   OpenCompass

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term $\textbf{\textit{visual knowledge}}$, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present ๐Ÿ“ŠVKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both $\textit{world-centric}$ (e.g., intuitive physics) and $\textit{human-centric}$ (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, ๐Ÿ“€VKnowQA, and ๐Ÿค—VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured $\textit{Seeโ€“Thinkโ€“Answer}$ paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

โœ… Release the training and evaluation codes of VideoKnow+

โœ… Release the benchmark๏ผšVKnowU

โœ… Release the model weights of ๐Ÿค—VideoKnow+

โœ… Release the 30K training datasets๏ผš๐Ÿ“€VKnowQA-CS-12K and ๐Ÿ“€VKnowQA-30K

Requirements

  • Python >= 3.11
  • Pytorch >= 2.5.1
  • transformers == 4.51.3
  • vLLM == 0.7.3
  • trl == 0.16.0

Installation

git clone https://github.com/OpenGVLab/VKnowU
cd VKnowU

# Create and activate environment
conda create -n VKnowU python=3.11 
conda activate VKnowU
bash setup.sh

๐Ÿš€ Training

Supervised Fine-Tuning (SFT)

We begin with supervised fine-tuning on the ๐Ÿ“€VKnowQA-CS-12K dataset for one epoch:

bash ./src/scripts/run_sft_video.sh

Reinforcement Learning (RL)

Next, perform reinforcement learning using the ๐Ÿ“€VKnowQA-30K dataset (using vLLM acceleration to enable faster training):

  1. Employ an external verifier MLLM for calculate visual knowledge reward and modify the corresponding api in here.

  2. Run the RL scripts:

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

Note: During training, we adopt the following settings for efficiency:

  • VIDEO PIXELS: 128 ร— 28 ร— 28
  • FPS FRAMES: 16

All frame-related configurations can be adjusted in src/qwen-vl-utils.

๐Ÿ“ˆ Evaluation

During inference, we increase the maximum frame resolution and length to boost performance:

  • VIDEO PIXELS: 256 ร— 28 ร— 28
  • FPS FRAMES: 32

You can configure these parameters in src/qwen-vl-utils.

Evaluation Procedure

๐Ÿ“Š VKnowU

  1. Download the video and json data from VKnowU and organize them.

  2. Run the evaluation on VKnowU:

bash ./src/eval_vknowu.sh
  1. Caculate overall accuracy:
python ./src/eval/calculate_vknowu.py

๐Ÿ“Š Other Video Benchmarks

  1. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files in the eval_data.

  2. Run the evaluation across other video benchmarks:

bash ./src/eval_bench.sh
  1. Caculate overall accuracy:
python ./src/eval/calculate_bench.py

๐Ÿ™ Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly R1-V and VideoRFT.

๐Ÿ“š Citations

If you find this work helpful, please consider citing:

@article{jiang2025vknowu,
  title={VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs},
  author={Jiang, Tianxiang and Xia, Sheng and Xu, Yicheng and Wu, Linquan and Zeng, Xiangyu and Wang, Limin and Qiao, Yu and Wang, Yi},
  journal={arXiv preprint arXiv:2511.20272},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors