📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

📖 ArXiv │ 📊 VKnowU │ 📀 VKnowQA │ 🤗 Video-Know+ │

🔎 Overview

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term $\textbf{\textit{visual knowledge}}$, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present 📊VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both $\textit{world-centric}$ (e.g., intuitive physics) and $\textit{human-centric}$ (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, 📀VKnowQA, and 🤗VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured $\textit{See–Think–Answer}$ paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

🔧 ToDo

✅ Release the training and evaluation codes of VideoKnow+

✅ Release the benchmark：VKnowU

✅ Release the model weights of 🤗VideoKnow+

✅ Release the 30K training datasets：📀VKnowQA-CS-12K and 📀VKnowQA-30K

🛠️ Set up

Requirements

Python >= 3.11
Pytorch >= 2.5.1
transformers == 4.51.3
vLLM == 0.7.3
trl == 0.16.0

Installation

git clone https://github.com/OpenGVLab/VKnowU
cd VKnowU

# Create and activate environment
conda create -n VKnowU python=3.11 
conda activate VKnowU
bash setup.sh

🚀 Training

Supervised Fine-Tuning (SFT)

We begin with supervised fine-tuning on the 📀VKnowQA-CS-12K dataset for one epoch:

bash ./src/scripts/run_sft_video.sh

Reinforcement Learning (RL)

Next, perform reinforcement learning using the 📀VKnowQA-30K dataset (using vLLM acceleration to enable faster training):

Employ an external verifier MLLM for calculate visual knowledge reward and modify the corresponding api in here.
Run the RL scripts:

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

Note: During training, we adopt the following settings for efficiency:

VIDEO PIXELS: 128 × 28 × 28
FPS FRAMES: 16

All frame-related configurations can be adjusted in src/qwen-vl-utils.

📈 Evaluation

During inference, we increase the maximum frame resolution and length to boost performance:

VIDEO PIXELS: 256 × 28 × 28
FPS FRAMES: 32

You can configure these parameters in src/qwen-vl-utils.

Evaluation Procedure

📊 VKnowU

Download the video and json data from VKnowU and organize them.
Run the evaluation on VKnowU:

bash ./src/eval_vknowu.sh

Caculate overall accuracy:

python ./src/eval/calculate_vknowu.py

📊 Other Video Benchmarks

Download the video data from the official sites of each benchmark and organize them as specified in the JSON files in the eval_data.
Run the evaluation across other video benchmarks:

bash ./src/eval_bench.sh

Caculate overall accuracy:

python ./src/eval/calculate_bench.py

🙏 Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly R1-V and VideoRFT.

📚 Citations

If you find this work helpful, please consider citing:

@article{jiang2025vknowu,
  title={VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs},
  author={Jiang, Tianxiang and Xia, Sheng and Xu, Yicheng and Wu, Linquan and Zeng, Xiangyu and Wang, Limin and Qiao, Yu and Wang, Yi},
  journal={arXiv preprint arXiv:2511.20272},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
eval_data		eval_data
figs		figs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

🔎 Overview

🔧 ToDo

🛠️ Set up

Requirements

Installation

🚀 Training

Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

📈 Evaluation

Evaluation Procedure

📊 VKnowU

📊 Other Video Benchmarks

🙏 Acknowledgements

📚 Citations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

🔎 Overview

🔧 ToDo

🛠️ Set up

Requirements

Installation

🚀 Training

Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

📈 Evaluation

Evaluation Procedure

📊 VKnowU

📊 Other Video Benchmarks

🙏 Acknowledgements

📚 Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages