Why is [person4] pointing at [person1]?

Rationale: I think so because...

Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.

With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true.

Overview of VCR

  • 290k multiple choice questions
  • 290k correct answers and rationales: one per question
  • 110k images
  • Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach
  • Answers are 7.5 words on average; rationales are 16 words.
  • High human agreement (>90%)
  • Scaffolded on top of 80 object categories from COCO
  • Questions are highly diverse and challenging: browse and see for yourself!

Subscribe to our list

VCR is an ongoing effort. For announcements follow Rowan on twitter ยป and subscribe to our Google group ยป

From Recognition to Cognition: Visual Commonsense Reasoning

If the paper inspires you, please cite us:

@inproceedings{zellers2019vcr,
  author = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
  title = {From Recognition to Cognition: Visual Commonsense Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}

Authors

VCR is an effort between researchers at the University of Washington and AI2, along with a group of fantastic crowd workers who annotated the data. We're also grateful for the following sponsors: