You are here
On Agreements in Visual Understanding. 2019 Conference on Neural Information Processing Systems.
Grounding linguistic symbols with digitized images requires a reliable representation of visual concepts. Whether from a cognitive perspective, or a computational perspective, those iconic representations are meant to be reused in different linguis- tic contexts. For visual grounding applications, this poses an important premise question: what would be the levels of agreement on purely visual content when no linguistic descriptions are involved? In particular, (i) the agreement between human assessors on purely visual content can be a relevant indicator of the scalability of visual grounding as a computational approach to natural language understanding, and (ii) the agreement between computer models and human assessors can give us some insights on the difficulty of bringing added value from grounding. In this paper, we study these agreements through the design of a new image similarity collection. In particular, we study inter-human, inter-model, and human-model agreements both on open-domain images and on medical images involving a more challenging context. Our experiments show that coarse-grained agreement between human assessors can reach 90.1%, at different expertise levels, even when no linguistic descriptions are associated with the images. However, a detailed analysis of deep learning search results on our collection showed that different interpretations of the same neural layer have highly different perspectives on visual content, with average correlation ratios ranging between 0.1 and 0.4 for the top 50 results. Although these findings confirm that there is a sufficiently common ground in cognitive iconic representations to build relevant references for visually-grounded language models, they also show that relying on one single model (or layer) for image representation is not suited for grounding applications, and that ensemble representations might be a more viable option.