Deep Multimodal and contextual visual recognition

April 7, 2017 at 3:30 pm by

Affiliation: PostDoc Researcher at the Computer Vision Center

Place: Large Lecture Room


Visual recognition is often addressed only using RGB images. However, humans perceive and understand the world through multiple senses and leverage other sources of information such as prior knowledge, context and experience. Thus, mechanisms to incorporate additional contextual cues and prior and external information can help to simplify difficult recognition problems. In this talk I will introduce some recent works in these directions, focusing on two particular scenarios: scene recognition and food recognition.

Current scene recognition systems use scene-centric convolutional neural networks trained on large RGB scene datasets. One possible way to improve performance is combining scene-centric and object-centric networks, where both scene and object knowledge is transferred and integrated. We also explored multimodal scene representations where RGB images are augmented with depth maps.

Automatic food recognition has multiple applications (e.g. diet monitoring, diabetes control, recipe search, tagging). A common scenario is people going out to restaurants using smartphones to retrieve information about their meals. In this scenario rich contextual information can be exploited, from geolocation to restaurant and recipe databases available in the web.


Luis Herranz is a P-SPHERE postdoctoral fellow at the Computer Vision Center, Barcelona. He received the Ph.D in Computer Science and Telecommunication from the Universidad Autónoma de Madrid, Spain in 2010. He worked at Mitsubishi Electric R&D Centre Europe, United Kingdom, and at the Institute of Computing Technology of the Chinese Academy of Sciences, Beijing, China. He has worked in diverse topics in multimedia and computer vision. His current research interests include deep learning, visual perception and understanding and multimodal modeling.