The role of text and language in solving computer vision problems, has received lots of attention in the recent past. In this talk, we present some of our attempts in bridging the textual and visual descriptions of the
world. By exploiting the relationships between textual and visual descriptions, we are able to build superior solutions for annotation and retrieval of images and videos. We believe that to solve many immediate vision problems, one can avoid an explicit recognition and use the noisy and loosely connected descriptions.
In general, our methods attempt to align text and images which are either domain specific or even obtained from the wild. Our alignment methods use some level of supervision. We demonstrate this alignment at multiple granularity of text i.e., words, phrases and sentences. In each of these cases, alignment is formulated as an optimization/learning problem and solved with the help of data, and domain knowledge if applicable.
Bio: C. V. Jawahar is a Professor at IIIT Hyderabad, India. He heads the Center for Visual Information Technology (CVIT) at IIIT Hyderabad. Before joining IIIT Hyderabad, he had worked with Center for AI and
Robotics, Bangalore. His research interests are in the broad areas of Computer Vision and Machine Learning. He has worked extensively in the area of document image analysis with focus on Indian languages.