Simple Inference and Generation Using Multimodal Information

CVC Seminar


Can we make computers understand language from just text, or do we need further grounding, such as providing videos and sound? This question has been asked in the NLP community, with much evidence pointing to the fact that even with very large pre-trained language models, the latest technological gem of NLP, we cannot truly understand language without additional input signals.

In this CVC seminar, Dr. Shay Cohen will address several issues in providing such grounding. He will first show how multimodality can help with shallow as well as more complex inference, such as what is needed to understand crime drama. I will also show how such grounding can help with the generation of text, in the context of narrating videos. I will conclude with the notion that multimodal analysis and grounding, while related, are not identical to multiview learning, and will describe work that makes use of multiview learning to solve problems in captioning, parsing and word embedding derivation.

Short Bio:

Dr. Shay Cohen is a Reader at the University of Edinburgh (School of Informatics). Prior to this, he was a postdoctoral research scientist in the Department of Computer Science at Columbia University, and held an NSF/CRA Computing Innovation Fellowship. He received his B.Sc. and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon University in 2011. His research interests span a range of topics in natural language processing and machine learning, with a focus on structured prediction (for example, parsing) and text generation.