Simple Inference and Generation Using Multimodal Information

Abstract: Can we make computers understand language from just text, or do we need further grounding, such as providing videos and sound? This question has been asked in the NLP community, with much evidence pointing to the fact that even with very large pre-trained language models, the latest technological gem of NLP, we cannot truly … Read more