Learning spatio-temporal video representations

Learning spatio-temporal video representations

Place: Large Lecture Room

Abstract: The state-of-the-art models for action recognition and other video tasks are currently based on spatio-temporal (3D) convolutions. Although highly effective, such models are computationally expensive and usually require orders of magnitude more FLOPs than their 2D counterparts. In this talk, I will present two recent models that manage to both achieve high performance and be more efficient than state-of-the art approaches. In Multi-Fiber networks (ECCV 2018) we exploit group convolutions to slice the computational cost while keeping the discriminative power and performance of spatio-temporal convolutions. In our Double Attention Networks (NIPS 2018) we propose a novel, factorized self-attention mechanism that is able to gather and distribute information over the full spatio-temporal input tensor, while remaining computationally and memory efficient at the same time. Both approaches are evaluated on image (Imagenet/classification) and video (Kinetics/action recognition) tasks and enjoy state-of-the-art performance in both domains.

Short Bio: Dr. Yannis Kalantidis is member of the Computer Vision Group at Facebook Research AML in Menlo Park. Before that, he was a scientist at Yahoo Research in San Francisco for two years. He was the principal scientist behind the visual similarity project at Yahoo/Flickr that allows for visual similarity search in many billions of images from Flickr.