Context, regularized expert stacking, and transductive inference for large-scale action recognition in video

March 21, 2014 at 12:30 pm by

Place: Large Lecture Room
Affiliation: Computer Vision Centre, Barcelona. Spain.


In this seminar I will discuss the recent participation by my former research group at the University of Florence in the THUMOS Competition on Action Recognition with a Large Number of Classes. The 2013 THUMOS ECCV Workshop (and associated competition) was created in recognition of the need for the vision community to make a concerted effort to go beyond datasets with limited number of action classes, such as KTH, Weizmann and IXMAS.  The competition was organized around the UCF Action Dataset which is currently the largest action dataset both in terms of number of categories and clips, with more than 2 million frames in 13,000 clips drawn from 101 action classes.


For the 2013 THUMOS Challenge we took a straightforward, recognition-based bag-of-features approach based on a variety of features extracted from both videos and keyframes.  To a respectable baseline derived from quantized, organizer-provided features we added a number of novel enhancements developed in my lab in Florence: (i) action-specific scene context inferred from dense local SIFT pyramids on keyframes; (ii) expert fusion by L1-regularized logistic regression for stacking; and (iii) a CRF model for transductive labeling of test clips.  Each of these enhancements to the basic bag-of-words pipeline were derived from independent results from (not necessarily recognition-related) research lines in my lab.  In the course o describing our approach to THUMOS, I will take the opportunity to comment on these lines and to discuss future research directions.



Watch the Video Presentation