Place: Large Lecture Room
Affiliation: Computer VIsion Centre and Dep. of Computer Science, UAB.
The increasing ubiquitousness of digital information in our daily lives has positioned video as a favored information vehicle, and given rise to an astonishing generation of social media and surveillance footage. This raises a series of technological demands for automatic video understanding and management, which together with the compromising attentional limitations of human operators, have motivated the research community to guide its steps towards a better attainment of such capabilities. As a result, current trends on cognitive vision promise to recognize complex events and self-adapt to different environments, while managing and integrating several types of knowledge. Future directions suggest to reinforce the multi-modal fusion of information sources and the communication with end-users. In this thesis we tackle the problem of recognizing and describing meaningful events in video sequences from different domains, and communicating the resulting knowledge to end-users by means of advanced interfaces for human–computer interaction. This problem is addressed by designing the high-level modules of a cognitive vision framework exploiting ontological knowledge. Ontologies allow us to define the relevant concepts in a domain and the relationships among them; we prove that the use of ontologies to organize, centralize, link, and reuse different types of knowledge is a key factor in the materialization of our objectives. The proposed framework contributes to: (i) automatically learn the characteristics of different scenarios in a domain; (ii) reason about uncertain, incomplete, or vague information from visual –camera’s– or linguistic –end-user’s– inputs; (iii) derive plausible interpretations of complex events from basic spatiotemporal developments; (iv) facilitate natural interfaces that adapt to the needs of end-users, and allow them to communicate efficiently with the system at different levels of interaction; and finally, (v) find mechanisms to guide modeling processes, maintain and extend the resulting models, and to exploit multimodal resources synergically to enhance the former tasks. We describe a holistic methodology to achieve these goals. First, the use of prior taxonomical knowledge is proved useful to guide MAP-MRF inference processes in the automatic identification of semantic regions, with independence of a particular scenario. Towards the recognition of complex video events, we combine fuzzy metric-temporal reasoning with SGTs, thus assessing high-level interpretations from spatiotemporal data. Here, ontological resources like T–Boxes, onomasticons, or factual databases become useful to derive video indexing and retrieval capabilities, and also to forward highlighted content to smart user interfaces. There, we explore the application of ontologies to discourse analysis and cognitive linguistic principles, or scene augmentation techniques towards advanced communication by means of natural language dialogs and synthetic visualizations. Ontologies become fundamental to coordinate, adapt, and reuse the different modules in the system. The suitability of our ontological framework is demonstrated by a series of applications that especially benefit the field of smart video surveillance, viz. automatic generation of linguistic reports about the content of video sequences in multiple natural languages; content-based filtering and summarization of these reports; dialogue-based interfaces to query and browse video contents; automatic learning of semantic regions in a scenario; and tools to evaluate the performance of components and models in the system, via simulation and augmented reality.