Towards End-to-End Networks for Visual Tracking in RGB and TIR Videos

October 18, 2019 at 11:30 am by

Place: CVC Sala d’actes

Committee:

  • Dr. Juan Carlos SanMiguel (Electronic Technology and Communications Department, Universidad Autónoma de Madrid)
  • Dr. Daniel Ponsa (Centre de Visió per Computador, Universitat Autònoma de Barcelona)
  • Dr. Michael Felsberg (Department of Electrical Engineering, Linköping University)
  • Dr. Jordi Gonzalez (Centre de Visió per Computador, Universitat Autònoma de Barcelona)
  • Dr. Javier Vázquez Corral (Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra)

Thesis Director:

  • Dr. Joost van de Weijer (Centre de Visió per Computador, Universitat Autònoma de Barcelona)

Thesis Co-Directors:

  • Dr. Abel Gonzalez-Garcia (Centre de Visió per Computador)
  • Dr. Fahad Shahbaz Khan (Department of Electrical Engineering, Linköping University)
Abstract:

In the current work, we identify several problems of current tracking systems. The lack of large-scale labeled datasets hampers the usage of deep learning, especially end-to-end training, for tracking in TIR images. Therefore, many methods for tracking on TIR data are still based on hand-crafted features. This situation also happens in multi-modal tracking, e.g. RGB-T tracking. Another reason, which hampers the development of RGB-T tracking, is that there exists little research on the fusion mechanisms for combining information from RGB and TIR modalities. One of the crucial components of most trackers is the update module. For the currently existing end-to-end tracking architecture, e.g, Siamese trackers, the online model update is still not taken into consideration at the training stage. They use no-update or a linear update strategy during the inference stage. While such a hand-crafted approach to updating has led to improved results, its simplicity limits the potential gain likely to be obtained by learning to update.

To address the data-scarcity for TIR and RGB-T tracking, we use image-to-image translation to generate a large-scale synthetic TIR dataset. This dataset allows us to perform end-to-end training for TIR tracking. Furthermore, we investigate several fusion mechanisms for RGB-T tracking. The multi-modal trackers are also trained in an end-to-end manner on the synthetic data. To improve the standard online update, we pose the updating step as an optimization problem which can be solved by training a neural network. Our approach thereby reduces the hand-crafted components in the tracking pipeline and sets a further step in the direction of a complete end-to-end trained tracking network which also considers updating during optimization.