WORKSHOP | Multimodal Foundation Models: from Research to Innovation

WORKSHOP | Multimodal Foundation Models: from Research to Innovation

On 1 July 2026, the European AI research community will gather in Barcelona for the Workshop on Multimodal Foundation Models: From Research to Innovation.

This half-day workshop will explore the latest advances in multimodal artificial intelligence, with a focus on how cutting-edge research can be translated into real-world innovation. Bringing together leading researchers, industry representatives and European AI initiatives, the event will create a space for scientific exchange, applied perspectives and discussion on the future of AI-driven innovation.

The workshop is open to the public and will take place at the Institut d’Estudis Catalans (IEC) in Barcelona, with the possibility to attend either in person or online.

  •  Wednesday 1st July, 2026
  • ️ 09:30 h – 13:00 h
  •  Institut d’Estudis Catalans (Sala Prat de la Riba), Barcelona
  • On-site and online

Multimodal foundation models are becoming central to the next generation of artificial intelligence. By learning from and integrating different types of data — such as text, images, video, audio, 3D, sensors, and other data streams — these models open new possibilities for more generalisable, robust, and adaptable AI systems.

The workshop will feature invited talks by leading researchers in the field, presenting recent developments in multimodal foundation models. These scientific contributions will be complemented by an invited industry talk and a panel discussion on the role of AI-driven innovation in connecting research breakthroughs with practical applications.

The event is hosted by the Computer Vision Center (CVC), co-organised by ELLIOTELLIS Unit Barcelona, the ELLIS Programme on Multimodal Learning SystemsELIAS, and XARXA RDI-IA, and supported by the city council of Barcelona.

(!) This program is subject to updates

9:30 h | Welcome

TBD

9:40 h | Keynote Talk: Dr Ioannis Patras

TBD

10:20 h | Talk: Modelling in an Ego-Sensed World | Dr Dima Damen

Dr Dima Damen, Professor of Computer Vision, University of Bristol, UK

Abstract: (TBD)

Bio: 
Dima Damen is a Professor of Computer Vision at the University of Bristol and Senior Research Scientist at Google DeepMind. Dima is currently an EPSRC Fellow (2020-2026), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She is best known for her leading works in Egocentric Vision, and has also contributed to novel research questions including mono-to-3D, video object segmentation, assessing action completion, domain adaptation, skill/expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning as well as multi-modal fusion using vision, audio and language. She is the project lead for EPIC-KITCHENS, the seminal dataset in egocentric vision, with accompanying open challenges and follow-up works: EPIC-Sounds, VISOR and EPIC Fields, as well as the recent HD-EPIC. She is part of the large-scale consortium effort Ego4D and Ego-Exo4D. She is an ELLIS Fellow, associate editor (AE) of IJCV, and was a program chair for ICCV 2021 and Associate Editor-in-Chief (AEIC) of IEEE TPAMI (2023-2025). She is frequently a Senior Area Chair and Area Chair in major conferences and was selected as Outstanding SAC in ECCV 2024 and Outstanding Reviewer in CVPR2021, CVPR2020, ICCV2017, CVPR2013 and CVPR2012. Dima received her PhD from the University of Leeds (2009), joined the University of Bristol as a Postdoctoral Researcher (2010-2012), Assistant Professor (2013-2018), Associate Professor (2018-2021) and was appointed as chair in August 2021. She supervises 9 PhD students and 3 Visiting PhD students.

10:40 h | Talk: Towards Accurate and Efficient Model Merging | Dr Joost van de Weijer

Dr Joost van de Weijer, PI of the Learning and Machine Perception (LAMP) team at the Computer Vision Center (CVC/UAB)

Abstract: 

Model merging has emerged as a promising paradigm for combining multiple task-specific models into a single multi-task model without requiring additional training. In this talk, I will present two recent directions that address key challenges in model merging. First, I will discuss isotropic model merging, which analyzes the role of singular component alignment in successful merging and introduces common and task-specific subspaces to improve performance and reduce the gap to single-task models. Second, I will present Core Space merging, an efficient framework for merging LoRA-adapted models directly in a shared low-rank representation, preserving the computational benefits of parameter-efficient adaptation while improving accuracy. 

Bio: 

Joost van de Weijer leads the Learning and Machine Perception (LAMP) team at the Computer Vision Center, Universitat Autònoma de Barcelona. He received his PhD from the University of Amsterdam in 2005, focusing on physics-based computer vision for improved color understanding in images. He was a Marie Curie Intra-European Fellow at INRIA Rhône-Alpes, France, and later held a Ramón y Cajal fellowship at the Universitat Autònoma de Barcelona. His current research centers on lifelong learning in deep neural networks, encompassing areas such as continual learning, transfer learning, and domain adaptation. In addition, he works on generative models for visual data, with recent efforts aimed at improving efficiency and text-based control in diffusion models. He is a Fellow of the ELLIS Society.

11:00 h | Coffee Break
11:20 h | Talk: Robot World Models | Dr Jai Bardhan

Dr Jai Bardhan, Researcher, Czech Institute of Informatics, Robotics and Cybernetics (CIIRC CTU) 

Abstract: 

Robot world models promise a data-driven replacement for hand-built simulators, supporting policy evaluation, improvement, and planning from video alone. Pretrained video diffusion models are a natural starting point — but a strong visual prior is not yet a usable simulator. This talk presents two ways to adapt one, Stable Video Diffusion, into a deployable world model, both evaluated on DROID. PersistWorld tackles temporal stability: it post-trains the model on its own autoregressive rollouts via RL, closing the loop that otherwise makes long-horizon predictions degrade. DepthWorld tackles geometric grounding: it jointly predicts RGB and depth so rollouts compose into a consistent 3D world, with depth supervision improving RGB prediction itself.

Bio: 

Jai Bardhan received his B.Tech. degree in computer science and M.S. degree in computational natural sciences from IIIT Hyderabad, India, in 2023. He is currently a researcher in the Intelligent Machine Perception group at the Czech Institute of Informatics, Robotics and Cybernetics (CIIRC CTU), Prague, Czech Republic. His current research interests include world models for robotic manipulation, 3D geometric learning, vision language action (VLA) models and self-supervised representation learning.

11:40 h | Talk: Leveraging CLIP for Medical Anomaly Detection | Dr Vittorio Murino

Dr Vittorio Murino, Professor at the University of Verona &  PI of the AI for Good research unit at the Italian Institute of Technology

Abstract: 

This talk will present a novel few-shot anomaly detection approach, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. 

Bio: 

Vittorio Murino is full professor of Computer Vision and Machine Learning at the University of Verona, Italy and principal investigator of the AIGO – AI for Good research unit at the Italian Institute of Technology in Genova, Italy. 

He has been chairman of the Department of Computer Science @ UNIVR from 2001, year of foundation, to 2007. From 2009 to 2019, he worked at the Italian Institute of Technology as principal investigator of Pattern Analysis and Computer Vision (PAVIS) research unit. From 2019 to 2021, he worked as Senior Video Intelligence Expert at the Ireland Research Centre of Huawei Technologies (Ireland) Co., Ltd. in Dublin.

His main research interests are computer vision, machine learning and pattern recognition, specifically focusing on deep learning approaches in imperfect data regime (unsupervised, self-supervised, few/noisy labeling, and biased scenarios), domain adaptation and generalization, and multimodal learning for behavior analysis and related applications.

Prof. Murino is co-author of more than 400 papers published in refereed journals and international conferences, and member of the technical and organization committees of important conferences (CVPR, ICCV, ECCV, ICLR, ICML, NeurIPS, ICPR).

Finally, prof. Murino is IEEE Fellow, IAPR Fellow, and ELLIS Fellow.

12:00 h | Talk (TBC)
12:20 h | Panel Discussion

Dr Jordina Torrents Barrena, Senior AI Manager and Principal AI Architect, AI Lab, HP Inc.

Jordina Torrents Barrena received her Computer Engineering BSc degree at Universitat Rovira i Virgili in 2014. She obtained her first MSc degree in Computer Engineering: Computer Security and Intelligent Systems (with honours) at Universitat Rovira i Virgili in 2016 and a second MSc degree in Computer Vision at Universitat Autònoma de Barcelona in 2016. She obtained her PhD degree from Universitat Pompeu Fabra on artificial intelligence, deep learning and medical imaging in December 2019. During the PhD, she performed secondments at King’s College London and Google UK. So far, she has published over 60 articles in scientific peer-reviewed journals and international conferences, 3 book chapters and has 6 patents. She was awarded the Google Women Techmakers 2017, Premi Dona TIC “Revelació” 2018, COE & Oracle Big Data and Artificial Intelligence Talent Awards 2019, McKinsey & Company “Next Generation Women Leaders” 2020, Ada Byron Award (Young) 2021, and Society of Women Engineering (SWE) Patent Award 2024, SWE Trailblazer Award 2025 and the Official College of Computer Engineering, La Nit Awards (Talent Jove) 2026, for her strong academic record, passion for increasing the involvement of women in computer science and demonstrated leadership. Jordina is currently the Senior AI Manager and Principal AI Architect at the AI Lab (HP Inc.). She is also an associate professor at Universitat Oberta de Catalunya and Universitat Ramon Llull.

Dr Lukas Fischer, Head of Applied Research / NXAI

Lukas Fischer is Head of Applied Research at NXAI, where he leads the development of efficient foundation models and industrial AI solutions. Previously, he served as Research Manager for Data Science at the Software Competence Center Hagenberg (SCCH), scientifically managing and coordinating the Data Science area with more than 50 researchers across multiple research groups. He holds a PhD in Medical Physics and conducted research at the Computational Imaging Research Lab (CIR) at the Medical University of Vienna, focusing on computer vision methods for medical image analysis and the quantification of trabecular bone microarchitecture. His research interests span foundation models, machine learning, deep learning, multimodal AI, computer vision, industrial AI, and trustworthy AI. Lukas has served as reviewer and program committee member for leading journals and conferences, including Nature Scientific Reports, IEEE TMI, IEEE JBHI, MICCAI, and MIDL. In addition, he has extensive experience coordinating and managing national and European research and innovation projects, bridging cutting-edge AI research with real-world industrial applications.

13:00 h | Lunch