Keynote Talk – ACMCV

Enterprise Visual Understanding with Vision-Language Models: From Documents to Intelligent Agents

David Vázquez

Director of Research Programs at ServiceNow Research

David Vázquez is Director of AI Research at ServiceNow Research, where he leads the Fundamental AI Research group. His current work focuses on multimodal learning, vision–language models, reasoning for enterprise applications, web agents, and data-efficient learning. He has published extensively in top venues such as NeurIPS, ICLR, ICML, CVPR, ICCV, and ACL, contributing to advances in multimodal document understanding (BigDocs), chart reasoning (BigCharts), visual content–to–code generation (StarFlow, StarVector), and alignment techniques for VLMs (AlignVLM).

David received degrees in Software Engineering from Universidade da Coruña and in Computer Science from the Universitat Autònoma de Barcelona (UAB), including a PhD in Computer Vision and AI. He completed postdoctoral fellowships at the Computer Vision Center (CVC) and at MILA under Aaron Courville, funded by a Marie Curie Fellowship. Earlier in his career, he worked on autonomous driving technologies, creating the SYNTHIA dataset, an autonomous driving simulator, and contributing to real vehicle prototypes with a focus on perception (object detection, semantic segmentation, 3D reconstruction, SLAM). He is also an Adjunct Professor at UAB, where he continues to teach and supervise graduate research.

Vision-Language Models (VLMs) have demonstrated remarkable progress in natural image understanding and creative generation, yet their performance often falls short on enterprise-critical tasks such as document analysis, chart reasoning, workflow automation, and user interface navigation. In this talk, will be presented recent advances in adapting multimodal foundation models to enterprise applications, with a focus on text-rich visual understanding, document intelligence, and visual content–to–code generation. Also, will be introduced datasets and benchmarks such as BigDocs, BigCharts, StarFlow, and StarVector, designed to push VLMs toward real-world enterprise use cases. It will also be discussed AlignVLM, a robust architecture that bridges visual and textual representations to achieve competitive results on challenging document benchmarks. Finally, it will be highlighted how these models enable the next generation of AI agents—systems capable of reasoning, planning, and acting—by grounding natural language instructions in complex graphical user interfaces. Together, these directions illustrate a path toward enterprise-ready multimodal AI that is accurate, reliable, and adaptable.