Towards better cross-modal learning by Probabilistic embedding and AdamP optimizer

CVC Seminar

Download the presentation slides


Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this talk, Probabilistic Cross-Modal Embedding (PCME) [1] will be introduced, where samples from the different modalities are represented as probabilistic distributions in the common embedding space. Since common benchmarks such as COCO suffer from non-exhaustive annotations for cross-modal matches, we propose to (1) additionally evaluate retrieval on the CUB dataset, a smaller yet clean database where all possible image-caption pairs are annotated, and (2) use a new evaluation metric for COCO caption, called plausible match R-Precision (PMRP). The experimental results demonstrate that PCME not only improves the retrieval performance over its deterministic counterpart, but also provides uncertainty estimates that render the embeddings more interpretable.

In the second half of this talk, it will be showed that gradient descent-based optimizers suffer from the monotonic norm increase problem, especially when momentum is used. It is verified that the widely adopted combination of the two ingredients (batch normalization and momentum) leads to the premature decay of effective step sizes and sub-optimal model performances. A simple and effective remedy is proposed, SGDP and AdamP [2]: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. It is verified that our solution brings about uniform gains in those benchmarks.

[1] Probabilistic Embeddings for Cross-Modal Retrieval, CVPR’21

[2] AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights, ICLR’21

Short bio:

Research scientist and tech leader at NAVER AI Lab, working on machine learning and its applications. In particular, his research interests focus on reliable machine learning tasks, such as robustness, de-biasing, uncertainty estimation, explainability, and fair evaluation. Prior to working at NAVER, he worked as a research engineer at the advanced recommendation team (ART) in Kakao from 2016 to 2018.

Mr. Chun received a master’s degree in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST) in 2016. During his master’s degree, he researched a scalable algorithm for robust subspace clustering (the algorithm is based on robust PCA and k-means clustering). Before his master’s study, he worked at IUM-SOCIUS in 2012 as a software engineering internship. He also did a research internship at Networked and Distributed Computing System Lab in KAIST and NAVER Labs during summer 2013 and fall 2015, respectively.