X-CLIP: End-to-End Multi-grained Contrastive Learning
for Video-Text Retrieval

  • 1MAC-Lab, Xiamen University
  • 2Institute of Artificial Intelligence, Xiamen University
  • 3DAMO Academy, Alibaba Group
  • ✉corresponding author
Accepted by ACM MM 2022 (Main Track)
TL;DR: X-CLIP is a video-text retrieval model with multi-grained contrastive learning based on CLIP.
Abstract
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or finegrained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSRVTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-the-art by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.
Bibtex
              
@inproceedings{ma2022x,
  title={X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval},
  author={Ma, Yiwei and Xu, Guohai and Sun, Xiaoshuai and Yan, Ming and Zhang, Ji and Ji, Rongrong},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={638--647},
  year={2022}
}
              
            
Related Works

There are lots of wonderful works that might interest you.

+ CLIP is an important work in the field of large-scale image pre-training.

+ CLIP4Clip is the first work to transfer the knowledge of CLIP to the video-text retrieval task.

Acknowledgement

This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).

This work was supported by Alibaba Group through Alibaba Research Intern Program.