Abstract
Video-text retrieval has been a crucial and fundamental task in
multi-modal research. The development of video-text retrieval has
been considerably promoted by large-scale multi-modal contrastive
pre-training, which primarily focuses on coarse-grained or finegrained contrast. However, cross-grained contrast, which is the
contrast between coarse-grained representations and fine-grained
representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained
contrast calculate the correlation between coarse-grained features
and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature
during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However,
another challenge lies in the similarity aggregation problem, which
aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we
propose the Attention Over Similarity Matrix (AOSM) module to
make the model focus on the contrast between essential frames
and words, thus lowering the impact of unnecessary frames and
words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance
on five widely-used video-text retrieval datasets, including MSRVTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo
(47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-the-art by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative
improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.
Acknowledgement
This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science
Foundation of China (No. U21B2037, No. 62176222, No. 62176223,
No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No.
62002305), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).
This work was supported
by Alibaba Group through Alibaba Research Intern Program.