Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

¹Xiamen University, ²Contemporary Amperex Technology

^✧Corresponding author

Accepted by ACM MM 2023 (Main Track)

Abstract

Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. De- spite considerable efforts to bridge the gap between vision and language, the significant differences between these modalities con- tinue to pose a challenge. Previous methods have attempted to align text and image samples in a modal-shared space, but they face uncertainties in optimization directions due to the movable fea- tures of both modalities and the failure to account for one-to-many relationships of image-text pairs in TPR datasets. To address this is- sue, we propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample, thus mitigating the optimization problem. Additionally, this embed- ding scheme generates multiple features for each sample without introducing trainable parameters, making it easier to align with sev- eral positive samples. Based on this paradigm, we propose a novel Bi-directional one-to-many Embedding Alignment (Beat) model to address the TPR task. Our experimental results demonstrate that the proposed Beat model achieves state-of-the-art performance on three popular TPR datasets, including CUHK-PEDES (65.61 R@1), ICFG-PEDES (58.25 R@1), and RSTPReID (48.10 R@1). Furthermore, additional experiments on MS-COCO, CUB, and Flowers datasets further demonstrate the potential of Beat to be applied to other image-text retrieval tasks.

Model Architecture

Illustration of the proposed Beat model. Images and texts are first processed by ConvNet and SeqModel to obtain visual features 𝑽 and textual features 𝑻, respectively. Global features 𝒗𝑔 and 𝒕𝑔 are extracted through a global max pooling (GMP) and a fully connected layer (FC) based on 𝑽 and 𝑻. Similarly, we adopt a global max pooling (GMP) and an FC layer to obtain 𝐾 local visual features . Then, we employ word attention module (WAM) , GMP and FC layer to obtain 𝐾 local textual features . We then adopt a non-local module (NLM) to get 𝐾 non-local visual features and textual features . Afterward, the REM-G is introduced to perform bi-directional one-to-many embedding. Finally, textual and visual samples are aligned in two modal-specific spaces under the guidance of ID loss and CR loss.

t-SNE Visualization results on CUHK-PEDES

Feature visualization of the base model and Beat via t-SNE on CUHK-PEDES. We show the changing process of cross-modal feature distributions with training. The feature of each image and text is marked as a circle and a triangle, respectively. Each identity is indicated in a specific color. For clarity, we only visualize global visual and textual features.