Skip to content

An Effective and Scalable Framework for Multimodal Search with Target Modality

License

Notifications You must be signed in to change notification settings

whenever5225/MUST

Repository files navigation

MUST: An Effective and Scalable Framework for Multimodal Search with Target Modality

1. Introduction

We introduce a new research problem: multimodal search with target modality. This problem involves searching for objects in one modality (the target) using multiple modalities as input. One of the input modalities is the target modality, and the others are the auxiliary modalities. The auxiliary modalities modify or refine some aspects of the target modality input. For example, we can search for videos using a reference video and auxiliary image and text. Our paper entitled “MUST: An Effective and Scalable Framework for Multimodal Search with Target Modaliy” provides an efficient and scalable framework for multimodal search with target modality, called MUST. The evaluation results demonstrate that MUST improves search accuracy by about 50%, is more than 10x faster than the baseline methods, and can scale to more than 10 million data size.

video_search_example

This repo contains the code, datasets, optimal parameters, and other detailed information used for the experiments of our paper.

2. Baseline

  • Multi-streamed retrieval (MR). MR is a traditional strategy for solving hybrid queries in IR and DB communities [VLDB'20, SIGMOD'21]. We adapt this framework to handle MSTM problem and enhance it by using advanced unimodal and multimodal encoders like CLIP [CVPR'22].

  • Joint embedding (JE). JE is a mainstream method for addressing multimodal search in CV community. We use two representative multimodal encoders: TIRG (pioneer) [CVPR'19] and CLIP (SOTA) [CVPR'22].

3. MUST Overview

In MUST, we use three pluggable components: (1) Embedding; (2) Vector weight learning; (3) Indexing and search.

must_framework

4. Datasets

Dataset # Modality # Object # Query Type Source
CelebA (link) 2 191,549 34,326 Image; Text real-world
MIT-States (link) 2 53,743 72,732 Image; Text real-world
Shopping* 2 96,009 47,658 Image; Text real-world
CelebA+ (link) 4 191,549 34,326 Imagex3; Text real-world
ImageText1M (link) 2 1,000,000 1,000 Image; Text semi-synthetic
AudioText1M (link) 2 992,272 200 Audio; Text semi-synthetic
VideoText1M (link) 2 1,000,000 10,000 Video; Text semi-synthetic
ImageText16M (link) 2 16,000,000 10,000 Image; Text semi-synthetic

*Please contact the author of the dataset to get access to the images.

5. Parameters

To obtain embedding vectors, we use the same training hyper-parameters as the original papers of encoders. The encoder configuration is the same for all three frameworks. For the vector weight learning module, we set the learning rate to 0.2 and train for 20 iterations by default. Appendix contains the analysis of other parameters and the output weights of module on different datasets.

6. Usage

(1) Prerequistes

PyTorch
Pybind
GCC 4.9+ with OpenMP
CMake 2.8+

(2) Run

(i) Embedding

Refer to TIRG and CLIP.

We convert vectors of all objects and query inputs to fvecs format or ivecs format, and groundtruth data to ivecs format. For the description of fvecs and ivecs format, see here.

(ii) Vector weight learning

cd ./vector_weight_learning
python setup.py install
python main.py

(iii) Indexing and search

cd ./scripts
./run release build_<framework> # index build
./run release search_<framework> # search

7. Demo

demo_mitstates1

8. Acknowledgements

We used the implementation of our embedding from TIRG and CLIP. We implemented our indexing components and search codes based on CGraph. We appreciate their inspiration and the references provided for this project.