Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multi30k for eval #46

Open
elliottd opened this issue Nov 28, 2022 · 1 comment
Open

add multi30k for eval #46

elliottd opened this issue Nov 28, 2022 · 1 comment

Comments

@elliottd
Copy link

Add the Multi30K datasets for multilingual image--sentence retrieval evaluation. The evaluation data is available in English, French, Czech, and German. The sentence data can be found on Github at https://github.com/multi30k/dataset/tree/master/data/task1/raw.

The raw untokenized sentence data can be found in the following files, where LANG = (en, cs, de, fr):

test_2016_flickr.LANG.gz
test_2017_flickr.LANG.gz
test_2018_flickr.LANG.gz

The corresponding image information can be found in https://github.com/multi30k/dataset/tree/master/data/task1/image_splits

test_2016_flickr.txt this uses the test set images from the original Flickr30K dataset.
test_2017_flickr.txt uses newly collected images
test_2018_flickr.txt uses newly collected images

The newly collected images are available to download via Google Drive. Not sure if this is easy to automatically download so re-hosting elsewhere might be possible.

@rom1504
Copy link
Contributor

rom1504 commented Nov 29, 2022

seems good

we would need something similar to https://github.com/LAION-AI/CLIP_benchmark/blob/main/clip_benchmark/datasets/multilingual_mscoco.py

autodownloading gdrive is possible, done for a few datasets here already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants