Skip to content

The first work for cross-domain open-vocabulary action recognition with a benchmark

Notifications You must be signed in to change notification settings

KunyuLin/XOV-Action

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

🌈 Overview

alt text

alt text

  • We establish a CROSS-domain Open-Vocabulary Action recognition (XOV-Action) benchmark, and our evaluation reveals that existing CLIP-based video learners exhibit limited performance when recognizing actions in unseen test domains.
  • To address the cross-domain open-vocabulary action recognition task, our work focuses on a critical challenge, namely scene bias, and we accordingly contribute a novel scene-aware video-text alignment method.

📚 XOV-Action Benchmark

New Features 🔥

  • XOV-Action is the first benchmark, composed of two training datasets and four test datasets, for evaluating models in the cross-domain open-vocabulary action recognition task.
  • We identify closed-set and open-set categories for each test domain, thus providing a comprehensive way to evaluate models across various situations. This is different from prevailing existing open-vocabulary action recognition works that treat all the categories in another datasets as open-set.

Benchmark Components

Training Datasets

  • Kinetics400: One of the most widely-used action recognition datasets, consisting of 400 action categories.
  • Kinetics150: A subset of Kinetics400, composed of 150 action categories selected from the full Kinetics400.

Test Datasets

  • UCF101: One of the most widely-used action recognition datasets, consisting of 101 action categories.
  • HMDB51: A widely-used action recognition datasets consisting of 51 action categories.
  • ARID: A dataset consisting of 11 categories of action videos, which are recorded under dark environments.
  • NEC-Drone: A dataset consisting of 16 categories of action videos, which are recorded by drones in the same basketball court.

Evaluation Metrics

  • The closed-set accuracy measures the recognition performance of closed-set categories, which primarily evaluates the model abilities of tackling domain gaps when fitting training videos.
  • The open-set accuracy measures the performance of open-set categories, which evaluates the generalization abilities across both video domains and action categories.
  • The overall accuracy measures the recognition performance over all categories, which provides a holistic view of model effectiveness across various situations.

Please refer to our paper for more details.

🚀 Methodology

  • The code for our proposed method is coming soon.

📌 Acknowledgement

  • Our benchmark is established based on Kinetics400, UCF101, HMDB51, ARID and NEC-Drone. Thanks for the authors' diligent efforts and significant contributions.

  • Our method is implemented based on the codebases ViFi-CLIP and Open-VCLIP. Thanks for the authors' high-quality codebases.

  • If you find our paper/code/benchmark useful, please consider citing our paper:

@misc{lin2024xovaction,
  author       = {Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng},
  title        = {Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition},
  year         = {2024},
  eprint       = {2403.01560},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV}
}

About

The first work for cross-domain open-vocabulary action recognition with a benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published