Skip to content
@opendatalab

OpenDataLab

OpenDataLab provides access to numerous significant open-source datasets.

English🌎|简体中文🀄

🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

🌟Extensive open data resources for AI Model

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision
● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

✨Open-source data processing toolkits

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

💫Dataset description language

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.


📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.

Popular repositories Loading

  1. MinerU MinerU Public

    A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

    Python 11.9k 900

  2. PDF-Extract-Kit PDF-Extract-Kit Public

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

    Python 5k 334

  3. labelU labelU Public

    Data annotation toolbox supports image, audio and video data.

    Python 800 72

  4. WanJuan1.0 WanJuan1.0 Public

    万卷1.0多模态语料

    536 26

  5. LabelLLM LabelLLM Public

    The Open-Source Data Annotation Platform

    TypeScript 514 42

  6. magic-doc magic-doc Public

    Python 318 24

Repositories

Showing 10 of 32 repositories
  • PDF-Extract-Kit Public

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

    opendatalab/PDF-Extract-Kit’s past year of commit activity
    Python 4,961 AGPL-3.0 334 43 4 Updated Oct 4, 2024
  • MinerU Public

    A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

    opendatalab/MinerU’s past year of commit activity
    Python 11,927 AGPL-3.0 900 143 5 Updated Sep 30, 2024
  • UniMERNet Public

    UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

    opendatalab/UniMERNet’s past year of commit activity
    Python 175 Apache-2.0 16 5 0 Updated Sep 30, 2024
  • .github Public
    opendatalab/.github’s past year of commit activity
    0 2 0 0 Updated Sep 12, 2024
  • CLIP-Parrot-Bias Public

    ECCV2024_Parrot Captions Teach CLIP to Spot Text

    opendatalab/CLIP-Parrot-Bias’s past year of commit activity
    Python 58 Apache-2.0 2 3 0 Updated Sep 6, 2024
  • opendatalab/CrossViewDiff’s past year of commit activity
    JavaScript 4 1 0 0 Updated Sep 2, 2024
  • MLS-BRN Public

    [CVPR 2024] 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions

    opendatalab/MLS-BRN’s past year of commit activity
    Python 31 2 1 0 Updated Aug 30, 2024
  • UrBench Public
    opendatalab/UrBench’s past year of commit activity
    JavaScript 0 Apache-2.0 0 0 0 Updated Aug 30, 2024
  • labelU Public

    Data annotation toolbox supports image, audio and video data.

    opendatalab/labelU’s past year of commit activity
    Python 800 72 10 0 Updated Aug 29, 2024
  • opendatalab/skydiffusion’s past year of commit activity
    JavaScript 8 Apache-2.0 1 0 0 Updated Aug 18, 2024