Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MetaShift dataset #3900

Merged
merged 57 commits into from
Apr 1, 2022
Merged

Conversation

dnaveenr
Copy link
Contributor

@dnaveenr dnaveenr commented Mar 12, 2022

This PR adds the MetaShift dataset.

Dataset Request : Add MetaShift dataset #3813

@lhoestq As discussed,

  • I have copied the preprocessing script and modified it as required to not create new directories and folders and instead yield the images.
  • I do the preprocessing in _split_generators to get the required data which is then passed to _generate_examples.
  • Beyond the generated MetaShift dataset, the original preprocess script also generates the meta-graphs for each class, I have currently not included this part. [ Ref : Link ]
  • There is a Bonus section, the authors share. I have currently not included this part. [ Ref : Link ]
  • I had a basic test script which downloaded the dataset and tested the basic functionality. Things seems fine.
    For real data, I performed the following test :
RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_metashift
============================================== test session starts ===============================================
platform linux -- Python 3.7.11, pytest-7.0.1, pluggy-1.0.0
rootdir: ./datasets
plugins: hydra-core-1.1.1, datadir-1.3.1, forked-1.4.0, xdist-2.5.0
collected 1 item                                                                                                 

tests/test_dataset_common.py .                                                                             [100%]

========================================= 1 passed in 4821.25s (1:20:21) =========================================
  • I couldn't get the dummy dataset. Need some inputs here.
    Error as follows :
Using custom data configuration default
Dataset metashift with config None seems to already open files in the method `_split_generators(...)`. You might consider to instead only open files in the method `_generate_examples(...)` instead. If this is not possible the dummy data has to be created with less guidance. Make sure you create the file dummy_data/full-candidate-subsets.pkl.
    for split in generator_splits:
UnboundLocalError: local variable 'generator_splits' referenced before assignment

To-Do :

  • Currently I am using the default _SELECTED_CLASSES. I need to use config option here as suggested
  • Complete fields in the Dataset Card.
  • Tagging the dataset using the Datasets Tagging app.

Need your help and suggestions for improvement. Thank you

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 12, 2022

The documentation is not available anymore as the PR was closed or merged.

@dnaveenr
Copy link
Contributor Author

@lhoestq Please could you review this when you get time. Thank you.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for working on this!

Beyond the generated MetaShift dataset, the original preprocess script also generates the meta-graphs for each class, I have currently not included this part. [ Ref : Link ]

Maybe we can add the generated meta-graphs to the card as images (with attributions)?

There is a Bonus section, the authors share. I have currently not included this part. [ Ref : Link ]

Would be cool if we could have them as additional configs. Also, maybe we could have configs that expose image metadata from the https://nlp.stanford.edu/data/gqa/sceneGraphs.zip file (this file is downloaded in the script but not used).

I couldn't get the dummy dataset. Need some inputs here.

I suggest you try to generate the dataset_infos.json file first, and then I can help with the dummy data.

datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
- **Leaderboard:** [More Information Needed]
- **Point of Contact:** [More Information Needed]

### Dataset Summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataset was used to investigate the modality gap phenomenon, so maybe we can mention/explain that here?

datasets/metashift/metashift.py Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
dnaveenr and others added 12 commits March 15, 2022 22:27
Rename card name.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Naming for links and add point of contact info.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Fix extra whitespace.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Extra full stop removed.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Add bibtex tag.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Cleaner code changes.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Use os.path.join instead.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Use staticmethod, remove print statements.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Add task template.

Co-authored-by: Mario Šaško <mario@huggingface.co>
add static method.

Co-authored-by: Mario Šaško <mario@huggingface.co>
@dnaveenr
Copy link
Contributor Author

Thanks a lot for your inputs @mariosasko .

Maybe we can add the generated meta-graphs to the card as images (with attributions)?

Yes. We can do this for the default set of classes. Will add this.

Would be cool if we could have them as additional configs. Also, maybe we could have configs that expose image metadata from the https://nlp.stanford.edu/data/gqa/sceneGraphs.zip file (this file is downloaded in the script but not used).

I'll try adding the bonus section as additional config.
Regarding exposing the image metadata with a config parameter, how will we showcase/display this information ?

@mariosasko
Copy link
Collaborator

Regarding exposing the image metadata with a config parameter, how will we showcase/display this information ?

Oh, I forgot to mention that. Let's add a Dataset Usage section to the card to document the params (similar to this: https://huggingface.co/datasets/electricity_load_diagrams#dataset-usage). Also, feel free to add the constants that can be tuned as config params (e.g. IMAGE_SUBSET_SIZE_THRESHOLD or the 5 in len(subject_data) <= 5).

@dnaveenr
Copy link
Contributor Author

dnaveenr commented Mar 16, 2022

Okay. Got it. Will add these and constants as config parameters.

The image metadata from scene graphs looks like this :

{
    "2407890": {
        "width": 640,
        "height": 480,
        "location": "living room",
        "weather": none,
        "objects": {
            "271881": {
                "name": "chair",
                "x": 220,
                "y": 310,
                "w": 50,
                "h": 80,
                "attributes": ["brown", "wooden", "small"],
                "relations": {
                    "32452": {
                        "name": "on",
                        "object": "275312"
                    },
                    "32452": {
                        "name": "near",
                        "object": "279472"
                    }                    
                }
            }
        }
    }
}

load_dataset("metashift", selected_classes=["cat", "dog", ...], image_metadata=True)
How do we showcase/display the image metadata(json) information ?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, here are a few suggestions to fix the CI :)

datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
dnaveenr and others added 4 commits March 28, 2022 22:40
CI fixes.

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Correct task categories.

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Add encoding.

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments, but other than that looks great.

datasets/metashift/README.md Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
dnaveenr and others added 13 commits March 31, 2022 23:04
Add paperswithcode id.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Correct sentence.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
add default classes info.

Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
@dnaveenr
Copy link
Contributor Author

Thanks a lot for your suggestions, Mario. The thing I learnt from the review is that I need to make better sentence formations. I will keep this in mind. :)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an easy dataset to add, but you did a great job! And it can even be streamed!

Pinging @lhoestq for the final review

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks all good thank you ! I fixed minor issues with the tags and the license

Super impressed by your work on this, congrats :)

datasets/metashift/README.md Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
datasets/metashift/metashift.py Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit 92da6d5 into huggingface:master Apr 1, 2022
@dnaveenr
Copy link
Contributor Author

dnaveenr commented Apr 1, 2022

Thanks a lot for your support. @mariosasko and @lhoestq .

Super impressed by your work on this, congrats :)

Its my first dataset contribution to the 🤗 Datasets library, I'm super excited. Thank you. :)

Also, I think we can close this request issue now, #3813

@dnaveenr dnaveenr deleted the add_metashift_dataset branch April 1, 2022 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants