Adding hiera (#30356)

* initialized Structure * Updated variable names * Added Config class, basic HF setup, convert_to_hf * Fixed Convert function, added hiera to HF files, Initilized test files * better naming for x in forward pass * Moved utils to hiera * Change hiera -> hiera_model * Fixed integration into tranformers * Fix: Convert Checkpoint * added documentation for hiera * added documentation for hiera * added Docstings to models, Transformers based changes * make style and quality * make style and quality * Integration & Block tests running * Fixed bugs * initialized Structure * Updated variable names * Added Config class, basic HF setup, convert_to_hf * Fixed Convert function, added hiera to HF files, Initilized test files * better naming for x in forward pass * Moved utils to hiera * Change hiera -> hiera_model * Fixed integration into tranformers * Fix: Convert Checkpoint * added documentation for hiera * added documentation for hiera * added Docstings to models, Transformers based changes * make style and quality * make style and quality * Integration & Block tests running * Fixed bugs * Removed tim dependency * added HieraBlock * fixed: Model name * added tests for HieraModel, HieraBlock * fixed imports * fixed quality & copies * Fixes * Update docs/source/en/model_doc/hiera.md Fix name Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/hiera.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/hiera.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/transformers/models/hiera/configuration_hiera.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/transformers/models/hiera/configuration_hiera.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/transformers/models/hiera/modeling_hiera.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/transformers/models/hiera/modeling_hiera.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Fixed formatting * Code quality & Import differences * quality and repo-consistency fix * fixed no torch error * Docstring fix * Docstring fix * doc string fix * fixed example usage * Resolved issues in modeling_hiera * Removed Hiera MAE * Added test and resolved bug * fixed doc string * First commit * Finished conversion script and model forward working * Resolved all issues * nits * Improving tests * Nits * More nits * Improving HieraForMaskedImageModeling * More improvements and nits * Fixed docstrings of outputs * More fixes * More imrpovments * Updated conversion script * Fixed docstrings * Improved tests * Fixed attentou outputs test * All tests green * Removed unnecessary file * contribution attribution * Resolved a few issues * Resolved Comments * Updated model repo id and fixed bugs * Removed loss print * Make tests green * Updated docstrings * Fix style * Fixed num_heads in config * Removed unnecessary video checkpoint related code in the conversion script * Fix style * Changed atol in conversion script * HieraConfig * Fix copies * Fixed typo * Resolved few issues * make * converted conv_nd -> nn.Module * Removed video complexities * Removed video complexities * fix style * Addressing comments * Update src/transformers/models/hiera/modeling_hiera.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/hiera/modeling_hiera.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/hiera/modeling_hiera.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fix style * Fixed tests * Fixed typo * Fixed interpolate test * Made torch fx compatible * Made sure imageprocesor is correct * Addressed comments * Noise directly as torch * Remove unnecesary attr * Added return_dit * Update src/transformers/models/hiera/__init__.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Updated checkpoints * [run_slow] hiera * Fixed device mismatch * [run_slow] hiera * Fixed GPU tests * [run_slow] hiera --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-29-50.us-east-2.compute.internal> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Eduardo Pacheco <eduardo.pach@hotmail.com> Co-authored-by: Eduardo Pacheco <69953243+EduardoPach@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
huggingface · Jul 11, 2024 · c1e139c · c1e139c
1 parent 574e68d
commit c1e139c
Show file tree

Hide file tree

Showing 18 changed files with 2,945 additions and 0 deletions.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -603,6 +603,8 @@
         title: FocalNet
       - local: model_doc/glpn
         title: GLPN
+      - local: model_doc/hiera
+        title: Hiera
       - local: model_doc/imagegpt
         title: ImageGPT
       - local: model_doc/levit
@@ -680,6 +682,8 @@
         title: CLAP
       - local: model_doc/encodec
         title: EnCodec
+      - local: model_doc/hiera
+        title: Hiera
       - local: model_doc/hubert
         title: Hubert
       - local: model_doc/mctct

diff --git a/docs/source/en/index.md b/docs/source/en/index.md
@@ -159,6 +159,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                [Grounding DINO](model_doc/grounding-dino)                |       ✅        |         ❌         |      ❌      |
 |                      [GroupViT](model_doc/groupvit)                      |       ✅        |         ✅         |      ❌      |
 |                       [HerBERT](model_doc/herbert)                       |       ✅        |         ✅         |      ✅      |
+|                         [Hiera](model_doc/hiera)                         |       ✅        |         ❌         |      ❌      |
 |                        [Hubert](model_doc/hubert)                        |       ✅        |         ✅         |      ❌      |
 |                        [I-BERT](model_doc/ibert)                         |       ✅        |         ❌         |      ❌      |
 |                       [IDEFICS](model_doc/idefics)                       |       ✅        |         ✅         |      ❌      |

diff --git a/docs/source/en/model_doc/hiera.md b/docs/source/en/model_doc/hiera.md
@@ -0,0 +1,48 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Hiera
+
+## Overview
+
+Hiera was proposed in [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/abs/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
+
+The paper introduces "Hiera," a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed "bells-and-whistles," are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity. 
+
+The abstract from the paper is the following:
+
+*Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.*
+
+This model was a joint contibution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera).
+
+## HieraConfig
+
+[[autodoc]] HieraConfig
+
+## HieraModel
+
+[[autodoc]] HieraModel
+    - forward
+
+## HieraForPreTraining
+
+[[autodoc]] HieraForPreTraining
+    - forward
+
+## HieraForImageClassification
+
+[[autodoc]] HieraForImageClassification
+    - forward
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -462,6 +462,7 @@
         "GroupViTVisionConfig",
     ],
     "models.herbert": ["HerbertTokenizer"],
+    "models.hiera": ["HieraConfig"],
     "models.hubert": ["HubertConfig"],
     "models.ibert": ["IBertConfig"],
     "models.idefics": ["IdeficsConfig"],
@@ -2285,6 +2286,15 @@
             "GroupViTVisionModel",
         ]
     )
+    _import_structure["models.hiera"].extend(
+        [
+            "HieraBackbone",
+            "HieraForImageClassification",
+            "HieraForPreTraining",
+            "HieraModel",
+            "HieraPreTrainedModel",
+        ]
+    )
     _import_structure["models.hubert"].extend(
         [
             "HubertForCTC",
@@ -5112,6 +5122,7 @@
         GroupViTVisionConfig,
     )
     from .models.herbert import HerbertTokenizer
+    from .models.hiera import HieraConfig
     from .models.hubert import HubertConfig
     from .models.ibert import IBertConfig
     from .models.idefics import (
@@ -6795,6 +6806,13 @@
             GroupViTTextModel,
             GroupViTVisionModel,
         )
+        from .models.hiera import (
+            HieraBackbone,
+            HieraForImageClassification,
+            HieraForPreTraining,
+            HieraModel,
+            HieraPreTrainedModel,
+        )
         from .models.hubert import (
             HubertForCTC,
             HubertForSequenceClassification,

diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
@@ -105,6 +105,7 @@
     grounding_dino,
     groupvit,
     herbert,
+    hiera,
     hubert,
     ibert,
     idefics,

diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -122,6 +122,7 @@
         ("graphormer", "GraphormerConfig"),
         ("grounding-dino", "GroundingDinoConfig"),
         ("groupvit", "GroupViTConfig"),
+        ("hiera", "HieraConfig"),
         ("hubert", "HubertConfig"),
         ("ibert", "IBertConfig"),
         ("idefics", "IdeficsConfig"),
@@ -403,6 +404,7 @@
         ("grounding-dino", "Grounding DINO"),
         ("groupvit", "GroupViT"),
         ("herbert", "HerBERT"),
+        ("hiera", "Hiera"),
         ("hubert", "Hubert"),
         ("ibert", "I-BERT"),
         ("idefics", "IDEFICS"),

diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
@@ -85,6 +85,7 @@
             ("glpn", ("GLPNImageProcessor",)),
             ("grounding-dino", ("GroundingDinoImageProcessor",)),
             ("groupvit", ("CLIPImageProcessor",)),
+            ("hiera", ("BitImageProcessor",)),
             ("idefics", ("IdeficsImageProcessor",)),
             ("idefics2", ("Idefics2ImageProcessor",)),
             ("imagegpt", ("ImageGPTImageProcessor",)),

diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
@@ -119,6 +119,7 @@
         ("graphormer", "GraphormerModel"),
         ("grounding-dino", "GroundingDinoModel"),
         ("groupvit", "GroupViTModel"),
+        ("hiera", "HieraModel"),
         ("hubert", "HubertModel"),
         ("ibert", "IBertModel"),
         ("idefics", "IdeficsModel"),
@@ -295,6 +296,7 @@
         ("gpt2", "GPT2LMHeadModel"),
         ("gpt_bigcode", "GPTBigCodeForCausalLM"),
         ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"),
+        ("hiera", "HieraForPreTraining"),
         ("ibert", "IBertForMaskedLM"),
         ("idefics", "IdeficsForVisionText2Text"),
         ("idefics2", "Idefics2ForConditionalGeneration"),
@@ -535,6 +537,7 @@
         ("efficientnet", "EfficientNetModel"),
         ("focalnet", "FocalNetModel"),
         ("glpn", "GLPNModel"),
+        ("hiera", "HieraModel"),
         ("imagegpt", "ImageGPTModel"),
         ("levit", "LevitModel"),
         ("mobilenet_v1", "MobileNetV1Model"),
@@ -610,6 +613,7 @@
         ),
         ("efficientnet", "EfficientNetForImageClassification"),
         ("focalnet", "FocalNetForImageClassification"),
+        ("hiera", "HieraForImageClassification"),
         ("imagegpt", "ImageGPTForImageClassification"),
         (
             "levit",
@@ -1258,6 +1262,7 @@
         ("dinat", "DinatBackbone"),
         ("dinov2", "Dinov2Backbone"),
         ("focalnet", "FocalNetBackbone"),
+        ("hiera", "HieraBackbone"),
         ("maskformer-swin", "MaskFormerSwinBackbone"),
         ("nat", "NatBackbone"),
         ("pvt_v2", "PvtV2Backbone"),

diff --git a/src/transformers/models/hiera/__init__.py b/src/transformers/models/hiera/__init__.py
@@ -0,0 +1,59 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {"configuration_hiera": ["HieraConfig"]}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_hiera"] = [
+        "HieraForImageClassification",
+        "HieraForPreTraining",
+        "HieraBackbone",
+        "HieraModel",
+        "HieraPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_hiera import HieraConfig
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_hiera import (
+            HieraBackbone,
+            HieraForImageClassification,
+            HieraForPreTraining,
+            HieraModel,
+            HieraPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)