Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Vision transformer #100

Closed
wants to merge 205 commits into from
Closed

Vision transformer #100

wants to merge 205 commits into from

Conversation

growlix
Copy link
Contributor

@growlix growlix commented Dec 3, 2020

Vision transformer trunk and head, and functionality to run on slurm/FAIR cluster.

facebook-github-bot and others added 30 commits June 22, 2020 05:47
fbshipit-source-id: bad295c2b54e5d8258176d45951637725dd771bf
Summary: Bringing fully parity with the torchvision models. A user can now use any torchvision resnet model and all they have to do is append `trunk.base_model._feature_blocks` from the config file and that's it

Reviewed By: imisra

Differential Revision: D22116278

fbshipit-source-id: e8eafd4e0de61351956608b1b02096a81ebfaa9e
Summary:
properly document and cleanup the logic for loading a model file specified by user. Now the logic is deduped and robust (throws errors and checks for compatibility)
1. remove the need of `LOAD_TRUNK_AND_HEADS` . this wasn't needed and could be confusing
2. if vissl compatible model is passed, we automatically figure out what prefixes should be appended to the model. No need for user to pass arguments like `APPEND_PREFIX`
3. if non-vissl compatible model is passed, the model compatibility is checked and suggestion is given to user on how to make it compaible.
3. rename `APPEND_SUFFIX` -> `APPEND_PREFIX` and `REMOVE_SUFFIX` -> `REMOVE_PREFIX`
4. add `check_model_compatibility` to make sure that the model is compatible to load. previously if the checkpoint is not compatible, it would not load the layers and would also NOT fail. now it will fail.
5. combine some of the model checkpoint logic from `base_ssl_model` and `checkpoint`. all centralized in latter.

Reviewed By: blefaudeux

Differential Revision: D22164172

fbshipit-source-id: 4c45682f9da679df0c91ee5b301e3551c9e7d349
#36)

Summary:
… my FAIR cluster devfairs
Pull Request resolved: fairinternal/ssl_scaling#36

Reviewed By: blefaudeux

Differential Revision: D22164969

Pulled By: prigoyal

fbshipit-source-id: 1ccc44af359fcf06f174f24a20a7d277153bd65b
Summary: renaming from ssl_framework_plugin to vissl_plugin

Reviewed By: blefaudeux

Differential Revision: D22165297

fbshipit-source-id: 0973b007a1504d55b3bd098824888640397e4cb9
Summary: for the console handlers, the logging level was set as info which means it would skip the debug messages. settting the correct level now

Reviewed By: blefaudeux

Differential Revision: D22165926

fbshipit-source-id: 4246c7592e95eb7b319cfabdc23c8442451f4bd1
Summary:
- Trunk declaration around ModuleDict: makes it trivial to index the features you want to pull, makes sure that names and modules are in sync, by design, and makes it possible to have the same forward for most trunks.
- Tentatively fix EfficientNet, which I believe was buggy around the drop connection rate, and refactor a bit
- Simplify and try to make the code easier

Reviewed By: prigoyal

Differential Revision: D22078597

fbshipit-source-id: 26d3e50469107a53cb3cb597d9d16eb59cbe51ec
Summary:
Pull Request resolved: fairinternal/ssl_scaling#37

Up to now these configs would not actually run if the machine where they were scheduled already went through a test and were not wiped out

Reviewed By: prigoyal

Differential Revision: D22169259

fbshipit-source-id: 1ece8ac5a2c4f866f54eb7f5d97728ab5e3a365b
Summary:
logging the metrics to a `metric.json` file.

also used the opportunity to rename `tasks` folder to `ssl_tasks` and extract the `accuracy_list_meter.py` from the `__init__.py` for better code readability

Reviewed By: blefaudeux

Differential Revision: D22166885

fbshipit-source-id: d52728616a2f54223994d64267b4b9d0017d33cb
Summary: I noticed in the code that at several places we use local rank and get it from the env. given this is a helpful thing, I am creating a common utility function

Reviewed By: blefaudeux

Differential Revision: D22170823

fbshipit-source-id: 1da72d69235ac6da35287797eb50f026540e730c
Summary:
Pull Request resolved: fairinternal/ssl_scaling#40

Unblocking master, moving all syncBNs to pytorch until we properly solve that

Reviewed By: mannatsingh

Differential Revision: D22196409

fbshipit-source-id: 0cae37bae5efcca5ffc17f9ca9d7982d2a3f0e55
Summary: while looking at the circleCI setup, I ran into the getting the dataset to run integration tests. The problem is easily solvable by adding a synthetic dataset class which is very minimal and returns a mean image. the dataset size is set to 500 max by default and user can control it (increase) from the yaml config

Reviewed By: mannatsingh

Differential Revision: D22186576

fbshipit-source-id: 06a1562abf2d7f1849ff2d83d3b9cc2849641205
Summary:
Pull Request resolved: fairinternal/ssl_scaling#41

Unit testing some losses, would need more coverage but that's a start. Just checking that types and dimensions are correct, not a correctness check

Reviewed By: prigoyal

Differential Revision: D22198026

fbshipit-source-id: 161c3ce6948b06736056c69096f064b14ffdf470
…#2)

Summary:
adding all sorts of coding quality standards: isort, flake, black, pre-commit check

and some improvements to setup.py including versioning, requirements.txt etc
Pull Request resolved: #2

Reviewed By: mannatsingh

Differential Revision: D22190024

Pulled By: prigoyal

fbshipit-source-id: 58f8ee7c59c821272a89febf436e1bae35841832
Summary:
Pull Request resolved: fairinternal/ssl_scaling#42

Pull Request resolved: #4

removed the third-party completely
for classy vision, that;s in requirements.txt now

for apex, we simplified install instructions with a tarball to pin to a specific version

Reviewed By: mannatsingh

Differential Revision: D22213297

fbshipit-source-id: 374174fae6ff91aad6f18af2c4557c6b7a157ef6
…to fb (#5)

Summary:
Pull Request resolved: #5

1. moving regnet files to fb specific folder as the tests on github will fail since regnet is not OSS yet
2. for unit test, use non internal hydra function. make it work without hydra plugin
3. in test tasks, test the actual lib vissl and not the distributed_train which is a binary
4. re-organized test files under config/test folder for clarity
5. small fix to swav loss - wasn't working on gpus anymore

Reviewed By: mannatsingh

Differential Revision: D22219028

fbshipit-source-id: 94671e18b3adbb6f983284b03e3db60692f2813e
Summary: Rename img pil enhancements. Add docstrings to the image transforms.

Reviewed By: prigoyal

Differential Revision: D22222850

fbshipit-source-id: 08e6d33fb398b9080a5deefbccf0ca87286463d2
Summary:
Pull Request resolved: #7

setting up the docker file so we can have the proper environment and also provide helpful scripts to use otherwise like conda install, etc.

Reviewed By: mannatsingh

Differential Revision: D22233535

fbshipit-source-id: 8aa06a5586ca49c4c61fd3a404daa7a6c3fec836
Summary:
Pull Request resolved: fairinternal/ssl_scaling#43

Pull Request resolved: #8

setting up the config for circle ci testing - cpu and gpu tests both

also had to make some changes to make pre-commit-hook compatible and working nicely

Reviewed By: mannatsingh

Differential Revision: D22257436

fbshipit-source-id: 56d952c014885450dde8ee8c3f4a1292746a4328
Summary:
- adding more gpu tests to run on CI. since the CI machines have only 8GB gpu memory (I tried getting access to the large machine in the circle ci set but it didn't work (still got only 8gb). in the meantime, it's okay for us to run on smaller batch size per replica in gpu test since we are not checking correctness.

- also disable the complexity for pirl since the model has multi-input but clarry vision api supports 1 input only cc mannatsingh

- also one small fix in the deepclusterv2 loss in the logging function

Reviewed By: mannatsingh

Differential Revision: D22266538

fbshipit-source-id: a671f10fe0b71b5d84aa44aafe88f9fef7bcfdb9
Summary: hydra plugin isn't needed anywhere (fbcode/github) so removing it

Reviewed By: mannatsingh

Differential Revision: D22264013

fbshipit-source-id: fafcc23fd994af855a8fcdbff768666476db781a
Summary: fixing usage of hydra.experimental after Hydra update

Reviewed By: jieru-hu

Differential Revision: D22264458

fbshipit-source-id: 4f42a555e9385c72b428c7e4481a45e255583d3e
Summary:
Pull Request resolved: #10

tracking the hydra1.0 branch on github as per recommendation from omry

Reviewed By: mannatsingh

Differential Revision: D22268778

fbshipit-source-id: f9b1e976c157d5ee3e5646154b4930c6569a5c21
…#12)

Summary:
Pull Request resolved: #12

moved the gpu tests to the test script so we don't need to change the circleci config file everytime

Reviewed By: blefaudeux

Differential Revision: D22284063

fbshipit-source-id: fe925a842b1175bf27b9d82988202053cdba6b3b
Summary:
Pull Request resolved: #9

Pull Request resolved: fairinternal/ssl_scaling#39

This unit test includes a couple of FW passes, I feel that's important to catch errors in an easier fashion than integration test

- enforce the task un even if checkpoints
- add a resnet trunk test task
- add an efficientnet trunk task
- switch off the complexity computation for EfficientNet, until this is fixed

Reviewed By: prigoyal

Differential Revision: D22193592

fbshipit-source-id: d29c797c029dd027ddff9c01c8ab9fe07483a3f1
Summary:
Pull Request resolved: #13

conda packaging vissl for various cuda, pytorch, python versions
- cuda: 9.1, 10.0, 10.1, 10.2
- pytorch: 1.4 , 1.5
- python: 3.6, 3.7, 3.8

Reviewed By: blefaudeux

Differential Revision: D22286187

fbshipit-source-id: efe7d4f4a9805f1eb9af92a9c8facfa410c53d5a
Summary:
Pull Request resolved: #16

- The DiskImageDataset can now use the labels that are computed by the torchvision ImageFolder dataset.
- The DiskImageDataset accepts a `root_dir` argument which makes it so that the image paths used in `npy` files can be relative paths.

Reviewed By: prigoyal

Differential Revision: D22259478

fbshipit-source-id: 34373e02661903840b379a86270ff4590acd2730
Summary:
Pull Request resolved: #15

Pull Request resolved: fairinternal/ssl_scaling#44

Reviewed By: mannatsingh

Differential Revision: D22308252

Pulled By: prigoyal

fbshipit-source-id: 668c22b9edfbac823177a1567815b7d4378a6c33
@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@growlix has updated the pull request. You must reimport the pull request before landing.

@prigoyal prigoyal closed this Jan 26, 2021
facebook-github-bot pushed a commit that referenced this pull request Apr 26, 2021
Summary:
# Layer by layer memory profiling

A first version of the memory profiling, tracking the memory used through the forward/backward passes, with a breakdown of the memory dedicated to activations (issue fairinternal/ssl_scaling#97).

- [x] Define the test plan
- [x] Provide example curves and data output
- [x] Run on FSDP vs DDP
- [x] Run on FSDP with or without checkpointing

## Using the feature

Just add `cfg.PROFILING.TRACK_BY_LAYER_MEMORY=True` in the command line when running a job to track the memory usage, layer by layer, during both the forward and backward.

Further configuration is available to chose:
- which rank is monitored
- for how many iterations
- starting from which iteration

Pull Request resolved: fairinternal/ssl_scaling#100

Test Plan:
The feature comes with its own set of unit tests

## Example outputs

The output directory will contain the following files for each rank and iteration monitored:

```
memory_rank_0_iteration_0.json
memory_rank_0_iteration_0.jpg
```

The JSON file contains the raw data, while the JPG file provides an overview of what happening in terms of memory:

<img width="1047" alt="Screenshot 2021-04-19 at 11 26 06" src="https://user-images.githubusercontent.com/7412790/115261974-19376780-a102-11eb-838c-688d807094d3.png">

Reviewed By: prigoyal

Differential Revision: D27977734

Pulled By: QuentinDuval

fbshipit-source-id: 4000f84e418afecb7c02dee5c5add260a04046ba
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.