-
Notifications
You must be signed in to change notification settings - Fork 332
Conversation
fbshipit-source-id: bad295c2b54e5d8258176d45951637725dd771bf
Summary: Bringing fully parity with the torchvision models. A user can now use any torchvision resnet model and all they have to do is append `trunk.base_model._feature_blocks` from the config file and that's it Reviewed By: imisra Differential Revision: D22116278 fbshipit-source-id: e8eafd4e0de61351956608b1b02096a81ebfaa9e
Summary: properly document and cleanup the logic for loading a model file specified by user. Now the logic is deduped and robust (throws errors and checks for compatibility) 1. remove the need of `LOAD_TRUNK_AND_HEADS` . this wasn't needed and could be confusing 2. if vissl compatible model is passed, we automatically figure out what prefixes should be appended to the model. No need for user to pass arguments like `APPEND_PREFIX` 3. if non-vissl compatible model is passed, the model compatibility is checked and suggestion is given to user on how to make it compaible. 3. rename `APPEND_SUFFIX` -> `APPEND_PREFIX` and `REMOVE_SUFFIX` -> `REMOVE_PREFIX` 4. add `check_model_compatibility` to make sure that the model is compatible to load. previously if the checkpoint is not compatible, it would not load the layers and would also NOT fail. now it will fail. 5. combine some of the model checkpoint logic from `base_ssl_model` and `checkpoint`. all centralized in latter. Reviewed By: blefaudeux Differential Revision: D22164172 fbshipit-source-id: 4c45682f9da679df0c91ee5b301e3551c9e7d349
#36) Summary: … my FAIR cluster devfairs Pull Request resolved: fairinternal/ssl_scaling#36 Reviewed By: blefaudeux Differential Revision: D22164969 Pulled By: prigoyal fbshipit-source-id: 1ccc44af359fcf06f174f24a20a7d277153bd65b
Summary: renaming from ssl_framework_plugin to vissl_plugin Reviewed By: blefaudeux Differential Revision: D22165297 fbshipit-source-id: 0973b007a1504d55b3bd098824888640397e4cb9
Summary: for the console handlers, the logging level was set as info which means it would skip the debug messages. settting the correct level now Reviewed By: blefaudeux Differential Revision: D22165926 fbshipit-source-id: 4246c7592e95eb7b319cfabdc23c8442451f4bd1
Summary: - Trunk declaration around ModuleDict: makes it trivial to index the features you want to pull, makes sure that names and modules are in sync, by design, and makes it possible to have the same forward for most trunks. - Tentatively fix EfficientNet, which I believe was buggy around the drop connection rate, and refactor a bit - Simplify and try to make the code easier Reviewed By: prigoyal Differential Revision: D22078597 fbshipit-source-id: 26d3e50469107a53cb3cb597d9d16eb59cbe51ec
Summary: Pull Request resolved: fairinternal/ssl_scaling#37 Up to now these configs would not actually run if the machine where they were scheduled already went through a test and were not wiped out Reviewed By: prigoyal Differential Revision: D22169259 fbshipit-source-id: 1ece8ac5a2c4f866f54eb7f5d97728ab5e3a365b
Summary: logging the metrics to a `metric.json` file. also used the opportunity to rename `tasks` folder to `ssl_tasks` and extract the `accuracy_list_meter.py` from the `__init__.py` for better code readability Reviewed By: blefaudeux Differential Revision: D22166885 fbshipit-source-id: d52728616a2f54223994d64267b4b9d0017d33cb
Summary: I noticed in the code that at several places we use local rank and get it from the env. given this is a helpful thing, I am creating a common utility function Reviewed By: blefaudeux Differential Revision: D22170823 fbshipit-source-id: 1da72d69235ac6da35287797eb50f026540e730c
Summary: Pull Request resolved: fairinternal/ssl_scaling#40 Unblocking master, moving all syncBNs to pytorch until we properly solve that Reviewed By: mannatsingh Differential Revision: D22196409 fbshipit-source-id: 0cae37bae5efcca5ffc17f9ca9d7982d2a3f0e55
Summary: while looking at the circleCI setup, I ran into the getting the dataset to run integration tests. The problem is easily solvable by adding a synthetic dataset class which is very minimal and returns a mean image. the dataset size is set to 500 max by default and user can control it (increase) from the yaml config Reviewed By: mannatsingh Differential Revision: D22186576 fbshipit-source-id: 06a1562abf2d7f1849ff2d83d3b9cc2849641205
Summary: Pull Request resolved: fairinternal/ssl_scaling#41 Unit testing some losses, would need more coverage but that's a start. Just checking that types and dimensions are correct, not a correctness check Reviewed By: prigoyal Differential Revision: D22198026 fbshipit-source-id: 161c3ce6948b06736056c69096f064b14ffdf470
…#2) Summary: adding all sorts of coding quality standards: isort, flake, black, pre-commit check and some improvements to setup.py including versioning, requirements.txt etc Pull Request resolved: #2 Reviewed By: mannatsingh Differential Revision: D22190024 Pulled By: prigoyal fbshipit-source-id: 58f8ee7c59c821272a89febf436e1bae35841832
Summary: Pull Request resolved: fairinternal/ssl_scaling#42 Pull Request resolved: #4 removed the third-party completely for classy vision, that;s in requirements.txt now for apex, we simplified install instructions with a tarball to pin to a specific version Reviewed By: mannatsingh Differential Revision: D22213297 fbshipit-source-id: 374174fae6ff91aad6f18af2c4557c6b7a157ef6
Re-sync with internal repository
…to fb (#5) Summary: Pull Request resolved: #5 1. moving regnet files to fb specific folder as the tests on github will fail since regnet is not OSS yet 2. for unit test, use non internal hydra function. make it work without hydra plugin 3. in test tasks, test the actual lib vissl and not the distributed_train which is a binary 4. re-organized test files under config/test folder for clarity 5. small fix to swav loss - wasn't working on gpus anymore Reviewed By: mannatsingh Differential Revision: D22219028 fbshipit-source-id: 94671e18b3adbb6f983284b03e3db60692f2813e
Summary: Rename img pil enhancements. Add docstrings to the image transforms. Reviewed By: prigoyal Differential Revision: D22222850 fbshipit-source-id: 08e6d33fb398b9080a5deefbccf0ca87286463d2
Summary: Pull Request resolved: #7 setting up the docker file so we can have the proper environment and also provide helpful scripts to use otherwise like conda install, etc. Reviewed By: mannatsingh Differential Revision: D22233535 fbshipit-source-id: 8aa06a5586ca49c4c61fd3a404daa7a6c3fec836
Summary: Pull Request resolved: fairinternal/ssl_scaling#43 Pull Request resolved: #8 setting up the config for circle ci testing - cpu and gpu tests both also had to make some changes to make pre-commit-hook compatible and working nicely Reviewed By: mannatsingh Differential Revision: D22257436 fbshipit-source-id: 56d952c014885450dde8ee8c3f4a1292746a4328
Summary: - adding more gpu tests to run on CI. since the CI machines have only 8GB gpu memory (I tried getting access to the large machine in the circle ci set but it didn't work (still got only 8gb). in the meantime, it's okay for us to run on smaller batch size per replica in gpu test since we are not checking correctness. - also disable the complexity for pirl since the model has multi-input but clarry vision api supports 1 input only cc mannatsingh - also one small fix in the deepclusterv2 loss in the logging function Reviewed By: mannatsingh Differential Revision: D22266538 fbshipit-source-id: a671f10fe0b71b5d84aa44aafe88f9fef7bcfdb9
Summary: hydra plugin isn't needed anywhere (fbcode/github) so removing it Reviewed By: mannatsingh Differential Revision: D22264013 fbshipit-source-id: fafcc23fd994af855a8fcdbff768666476db781a
Summary: fixing usage of hydra.experimental after Hydra update Reviewed By: jieru-hu Differential Revision: D22264458 fbshipit-source-id: 4f42a555e9385c72b428c7e4481a45e255583d3e
Summary: Pull Request resolved: #10 tracking the hydra1.0 branch on github as per recommendation from omry Reviewed By: mannatsingh Differential Revision: D22268778 fbshipit-source-id: f9b1e976c157d5ee3e5646154b4930c6569a5c21
Summary: Pull Request resolved: #9 Pull Request resolved: fairinternal/ssl_scaling#39 This unit test includes a couple of FW passes, I feel that's important to catch errors in an easier fashion than integration test - enforce the task un even if checkpoints - add a resnet trunk test task - add an efficientnet trunk task - switch off the complexity computation for EfficientNet, until this is fixed Reviewed By: prigoyal Differential Revision: D22193592 fbshipit-source-id: d29c797c029dd027ddff9c01c8ab9fe07483a3f1
Summary: Pull Request resolved: #13 conda packaging vissl for various cuda, pytorch, python versions - cuda: 9.1, 10.0, 10.1, 10.2 - pytorch: 1.4 , 1.5 - python: 3.6, 3.7, 3.8 Reviewed By: blefaudeux Differential Revision: D22286187 fbshipit-source-id: efe7d4f4a9805f1eb9af92a9c8facfa410c53d5a
Summary: Pull Request resolved: #16 - The DiskImageDataset can now use the labels that are computed by the torchvision ImageFolder dataset. - The DiskImageDataset accepts a `root_dir` argument which makes it so that the image paths used in `npy` files can be relative paths. Reviewed By: prigoyal Differential Revision: D22259478 fbshipit-source-id: 34373e02661903840b379a86270ff4590acd2730
Summary: Pull Request resolved: #15 Pull Request resolved: fairinternal/ssl_scaling#44 Reviewed By: mannatsingh Differential Revision: D22308252 Pulled By: prigoyal fbshipit-source-id: 668c22b9edfbac823177a1567815b7d4378a6c33
@growlix has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@growlix has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@growlix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@growlix has updated the pull request. You must reimport the pull request before landing. |
…etimes crash because 'empty histogram'
@growlix has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@growlix has updated the pull request. You must reimport the pull request before landing. |
@growlix has updated the pull request. You must reimport the pull request before landing. |
@growlix has updated the pull request. You must reimport the pull request before landing. |
@growlix has updated the pull request. You must reimport the pull request before landing. |
@growlix has updated the pull request. You must reimport the pull request before landing. |
…l into vision_transformer "Added deit model implementation"
@growlix has updated the pull request. You must reimport the pull request before landing. |
@growlix has updated the pull request. You must reimport the pull request before landing. |
Summary: # Layer by layer memory profiling A first version of the memory profiling, tracking the memory used through the forward/backward passes, with a breakdown of the memory dedicated to activations (issue fairinternal/ssl_scaling#97). - [x] Define the test plan - [x] Provide example curves and data output - [x] Run on FSDP vs DDP - [x] Run on FSDP with or without checkpointing ## Using the feature Just add `cfg.PROFILING.TRACK_BY_LAYER_MEMORY=True` in the command line when running a job to track the memory usage, layer by layer, during both the forward and backward. Further configuration is available to chose: - which rank is monitored - for how many iterations - starting from which iteration Pull Request resolved: fairinternal/ssl_scaling#100 Test Plan: The feature comes with its own set of unit tests ## Example outputs The output directory will contain the following files for each rank and iteration monitored: ``` memory_rank_0_iteration_0.json memory_rank_0_iteration_0.jpg ``` The JSON file contains the raw data, while the JPG file provides an overview of what happening in terms of memory: <img width="1047" alt="Screenshot 2021-04-19 at 11 26 06" src="https://user-images.githubusercontent.com/7412790/115261974-19376780-a102-11eb-838c-688d807094d3.png"> Reviewed By: prigoyal Differential Revision: D27977734 Pulled By: QuentinDuval fbshipit-source-id: 4000f84e418afecb7c02dee5c5add260a04046ba
Vision transformer trunk and head, and functionality to run on slurm/FAIR cluster.