Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
WIP: Composer Jenkinsfile (mosaicml#82)
* Composer Jenkinsfile * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * Fixed exit * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * Update README.md * testing * testing * testing * testing * testing * testing * testing * testing * testing * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Removing bad symlink * DDP Port Auto Selection; Removed spawning in tests * Fixed jenkinsfile * Fixed missing tests * Docker builds * Smaller build matrix * Not running dev checks when building images * testing * testing * testing * Gpu tests * Typo fix * Testing * Testing * Increased cpu limit * Added log warning * Ensuring that the launch script raises on sigkilled processes * Upped the memory limit * Configure a default virtualenv in the dockerfile `pip install -e` does not properly install console_scripts outside of a virtualenv In addition, a virtualenv elimiantes the need to use upgrade-alternatives. This also fixes apt, which can continue to use system python. * Fix the run directory uploader * Add ninja for deepspeed test * Testing * Fixed pytorch version in jenkinsfile * Adding git to the jenkinsfile * Update Dockerfile Installing git by default * Update Dockerfile Added `--without-pip` since pip comes from system python. Setuptools breaks when using `--system-site-packages` without `--without-pip` * Fixed Dockerfile virtualenv Need to install with pip (so pip is relative to the installation directory), but immediately upgrade it since the default setuptools is broken. * Fixed python virtualenv in the dockerfile * testing * testing * Update Dockerfile Including `--system-site-packages` with the `--upgrade` command. Otherwise it reverts to not including system packages. * Restore setting the NCCL version * Fixed pip * Update Dockerfile Keep the nccl hack to fix gcp * More docker changes * Use the bash shell * Update the default path; allow downgrades * testing * Testing * Fixed ubuntu version * testing * Added virtualenv arg * null node selector cpu * Global virtualenv Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv. * Global virtualenv Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv. * Run on colo; fix docker for noninteractive shells * Fix for non-interactive shells * Update Dockerfile Fix for non-interactive shells * A yapf update broke some formatting...re-running the linter * testing * testing * Enabled dockerfile matrix build; switched to 3080s * Increase timeout for test_blurmaxpool_shapes * Use deterministic mode * Deterministic mode for test_checkpoint * Fix determinsitc mode * Early check check for CUBLAS_WORKSPACE_CONFIG when using deterministic mode * Using colo to run all pytest * auto setting CUBLAS_WORKSPACE_CONFIG * Increase limits * Fix nit * Removed change * Address PR feedback; fix zsh * Fixes * Added --no-cache-dir * Switched to 3090s * Running deepspeed tests via jenkins Fixed ddp test incorreclty marked as gpu when it should be marked as deepspeed * Node without label * Swithced cloud to colo-research-01 * Fixes * Simplifying PR * Make the run directory rank-local; fix checkpoints saving and restoring - Sharding the of the run directory accross ranks won't work in multi-node training. This change makes the run directory rank-local - Fixed callbacks and loggers to support rank-local run directories. Specifically, wandb and the run directory uploader now run on all ranks, not just rank zero - When using deepspeed with zero-1+, each rank writes to the checkpoint folder. Previously, only rank zero's data was being stored. Now, each rank's data is being stored by the rank-local run directory uploader. The checkpoint loader takes a checkpoint path that is parameterized by the rank, so each node will load only the shards of the checkpoint that is needed. * Fixed checkpointing tests * Fixed the node selector; only running deepspeed tests for the time being * Added build system to pyproject.toml * Testing * Fixed isort * Re-enable python tests * testing * testing * Fixed isort * Fixing deepspeed conditional import * Speeding up logger test * Adjusted k8s limits * Fixed jenkinsfile * Fixed missing values * testing * Fixed typos * Update Jenkinsfile Fixed cpu limits * Update Jenkinsfile * Fixing some of the slow tests * Making tests faster * Fixed broken tests * Addressed PR feedback * Formatting Removed run_directory.get_relative_to_run_directory * Added docstrings * Lowered the CPU limit * Fixed tests * Pinning yapf to 0.31.0 to see if that fixes a concurrency bug * Bump yapf version * Fix github status check names * Updated the README * Fixed tests * Addressed PR feedback * Added lint script to repo; using new Jenkins scratch/command * Fix typo * Fixed closure * Added missing commas * fix typo * Added debugging * Fix the script * Remove echo * Dockerfile fix * Fix jenkinsfile * Updated shebangs
- Loading branch information