Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Composer Jenkinsfile #82

Merged
merged 202 commits into from
Jan 20, 2022
Merged

WIP: Composer Jenkinsfile #82

merged 202 commits into from
Jan 20, 2022

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Nov 12, 2021

This PR removes github actions and switches to having tests run on Jenkins:

TODO:

@ravi-mosaicml ravi-mosaicml changed the base branch from dev to ravi/rank_local_run_directory January 14, 2022 00:31
Base automatically changed from ravi/rank_local_run_directory to dev January 14, 2022 23:07
Copy link
Member

@nlsapp nlsapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup debug statements, check the questions, otherwise looks good

.ci/Jenkinsfile Outdated Show resolved Hide resolved
.ci/Jenkinsfile Outdated Show resolved Hide resolved
.ci/Jenkinsfile Show resolved Hide resolved
.ci/Jenkinsfile Outdated Show resolved Hide resolved
.ci/Jenkinsfile Outdated Show resolved Hide resolved
.ci/Jenkinsfile Show resolved Hide resolved
.ci/Jenkinsfile Outdated Show resolved Hide resolved
.ci/Jenkinsfile Show resolved Hide resolved
docker/Makefile Show resolved Hide resolved
Copy link
Member

@nlsapp nlsapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'd prefer to see the commit counter hit 200 🤣

@ravi-mosaicml
Copy link
Contributor Author

LGTM, although I'd prefer to see the commit counter hit 200 rofl

Feedback addressed 🤣

Copy link
Member

@nlsapp nlsapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - I'd prefer to see safer shebangs in scripts (#!/usr/bin/env <>), but this can be addressed later

.ci/Jenkinsfile Outdated Show resolved Hide resolved
scripts/lint.sh Outdated Show resolved Hide resolved
@ravi-mosaicml ravi-mosaicml merged commit a09056f into dev Jan 20, 2022
@ravi-mosaicml ravi-mosaicml deleted the ravi/jenkinsfile branch January 20, 2022 00:50
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
* Composer Jenkinsfile

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Fixed exit

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Update README.md

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Removing bad symlink

* DDP Port Auto Selection; Removed spawning in tests

* Fixed jenkinsfile

* Fixed missing tests

* Docker builds

* Smaller build matrix

* Not running dev checks when building images

* testing

* testing

* testing

* Gpu tests

* Typo fix

* Testing

* Testing

* Increased cpu limit

* Added log warning

* Ensuring that the launch script raises on sigkilled processes

* Upped the memory limit

* Configure a default virtualenv in the dockerfile

`pip install -e` does not properly install console_scripts outside of a virtualenv

In addition, a virtualenv elimiantes the need to use upgrade-alternatives. This also fixes apt, which can continue to use system python.

* Fix the run directory uploader

* Add ninja for deepspeed test

* Testing

* Fixed pytorch version in jenkinsfile

* Adding git to the jenkinsfile

* Update Dockerfile

Installing git by default

* Update Dockerfile

Added `--without-pip` since pip comes from system python. Setuptools breaks when using `--system-site-packages` without `--without-pip`

* Fixed Dockerfile virtualenv

Need to install with pip (so pip is relative to the installation directory), but immediately upgrade it since the default setuptools is broken.

* Fixed python virtualenv in the dockerfile

* testing

* testing

* Update Dockerfile

Including `--system-site-packages` with the `--upgrade` command. Otherwise it reverts to not including system packages.

* Restore setting the NCCL version

* Fixed pip

* Update Dockerfile

Keep the nccl hack to fix gcp

* More docker changes

* Use the bash shell

* Update the default path; allow downgrades

* testing

* Testing

* Fixed ubuntu version

* testing

* Added virtualenv arg

* null node selector cpu

* Global virtualenv

Make one global virtualenv. Works in both user mode and root mode. Compatible if the user  overrides it with their own virtualenv.

* Global virtualenv

Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv.

* Run on colo; fix docker for noninteractive shells

* Fix for non-interactive shells

* Update Dockerfile

Fix for non-interactive shells

* A yapf update broke some formatting...re-running the linter

* testing

* testing

* Enabled dockerfile matrix build; switched to 3080s

* Increase timeout for test_blurmaxpool_shapes

* Use deterministic mode

* Deterministic mode for test_checkpoint

* Fix determinsitc mode

* Early check check for CUBLAS_WORKSPACE_CONFIG when using deterministic mode

* Using colo to run all pytest

* auto setting CUBLAS_WORKSPACE_CONFIG

* Increase limits

* Fix nit

* Removed change

* Address PR feedback; fix zsh

* Fixes

* Added --no-cache-dir

* Switched to 3090s

* Running deepspeed tests via jenkins
Fixed ddp test incorreclty marked as gpu when it should be marked as deepspeed

* Node without label

* Swithced cloud to colo-research-01

* Fixes

* Simplifying PR

* Make the run directory rank-local; fix checkpoints saving and restoring

- Sharding the of the run directory accross ranks won't work in multi-node training. This change makes the run directory rank-local
- Fixed callbacks and loggers to support rank-local run directories. Specifically, wandb and the run directory uploader now run on all ranks, not just rank zero
- When using deepspeed with zero-1+, each rank writes to the checkpoint folder. Previously, only rank zero's data was being stored. Now, each rank's data is being stored by the rank-local run directory uploader. The checkpoint loader takes a checkpoint path that is parameterized by the rank, so each node will load only the shards of the checkpoint that is needed.

* Fixed checkpointing tests

* Fixed the node selector; only running deepspeed tests for the time being

* Added build system to pyproject.toml

* Testing

* Fixed isort

* Re-enable python tests

* testing

* testing

* Fixed isort

* Fixing deepspeed conditional import

* Speeding up logger test

* Adjusted k8s limits

* Fixed jenkinsfile

* Fixed missing values

* testing

* Fixed typos

* Update Jenkinsfile

Fixed cpu limits

* Update Jenkinsfile

* Fixing some of the slow tests

* Making tests faster

* Fixed broken tests

* Addressed PR feedback

* Formatting

Removed run_directory.get_relative_to_run_directory

* Added docstrings

* Lowered the CPU limit

* Fixed tests

* Pinning yapf to 0.31.0 to see if that fixes a concurrency bug

* Bump yapf version

* Fix github status check names

* Updated the README

* Fixed tests

* Addressed PR feedback

* Added lint script to repo; using new Jenkins scratch/command

* Fix typo

* Fixed closure

* Added missing commas

* fix typo

* Added debugging

* Fix the script

* Remove echo

* Dockerfile fix

* Fix jenkinsfile

* Updated shebangs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use GPUs in tests Configure Jenkins
2 participants