-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Composer Jenkinsfile #82
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup debug statements, check the questions, otherwise looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'd prefer to see the commit counter hit 200 🤣
Feedback addressed 🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved - I'd prefer to see safer shebangs in scripts (#!/usr/bin/env <>
), but this can be addressed later
* Composer Jenkinsfile * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * Fixed exit * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * Update README.md * testing * testing * testing * testing * testing * testing * testing * testing * testing * Update Jenkinsfile * Update Jenkinsfile * Update Jenkinsfile * Removing bad symlink * DDP Port Auto Selection; Removed spawning in tests * Fixed jenkinsfile * Fixed missing tests * Docker builds * Smaller build matrix * Not running dev checks when building images * testing * testing * testing * Gpu tests * Typo fix * Testing * Testing * Increased cpu limit * Added log warning * Ensuring that the launch script raises on sigkilled processes * Upped the memory limit * Configure a default virtualenv in the dockerfile `pip install -e` does not properly install console_scripts outside of a virtualenv In addition, a virtualenv elimiantes the need to use upgrade-alternatives. This also fixes apt, which can continue to use system python. * Fix the run directory uploader * Add ninja for deepspeed test * Testing * Fixed pytorch version in jenkinsfile * Adding git to the jenkinsfile * Update Dockerfile Installing git by default * Update Dockerfile Added `--without-pip` since pip comes from system python. Setuptools breaks when using `--system-site-packages` without `--without-pip` * Fixed Dockerfile virtualenv Need to install with pip (so pip is relative to the installation directory), but immediately upgrade it since the default setuptools is broken. * Fixed python virtualenv in the dockerfile * testing * testing * Update Dockerfile Including `--system-site-packages` with the `--upgrade` command. Otherwise it reverts to not including system packages. * Restore setting the NCCL version * Fixed pip * Update Dockerfile Keep the nccl hack to fix gcp * More docker changes * Use the bash shell * Update the default path; allow downgrades * testing * Testing * Fixed ubuntu version * testing * Added virtualenv arg * null node selector cpu * Global virtualenv Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv. * Global virtualenv Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv. * Run on colo; fix docker for noninteractive shells * Fix for non-interactive shells * Update Dockerfile Fix for non-interactive shells * A yapf update broke some formatting...re-running the linter * testing * testing * Enabled dockerfile matrix build; switched to 3080s * Increase timeout for test_blurmaxpool_shapes * Use deterministic mode * Deterministic mode for test_checkpoint * Fix determinsitc mode * Early check check for CUBLAS_WORKSPACE_CONFIG when using deterministic mode * Using colo to run all pytest * auto setting CUBLAS_WORKSPACE_CONFIG * Increase limits * Fix nit * Removed change * Address PR feedback; fix zsh * Fixes * Added --no-cache-dir * Switched to 3090s * Running deepspeed tests via jenkins Fixed ddp test incorreclty marked as gpu when it should be marked as deepspeed * Node without label * Swithced cloud to colo-research-01 * Fixes * Simplifying PR * Make the run directory rank-local; fix checkpoints saving and restoring - Sharding the of the run directory accross ranks won't work in multi-node training. This change makes the run directory rank-local - Fixed callbacks and loggers to support rank-local run directories. Specifically, wandb and the run directory uploader now run on all ranks, not just rank zero - When using deepspeed with zero-1+, each rank writes to the checkpoint folder. Previously, only rank zero's data was being stored. Now, each rank's data is being stored by the rank-local run directory uploader. The checkpoint loader takes a checkpoint path that is parameterized by the rank, so each node will load only the shards of the checkpoint that is needed. * Fixed checkpointing tests * Fixed the node selector; only running deepspeed tests for the time being * Added build system to pyproject.toml * Testing * Fixed isort * Re-enable python tests * testing * testing * Fixed isort * Fixing deepspeed conditional import * Speeding up logger test * Adjusted k8s limits * Fixed jenkinsfile * Fixed missing values * testing * Fixed typos * Update Jenkinsfile Fixed cpu limits * Update Jenkinsfile * Fixing some of the slow tests * Making tests faster * Fixed broken tests * Addressed PR feedback * Formatting Removed run_directory.get_relative_to_run_directory * Added docstrings * Lowered the CPU limit * Fixed tests * Pinning yapf to 0.31.0 to see if that fixes a concurrency bug * Bump yapf version * Fix github status check names * Updated the README * Fixed tests * Addressed PR feedback * Added lint script to repo; using new Jenkins scratch/command * Fix typo * Fixed closure * Added missing commas * fix typo * Added debugging * Fix the script * Remove echo * Dockerfile fix * Fix jenkinsfile * Updated shebangs
This PR removes github actions and switches to having tests run on Jenkins:
TODO:
@nlsapp @connor-m-kaz Configure the remote jenkinsfile plugin so we can store theDeferring for Move the Jenkinsfile out of.ci/Jenkinsfile
in a separate repo (such as https://github.com/mosaicml/testing)composer
; use the remote jenkinsfile plugin #247@ravi-mosaicml Write a stub jenkinsfile that the above remote plugin would point to. This stub file should load the correct jenkinsfile for the repoDeferring for Move the Jenkinsfile out ofcomposer
; use the remote jenkinsfile plugin #247@nlsapp @connor-m-kaz Configure jenkins reports to be public (without authentication), so public users can view Jenkins test reports which are auto-linked to the PRDeferring for Make Jenkins reports public for 3rd party contributors #248@ravi-mosaicml Configure commit trigger strings that can run specific jenkins jobsRight now, let's run everything.