Skip to content

Commit

Permalink
WIP: Composer Jenkinsfile (mosaicml#82)
Browse files Browse the repository at this point in the history
* Composer Jenkinsfile

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Fixed exit

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Update README.md

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Removing bad symlink

* DDP Port Auto Selection; Removed spawning in tests

* Fixed jenkinsfile

* Fixed missing tests

* Docker builds

* Smaller build matrix

* Not running dev checks when building images

* testing

* testing

* testing

* Gpu tests

* Typo fix

* Testing

* Testing

* Increased cpu limit

* Added log warning

* Ensuring that the launch script raises on sigkilled processes

* Upped the memory limit

* Configure a default virtualenv in the dockerfile

`pip install -e` does not properly install console_scripts outside of a virtualenv

In addition, a virtualenv elimiantes the need to use upgrade-alternatives. This also fixes apt, which can continue to use system python.

* Fix the run directory uploader

* Add ninja for deepspeed test

* Testing

* Fixed pytorch version in jenkinsfile

* Adding git to the jenkinsfile

* Update Dockerfile

Installing git by default

* Update Dockerfile

Added `--without-pip` since pip comes from system python. Setuptools breaks when using `--system-site-packages` without `--without-pip`

* Fixed Dockerfile virtualenv

Need to install with pip (so pip is relative to the installation directory), but immediately upgrade it since the default setuptools is broken.

* Fixed python virtualenv in the dockerfile

* testing

* testing

* Update Dockerfile

Including `--system-site-packages` with the `--upgrade` command. Otherwise it reverts to not including system packages.

* Restore setting the NCCL version

* Fixed pip

* Update Dockerfile

Keep the nccl hack to fix gcp

* More docker changes

* Use the bash shell

* Update the default path; allow downgrades

* testing

* Testing

* Fixed ubuntu version

* testing

* Added virtualenv arg

* null node selector cpu

* Global virtualenv

Make one global virtualenv. Works in both user mode and root mode. Compatible if the user  overrides it with their own virtualenv.

* Global virtualenv

Make one global virtualenv. Works in both user mode and root mode. Compatible if the user overrides it with their own virtualenv.

* Run on colo; fix docker for noninteractive shells

* Fix for non-interactive shells

* Update Dockerfile

Fix for non-interactive shells

* A yapf update broke some formatting...re-running the linter

* testing

* testing

* Enabled dockerfile matrix build; switched to 3080s

* Increase timeout for test_blurmaxpool_shapes

* Use deterministic mode

* Deterministic mode for test_checkpoint

* Fix determinsitc mode

* Early check check for CUBLAS_WORKSPACE_CONFIG when using deterministic mode

* Using colo to run all pytest

* auto setting CUBLAS_WORKSPACE_CONFIG

* Increase limits

* Fix nit

* Removed change

* Address PR feedback; fix zsh

* Fixes

* Added --no-cache-dir

* Switched to 3090s

* Running deepspeed tests via jenkins
Fixed ddp test incorreclty marked as gpu when it should be marked as deepspeed

* Node without label

* Swithced cloud to colo-research-01

* Fixes

* Simplifying PR

* Make the run directory rank-local; fix checkpoints saving and restoring

- Sharding the of the run directory accross ranks won't work in multi-node training. This change makes the run directory rank-local
- Fixed callbacks and loggers to support rank-local run directories. Specifically, wandb and the run directory uploader now run on all ranks, not just rank zero
- When using deepspeed with zero-1+, each rank writes to the checkpoint folder. Previously, only rank zero's data was being stored. Now, each rank's data is being stored by the rank-local run directory uploader. The checkpoint loader takes a checkpoint path that is parameterized by the rank, so each node will load only the shards of the checkpoint that is needed.

* Fixed checkpointing tests

* Fixed the node selector; only running deepspeed tests for the time being

* Added build system to pyproject.toml

* Testing

* Fixed isort

* Re-enable python tests

* testing

* testing

* Fixed isort

* Fixing deepspeed conditional import

* Speeding up logger test

* Adjusted k8s limits

* Fixed jenkinsfile

* Fixed missing values

* testing

* Fixed typos

* Update Jenkinsfile

Fixed cpu limits

* Update Jenkinsfile

* Fixing some of the slow tests

* Making tests faster

* Fixed broken tests

* Addressed PR feedback

* Formatting

Removed run_directory.get_relative_to_run_directory

* Added docstrings

* Lowered the CPU limit

* Fixed tests

* Pinning yapf to 0.31.0 to see if that fixes a concurrency bug

* Bump yapf version

* Fix github status check names

* Updated the README

* Fixed tests

* Addressed PR feedback

* Added lint script to repo; using new Jenkins scratch/command

* Fix typo

* Fixed closure

* Added missing commas

* fix typo

* Added debugging

* Fix the script

* Remove echo

* Dockerfile fix

* Fix jenkinsfile

* Updated shebangs
  • Loading branch information
ravi-mosaicml authored and coryMosaicML committed Feb 23, 2022
1 parent e3c4956 commit 34eff71
Show file tree
Hide file tree
Showing 15 changed files with 311 additions and 244 deletions.
243 changes: 243 additions & 0 deletions .ci/Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
pCloud = "colo-research-01"
gitUrl = null
gitBranch = null
gitCommit = null
pTimeout = '1800' // in seconds
pytorchDockerChanged = null
runWithChecks = null
expandDockerMatrix = null
prChangeset = null
builds = []
jenkinsJobBasePath = "scratch"

def cloneJenkinsfilesRepo() {
// Clone the remote jenkins file in WORKSPACE_TMP
dir ("$WORKSPACE_TMP") {
checkout([
$class: 'GitSCM',
branches: [[name: 'main']], // TODO RJPP_BRANCH
doGenerateSubmoduleConfigurations: false,
extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: 'jenkinsfiles']],
submoduleCfg: [],
userRemoteConfigs: [[url: 'https://github.com/mosaicml/testing', credentialsId: "9cf9add1-2cdd-414b-8160-94bd4ac4a13d"]] // TODO RJPP_SCM_URL
])
return "$WORKSPACE_TMP/jenkinsfiles"
}
}

def runPytest(Map args) {
// Run pytest. Parameters
// extraDeps (str, optional): The pip extra deps to install -- e.g. pip install mosaicml[$extraDeps]. (default: `all`)
// pythonVersion (str, optional): The python version (should be 3.7, 3.8, or 3.9).
// Required if `pDockerImage` is left blank
// gpu (bool, optional): Whether to run tests on a gpu (default: `false`)
// pDockerImage (str, optional): Base docker image to use. Required if `pythonVersion` is left blank
def extraDeps = args.extraDeps ?: 'all'
def pythonVersion = args.pythonVersion
def gpu = args.gpu ?: false
def pDockerImage = args.pDockerImage
def nGpus = "0"
def memLimit = "7Gi"
def cpuLimit = "2"
def markers = '""' // no markers. interpreted as a bash array

if (gpu){
nGpus = "2"
cpuLimit = "16" // 8 cpu per gpu
memLimit = "15Gi" // 7.5Gb per gpu
markers = '"gpu" "deepspeed"' // interpreted as a bash array
}

def name = null
def title = null
if (!pDockerImage) {
if (!pythonVersion) {
error ("pDockerImage or pythonVersion must be specified")
}
def pytorchVersion = pythonVersion == "3.9" ? "1.10.0" : "1.9.1"
name = "pytest/python${pythonVersion}-extraDeps_${extraDeps}-gpu_$gpu"
title = "Pytest - Python ${pythonVersion}, composer[${extraDeps}] (GPU $gpu)"
def cudaVersion = "cpu"
if (gpu) {
cudaVersion = pythonVersion == "3.9" ? "cu113" : "cu111"

}
pDockerImage = "mosaicml/pytorch:${pytorchVersion}_${cudaVersion}-python${pythonVersion}-ubuntu20.04"
}
def summary = title

def pytestCommand = """#!/usr/bin/env bash
set -euxo pipefail
EXTRA_DEPS=$extraDeps
MARKERS=($markers)
# Install dependencies
if [ -z "\${EXTRA_DEPS}" ]; then
pip install .
else
pip install .[\${EXTRA_DEPS}]
fi
# Disable WandB. Since WandB may not be installed, ignoring errors
set +e
python -m wandb disabled || true
set -e
# For each marker, run pytest
I=0
if [ -n "\${MARKERS}"]; then
JUNIT_PREFIX=build/output/\${BUILD_NUMBER}_\${I} ./scripts/test.sh --test_duration all -v
((I=I+1))
else
for marker in \${MARKERS[@]}; do
# Run the tests
JUNIT_PREFIX=build/output/\${BUILD_NUMBER}_\${I} ./scripts/test.sh --test_duration all -v -m \$marker
((I=I+1))
done
fi
# Combine the coverage reports
python -m coverage combine
python -m coverage xml -o build/output/\${BUILD_NUMBER}.coverage.xml"""

def closure = { ->
builds << build(
job: "${jenkinsJobBasePath}/command",
parameters: [
string(name: 'P_CLOUD', value: pCloud),
string(name: 'P_GIT_REPO', value: gitUrl),
string(name: 'P_GIT_COMMIT', value: gitCommit),
string(name: 'P_DOCKER_IMAGE', value: pDockerImage),
string(name: 'P_CPU_LIMIT', value: cpuLimit),
string(name: 'P_MEM_LIMIT', value: memLimit),
string(name: 'P_TIMEOUT', value: pTimeout),
string(name: 'P_N_GPUS', value: nGpus),
text(name: 'P_COMMAND', value: pytestCommand),
string(name: 'P_ARTIFACTS_GLOB', value: "build/output/*.xml"),
string(name: 'P_JUNIT_GLOB', value: "build/output/*.junit.xml"),
string(name: 'P_COVERAGE_GLOB', value: "build/output/*.coverage.xml"),
]
)
}
if (name != null && title != null && summary != null) {
runWithChecks(
name: name,
title: title,
summary: summary,
) {
closure()
}
} else {
closure()
}
}

stage('Prepare') {
node (pCloud) {
def loadedSCM = checkout scm

gitUrl = loadedSCM.GIT_URL
gitBranch = loadedSCM.GIT_BRANCH
gitCommit = loadedSCM.GIT_COMMIT

echo "gitUrl: $gitUrl"
echo "gitBranch: $gitBranch"
echo "gitCommit: $gitCommit"

def jenkinsfileWorkspace = cloneJenkinsfilesRepo()

runWithChecks = load "$jenkinsfileWorkspace/utils/runWithChecks.groovy"
expandDockerMatrix = load "$jenkinsfileWorkspace/utils/expandDockerMatrix.groovy"
prChangeset = load "$jenkinsfileWorkspace/utils/prChangeset.groovy"

pytorchDockerChanged = prChangeset("docker/pytorch/")
}
}

def dockerImagePostBuild(stagingImageTag) {
if (gitBranch == "main") {
// no need to run tests again
return
}
runPytest(pDockerImage: stagingImageTag)
}

stage('Build') {
def jobs = [:]
if (pytorchDockerChanged) {
jobs << expandDockerMatrix(
P_CLOUD: pCloud,
P_BUILD_MATRIX: './composer/pytorch_build_matrix.sh',
P_BUILD_MATRIX_GIT_REPO: 'https://github.com/mosaicml/testing.git', // TODO RJPP_SCM_URL
P_BUILD_MATRIX_GIT_COMMIT: 'main', // TODO RJPP_BRANCH
P_DOCKERFILE: 'Dockerfile',
P_BUILD_CONTEXT: './docker/pytorch',
P_GIT_REPO: gitUrl,
P_GIT_COMMIT: gitCommit,
P_CPU_LIMIT: '4',
P_MEM_LIMIT: '15Gi',
P_TIMEOUT: pTimeout,
P_KANIKO_PUSH_FINAL: gitBranch == "dev" || gitBranch == "main", // only push if we're on the main or dev branch
) { stagingImage -> dockerImagePostBuild(stagingImage) }
}
if (gitBranch != "main" && gitBranch != "dev") {
// if not on main or dev, run the pytest again.
jobs << [
'Lint': { ->
runWithChecks(
name: 'lint',
title: 'Lint',
summary: 'Static Analysis Checks',
) {
builds << build(
job: "${jenkinsJobBasePath}/command",
parameters: [
string(name: 'P_CLOUD', value: pCloud),
string(name: 'P_GIT_REPO', value: gitUrl),
string(name: 'P_GIT_COMMIT', value: gitCommit),
string(name: 'P_DOCKER_IMAGE', value: "mosaicml/pytorch:1.10.0_cpu-python3.9-ubuntu20.04"),
string(name: 'P_TIMEOUT', value: pTimeout),
string(name: 'P_CPU_LIMIT', value: "2"),
string(name: 'P_MEM_LIMIT', value: "7Gi"),
string(name: 'P_COMMAND', value: './scripts/lint.sh'),
]
)
}
},
'Python 3.7 - All': { -> runPytest(pythonVersion: "3.7") },
'Python 3.8 - All': { -> runPytest(pythonVersion: "3.8") },
'Python 3.9 - All': { -> runPytest(pythonVersion: "3.9") },
'Python 3.9 - Dev': { -> runPytest(pythonVersion: "3.9", extraDeps: "dev") },
'Python 3.9 - All (GPU)': { -> runPytest(pythonVersion: "3.9", gpu: true) },
]
}
try {
parallel(jobs)
}
finally {
stage('Merge Artifacts') {
node (pCloud) {
checkout scm // checking out the SCM so the coverage report can load the source
builds.each { item ->
copyArtifacts(
projectName: item.fullProjectName,
selector: specific("${item.number}"),
fingerprintArtifacts: true,
optional: true,
)
}

sh 'mkdir -p build/output/'

archiveArtifacts(artifacts: "build/output/*.xml", fingerprint: true, allowEmptyArchive: true)
junit(allowEmptyResults: true, testResults: "build/output/*.junit.xml")
publishCoverage(
adapters: [cobertura(path: "build/output/*.coverage.xml", mergeToOneReport: true)],
calculateDiffForChangeRequests: true,
sourceFileResolver: [level: 'STORE_LAST_BUILD']
)
}
}
}
}
27 changes: 0 additions & 27 deletions .github/actions/isort/action.yaml

This file was deleted.

23 changes: 0 additions & 23 deletions .github/actions/license/action.yaml

This file was deleted.

22 changes: 0 additions & 22 deletions .github/actions/pyright/action.yaml

This file was deleted.

27 changes: 0 additions & 27 deletions .github/actions/yapf/action.yaml

This file was deleted.

41 changes: 0 additions & 41 deletions .github/workflows/formatting.yaml

This file was deleted.

Loading

0 comments on commit 34eff71

Please sign in to comment.