Machine Learning Engineering Open Book

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, I'm working on developing/training open-source Retrieval Augmented models at Contextual.AI.

I've been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual I'm happy to share these with the wider ML community.

Accelerator - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)
Network - intra-node and inter-node connectivity, calculating bandwidth requirements
Storage - local and distributed disks and file systems
CPU - cpus, affinities (WIP)
CPU Memory - how much CPU memory is enough - the shortest chapter ever.

Part 3. Performance

Part 4. Operating

SLURM
Training hyper-parameters and model initializations
Instabilities

Part 5. Development

Debugging software and hardware failures
And more debugging
Reproducibility
Tensor precision / Data types
HF Transformers notes - making small models, tokenizers, datasets, and other tips

Part 6. Miscellaneous

Resources - LLM/VLM chronicles

PDF version

Download the PDF version of the book.

I will try to rebuild it once a week or so, but if you want the latest, the instructions for building are here.

Thank you HuggingFace for giving me permission to host my book's PDF at the HF hub.

Shortcuts

Things that you are likely to need to find quickly and often.

Tools:

all_reduce_bench.py - a much easier way to benchmark network throughput than nccl-tests.
torch-distributed-gpu-test.py - a tool to quickly test your inter-node connectivity

Guides:

debugging pytorch applications - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
slurm for users - a slurm cheatsheet and tricks
make tiny models/datasets/tokenizers
LLM/VLM chronicles collection

Gratitude

None of this would have been possible without me being entrusted with doing the specific LLM/VLM trainings I have learned this know-how from. This is a privilege that only a few enjoy due to the prohibitively expensive cost of renting huge ML compute clusters. So hopefully the rest of the ML community will vicariously learn from these notes.

Special thanks go to Thom Wolf who proposed that I lead the BLOOM-176B training back when I didn't know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

Contributing

If you found a bug, typo or would like to propose an improvement please don't hesitate to open an Issue or contribute a PR.

License

The content of this site is distributed under Attribution-ShareAlike 4.0 International.

My repositories map

✔ Machine Learning: ML Engineering Open Book | ML ways | Porting

✔ Guides: The Art of Debugging

✔ Applications: ipyexperiments

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
accelerator		accelerator
build		build
checkpoints		checkpoints
cpu-memory		cpu-memory
cpu		cpu
debug		debug
dtype		dtype
fault-tolerance		fault-tolerance
hparams		hparams
images		images
insights		insights
instabilities		instabilities
model-parallelism		model-parallelism
multi-node		multi-node
network		network
orchestration/slurm		orchestration/slurm
performance		performance
reproducibility		reproducibility
resources		resources
storage		storage
transformers		transformers
.gitignore		.gitignore
LICENSE-CC-BY-SA		LICENSE-CC-BY-SA
Makefile		Makefile
README.md		README.md
chapters-md.txt		chapters-md.txt
incoming.md		incoming.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Engineering Open Book

Table of Contents

PDF version

Shortcuts

Gratitude

Contributing

License

My repositories map

About

Releases

Packages

Languages

License

stackshareiodev/ad_ml-engineering

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Engineering Open Book

Table of Contents

PDF version

Shortcuts

Gratitude

Contributing

License

My repositories map

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages