Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Llama, rebalancing, throughput eval, and all CLI scripts #452

Merged
merged 32 commits into from
Aug 8, 2023
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
69abacc
Show argparse defaults, fix docstring
borzunov Aug 8, 2023
ca2850e
Test petals.cli.run_dht
borzunov Aug 8, 2023
816401e
Increase mean_block_selection_delay
borzunov Aug 8, 2023
7330653
Test rebalancing
borzunov Aug 8, 2023
a00e79d
Add help to benchmark argparse
borzunov Aug 8, 2023
5b3d4c4
Use less RAM
borzunov Aug 8, 2023
2b765b9
Don't set default model in benchmarks
borzunov Aug 8, 2023
fae58d9
Fix sleep time
borzunov Aug 8, 2023
856f53f
Test --throughput eval
borzunov Aug 8, 2023
05dc383
Fix flapping test
borzunov Aug 8, 2023
18e5b00
Use AutoDistributed{Config,Model} in tests
borzunov Aug 8, 2023
168e478
Add Maykeye/TinyLLama-v0 to tests
borzunov Aug 8, 2023
5760b15
Test using includes only
borzunov Aug 8, 2023
015238a
Adjust --num_blocks and --block_indices for 8-layer TinyLlama-v0
borzunov Aug 8, 2023
17cae64
Refactor matrix
borzunov Aug 8, 2023
b7b7464
Fix commands
borzunov Aug 8, 2023
c907990
Skip TP tests for llama
borzunov Aug 8, 2023
0040539
Fix test_greedy_generation() for llama
borzunov Aug 8, 2023
a5a95c4
Fix commands
borzunov Aug 8, 2023
c3e7638
Fix test_server_info()
borzunov Aug 8, 2023
b622a14
Fix server layout
borzunov Aug 8, 2023
8a379aa
Try reducing RAM usage
borzunov Aug 8, 2023
ecd7d3f
Check if benchmarks work
borzunov Aug 8, 2023
6ffbc28
Watch free RAM (common issue in CI)
borzunov Aug 8, 2023
033a3ca
Reduce RAM further
borzunov Aug 8, 2023
f06cebd
Tune constants to save RAM
borzunov Aug 8, 2023
47d2d53
Speed benchmark tests
borzunov Aug 8, 2023
d8e08e6
Fix flapping test
borzunov Aug 8, 2023
315c5c6
Try --no_relay
borzunov Aug 8, 2023
5cbb33b
Increase swap space
borzunov Aug 8, 2023
54cd213
Fix flapping test
borzunov Aug 8, 2023
1e34dfd
Fix flapping test
borzunov Aug 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Reduce RAM further
  • Loading branch information
borzunov committed Aug 8, 2023
commit 033a3ca69d7745f61bb8d0d918d3369d0808890f
30 changes: 17 additions & 13 deletions .github/workflows/run-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,13 @@ jobs:
export ADAPTER_NAME="${{ matrix.model == 'bigscience/bloom-560m' && 'artek0chumak/bloom-560m-safe-peft' || '' }}"
export TENSOR_PARALLEL_ARGS="${{ matrix.model == 'bigscience/bloom-560m' && '--tensor_parallel_devices cpu cpu' || '' }}"

# [Step 1] Watch free RAM (lack of RAM is a common issue in CI)

bash -c 'while true; do free -h && sleep 10s; done' &
RAM_WATCH_PID=$!

# [Step 2] Set up a tiny test swarm (see https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm)

python -m petals.cli.run_dht --identity_path tests/bootstrap.id --host_maddrs /ip4/127.0.0.1/tcp/31337 &> bootstrap.log &
BOOTSTRAP_PID=$!

Expand All @@ -53,30 +57,26 @@ jobs:

sleep 5 # wait for DHT init

python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --num_blocks 7 \
python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --num_blocks 8 \
--mean_balance_check_period 10 \
--initial_peers $INITIAL_PEERS --throughput 1 &> server1.log &
SERVER1_PID=$!
# ^-- this server should choose blocks 0:3, then see that blocks 22:24 are not covered and move to 21:24

sleep 10 # wait for the 1st server to choose blocks

python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --block_indices 0:7 \
python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --block_indices 0:8 \
--attn_cache_tokens 2048 --max_chunk_size_bytes 1024 --identity_path tests/server2.id \
--initial_peers $INITIAL_PEERS --throughput 1 &> server2.log &
SERVER2_PID=$!

python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --num_blocks 7 \
python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --num_blocks 8 \
--initial_peers $INITIAL_PEERS --throughput auto &> server3.log &
SERVER3_PID=$!

python -m petals.cli.run_server $MODEL_NAME --adapters $ADAPTER_NAME --torch_dtype float32 --num_blocks 3 \
python -m petals.cli.run_server $MODEL_NAME $TENSOR_PARALLEL_ARGS --torch_dtype float32 --block_indices 0:2 \
--initial_peers $INITIAL_PEERS --throughput auto &> server4.log &
SERVER4_PID=$!

python -m petals.cli.run_server $MODEL_NAME $TENSOR_PARALLEL_ARGS --torch_dtype float32 --block_indices 0:2 \
--initial_peers $INITIAL_PEERS --throughput auto &> server5.log &
SERVER5_PID=$!
# ^-- tensor parallelism is not compatible with adapters yet + we test a server without adapters in the swarm

sleep 5 # wait for the log files to appear
Expand All @@ -85,12 +85,14 @@ jobs:
LOGGER_PID=$!

sleep 30 # wait for servers to eval throughput, download layers, and rebalance
kill -0 $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID $SERVER5_PID # ensure all peers survived init
kill -0 $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID # ensure all peers survived init

# [Step 3] Run PyTest

# run standard tests
pytest tests --durations=0 --durations-min=1.0 -v

# check if benchmarks run (the numbers won't show anything due to small models, CPU servers, and low --n_steps)
# [Step 4] Check if benchmarks work (their results here are meaningless since it's a tiny swarm of CPU servers)

python benchmarks/benchmark_inference.py --model $MODEL_NAME --initial_peers $INITIAL_PEERS --torch_dtype float32 \
--seq_len 3
python benchmarks/benchmark_forward.py --model $MODEL_NAME --initial_peers $INITIAL_PEERS --torch_dtype float32 \
Expand All @@ -100,7 +102,9 @@ jobs:
python benchmarks/benchmark_training.py --model $MODEL_NAME --initial_peers $INITIAL_PEERS --torch_dtype float32 \
--seq_len 3 --pre_seq_len 3 --n_steps 3 --batch_size 3 --task causal_lm

kill -0 $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID $SERVER5_PID # ensure all peers survived tests
# [Step 5] Clean up

kill -0 $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID # ensure all peers survived tests

kill -s SIGINT $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID $SERVER5_PID $LOGGER_PID $RAM_WATCH_PID
kill -s SIGINT $BOOTSTRAP_PID $SERVER1_PID $SERVER2_PID $SERVER3_PID $SERVER4_PID $LOGGER_PID $RAM_WATCH_PID
echo "Done!"
Loading