Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch MPFuture to not use shared memory #230

Merged
merged 1 commit into from
Jan 22, 2023
Merged

Patch MPFuture to not use shared memory #230

merged 1 commit into from
Jan 22, 2023

Conversation

borzunov
Copy link
Collaborator

@borzunov borzunov commented Jan 22, 2023

Shmem-related crashes like the one below periodically happen on servers (which causes the swarm downtime), bootstrap peers and the health monitor:

Jan 20 03:01:05 petals-bootstrap2 python[32150]: terminate called after throwing an instance of 'c10::Error'
Jan 20 03:01:05 petals-bootstrap2 python[32150]:   what():  could not unlink the shared memory file /torch_32150_2656295120_383
Jan 20 03:01:05 petals-bootstrap2 python[32150]: Exception raised from close at ../aten/src/ATen/MapAllocator.cpp:514 (most recent call first):
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3f0aaa156e in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7f3f0aa6bf18 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #2: at::RefcountedMapAllocator::close() + 0xd1 (0x7f3ef0d11141 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #3: THManagedMapAllocator::close() + 0x4b (0x7f3f0b0773ab in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libshm.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #4: <unknown function> + 0x4433 (0x7f3f0b077433 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libshm.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #5: <unknown function> + 0x4ac468 (0x7f3f09cac468 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #6: <unknown function> + 0x40465 (0x7f3f0aa89465 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #7: c10::TensorImpl::~TensorImpl() + 0x2ca (0x7f3f0aa827ea in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3f0aa82939 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #9: <unknown function> + 0x6f6268 (0x7f3f09ef6268 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #10: THPVariable_subclass_dealloc(_object*) + 0x2a5 (0x7f3f09ef6555 in /home/borzunov/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: <omitting python frames>
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #23: <unknown function> + 0x90402 (0x7f3f0ba90402 in /lib/x86_64-linux-gnu/libc.so.6)
Jan 20 03:01:05 petals-bootstrap2 python[32150]: frame #24: <unknown function> + 0x11f590 (0x7f3f0bb1f590 in /lib/x86_64-linux-gnu/libc.so.6)

Currently it happens once in 1-3 days even on bootstrap peers. On servers, this happens more often. There's also a hypothesis that this happens when people run fine-tuning.

This PR monkey-patches hivemind not to use it, since we don't need the .cancel() functionality in hivemind. This should solve the issue until we ship a proper hivemind fix.

@borzunov borzunov changed the title Patch MPFuture to don't use shared memory Patch MPFuture to not use shared memory Jan 22, 2023
@borzunov borzunov merged commit 2d4653b into main Jan 22, 2023
@borzunov borzunov deleted the patch-mpfuture branch January 22, 2023 05:37
@borzunov
Copy link
Collaborator Author

[I have rolled back the merge due to unexpected server behavior. Actually, they started to crash even more often - very surprising.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant