Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Segmentation fault with embedding.search after upserting #498

Closed
Vincent-liuwingsang opened this issue Jul 8, 2023 · 7 comments
Closed

Comments

@Vincent-liuwingsang
Copy link

Vincent-liuwingsang commented Jul 8, 2023

I'm getting segmentation fault "randomly" when using with FastApi. It usually happens after the index is updated.

3 example errors below when running some tests where the setup is:

  • sqlite as db
  • 10 embedding.search per second via GET (with RLock A)
  • embedding.upsert + embedding.persist for ~2k docs tasking roughly 5 seconds to index. (with RLock A)
  • seg fault usually occurs in a couple dozen of embedding.search after the embedding update is done(e.g. 3-4 seconds after)

I'm a little stuck as the memory usage doesn't seem too bad with 800mb(its an existing index+db. about 400mb without loading the index). Any idea how I can debug this?


Thread 0x00000002996b3000 (most recent call first):
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 449 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 549 in feed_forward_chunk
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 236 in apply_chunking_to_forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 537 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 610 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 1020 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 108 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 128 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 75 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/transformers.py", line 48 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/base.py", line 136 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 293 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 71 in search
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 107 in dbsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 53 in __call__
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 345 in batchsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 329 in search


Thread 0x00000002a26b3000 (most recent call first):
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/activations.py", line 79 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 450 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 549 in feed_forward_chunk
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 236 in apply_chunking_to_forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 537 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 610 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 1020 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 108 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 128 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 75 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/transformers.py", line 48 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/base.py", line 136 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 293 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 71 in search
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 107 in dbsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 53 in __call__
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 345 in batchsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 329 in search
  
  
  Thread 0x000000029fbb3000 (most recent call first):
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 462 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 550 in feed_forward_chunk
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 236 in apply_chunking_to_forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 537 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 610 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 1020 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 108 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 128 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/models/pooling.py", line 75 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/transformers.py", line 48 in encode
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/vectors/base.py", line 136 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 293 in batchtransform
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 71 in search
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 107 in dbsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/search.py", line 53 in __call__
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 345 in batchsearch
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/txtai/embeddings/base.py", line 329 in search
  
Thread 0x0000000297f3f000 (most recent call first):
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/functional.py", line 2515 in layer_norm
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 190 in forward
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl
  File ".../.pyenv/versions/3.11.1/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 465 in forward
@Vincent-liuwingsang
Copy link
Author

tested with fresh conda environment, empty embedding instance and same condition. (e.g. 10 read/s, then rlock with an upsert+persist, 10 read/s continues).

I thought some of my installation could be dodgy but it seems a bit random that seg fault happens in tqdm🤔

Thread 0x0000000295a13000 (most recent call first):
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 434 in format_meter
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1148 in __str__
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1492 in display
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1344 in refresh
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1095 in __init__
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1521 in trange
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 159 in encode
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/vectors/transformers.py", line 48 in encode
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/vectors/base.py", line 136 in batchtransform
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 296 in batchtransform
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 71 in search
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 108 in dbsearch
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 53 in __call__
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 349 in batchsearch
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 332 in search
  
  
   File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 387 in set_truncation_and_padding
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428 in _batch_encode_plus
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825 in batch_encode_plus
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634 in _call_one
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548 in __call__
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 113 in tokenize
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 319 in tokenize
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 161 in encode
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/vectors/transformers.py", line 48 in encode
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/vectors/base.py", line 136 in batchtransform
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 296 in batchtransform
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 71 in search
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 108 in dbsearch
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/search.py", line 53 in __call__
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 349 in batchsearch
  File "/Users/wingsangvincentliu/miniconda3/lib/python3.10/site-packages/txtai/embeddings/base.py", line 332 in search

  

@davidmezzetti
Copy link
Member

This is a tricky one, unfortunately I don't have a good answer for you. There is a known issue between Faiss and PyTorch on macOS regarding OpenMP conflicts.

More information on this issue can be found in the following links. Perhaps one of those can give you a hint on things to try.

Some ideas from those threads.

  • Disable OpenMP threading: export OMP_NUM_THREADS=1
  • Try to build libomp from source
  • Downgrade PyTorch to <= 1.12
  • Switch the index backend from faiss to hnsw

I also recommend upvoting some of those other issues to attempt to get them more visibility. This issue appears to be the root cause.

@davidmezzetti
Copy link
Member

I was able to get the macOS build scripts to work with the latest version of PyTorch and setting the OMP_NUM_THREADS=1 variable. This is probably the best bet.

@Vincent-liuwingsang
Copy link
Author

thanks will give this a go tonight. Which version of pytorch are you building with?

@Vincent-liuwingsang
Copy link
Author

@davidmezzetti Thanks again for the help and advice. A bit of further digging suggests the random seg fault I'm experiencing is not related to this library or mac specific setup at all.

crash logs are consistent suggesting that pyobjc(as subdependency) is the culprit. Isolating the pyobjc part seemed to have resolved the issue. @davidmezzetti Cheers for looking into this!

Thread 28 Crashed:
0   libobjc.A.dylib               	       0x193a545f8 objc_release + 16
1   libobjc.A.dylib               	       0x193a5c0b4 AutoreleasePoolPage::releaseUntil(objc_object**) + 196
2   libobjc.A.dylib               	       0x193a58b7c objc_autoreleasePoolPop + 256
3   libobjc.A.dylib               	       0x193a80520 objc_tls_direct_base<AutoreleasePoolPage*, (tls_key)3, AutoreleasePoolPage::HotPageDealloc>::dtor_(void*) + 168
4   libsystem_pthread.dylib       	       0x193ded970 _pthread_tsd_cleanup + 620
5   libsystem_pthread.dylib       	       0x193df069c _pthread_exit + 84
6   libsystem_pthread.dylib       	       0x193deffb4 _pthread_start + 160
7   libsystem_pthread.dylib       	       0x193deada0 thread_start + 8

@Vincent-liuwingsang
Copy link
Author

the root cause of this seemed to be that

.autorelease() on objc objects doesn't seem to work well in multithread environments

the fix was to use objc.autorelease_pool() to wrap resources that will be allocated and de-allocated.

@davidmezzetti
Copy link
Member

Glad you were able to resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants