Skip to content
This repository has been archived by the owner on Dec 16, 2019. It is now read-only.

Ingestion worker quit unexpectedly after many ingestions #285

Open
dtitov opened this issue Apr 23, 2018 · 11 comments
Open

Ingestion worker quit unexpectedly after many ingestions #285

dtitov opened this issue Apr 23, 2018 · 11 comments
Labels

Comments

@dtitov
Copy link

dtitov commented Apr 23, 2018

I've been running ingestion scenario many times using the test-suite. After 30+ times, ingestion stopped working. I noticed that ingestion container died, here's the last part of it's logs:

ingest-swe1_1         | Traceback (most recent call last):
ingest-swe1_1         |   File "/usr/bin/ega-ingest", line 11, in <module>
ingest-swe1_1         |     load_entry_point('lega==0.1', 'console_scripts', 'ega-ingest')()
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/ingest.py", line 183, in main
ingest-swe1_1         |     consume(do_work, 'files', 'staged')
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 113, in consume
ingest-swe1_1         |     from_channel.start_consuming()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1756, in start_consuming
ingest-swe1_1         |     self.connection.process_data_events(time_limit=None)
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 716, in process_data_events
ingest-swe1_1         |     self._dispatch_channel_events()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 518, in _dispatch_channel_events
ingest-swe1_1         |     impl_channel._get_cookie()._dispatch_events()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1385, in _dispatch_events
ingest-swe1_1         |     evt.body)
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 100, in process_request
ingest-swe1_1         |     answer = work( json.loads(body) ) # Exceptions should be already caught
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/db.py", line 242, in wrapper
ingest-swe1_1         |     raise e
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/db.py", line 238, in wrapper
ingest-swe1_1         |     res = func(*args)
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/ingest.py", line 98, in work
ingest-swe1_1         |     assert( isinstance(encrypted_hash,str) )
ingest-swe1_1         | AssertionError
ega_ingest-swe1_1 exited with code 1
@dtitov dtitov added the bug label Apr 23, 2018
@silverdaz
Copy link
Contributor

I noticed I had put that assertion in place:
assert( isinstance(encrypted_hash,str) ) (on Line 98 of the ingest.py)

This was useful when, in the dev cycle, we could pass anything.
However, Central EGA doesn't pass anything in some cases, and we should guard against the None value. It's a miss.

@silverdaz
Copy link
Contributor

Is it still the case?

@dtitov
Copy link
Author

dtitov commented Jun 12, 2018

Actually, yes, I've noticed such behavior just today while testing the PR. I haven't checked the exception though. But verify service failed as well, together with ingest.

@silverdaz
Copy link
Contributor

It might be the keyserver then. Can you show the exceptions?

@dtitov
Copy link
Author

dtitov commented Jun 12, 2018

When I'll catch it once more, I'll post it here. It happens randomly - floating bug.

@dtitov
Copy link
Author

dtitov commented Jun 19, 2018

Here we go:

ingest         | /usr/lib64/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
ingest         |   """)
ingest         | Traceback (most recent call last):
ingest         |   File "/usr/bin/ega-ingest", line 11, in <module>
ingest         |     load_entry_point('lega==0.1', 'console_scripts', 'ega-ingest')()
ingest         |   File "/home/lega/.local/lib/python3.6/site-packages/lega/ingest.py", line 148, in main
ingest         |     consume(do_work, broker, 'files', 'archived')
ingest         |   File "/home/lega/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 114, in consume
ingest         |     from_channel.start_consuming()
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1756, in start_consuming
ingest         |     self.connection.process_data_events(time_limit=None)
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 707, in process_data_events
ingest         |     self._flush_output(common_terminator)
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 474, in _flush_output
ingest         |     result.reason_text)
ingest         | pika.exceptions.ConnectionClosed: (320, "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'")

@silverdaz
Copy link
Contributor

The connection to the local broker was closed abruptly. Do you know what closed it?
I usually get that exception when I kill the broker or restart the component.

We could write the code so that it retries. That'll guard us from the broken connection. In my mind, this code is not embedded in the application but is placed outside, at the level of the manager, like kubernetes or docker.

@dtitov
Copy link
Author

dtitov commented Jun 19, 2018

Well, this is one of the scenarios, when I shut down the broker and then start it again. But in most of the attempts ingestion worker doesn't exit after that, so looks like it's able to recover connection. But apparently not always...

@silverdaz
Copy link
Contributor

When the broker is killed, the ingestion does indeed die. And the container solution is: "Restart a new one". So nope, the old one won't restart and recover. This was not in the design.

@dtitov
Copy link
Author

dtitov commented Jun 19, 2018

Can it possibly be caused by restart: on-failure:3 option for ingest service in docker-compose.yml? Which makes it restart itself at least 3 times on failures?

@blankdots
Copy link

blankdots commented Jun 19, 2018

Following this issue on pika pika/pika#856 led me to this pika/pika#858 ... might be worth a shot
... meaning adding:

except pika.exceptions.ConnectionClosed:
        LOGGER.info('Connection closed. Recovering')
        continue

here: https://github.com/NBISweden/LocalEGA/blob/dev/lega/utils/amqp.py#L112-L118

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants