Ingestion worker quit unexpectedly after many ingestions #285

dtitov · 2018-04-23T12:12:57Z

I've been running ingestion scenario many times using the test-suite. After 30+ times, ingestion stopped working. I noticed that ingestion container died, here's the last part of it's logs:

ingest-swe1_1         | Traceback (most recent call last):
ingest-swe1_1         |   File "/usr/bin/ega-ingest", line 11, in <module>
ingest-swe1_1         |     load_entry_point('lega==0.1', 'console_scripts', 'ega-ingest')()
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/ingest.py", line 183, in main
ingest-swe1_1         |     consume(do_work, 'files', 'staged')
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 113, in consume
ingest-swe1_1         |     from_channel.start_consuming()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1756, in start_consuming
ingest-swe1_1         |     self.connection.process_data_events(time_limit=None)
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 716, in process_data_events
ingest-swe1_1         |     self._dispatch_channel_events()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 518, in _dispatch_channel_events
ingest-swe1_1         |     impl_channel._get_cookie()._dispatch_events()
ingest-swe1_1         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1385, in _dispatch_events
ingest-swe1_1         |     evt.body)
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 100, in process_request
ingest-swe1_1         |     answer = work( json.loads(body) ) # Exceptions should be already caught
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/db.py", line 242, in wrapper
ingest-swe1_1         |     raise e
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/utils/db.py", line 238, in wrapper
ingest-swe1_1         |     res = func(*args)
ingest-swe1_1         |   File "/root/.local/lib/python3.6/site-packages/lega/ingest.py", line 98, in work
ingest-swe1_1         |     assert( isinstance(encrypted_hash,str) )
ingest-swe1_1         | AssertionError
ega_ingest-swe1_1 exited with code 1

The text was updated successfully, but these errors were encountered:

silverdaz · 2018-04-23T12:17:49Z

I noticed I had put that assertion in place:
assert( isinstance(encrypted_hash,str) ) (on Line 98 of the ingest.py)

This was useful when, in the dev cycle, we could pass anything.
However, Central EGA doesn't pass anything in some cases, and we should guard against the None value. It's a miss.

silverdaz · 2018-06-12T13:01:40Z

Is it still the case?

dtitov · 2018-06-12T13:03:09Z

Actually, yes, I've noticed such behavior just today while testing the PR. I haven't checked the exception though. But verify service failed as well, together with ingest.

silverdaz · 2018-06-12T13:07:00Z

It might be the keyserver then. Can you show the exceptions?

dtitov · 2018-06-12T13:08:10Z

When I'll catch it once more, I'll post it here. It happens randomly - floating bug.

dtitov · 2018-06-19T12:54:16Z

Here we go:

ingest         | /usr/lib64/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
ingest         |   """)
ingest         | Traceback (most recent call last):
ingest         |   File "/usr/bin/ega-ingest", line 11, in <module>
ingest         |     load_entry_point('lega==0.1', 'console_scripts', 'ega-ingest')()
ingest         |   File "/home/lega/.local/lib/python3.6/site-packages/lega/ingest.py", line 148, in main
ingest         |     consume(do_work, broker, 'files', 'archived')
ingest         |   File "/home/lega/.local/lib/python3.6/site-packages/lega/utils/amqp.py", line 114, in consume
ingest         |     from_channel.start_consuming()
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 1756, in start_consuming
ingest         |     self.connection.process_data_events(time_limit=None)
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 707, in process_data_events
ingest         |     self._flush_output(common_terminator)
ingest         |   File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 474, in _flush_output
ingest         |     result.reason_text)
ingest         | pika.exceptions.ConnectionClosed: (320, "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'")

silverdaz · 2018-06-19T13:07:33Z

The connection to the local broker was closed abruptly. Do you know what closed it?
I usually get that exception when I kill the broker or restart the component.

We could write the code so that it retries. That'll guard us from the broken connection. In my mind, this code is not embedded in the application but is placed outside, at the level of the manager, like kubernetes or docker.

dtitov · 2018-06-19T13:11:04Z

Well, this is one of the scenarios, when I shut down the broker and then start it again. But in most of the attempts ingestion worker doesn't exit after that, so looks like it's able to recover connection. But apparently not always...

silverdaz · 2018-06-19T13:13:23Z

When the broker is killed, the ingestion does indeed die. And the container solution is: "Restart a new one". So nope, the old one won't restart and recover. This was not in the design.

dtitov · 2018-06-19T13:14:31Z

Can it possibly be caused by restart: on-failure:3 option for ingest service in docker-compose.yml? Which makes it restart itself at least 3 times on failures?

blankdots · 2018-06-19T16:28:13Z

Following this issue on pika pika/pika#856 led me to this pika/pika#858 ... might be worth a shot
... meaning adding:

except pika.exceptions.ConnectionClosed:
        LOGGER.info('Connection closed. Recovering')
        continue

here: https://github.com/NBISweden/LocalEGA/blob/dev/lega/utils/amqp.py#L112-L118

dtitov added the bug label Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion worker quit unexpectedly after many ingestions #285

Ingestion worker quit unexpectedly after many ingestions #285

dtitov commented Apr 23, 2018

silverdaz commented Apr 23, 2018

silverdaz commented Jun 12, 2018

dtitov commented Jun 12, 2018

silverdaz commented Jun 12, 2018

dtitov commented Jun 12, 2018

dtitov commented Jun 19, 2018

silverdaz commented Jun 19, 2018

dtitov commented Jun 19, 2018

silverdaz commented Jun 19, 2018

dtitov commented Jun 19, 2018

blankdots commented Jun 19, 2018 •

edited

Loading

Ingestion worker quit unexpectedly after many ingestions #285

Ingestion worker quit unexpectedly after many ingestions #285

Comments

dtitov commented Apr 23, 2018

silverdaz commented Apr 23, 2018

silverdaz commented Jun 12, 2018

dtitov commented Jun 12, 2018

silverdaz commented Jun 12, 2018

dtitov commented Jun 12, 2018

dtitov commented Jun 19, 2018

silverdaz commented Jun 19, 2018

dtitov commented Jun 19, 2018

silverdaz commented Jun 19, 2018

dtitov commented Jun 19, 2018

blankdots commented Jun 19, 2018 • edited Loading

blankdots commented Jun 19, 2018 •

edited

Loading