Skip to content

Commit

Permalink
Document service configuration, as part of #438.
Browse files Browse the repository at this point in the history
Rename MiSeq Monitor to MiCall Watcher.
Pull find_groups out of the resistance module, to avoid unnecessary requirements.
  • Loading branch information
donkirkby committed Apr 17, 2018
1 parent b7938c9 commit 3abcd47
Show file tree
Hide file tree
Showing 11 changed files with 142 additions and 70 deletions.
126 changes: 98 additions & 28 deletions docs/admin.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,23 @@ title: Admin Tasks for the MiCall Pipeline
description: Getting things done
---

## The MiCall Monitor ##
## The MiCall Watcher ##

MiCall Monitor (or just Monitor) handles the automated processing of new MiSeq data
MiCall Watcher handles the automated processing of new MiSeq data
through the MiCall pipeline. It periodically scans the `RAW_DATA` folder for data,
and when new data appears it interfaces with Kive to start the processing.
This folder is populated outside of MiCall:

* Runs get uploaded by the MiSeq to `RAW_DATA`.
* The `watch_runs.rb` script in that folder watches for the files to finish
copying, and then creates a file named `needsprocessing` in the folder.
* The [MiseqQCReport][] scripts upload the QC data to QAI, and then create a
`qc_uploaded` file.

The Monitor looks for folders with both of these flag files, and ignores ones
The Monitor looks for folders with this flag file, and ignores ones
without.

[MiseqQCReport]: https://github.com/cfe-lab/MiSeqQCReport/tree/master/modules

### Hourly scan for new MiSeq runs ###

Every hour, Monitor looks for new data to process with the following procedure.
Every hour, MiCall Watcher looks for new data to process with the following procedure.

* Scan for folders that have a `needsprocessing` flag. Any such folders get added
to a list (in memory) of all run folders. This distinguishes other random stuff or
Expand All @@ -40,62 +36,136 @@ results subfolder corresponding to the current version of MiCall (located in
of whether or not it's The Newest Run).

* When a run folder is found that does *not* have such a results folder, one of
two things happens. If this folder is The Newest Run, Monitor gets/creates a MiCall
two things happens. If this folder is The Newest Run, MiCall Watcher gets/creates a MiCall
pipeline run for each sample, all at once (see "Get/Create Run" below). All other
folders have their samples added to an in-memory list of "Samples That Need Processing".

* Get/Create Run: Monitor looks for the existence of the required datasets on Kive
(by both MD5 and filename) and creates them if they don't exist. Then, Monitor
* Get/Create Run: MiCall Watcher looks for the existence of the required datasets on Kive
(by both MD5 and filename) and creates them if they don't exist. Then, MiCall Watcher
looks to see if this data is already being processed through the current version of
MiCall. If not, it starts the processing. Either way, the Kive processing task
is added to an in-memory list of "Samples In Progress".

### Periodic load monitoring and adding new processing jobs ###

Every 30 seconds, Monitor checks on all Samples In Progress.
Every 30 seconds, MiCall Watcher checks on all Samples In Progress.

* If a MiCall sample is finished, and that is the last one from a given MiSeq run to
finish, then all results for that run are downloaded into a subfolder specific to the
current version of MiCall (located in `Results/version_X.Y` where `X.Y` is the current
version number) in that run's corresponding folder in `RAW_DATA`, and a `doneprocessing`
flag is added to that subfolder. Post-processing (e.g. uploading stuff to QAI) occurs.

* Monitor keeps a specified lower limit of samples to keep active at any given time.
* MiCall Watcher keeps a specified lower limit of samples to keep active at any given time.
If a MiCall sample finishes processing and the number of active samples dips below
that limit, Monitor looks at its list of Samples That Need Reprocessing and starts
that limit, MiCall Watcher looks at its list of Samples That Need Reprocessing and starts
the next one, moving it from that list to Samples In Progress.

### Ways to manipulate the order in which the Monitor processes data ###
### Installing MiCall Watcher ###
Install the MiCall source code in a shared location:

$ cd /usr/local/shared
$ sudo git clone https://github.com/cfe-lab/MiCall.git

It should be run as a service, under its own user account, so first create the new user:

# For CentOS:
$ sudo useradd micall
$ sudo passwd micall

# For Ubuntu:
$ sudo adduser micall

Log in as the micall user, and create a Python 3.6 virtual environment:

$ cd ~
$ python3.6 -m venv vmicall
$ . vmicall/bin/activate
(vmicall) $ cd /usr/local/share/MiCall
(vmicall) $ pip install -r requirements-watcher.txt
(vmicall) $ python micall_watcher.py --help

Look at the options you can give to the `micall_watcher.py` script when you
configure the service file in a later step.

Copy the logging configuration if you want to change any of the settings.

$ cp micall_logging_config.py micall_logging_override.py

Read the instructions in the file, and edit the override copy. If the default
settings are fine, you don't need the override file.

Now configure the service using a systemd [service unit] configuration.
Here's an example configuration, in `/etc/systemd/system/micall_watcher.service`:

[Unit]
Description=micall_watcher

[Service]
ExecStart=/path/to/virtualenv/bin/python3.6 /path/to/MiCall/micall_watcher.py \
--pipeline_version=8.0 --raw_data=/data/raw \
--kive_server=bigbox --kive_user=micall_uploads \
--micall_filter_quality_pipeline_id=100 --micall_main_pipeline_id=101 \
--micall_resistance_pipeline_id=102 \
--qai_server=smallbox --qai_user=micall_uploads
Environment=MICALL_KIVE_PASSWORD=badexample MICALL_QAI_PASSWORD=worse
User=micall

# Allow the process to log its exit.
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

The settings can either be given on the command line or set as
environment variables. Environment variables are a better option for
sensitive parameters like passwords, because the command line is visible to all
users. Make sure you reduce the read permissions on the `.service` file so
other users can't read it. The environment variable names are the same as the
command options, but they add a `MICALL_` prefix, if it's not already there.

Once you write the configuration file, you
have to enable and start the service. From then on, it will start automatically
when the server boots up.

$ sudo systemctl daemon-reload
$ sudo systemctl enable micall_watcher
$ sudo systemctl start micall_watcher
$ sudo systemctl status micall_watcher

[service unit]: https://www.freedesktop.org/software/systemd/man/systemd.service.html

### Ways to manipulate the order in which the MiCall Watcher processes data ###
Sometimes, you want to do unusual things. Here are a few scenarios we've run into.

#### You need to force Monitor to reprocess a given run ####
First, stop Monitor. Remove the results subfolder for the current version of
MiCall. Restart Monitor. On the next hourly scan, Monitor will handle this
#### You need to force MiCall Watcher to reprocess a given run ####
Remove the results subfolder for the current version of
MiCall. On the next hourly scan, MiCall Watcher will handle this
folder as if it had never been processed through the current version of MiCall. That
is, if it's The Newest Run, its samples will be immediately started and added to
Samples In Progress, and if it's not The Newest Run, its samples will be added to
Samples That Need Processing.

#### You need Monitor to skip a folder and handle an older one first ####
Stop Monitor. Add an `errorprocessing` flag to the run's folder in `RAW_DATA`.
This will make Monitor's next hourly scan believe that it's failed and should be
skipped, and Monitor will move on to the next one. Note though that this has no
#### You need MiCall Watcher to skip a folder and handle an older one first ####
Stop MiCall Watcher. Add an `errorprocessing` flag to the run's folder in `RAW_DATA`.
This will make MiCall Watcher's next hourly scan believe that it's failed and should be
skipped, and MiCall Watcher will move on to the next one. Note though that this has no
effect on which folder is The Newest Run: even if you're skipping The Newest Run,
the next one down does not inherit the mantle.

* DO NOT delete the `needsprocessing` flag from a folder to try and keep Monitor from
* DO NOT delete the `needsprocessing` flag from a folder to try and keep MiCall Watcher from
handling it. This will cause Conan's scripts to reupload data to that folder.

Restart Monitor. After all samples from this run have finished processing,
remove the fake `errorprocessing` flag you set (no need to stop and restart Monitor).
Restart MiCall Watcher. After all samples from this run have finished processing,
remove the fake `errorprocessing` flag you set (no need to stop and restart MiCall Watcher).

#### You need to stop what's currently running and handle an older run ####
Stop Monitor. In Kive, stop the processing tasks that you need to clear out,
Stop MiCall Watcher. In Kive, stop the processing tasks that you need to clear out,
but don't remove them; the progress you've already made can be reused later when
revisiting these tasks later. Now, do the manipulations you need to do as in the
above case to make Monitor deal with your desired run first. Restart Monitor.
above case to make MiCall Watcher deal with your desired run first. Restart MiCall Watcher.

As in the above case, when you are ready to process the run you previously stopped,
you can remove the fake `errorprocessing` flag you created for that run, and Monitor
you can remove the fake `errorprocessing` flag you created for that run, and MiCall Watcher
will then restart those processing tasks on its next hourly scan. Kive will be able
reuse the progress already made when you stopped them.
36 changes: 36 additions & 0 deletions micall/monitor/find_groups.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from collections import namedtuple

from micall.utils.sample_sheet_parser import sample_sheet_parser

SampleGroup = namedtuple('SampleGroup', 'enum names')


def find_groups(file_names, sample_sheet_path, included_projects=None):
""" Group HCV samples with their MIDI partners.
:param list[str] file_names: a list of FASTQ file names without paths
:param sample_sheet_path: path to the SampleSheet.csv file
:param included_projects: project codes to include, or None to include
all
"""
with open(sample_sheet_path) as sample_sheet_file:
run_info = sample_sheet_parser(sample_sheet_file)

midi_files = {row['sample']: row['filename']
for row in run_info['DataSplit']
if row['project'] == 'MidHCV'}
wide_names = {row['filename']: row['sample']
for row in run_info['DataSplit']
if (row['project'] != 'MidHCV' and
(included_projects is None or
row['project'] in included_projects))}
trimmed_names = {'_'.join(file_name.split('_')[:2]): file_name
for file_name in file_names}
for trimmed_name, file_name in sorted(trimmed_names.items()):
sample_name = wide_names.get(trimmed_name)
if sample_name is None:
# Project was not included.
continue
midi_trimmed = midi_files.get(sample_name + 'MIDI')
midi_name = trimmed_names.get(midi_trimmed)
yield SampleGroup(sample_name, (file_name, midi_name))
2 changes: 1 addition & 1 deletion micall/monitor/kive_watcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from micall.drivers.run_info import parse_read_sizes
from micall.monitor import error_metrics_parser
from micall.monitor.sample_watcher import FolderWatcher, ALLOWED_GROUPS, SampleWatcher, PipelineType
from micall.resistance.resistance import find_groups
from micall.monitor.find_groups import find_groups

logger = logging.getLogger(__name__)
FOLDER_SCAN_INTERVAL = timedelta(hours=1)
Expand Down
35 changes: 0 additions & 35 deletions micall/resistance/resistance.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
#! /usr/bin/env python3.4
import os
from argparse import ArgumentParser, FileType
from collections import namedtuple, defaultdict
Expand All @@ -11,7 +10,6 @@

from micall.resistance.asi_algorithm import AsiAlgorithm, ResistanceLevels
from micall.core.aln2counts import AMINO_ALPHABET
from micall.utils.sample_sheet_parser import sample_sheet_parser

MIN_FRACTION = 0.05 # prevalence of mutations to report
MIN_COVERAGE = 100
Expand All @@ -21,8 +19,6 @@

AminoList = namedtuple('AminoList', 'region aminos genotype')

SampleGroup = namedtuple('SampleGroup', 'enum names')


class LowCoverageError(Exception):
pass
Expand Down Expand Up @@ -405,37 +401,6 @@ def load_asi():
return algorithms


def find_groups(file_names, sample_sheet_path, included_projects=None):
""" Group HCV samples with their MIDI partners.
:param list[str] file_names: a list of FASTQ file names without paths
:param sample_sheet_path: path to the SampleSheet.csv file
:param list included_projects: project codes to include, or None to include
all
"""
with open(sample_sheet_path) as sample_sheet_file:
run_info = sample_sheet_parser(sample_sheet_file)

midi_files = {row['sample']: row['filename']
for row in run_info['DataSplit']
if row['project'] == 'MidHCV'}
wide_names = {row['filename']: row['sample']
for row in run_info['DataSplit']
if (row['project'] != 'MidHCV' and
(included_projects is None or
row['project'] in included_projects))}
trimmed_names = {'_'.join(file_name.split('_')[:2]): file_name
for file_name in file_names}
for trimmed_name, file_name in sorted(trimmed_names.items()):
sample_name = wide_names.get(trimmed_name)
if sample_name is None:
# Project was not included.
continue
midi_trimmed = midi_files.get(sample_name + 'MIDI')
midi_name = trimmed_names.get(midi_trimmed)
yield SampleGroup(sample_name, (file_name, midi_name))


def report_resistance(amino_csv,
midi_amino_csv,
resistance_csv,
Expand Down
2 changes: 1 addition & 1 deletion micall/tests/test_kive_watcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

from micall.monitor.kive_watcher import find_samples, KiveWatcher, FolderEvent, FolderEventType, calculate_retry_wait
from micall.monitor.sample_watcher import PipelineType, ALLOWED_GROUPS, FolderWatcher, SampleWatcher
from micall.resistance.resistance import SampleGroup
from micall.monitor.find_groups import SampleGroup
from micall_watcher import parse_args


Expand Down
2 changes: 1 addition & 1 deletion micall/tests/test_sample_watcher.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from pathlib import Path

from micall.monitor.sample_watcher import FolderWatcher, SampleWatcher, PipelineType
from micall.resistance.resistance import SampleGroup
from micall.monitor.find_groups import SampleGroup


class DummySession:
Expand Down
3 changes: 2 additions & 1 deletion micall/utils/genreport_rerun.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
from datetime import datetime

from micall.resistance.genreport import gen_report
from micall.resistance.resistance import find_groups, report_resistance
from micall.monitor.find_groups import find_groups
from micall.resistance.resistance import report_resistance
from micall.settings import NEEDS_PROCESSING, pipeline_version, DONE_PROCESSING


Expand Down
2 changes: 1 addition & 1 deletion micall_basespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from micall.drivers.run_info import RunInfo, ReadSizes, parse_read_sizes
from micall.drivers.sample import Sample
from micall.drivers.sample_group import SampleGroup
from micall.resistance.resistance import find_groups
from micall.monitor.find_groups import find_groups
from micall.monitor import error_metrics_parser, quality_metrics_parser
from micall.g2p.pssm_lib import Pssm
from micall.monitor.tile_metrics_parser import summarize_tiles
Expand Down
2 changes: 1 addition & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Requirements for running the tests, doing development, and using utilities
-r requirements-test.txt
-r requirements-monitor.txt
-r requirements-watcher.txt
# Used for plotting profiling results.
gprof2dot==2016.10.13
2 changes: 1 addition & 1 deletion requirements-test.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Requirements for running the tests

-r requirements.txt
-r requirements-monitor.txt
-r requirements-watcher.txt
pytest==3.5.0
coverage==4.3.4
pandas==0.21.0
Expand Down
File renamed without changes.

0 comments on commit 3abcd47

Please sign in to comment.