Document service configuration, as part of #438.

Rename MiSeq Monitor to MiCall Watcher. Pull find_groups out of the resistance module, to avoid unnecessary requirements.
cfe-lab · Apr 17, 2018 · 3abcd47 · 3abcd47
1 parent b7938c9
commit 3abcd47
Show file tree

Hide file tree

Showing 11 changed files with 142 additions and 70 deletions.
diff --git a/docs/admin.md b/docs/admin.md
@@ -3,27 +3,23 @@ title: Admin Tasks for the MiCall Pipeline
 description: Getting things done
 ---
 
-## The MiCall Monitor ##
+## The MiCall Watcher ##
 
-MiCall Monitor (or just Monitor) handles the automated processing of new MiSeq data
+MiCall Watcher handles the automated processing of new MiSeq data
 through the MiCall pipeline.  It periodically scans the `RAW_DATA` folder for data,
 and when new data appears it interfaces with Kive to start the processing.
 This folder is populated outside of MiCall:
 
 * Runs get uploaded by the MiSeq to `RAW_DATA`.
 * The `watch_runs.rb` script in that folder watches for the files to finish
     copying, and then creates a file named `needsprocessing` in the folder.
-* The [MiseqQCReport][] scripts upload the QC data to QAI, and then create a 
-`qc_uploaded` file.
 
-The Monitor looks for folders with both of these flag files, and ignores ones
+The Monitor looks for folders with this flag file, and ignores ones
 without.
 
-[MiseqQCReport]: https://github.com/cfe-lab/MiSeqQCReport/tree/master/modules
-
 ### Hourly scan for new MiSeq runs ###
 
-Every hour, Monitor looks for new data to process with the following procedure.
+Every hour, MiCall Watcher looks for new data to process with the following procedure.
 
 * Scan for folders that have a `needsprocessing` flag.  Any such folders get added 
 to a list (in memory) of all run folders.  This distinguishes other random stuff or 
@@ -40,62 +36,136 @@ results subfolder corresponding to the current version of MiCall (located in
 of whether or not it's The Newest Run).
 
 * When a run folder is found that does *not* have such a results folder, one of 
-two things happens.  If this folder is The Newest Run, Monitor gets/creates a MiCall 
+two things happens.  If this folder is The Newest Run, MiCall Watcher gets/creates a MiCall 
 pipeline run for each sample, all at once (see "Get/Create Run" below).  All other 
 folders have their samples added to an in-memory list of "Samples That Need Processing".
 
-    * Get/Create Run: Monitor looks for the existence of the required datasets on Kive 
-    (by both MD5 and filename) and creates them if they don't exist.  Then, Monitor 
+    * Get/Create Run: MiCall Watcher looks for the existence of the required datasets on Kive 
+    (by both MD5 and filename) and creates them if they don't exist.  Then, MiCall Watcher 
     looks to see if this data is already being processed through the current version of 
     MiCall.  If not, it starts the processing.  Either way, the Kive processing task
     is added to an in-memory list of "Samples In Progress".
 
 ### Periodic load monitoring and adding new processing jobs ###
 
-Every 30 seconds, Monitor checks on all Samples In Progress.
+Every 30 seconds, MiCall Watcher checks on all Samples In Progress.
 
 * If a MiCall sample is finished, and that is the last one from a given MiSeq run to 
 finish, then all results for that run are downloaded into a subfolder specific to the 
 current version of MiCall (located in `Results/version_X.Y` where `X.Y` is the current 
 version number) in that run's corresponding folder in `RAW_DATA`, and a `doneprocessing` 
 flag is added to that subfolder.  Post-processing (e.g. uploading stuff to QAI) occurs.
 
-* Monitor keeps a specified lower limit of samples to keep active at any given time.  
+* MiCall Watcher keeps a specified lower limit of samples to keep active at any given time.  
 If a MiCall sample finishes processing and the number of active samples dips below 
-that limit, Monitor looks at its list of Samples That Need Reprocessing and starts 
+that limit, MiCall Watcher looks at its list of Samples That Need Reprocessing and starts 
 the next one, moving it from that list to Samples In Progress.
 
-### Ways to manipulate the order in which the Monitor processes data ###
+### Installing MiCall Watcher ###
+Install the MiCall source code in a shared location:
+
+    $ cd /usr/local/shared
+    $ sudo git clone https://github.com/cfe-lab/MiCall.git
+
+It should be run as a service, under its own user account, so first create the new user:
+
+    # For CentOS:
+    $ sudo useradd micall
+    $ sudo passwd micall
+
+    # For Ubuntu:
+    $ sudo adduser micall
+
+Log in as the micall user, and create a Python 3.6 virtual environment:
+
+    $ cd ~
+    $ python3.6 -m venv vmicall
+    $ . vmicall/bin/activate
+    (vmicall) $ cd /usr/local/share/MiCall
+    (vmicall) $ pip install -r requirements-watcher.txt
+    (vmicall) $ python micall_watcher.py --help
+
+Look at the options you can give to the `micall_watcher.py` script when you
+configure the service file in a later step.
+
+Copy the logging configuration if you want to change any of the settings.
+
+    $ cp micall_logging_config.py micall_logging_override.py
+
+Read the instructions in the file, and edit the override copy. If the default
+settings are fine, you don't need the override file.
+
+Now configure the service using a systemd [service unit] configuration.
+Here's an example configuration, in `/etc/systemd/system/micall_watcher.service`:
+
+    [Unit]
+    Description=micall_watcher
+
+    [Service]
+    ExecStart=/path/to/virtualenv/bin/python3.6 /path/to/MiCall/micall_watcher.py \
+        --pipeline_version=8.0 --raw_data=/data/raw \
+        --kive_server=bigbox --kive_user=micall_uploads \
+        --micall_filter_quality_pipeline_id=100 --micall_main_pipeline_id=101 \
+        --micall_resistance_pipeline_id=102 \
+        --qai_server=smallbox --qai_user=micall_uploads
+    Environment=MICALL_KIVE_PASSWORD=badexample MICALL_QAI_PASSWORD=worse
+    User=micall
+
+    # Allow the process to log its exit.
+    KillSignal=SIGINT
+
+    [Install]
+    WantedBy=multi-user.target
+
+The settings can either be given on the command line or set as
+environment variables. Environment variables are a better option for
+sensitive parameters like passwords, because the command line is visible to all
+users. Make sure you reduce the read permissions on the `.service` file so
+other users can't read it. The environment variable names are the same as the
+command options, but they add a `MICALL_` prefix, if it's not already there.
+
+Once you write the configuration file, you
+have to enable and start the service. From then on, it will start automatically
+when the server boots up.
+
+    $ sudo systemctl daemon-reload
+    $ sudo systemctl enable micall_watcher
+    $ sudo systemctl start micall_watcher
+    $ sudo systemctl status micall_watcher
+
+[service unit]: https://www.freedesktop.org/software/systemd/man/systemd.service.html
+
+### Ways to manipulate the order in which the MiCall Watcher processes data ###
 Sometimes, you want to do unusual things. Here are a few scenarios we've run into.
 
-#### You need to force Monitor to reprocess a given run ####
-First, stop Monitor.  Remove the results subfolder for the current version of 
-MiCall.  Restart Monitor.  On the next hourly scan, Monitor will handle this 
+#### You need to force MiCall Watcher to reprocess a given run ####
+Remove the results subfolder for the current version of 
+MiCall.  On the next hourly scan, MiCall Watcher will handle this 
 folder as if it had never been processed through the current version of MiCall.  That 
 is, if it's The Newest Run, its samples will be immediately started and added to 
 Samples In Progress, and if it's not The Newest Run, its samples will be added to 
 Samples That Need Processing.
 
-#### You need Monitor to skip a folder and handle an older one first ####
-Stop Monitor.  Add an `errorprocessing` flag to the run's folder in `RAW_DATA`.  
-This will make Monitor's next hourly scan believe that it's failed and should be 
-skipped, and Monitor will move on to the next one.  Note though that this has no 
+#### You need MiCall Watcher to skip a folder and handle an older one first ####
+Stop MiCall Watcher.  Add an `errorprocessing` flag to the run's folder in `RAW_DATA`.  
+This will make MiCall Watcher's next hourly scan believe that it's failed and should be 
+skipped, and MiCall Watcher will move on to the next one.  Note though that this has no 
 effect on which folder is The Newest Run: even if you're skipping The Newest Run, 
 the next one down does not inherit the mantle.
 
-* DO NOT delete the `needsprocessing` flag from a folder to try and keep Monitor from 
+* DO NOT delete the `needsprocessing` flag from a folder to try and keep MiCall Watcher from 
 handling it.  This will cause Conan's scripts to reupload data to that folder.  
 
-Restart Monitor.  After all samples from this run have finished processing, 
-remove the fake `errorprocessing` flag you set (no need to stop and restart Monitor).
+Restart MiCall Watcher.  After all samples from this run have finished processing, 
+remove the fake `errorprocessing` flag you set (no need to stop and restart MiCall Watcher).
 
 #### You need to stop what's currently running and handle an older run ####
-Stop Monitor.  In Kive, stop the processing tasks that you need to clear out, 
+Stop MiCall Watcher.  In Kive, stop the processing tasks that you need to clear out, 
 but don't remove them; the progress you've already made can be reused later when 
 revisiting these tasks later.  Now, do the manipulations you need to do as in the
-above case to make Monitor deal with your desired run first.  Restart Monitor.
+above case to make MiCall Watcher deal with your desired run first.  Restart MiCall Watcher.
 
 As in the above case, when you are ready to process the run you previously stopped,
-you can remove the fake `errorprocessing` flag you created for that run, and Monitor
+you can remove the fake `errorprocessing` flag you created for that run, and MiCall Watcher
 will then restart those processing tasks on its next hourly scan.  Kive will be able 
 reuse the progress already made when you stopped them.
diff --git a/micall/monitor/find_groups.py b/micall/monitor/find_groups.py
@@ -0,0 +1,36 @@
+from collections import namedtuple
+
+from micall.utils.sample_sheet_parser import sample_sheet_parser
+
+SampleGroup = namedtuple('SampleGroup', 'enum names')
+
+
+def find_groups(file_names, sample_sheet_path, included_projects=None):
+    """ Group HCV samples with their MIDI partners.
+
+    :param list[str] file_names: a list of FASTQ file names without paths
+    :param sample_sheet_path: path to the SampleSheet.csv file
+    :param included_projects: project codes to include, or None to include
+        all
+    """
+    with open(sample_sheet_path) as sample_sheet_file:
+        run_info = sample_sheet_parser(sample_sheet_file)
+
+    midi_files = {row['sample']: row['filename']
+                  for row in run_info['DataSplit']
+                  if row['project'] == 'MidHCV'}
+    wide_names = {row['filename']: row['sample']
+                  for row in run_info['DataSplit']
+                  if (row['project'] != 'MidHCV' and
+                      (included_projects is None or
+                       row['project'] in included_projects))}
+    trimmed_names = {'_'.join(file_name.split('_')[:2]): file_name
+                     for file_name in file_names}
+    for trimmed_name, file_name in sorted(trimmed_names.items()):
+        sample_name = wide_names.get(trimmed_name)
+        if sample_name is None:
+            # Project was not included.
+            continue
+        midi_trimmed = midi_files.get(sample_name + 'MIDI')
+        midi_name = trimmed_names.get(midi_trimmed)
+        yield SampleGroup(sample_name, (file_name, midi_name))
diff --git a/micall/monitor/kive_watcher.py b/micall/monitor/kive_watcher.py
@@ -20,7 +20,7 @@
 from micall.drivers.run_info import parse_read_sizes
 from micall.monitor import error_metrics_parser
 from micall.monitor.sample_watcher import FolderWatcher, ALLOWED_GROUPS, SampleWatcher, PipelineType
-from micall.resistance.resistance import find_groups
+from micall.monitor.find_groups import find_groups
 
 logger = logging.getLogger(__name__)
 FOLDER_SCAN_INTERVAL = timedelta(hours=1)

diff --git a/micall/resistance/resistance.py b/micall/resistance/resistance.py
@@ -1,4 +1,3 @@
-#! /usr/bin/env python3.4
 import os
 from argparse import ArgumentParser, FileType
 from collections import namedtuple, defaultdict
@@ -11,7 +10,6 @@
 
 from micall.resistance.asi_algorithm import AsiAlgorithm, ResistanceLevels
 from micall.core.aln2counts import AMINO_ALPHABET
-from micall.utils.sample_sheet_parser import sample_sheet_parser
 
 MIN_FRACTION = 0.05  # prevalence of mutations to report
 MIN_COVERAGE = 100
@@ -21,8 +19,6 @@
 
 AminoList = namedtuple('AminoList', 'region aminos genotype')
 
-SampleGroup = namedtuple('SampleGroup', 'enum names')
-
 
 class LowCoverageError(Exception):
     pass
@@ -405,37 +401,6 @@ def load_asi():
     return algorithms
 
 
-def find_groups(file_names, sample_sheet_path, included_projects=None):
-    """ Group HCV samples with their MIDI partners.
-
-    :param list[str] file_names: a list of FASTQ file names without paths
-    :param sample_sheet_path: path to the SampleSheet.csv file
-    :param list included_projects: project codes to include, or None to include
-        all
-    """
-    with open(sample_sheet_path) as sample_sheet_file:
-        run_info = sample_sheet_parser(sample_sheet_file)
-
-    midi_files = {row['sample']: row['filename']
-                  for row in run_info['DataSplit']
-                  if row['project'] == 'MidHCV'}
-    wide_names = {row['filename']: row['sample']
-                  for row in run_info['DataSplit']
-                  if (row['project'] != 'MidHCV' and
-                      (included_projects is None or
-                       row['project'] in included_projects))}
-    trimmed_names = {'_'.join(file_name.split('_')[:2]): file_name
-                     for file_name in file_names}
-    for trimmed_name, file_name in sorted(trimmed_names.items()):
-        sample_name = wide_names.get(trimmed_name)
-        if sample_name is None:
-            # Project was not included.
-            continue
-        midi_trimmed = midi_files.get(sample_name + 'MIDI')
-        midi_name = trimmed_names.get(midi_trimmed)
-        yield SampleGroup(sample_name, (file_name, midi_name))
-
-
 def report_resistance(amino_csv,
                       midi_amino_csv,
                       resistance_csv,

diff --git a/micall/tests/test_kive_watcher.py b/micall/tests/test_kive_watcher.py
@@ -16,7 +16,7 @@
 
 from micall.monitor.kive_watcher import find_samples, KiveWatcher, FolderEvent, FolderEventType, calculate_retry_wait
 from micall.monitor.sample_watcher import PipelineType, ALLOWED_GROUPS, FolderWatcher, SampleWatcher
-from micall.resistance.resistance import SampleGroup
+from micall.monitor.find_groups import SampleGroup
 from micall_watcher import parse_args
 
 

diff --git a/micall/tests/test_sample_watcher.py b/micall/tests/test_sample_watcher.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 
 from micall.monitor.sample_watcher import FolderWatcher, SampleWatcher, PipelineType
-from micall.resistance.resistance import SampleGroup
+from micall.monitor.find_groups import SampleGroup
 
 
 class DummySession:

diff --git a/micall/utils/genreport_rerun.py b/micall/utils/genreport_rerun.py
@@ -10,7 +10,8 @@
 from datetime import datetime
 
 from micall.resistance.genreport import gen_report
-from micall.resistance.resistance import find_groups, report_resistance
+from micall.monitor.find_groups import find_groups
+from micall.resistance.resistance import report_resistance
 from micall.settings import NEEDS_PROCESSING, pipeline_version, DONE_PROCESSING
 
 

diff --git a/micall_basespace.py b/micall_basespace.py
@@ -20,7 +20,7 @@
 from micall.drivers.run_info import RunInfo, ReadSizes, parse_read_sizes
 from micall.drivers.sample import Sample
 from micall.drivers.sample_group import SampleGroup
-from micall.resistance.resistance import find_groups
+from micall.monitor.find_groups import find_groups
 from micall.monitor import error_metrics_parser, quality_metrics_parser
 from micall.g2p.pssm_lib import Pssm
 from micall.monitor.tile_metrics_parser import summarize_tiles

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -1,5 +1,5 @@
 # Requirements for running the tests, doing development, and using utilities
 -r requirements-test.txt
--r requirements-monitor.txt
+-r requirements-watcher.txt
 # Used for plotting profiling results.
 gprof2dot==2016.10.13
diff --git a/requirements-test.txt b/requirements-test.txt
@@ -1,7 +1,7 @@
 # Requirements for running the tests
 
 -r requirements.txt
--r requirements-monitor.txt
+-r requirements-watcher.txt
 pytest==3.5.0
 coverage==4.3.4
 pandas==0.21.0

diff --git a/requirements-monitor.txt → requirements-watcher.txt b/requirements-monitor.txt → requirements-watcher.txt