Readme updates and update to parser options for prediction.

MiraldiLab · Nov 29, 2022 · 3a09bb1 · 3a09bb1
1 parent 47705c7
commit 3a09bb1
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 44 deletions.
diff --git a/README.md b/README.md
@@ -21,16 +21,14 @@ ___
 
 It is best to install maxATAC into a dedicated virtual environment.
 
-This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `ucsc-bedGraphToBigWig` in order to run all functions.
+This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `ucsc-bedgraphtobigwig` in order to run all functions.
 
-> The total install requirements for maxATAC with reference data are ~2 GB.
+> The total install data requirements for maxATAC is ~2 GB.
 
 ### Installing with Conda
 
 1. Create a conda environment for maxATAC with `conda create -n maxatac -c bioconda python=3.9 samtools wget bedtools ucsc-bedgraphtobigwig pigz`
 
-> If you get an error installing ucsc-bedgraphtobigwig try `conda install -c bioconda ucsc-bedgraphtobigwig`
-
 > If you get an error regarding graphviz while training a model, re-install graphviz with `conda install graphviz`
 
 2. Install maxATAC with `pip install maxatac`
@@ -40,6 +38,7 @@ This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`,
 4. Download reference data with `maxatac data`
 
 > If you have an error related to pybigwig, reference issues: [96](https://github.com/MiraldiLab/maxATAC/issues/96) and [87](https://github.com/MiraldiLab/maxATAC/issues/87#issue-1139117054)
+
 ### Installing with python virtualenv
 
 1. Create a virtual environment for maxATAC with `virtualenv -p python3.9 maxatac`.

diff --git a/docs/readme/predict.md b/docs/readme/predict.md
@@ -20,17 +20,13 @@ maxatac predict --tf CTCF --signal GM12878.bigwig
 
 The user must provide either the TF name that they want to make predictions for or the h5 model file they desire. If the user provides a TF name, the best model will be used and the correct threshold file will be provided for peak calling.
 
-### `-s, --signal`
+### `-s, --signal, -i`
 
 The ATAC-seq signal bigwig track that will be used to make predictions of TF binding.
 
-### `--genome`
-
-Specify which genome build this task is specified for (i.e. hg38). 
-
 ## Optional Arguments
 
-### `--sequence`
+### `--sequence, --seq`
 
 This argument specifies the path to the 2bit DNA sequence for the genome of interest. maxATAC models are trained with hg38 so you will need the correct `.2bit` file.
 
@@ -46,17 +42,17 @@ The cutoff value for the cutoff type provided. Note precision, recall, and F1-sc
 
 The cutoff file provided in /data/models that corresponds to the average validation performance metrics for the TF model.
 
-### `--output`
+### `-o, --output`
 
 Output directory path. Default: `./prediction_results`
 
-### `--blacklist`
+### `-bl, --blacklist`
 
 The path to a bigwig file that has regions to exclude. Default: maxATAC-defined blacklist.
 
-### `--roi`
+### `--bed, --peaks, --regions, , --roi, -roi`
 
-The path to a bed file that contains the genomic regions to predict TF binding in. These regions should be at least 1024 bp, the maxATAC model input regions.
+The path to a bed file that contains the genomic regions to focus TF predictions on. These peaks will be used to refine the prediction windows. 
 
 ### `--batch_size`
 
@@ -66,22 +62,26 @@ The number of regions to predict on per batch. Default `10000`. Decrease this va
 
 The step size to use for building the prediction intervals. Overlapping prediction bins will be averaged together. Default: `INPUT_LENGTH/4`, where INPUT_LENGTH is the maxATAC model input size of 1,024 bp. 
 
-### `--prefix`
+### `-n, --name, --prefix`
 
 Output filename prefix to use. Default `maxatac_predict`.
 
-### `--chrom_sizes`
+### `-cs, --chrom_sizes, -chrom_sizes, --chromosome_sizes`
 
 The path to the chromosome sizes file. This is used to generate the bigwig signal tracks.
 
-### `--chromosomes`
+### `-c, -chroms, --chromosomes`
 
 The chromosomes to make predictions on. Our models do not currently considered chromosomes X or Y. This means that most of the files will not contain this information. You should not predict in chrX or chrY unless you know your bigwig contains these chromosomes. Default: Autosomal chromosomes 1-22.
 
 ### `--loglevel`
 
 This argument is used to set the logging level. Currently, the only working logging level is `ERROR`.
 
-### `-bin, --bin_size`
+### `-w, --windows`
+
+The windows to use for prediction. These windows must be 1,024 bp wide and have a consistent step size.
+
+### `-skip_call_peaks, --skip_call_peaks`
 
-The bin size to use for calling peaks. Default: 200 bp based on the same sized used for benchmarking predictions.
+This will skip calling peaks at the end of predictions. 
diff --git a/maxatac/analyses/predict.py b/maxatac/analyses/predict.py
@@ -16,18 +16,20 @@
     from maxatac.utilities.peak_tools import get_threshold
     from maxatac.utilities.prediction_tools import write_predictions_to_bigwig, \
         create_prediction_regions, make_stranded_predictions
-    from maxatac.analyses.peaks import run_call_peaks
 
 
 def run_prediction(args):
     """
-    Predict TF binding with a maxATAC model. The user can provide a bed file of regions to predict on or prediction
-    regions can be created based on the chromosome of interest. The default prediction will predict across all autosomal
-    chromosomes.
+    Predict TF binding with a maxATAC model. The user can provide a bed file of regions to predict on,
+    called windows, or prediction regions can be created based on the chromosome of interest. The default prediction
+    will predict across all autosomal chromosomes.
 
-    BED file requirements for prediction. You must have at least a 3 column file with chromosome, start,
+    Peak file requirements for prediction. You must have at least a 3 column file with chromosome, start,
     and stop coordinates.
 
+    Windows file requirements for prediction. These windows will be directly input into the maxATAC data generator
+    and should be 1,024 bp wide. The window step should be uniform across the chromosome.
+
     The user can decide whether to make only predictions on the forward strand or also make prediction on the reverse
     strand. If the user wants both strand, signal tracks will be produced for the forward, reverse, and mean-combined
     bigwig signal tracks will be produced.
@@ -36,12 +38,12 @@ def run_prediction(args):
 
     1) Create directories and set up filenames
     2) Prepare regions for prediction. Either import user defined regions or create regions based on chromosomes list.
-    3) Make predictions on the reference strand. 
-    3) Convert predictions to bigwig format and write results.
+    3) Make predictions.
+    4) Convert predictions to bigwig format and write results.
+    5) Write predictions to an optional BED formated file of regions above a specific threshold.
 
-    Args:
-        output_directory, name, signal, sequence, models, predict_chromosomes, threads, batch_size, roi,
-        chrom_sizes, blacklist, average
+    Args: TF, output_directory, name, signal, sequence, model, threads, batch_size, roi, cutoff_type, cutoff_value,
+    cutoff_file, chrom_sizes, blacklist, average, windows, loglevel, step_size, chromosomes, skip_call_peaks
     """
     # Start Timer
     startTime = timeit.default_timer()
@@ -124,7 +126,7 @@ def run_prediction(args):
     if args.cutoff_file and args.skip_call_peaks is False:
         args.input_bigwig = outfile_name_bigwig
 
-        peaks_filename = os.path.join(output_directory, args.name + "_" + str(args.BIN_SIZE) + "bp.bed")
+        peaks_filename = os.path.join(output_directory, args.name + "_peaks.bed")
 
         thresh = get_threshold(cutoff_file=args.cutoff_file,
                                cutoff_type=args.cutoff_type,

diff --git a/maxatac/utilities/parser.py b/maxatac/utilities/parser.py
@@ -2,8 +2,6 @@
 import random
 import os
 from os import getcwd
-
-from pkg_resources import require
 from yaml import dump
 
 from maxatac.utilities.system_tools import (get_version,
@@ -245,7 +243,7 @@ def get_parser():
                                 help="The blacklisted regions to exclude in BED format"
                                 )
 
-    predict_parser.add_argument("--bed", "--peaks", "--regions", "-roi",
+    predict_parser.add_argument("--bed", "--peaks", "--regions", "--roi", "-roi",
                                 dest="roi",
                                 default=False,
                                 required=False,
@@ -290,7 +288,7 @@ def get_parser():
                                       Example: GM12878_CTCF"
                                 )
 
-    predict_parser.add_argument("-cs", "--chrom_sizes", "--chrom_sizes",
+    predict_parser.add_argument("-cs", "-chrom_sizes", "--chrom_sizes", "--chromosome_sizes",
                                 dest="chrom_sizes",
                                 type=str,
                                 help="Chromosome sizes file"
@@ -305,27 +303,20 @@ def get_parser():
                                       Default: All chromosomes chr1-22"
                                 )
 
-    predict_parser.add_argument("-bin", "--bin_size",
-                                dest="BIN_SIZE",
-                                type=int,
-                                default=DEFAULT_BENCHMARKING_BIN_SIZE,
-                                help="Bin size to use for peak calling"
-                                )
-
-    predict_parser.add_argument("-cutoff_type", "--cutoff_type",
+    predict_parser.add_argument("-ct", "-cutoff_type", "--cutoff_type",
                                 dest="cutoff_type",
                                 default="F1",
                                 type=str,
                                 help="Cutoff type (i.e. Precision)"
                                 )
 
-    predict_parser.add_argument("-cutoff_value", "--cutoff_value",
+    predict_parser.add_argument("-cv", "-cutoff_value", "--cutoff_value",
                                 dest="cutoff_value",
                                 type=float,
                                 help="Cutoff value for the cutoff type provided. Not used with F1 score."
                                 )
 
-    predict_parser.add_argument("-cutoff_file", "--cutoff_file",
+    predict_parser.add_argument("-cf", "-cutoff_file", "--cutoff_file",
                                 dest="cutoff_file",
                                 type=str,
                                 help="Cutoff file provided in /data/models"