Skip to content

Commit

Permalink
Readme updates and update to parser options for prediction.
Browse files Browse the repository at this point in the history
  • Loading branch information
tacazares committed Nov 29, 2022
1 parent 47705c7 commit 3a09bb1
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 44 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,14 @@ ___

It is best to install maxATAC into a dedicated virtual environment.

This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `ucsc-bedGraphToBigWig` in order to run all functions.
This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `ucsc-bedgraphtobigwig` in order to run all functions.

> The total install requirements for maxATAC with reference data are ~2 GB.
> The total install data requirements for maxATAC is ~2 GB.
### Installing with Conda

1. Create a conda environment for maxATAC with `conda create -n maxatac -c bioconda python=3.9 samtools wget bedtools ucsc-bedgraphtobigwig pigz`

> If you get an error installing ucsc-bedgraphtobigwig try `conda install -c bioconda ucsc-bedgraphtobigwig`
> If you get an error regarding graphviz while training a model, re-install graphviz with `conda install graphviz`
2. Install maxATAC with `pip install maxatac`
Expand All @@ -40,6 +38,7 @@ This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`,
4. Download reference data with `maxatac data`

> If you have an error related to pybigwig, reference issues: [96](https://github.com/MiraldiLab/maxATAC/issues/96) and [87](https://github.com/MiraldiLab/maxATAC/issues/87#issue-1139117054)
### Installing with python virtualenv

1. Create a virtual environment for maxATAC with `virtualenv -p python3.9 maxatac`.
Expand Down
30 changes: 15 additions & 15 deletions docs/readme/predict.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,13 @@ maxatac predict --tf CTCF --signal GM12878.bigwig

The user must provide either the TF name that they want to make predictions for or the h5 model file they desire. If the user provides a TF name, the best model will be used and the correct threshold file will be provided for peak calling.

### `-s, --signal`
### `-s, --signal, -i`

The ATAC-seq signal bigwig track that will be used to make predictions of TF binding.

### `--genome`

Specify which genome build this task is specified for (i.e. hg38).

## Optional Arguments

### `--sequence`
### `--sequence, --seq`

This argument specifies the path to the 2bit DNA sequence for the genome of interest. maxATAC models are trained with hg38 so you will need the correct `.2bit` file.

Expand All @@ -46,17 +42,17 @@ The cutoff value for the cutoff type provided. Note precision, recall, and F1-sc

The cutoff file provided in /data/models that corresponds to the average validation performance metrics for the TF model.

### `--output`
### `-o, --output`

Output directory path. Default: `./prediction_results`

### `--blacklist`
### `-bl, --blacklist`

The path to a bigwig file that has regions to exclude. Default: maxATAC-defined blacklist.

### `--roi`
### `--bed, --peaks, --regions, , --roi, -roi`

The path to a bed file that contains the genomic regions to predict TF binding in. These regions should be at least 1024 bp, the maxATAC model input regions.
The path to a bed file that contains the genomic regions to focus TF predictions on. These peaks will be used to refine the prediction windows.

### `--batch_size`

Expand All @@ -66,22 +62,26 @@ The number of regions to predict on per batch. Default `10000`. Decrease this va

The step size to use for building the prediction intervals. Overlapping prediction bins will be averaged together. Default: `INPUT_LENGTH/4`, where INPUT_LENGTH is the maxATAC model input size of 1,024 bp.

### `--prefix`
### `-n, --name, --prefix`

Output filename prefix to use. Default `maxatac_predict`.

### `--chrom_sizes`
### `-cs, --chrom_sizes, -chrom_sizes, --chromosome_sizes`

The path to the chromosome sizes file. This is used to generate the bigwig signal tracks.

### `--chromosomes`
### `-c, -chroms, --chromosomes`

The chromosomes to make predictions on. Our models do not currently considered chromosomes X or Y. This means that most of the files will not contain this information. You should not predict in chrX or chrY unless you know your bigwig contains these chromosomes. Default: Autosomal chromosomes 1-22.

### `--loglevel`

This argument is used to set the logging level. Currently, the only working logging level is `ERROR`.

### `-bin, --bin_size`
### `-w, --windows`

The windows to use for prediction. These windows must be 1,024 bp wide and have a consistent step size.

### `-skip_call_peaks, --skip_call_peaks`

The bin size to use for calling peaks. Default: 200 bp based on the same sized used for benchmarking predictions.
This will skip calling peaks at the end of predictions.
24 changes: 13 additions & 11 deletions maxatac/analyses/predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,20 @@
from maxatac.utilities.peak_tools import get_threshold
from maxatac.utilities.prediction_tools import write_predictions_to_bigwig, \
create_prediction_regions, make_stranded_predictions
from maxatac.analyses.peaks import run_call_peaks


def run_prediction(args):
"""
Predict TF binding with a maxATAC model. The user can provide a bed file of regions to predict on or prediction
regions can be created based on the chromosome of interest. The default prediction will predict across all autosomal
chromosomes.
Predict TF binding with a maxATAC model. The user can provide a bed file of regions to predict on,
called windows, or prediction regions can be created based on the chromosome of interest. The default prediction
will predict across all autosomal chromosomes.
BED file requirements for prediction. You must have at least a 3 column file with chromosome, start,
Peak file requirements for prediction. You must have at least a 3 column file with chromosome, start,
and stop coordinates.
Windows file requirements for prediction. These windows will be directly input into the maxATAC data generator
and should be 1,024 bp wide. The window step should be uniform across the chromosome.
The user can decide whether to make only predictions on the forward strand or also make prediction on the reverse
strand. If the user wants both strand, signal tracks will be produced for the forward, reverse, and mean-combined
bigwig signal tracks will be produced.
Expand All @@ -36,12 +38,12 @@ def run_prediction(args):
1) Create directories and set up filenames
2) Prepare regions for prediction. Either import user defined regions or create regions based on chromosomes list.
3) Make predictions on the reference strand.
3) Convert predictions to bigwig format and write results.
3) Make predictions.
4) Convert predictions to bigwig format and write results.
5) Write predictions to an optional BED formated file of regions above a specific threshold.
Args:
output_directory, name, signal, sequence, models, predict_chromosomes, threads, batch_size, roi,
chrom_sizes, blacklist, average
Args: TF, output_directory, name, signal, sequence, model, threads, batch_size, roi, cutoff_type, cutoff_value,
cutoff_file, chrom_sizes, blacklist, average, windows, loglevel, step_size, chromosomes, skip_call_peaks
"""
# Start Timer
startTime = timeit.default_timer()
Expand Down Expand Up @@ -124,7 +126,7 @@ def run_prediction(args):
if args.cutoff_file and args.skip_call_peaks is False:
args.input_bigwig = outfile_name_bigwig

peaks_filename = os.path.join(output_directory, args.name + "_" + str(args.BIN_SIZE) + "bp.bed")
peaks_filename = os.path.join(output_directory, args.name + "_peaks.bed")

thresh = get_threshold(cutoff_file=args.cutoff_file,
cutoff_type=args.cutoff_type,
Expand Down
19 changes: 5 additions & 14 deletions maxatac/utilities/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
import random
import os
from os import getcwd

from pkg_resources import require
from yaml import dump

from maxatac.utilities.system_tools import (get_version,
Expand Down Expand Up @@ -245,7 +243,7 @@ def get_parser():
help="The blacklisted regions to exclude in BED format"
)

predict_parser.add_argument("--bed", "--peaks", "--regions", "-roi",
predict_parser.add_argument("--bed", "--peaks", "--regions", "--roi", "-roi",
dest="roi",
default=False,
required=False,
Expand Down Expand Up @@ -290,7 +288,7 @@ def get_parser():
Example: GM12878_CTCF"
)

predict_parser.add_argument("-cs", "--chrom_sizes", "--chrom_sizes",
predict_parser.add_argument("-cs", "-chrom_sizes", "--chrom_sizes", "--chromosome_sizes",
dest="chrom_sizes",
type=str,
help="Chromosome sizes file"
Expand All @@ -305,27 +303,20 @@ def get_parser():
Default: All chromosomes chr1-22"
)

predict_parser.add_argument("-bin", "--bin_size",
dest="BIN_SIZE",
type=int,
default=DEFAULT_BENCHMARKING_BIN_SIZE,
help="Bin size to use for peak calling"
)

predict_parser.add_argument("-cutoff_type", "--cutoff_type",
predict_parser.add_argument("-ct", "-cutoff_type", "--cutoff_type",
dest="cutoff_type",
default="F1",
type=str,
help="Cutoff type (i.e. Precision)"
)

predict_parser.add_argument("-cutoff_value", "--cutoff_value",
predict_parser.add_argument("-cv", "-cutoff_value", "--cutoff_value",
dest="cutoff_value",
type=float,
help="Cutoff value for the cutoff type provided. Not used with F1 score."
)

predict_parser.add_argument("-cutoff_file", "--cutoff_file",
predict_parser.add_argument("-cf", "-cutoff_file", "--cutoff_file",
dest="cutoff_file",
type=str,
help="Cutoff file provided in /data/models"
Expand Down

0 comments on commit 3a09bb1

Please sign in to comment.