Skip to content

Commit

Permalink
Merge pull request #113 from MiraldiLab/tests
Browse files Browse the repository at this point in the history
  • Loading branch information
tacazares committed Jan 23, 2023
2 parents 4ad3cdc + 31abfa8 commit 0ab73fb
Show file tree
Hide file tree
Showing 43 changed files with 1,116 additions and 1,235 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,14 @@ ___

It is best to install maxATAC into a dedicated virtual environment.

This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `bedGraphToBigWig` in order to run all functions.
This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`, `graphviz`, and `ucsc-bedgraphtobigwig` in order to run all functions.

> The total install requirements for maxATAC with reference data are ~2 GB.
> The total install data requirements for maxATAC is ~2 GB.
### Installing with Conda

1. Create a conda environment for maxATAC with `conda create -n maxatac -c bioconda python=3.9 samtools wget bedtools ucsc-bedgraphtobigwig pigz`

> If you get an error installing ucsc-bedgraphtobigwig try `conda install -c bioconda ucsc-bedgraphtobigwig`
> If you get an error regarding graphviz while training a model, re-install graphviz with `conda install graphviz`
2. Install maxATAC with `pip install maxatac`
Expand All @@ -40,6 +38,7 @@ This version requires python 3.9, `bedtools`, `samtools`, `pigz`, `wget`, `git`,
4. Download reference data with `maxatac data`

> If you have an error related to pybigwig, reference issues: [96](https://github.com/MiraldiLab/maxATAC/issues/96) and [87](https://github.com/MiraldiLab/maxATAC/issues/87#issue-1139117054)
### Installing with python virtualenv

1. Create a virtual environment for maxATAC with `virtualenv -p python3.9 maxatac`.
Expand Down
36 changes: 22 additions & 14 deletions docs/readme/average.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,46 @@ The `average` function will average multiple bigwig files into a single bigwig f

## Example

Example command using only required flags:

```bash
maxatac average -i *.bw -n IMR-90
```

Example command using all flags:

```bash
maxatac average --bigwigs *.bw --prefix IMR-90 --output ./test --chroms chr1
maxatac average -i *.bw -n IMR-90 -o ./test -c chr1 -cs hg38.chrom.sizes
```

## Required Arguments

### `-i, --bigwigs`
### `-i`

This argument is reserved for the input bigwig files to be averaged together. You could use a `*.bw` wildcard to make a list of bigwig files as input or provide the path to two bigwig files.
The input bigwig files. You could use a `*.bw` wildcard to make a list of bigwig files as input or provide the path to each file.

### `--prefix`
### `-n`, `--name`, `--prefix`

This argument is reserved for the prefix used to build the output filename. This can be any string. The extension `.bw` will be added to the filename prefix.
The name string used to build the output filename. The extension `.bw` will be added to the filename.

## Optional Arguments

### `--chrom_sizes`
### `-cs`, `--chrom_sizes`, `--chromosome_sizes`

This argument is used to define the chromosome sizes file that is used to calculate the chromosome ends. The current default is for hg38.
The chromosome sizes file for the reference genome used during alignment. The current default is set for hg38.

### `--chromosomes`
### `-c`, `--chroms`, `--chromosomes`

This argument is used to define the chromosomes that are averaged together. Only the chromosomes in this list will be written to the output file. The current default list of chromosomes are restricted to the autosomal chromosomes:
The chromosomes that are averaged together and written to output. Only the chromosomes in this list will be written to the output file. The current default list of chromosomes are restricted to the autosomal chromosomes:

```pre
chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22
```bash
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
```

### `--output`
### `-o`, `--output`, `--output_dir`

This argument is used to define the output directory. If the output directory is not supplied a directory called `./average` will be created in the current working directory.
The output directory. If the output directory is not supplied the file will be created in the current working directory.

### `--loglevel`

This argument is used to set the logging level. Currently, the only working logging level is `ERROR`.
Set the logging level. Currently, the only working logging level is `ERROR`.
2 changes: 1 addition & 1 deletion docs/readme/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ This flag will set the precision of the predictions signal track. Provide an int

The output directory to write the results to. Default: `./prediction_results`

### `--blacklist`
### `--blacklist_bw`

The path to the blacklist bigwig signal track of regions that should be excluded. Default: `hg38_maxatac_blacklist.bed` which contains regions that are specific to ATAC-seq.

Expand Down
36 changes: 17 additions & 19 deletions docs/readme/normalize.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,32 @@
The `normalize` function will normalize an input bigwig file based on the following approaches:

* `min-max`: Find the genomic min and max values, then scale them between `[0,1]` or some-user defined range. The max value can be calculated as (1) the absolute max value across the genome (traditional definition of min-max) or (2) you can set a percentile cutoff to use as the max value. Option 2 improved robustness to outlying high ATAC-seq signal and maxATAC prediction accuracy. Specifically, we use the 99th-percentile max value instead of the absolute max value, and, given important performance ramifications, is the default.
* `median-mad`: Find the genomic median and calculate the median absolute deviation.
* `zscore`: Set the mean value to 0 with a standard deviation of 1.
* `arcsinh`: Transform the values using an inverse hyperbolic sin transformation (arcsinh)

## Example

```bash
maxatac normalize --signal GM12878_RP20M.bw --prefix GM12878 --output ./test --method min-max --max_percentile 99
maxatac normalize -i GM12878_RP20M.bw -name GM12878_minmax -o ./test --method min-max --max_percentile 99
```

## Required Arguments

### `--signal`
### `-i`, `--signal`

The input bigwig file to be normalized.

### `-n`, `--name`, `--prefix`

The name used to build the output filename. This can be any string.

## Optional Arguments

### `--method`

This argument is used to determine which method to use for normalization. Default: `min-max`
The method to use for normalization. Default: `min-max`

* `min-max`: Find the genomic min and max values, then scale them between `[0,1]` or some user-defined range. The max value can be calculated as (1) the absolute max value across the genome (traditional definition of min-max) or (2) you can set a percentile cutoff to use as the max value. Option 2 improved robustness to outlying high ATAC-seq signal and maxATAC prediction accuracy. Specifically, we use the 99th-percentile max value instead of the absolute max value, and, given important performance ramifications, is the default.
* `median-mad`: Find the genomic median and calculate the median absolute deviation.
* `zscore`: Set the mean value to 0 with a standard deviation of 1.
* `arcsinh`: Transform the values using an inverse hyperbolic sin transformation (arcsinh)

Expand All @@ -36,40 +38,36 @@ If method is `min-max` this argument will set the percentile value to use as the

### `--min`

The value to use as the minimum value for `min-max` normalization. Default: `0`
The minimum value for `min-max` normalization. Default: `0`

### `--max`

The value to use as the maximum value for `min-max` normalization. Default: `False`, so that max is calculated based on the ATAC signal track.
The maximum value for `min-max` normalization. Default: `False`, so that max is calculated based on the ATAC signal track.

### `--clip`

This flag determines whether to clip the values that are above the max value used in `min-max` normalization or to leave them as their real value. Default: `False`

### `--prefix`

This argument is reserved for the prefix used to build the output filename. This can be any string.

### `--chroms`
### `-c`, `--chroms`, `--chromosomes`

This argument is used to define the chromosomes that are averaged together. Only the chromosomes in this list will be written to the output file. The current default list of chromosomes are restricted to the autosomal chromosomes:
Define the chromosomes that are normalized. Only the chromosomes in this list will be written to the output file. The current default list of chromosomes are restricted to the autosomal chromosomes:

```pre
chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22
```

### `--chrom_sizes`
### `-cs`, `--chrom_sizes`, `--chromosome_sizes`

This argument is used to define the chromosome sizes file that is used to calcuate the chromosome ends. The current default file are the chromosome sizes for hg38.
Define the chromosome sizes file. The current default file are the chromosome sizes for hg38.

### `--blacklist`
### `--blacklist_bw`

The path to the blacklist bigwig file. This file is used to remove all the regions that are considered to have high technical noise. Default: maxATAC publication-defined blacklist.

### `--output`
### `-o`, `--output`, `--output_dir`

This argument is used to define the output directory. If the output directory is not supplied a directory called `./normalize` will be created in the current working directory.
Define the output directory. If the output directory is not supplied a directory called `./normalize` will be created in the current working directory.

### `--loglevel`

This argument is used to set the logging level. Currently, the only working logging level is `ERROR`.
Set the logging level. Currently, the only working logging level is `ERROR`.
32 changes: 18 additions & 14 deletions docs/readme/predict.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,17 @@ maxatac predict --tf CTCF --signal GM12878.bigwig

The user must provide either the TF name that they want to make predictions for or the h5 model file they desire. If the user provides a TF name, the best model will be used and the correct threshold file will be provided for peak calling.

### `-s, --signal`
### `-s, --signal, -i`

The ATAC-seq signal bigwig track that will be used to make predictions of TF binding.

### `-n, --name, --prefix`

Output filename prefix to use. Default `maxatac_predict`.

## Optional Arguments

### `--sequence`
### `--sequence, --seq`

This argument specifies the path to the 2bit DNA sequence for the genome of interest. maxATAC models are trained with hg38 so you will need the correct `.2bit` file.

Expand All @@ -42,17 +46,17 @@ The cutoff value for the cutoff type provided. Note precision, recall, and F1-sc

The cutoff file provided in /data/models that corresponds to the average validation performance metrics for the TF model.

### `--output`
### `-o, --output`

Output directory path. Default: `./prediction_results`

### `--blacklist`
### `-bl, --blacklist`

The path to a bigwig file that has regions to exclude. Default: maxATAC-defined blacklist.

### `--roi`
### `--bed, --peaks, --regions, , --roi, -roi`

The path to a bed file that contains the genomic regions to predict TF binding in. These regions should be at least 1024 bp, the maxATAC model input regions.
The path to a bed file that contains the genomic regions to focus TF predictions on. These peaks will be used to refine the prediction windows.

### `--batch_size`

Expand All @@ -62,22 +66,22 @@ The number of regions to predict on per batch. Default `10000`. Decrease this va

The step size to use for building the prediction intervals. Overlapping prediction bins will be averaged together. Default: `INPUT_LENGTH/4`, where INPUT_LENGTH is the maxATAC model input size of 1,024 bp.

### `--prefix`

Output filename prefix to use. Default `maxatac_predict`.

### `--chromosome_sizes`
### `-cs, --chrom_sizes, -chrom_sizes, --chromosome_sizes`

The path to the chromosome sizes file. This is used to generate the bigwig signal tracks.

### `--chromosomes`
### `-c, -chroms, --chromosomes`

The chromosomes to make predictions on. Our models do not currently considered chromosomes X or Y. This means that most of the files will not contain this information. You should not predict in chrX or chrY unless you know your bigwig contains these chromosomes. Default: Autosomal chromosomes 1-22.

### `--loglevel`

This argument is used to set the logging level. Currently, the only working logging level is `ERROR`.

### `-bin, --bin_size`
### `-w, --windows`

The windows to use for prediction. These windows must be 1,024 bp wide and have a consistent step size.

### `-skip_call_peaks, --skip_call_peaks`

The bin size to use for calling peaks. Default: 200 bp based on the same sized used for benchmarking predictions.
This will skip calling peaks at the end of predictions.
20 changes: 10 additions & 10 deletions docs/readme/prepare.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,54 +35,54 @@ Description of filename parts:

## Required Arguments

### `-i, --input`
### `-i`, `--input`

The input file to be processed. The input file can be either:

* `.bam`: Bulk ATAC-seq BAM file.
* `.tsv`: 10X scATAC fragments file. Must end in `.tsv` or `.tsv.gz`.

### `-o, --output`
### `-o`, `--output`

The output directory path.

### `-prefix`
### `-n`, `-name`, `--prefix`, `-prefix`

This argument is used to set the prefix for setting the filenames.

## Optional Arguments

The default values for the optional arguments are based on the testing performed in the maxATAC publication. See the [Methods](https://www.biorxiv.org/content/10.1101/2022.01.28.478235v1.article-metrics) of our publication for a detailed explanation of each parameter choice.

### `-skip_dedup, --skip_deduplication`
### `-skip_dedup`, `--skip_deduplication`

It is important to remove PCR duplicates from your ATAC-seq data if you have not done so already. Include this flag to perform PCR deduplication of the input BAM file if you know that it has not been deduplicated. Skipping this step will speed up data processing. Defualt: False

### `-slop, --slop`
### `-slop`, `--slop`

The slop size used to smooth sparse Tn5 cut sites' signal. Each Tn5 cut site will be extended +/- the slop size (in bp). Because maxATAC models were trained using slop size of 20bp (a value that approximates the size of Tn5 transposase), **this parameter should not be changed from default (20 bp) when using the trained models provided by maxATAC**. Default: 20 bp.

### `-rpm, --rpm_factor`
### `-rpm`, `--rpm_factor`

The reads per million (RPM) factor used for read-depth normalization of signal. Most groups use RPM and therefore 1,000,000 as a scaling factor, but maxATAC uses RP20M and therefore 20,000,000 because it is approximately the median sequencing depth of the ATAC-seq data used for training. Changing from the default (20000000) is not problematic for maxATAC prediction, as this track is only used for visualization. (Predictions are made on a min-max-like normalized signal track, also an output from `maxatac prepare`.) Default: 20000000.

### `--blacklist_bed`
### `--blacklist`

The path to the blacklist bed file. Default: maxATAC defined blacklisted regions for hg38.

### `--blacklist_bw`

The path to the blacklist bigwig file. Default: maxATAC defined blacklist for hg38 as a bigwig file.

### `--chrom_sizes`
### `-cs`, `--chrom_sizes`, `--chromosome_sizes`

The chromosome sizes file. Default: hg38 chrom sizes.

### `-chroms, --chromosomes`
### `-c`, `-chroms`, `--chromosomes`

The chromosomes to use for the final output. Default: Autosomal chromosomes chr1-22.

### `-threads`
### `-t`, `-threads`

The number of threads to use. Default: Get available CPU count.

Expand Down
2 changes: 2 additions & 0 deletions docs/readme/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ maxatac train --arch DCNN_V2 --sequence hg38.2bit --meta_file CTCF_meta.tsv --ou
```

## Required Arguments
### `--genome`

Specify which genome build this task is specified for (i.e. hg38).
### `--sequence`

This argument specifies the path to the 2bit DNA sequence for the genome of interest
Expand Down
16 changes: 10 additions & 6 deletions docs/readme/variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,37 @@ maxatac variants -m ELF1_99.h5 -signal GM12878__slop20bp_RP20M_minmax01.bw -name

## Required Arguments

### `--genome`

Specify which genome build this task is specified for (i.e. hg38).

### `-m, --model`

The trained maxATAC model that will be used to predict TF binding. This is a h5 file produced from `maxatac train`.

### `--signal`
### `-i`, `-s`, `--signal`

The ATAC-seq signal bigwig track that will be used to make predictions of TF binding.

### `--variants_bed`

The bed file of nucleotides to change. The first 3 columns should be the coordinates and the fourth column should be the nucleotide to use.

### `-n, --name`
### `-n`, `--name`, `--prefix`

Output filename prefix to use.

## Optional Arguments

### `-s, --sequence`
### `-s`, `--sequence`

This argument specifies the path to the 2bit DNA sequence for the genome of interest. Default: hg38.2bit

### `-roi`

The bed file of intervals to use for prediction windows. Predictions will be limited to these specific regions. Only the first 3 columns of the file will be considered when making the prediction windows. Default: Whole genome prediction.

### `-o, --output`
### `-o`, `--output_dir`

Output directory path. Default: `./variantss`

Expand All @@ -52,10 +56,10 @@ The path to a bigwig file that has regions to exclude. Default: maxATAC defined

The number of base pairs to overlap the 1,024 bp regions during prediction. This should be in multiples of 256. Default: 256

### `-chroms, --chromosomes`
### `-c`, `-chroms`, `--chromosomes`

The chromosomes to make predictions on. Default: All chromosomes. chr1-22, X, Y

### `--chrom_sizes`
### `-cs`, `--chrom_sizes`, `--chromosome_sizes`

The path to the chromosome sizes file. This is used to generate the bigwig signal tracks. Default: hg38.chrom.sizes
Loading

0 comments on commit 0ab73fb

Please sign in to comment.