Info about accessing the legacy version:
For the legacy version used in Kim et al., Nature Genetics, 2020 please see the scripts and README in the "legacy_natgen_2020" folder of this repo. The legacy version is only recommended for reproducing the paper results, and not for state-of-the-art amplicon classification. The legacy version was developed by Nam Nguyen, Jens Luebeck, and Hoon Kim. The current version is developed and maintained by Jens Luebeck.
If using AmpliconClassifier (current version), please cite:
Luebeck et al., Extrachromosomal DNA in the cancerous transformation of Barrett's esophagus.
biorXiv. 2022.
Please note that this software is actively being developed. Stable versions are released on the main branch.
- Supports both python2 and python3.
intervaltree
(python library):pip install intervaltree
scipy
(python library):pip install scipy
$AA_DATA_REPO
environment variable. For instructions see AmpliconArchitect installation.matplotlib
(python library, optional):pip install matplotlib
git clone https://github.com/jluebeck/AmpliconClassifier.git
cd AmpliconClassifier
echo export AC_SRC=$PWD >> ~/.bashrc
source ~/.bashrc
amplicon_classifier.py
takes a collection of (or single) AA graph files and corresponding AA cycles file as inputs.
To classify a single amplicon,
python amplicon_classifier.py --ref [hg19, GRCh37, or GRCh38] --cycles [/path/to/amplicon_cycles.txt] --graph [/path/to/amplicon_graph.txt] > classifier_stdout.log
If passing a list of amplicons, the --input
argument must be formatted as follows, with one amplicon per line:
sample_name_amplicon1 /path/to/sample_name_amplicon1_cycles.txt /path/to/sample_name_amplicon1_graph.txt
To generate the multi-amplicon input file automatically, you can use the make_input.sh
script, which takes a path and an output prefix.
make_input.sh /path/to/AA/output/directory/ [some_prefix]
To subsequently generate classifications for a list of amplicons:
python amplicon_classifier.py --ref [hg19, GRCh37, or GRCh38] --input [file with list of your amplicons] > classifier_stdout.log
and it will search for the cycles and graph files in that directory, and pair the locations into a text file compatible with the --input
argument.
There is also an experimental option you can set to visualize the strength of each amplicon class assigned to an amplicon, which can be turned on by setting --plotStyle individual
.
If combining data from both GRCh37 and hg19 in the same classification run, you can set the flag --add_chr_tag
to add the "chr" prefix to each chromosome name and effectively unify everything as hg19-based.
Contains an abstract classification of the amplicon, and also indicates in separate columns "BFB+" and "ecDNA+" status. Note that amplicons receiving a "Cyclic" classification may be ecDNA+, BFB+ or both.
Column name | Contents |
---|---|
sample_name |
Sample name prefix |
amplicon_number |
AA amplicon index, e.g. amplicon2 |
amplicon_decomposition_class |
Abstract description of the AA amplicon type. Note that Cyclic can refer to either BFB or ecDNA. Please see the following columns for that distinction. |
ecDNA+ |
Prediction about whether the AA amplicon contains ecDNA. Note, an AA amplicon may contain regions surrounding the ecDNA, or multiple linked ecDNA. Either Positive or None detected |
BFB+ |
Prediction about whether the AA amplicon is the result of a BFB. Either Positive or None detected |
ecDNA_amplicons |
Predicted number of distinct (non-overlapping) ecDNA which are represented in a single AA amplicon. This estimate is highly experimental. |
Because an ecDNA may overlap with a BFB, they are reported separately.
Reports the genes present on amplicons with each classification, and which genomic feature (e.g. ecDNA_1, BFB_1, etc), it is located on, along with the copy number and which end(s) of the gene have been lost ("truncated"), will be one of None
, 5p
(5-prime end), 3p
(3-prime end) or 5p_3p
if both. Genes are sourced from RefGene and most lncRNAs and micro-RNAs are excluded from the report.
Column name | Contents |
---|---|
sample_name |
Sample name prefix |
amplicon_number |
AA amplicon index, e.g. amplicon2 |
feature |
Which feature inside the amplicon the gene is present on. May be unknown if cannot be confidently assigned to a feature. |
gene |
Gene name (RefGene) |
gene_cn |
Maximum copy number of genomic segments (larger than 1kbp) overlapping the gene, as reported by AA |
truncated |
Which end(s) of the gene have been lost ("truncated"), will be one of None , 5p (5-prime end), 3p (3-prime end) or 5p_3p if both |
is_canonical_oncogene |
Reports if gene is present in COSMIC, ONGene, or the combined oncogene lists reported in Luebeck et al. biorXiv, 2022. |
Reports amplicon complexity scores as measured by the number of genomic segments and the diversity of copy number among all the amplicon decompositions performed by AA. For more information please see this pre-print.
Column name | Contents |
---|---|
sample_name |
Sample name prefix |
amplicon_number |
AA amplicon index, e.g. amplicon2 |
feature |
Which feature inside the amplicon the gene is present on. May be unknown if cannot be confidently assigned to a feature. |
total_feature_entropy |
This is the amplicon complexity score. |
decomp_entropy |
Amount of entropy or diversity captured in the AA decompositions overlapping this feature. |
Amp_nseg_entropy |
Amount of entropy or diversity captured by the number of genomic segments overlapping this feature. |
This two-column file reports the sample_name
and the number of ecDNA identified in the sample across all amplicons from the sample.
Additionally, there are three directories that can be created by amplicon_classifier.py
. They are
[prefix]_classification_bed_files/
, which contains bed files of the regions classified into each feature. May contain bed files markedunknown
if the region could not be confidently assigned.[prefix]_SV_summaries/
, which contains tab-separated files summarizing the SVs detected by AA and what features the overlap in the amplicon.[prefix]_annotated_cycles_files/
, which contains AA cycles files with additional annotations about length of discovered paths/cycles and their classification status.
Running on a single AA amplicon:
-c/--cycles
: AA cycles file-g/--graph
: AA graph file
OR running on multiple amplicons
-i/--input
: Tab-separated file containing one or more amplicons formatted as
sample_name_amplicon1 /path/to/sample_name_amplicon1_cycles.txt /path/to/sample_name_amplicon1_graph.txt
Other arguments
-
--ref [hg19, GRCh37, GRCh38, mm10, or GRCm38]
: (Required) Choose reference genome version used in generating AA output. -
-v/--version
: Print version and exit. -
-o
: Output filename prefix. Default is prefix of-i
or-c
. -
--add_chr_tag
: If you have a mix of hg19 and GRCh37 amplicons, you can set--ref hg19
and--add_chr_tag
to classify them all together. -
--min_flow
: Minumum cycle CN flow to consider among decomposed paths (default=1). -
--min_size
: Minimum cycle size (in bp) to consider as valid amplicon (default=5000). -
--report_genes [ecdna, bfb, other, all]
: Extract list of genes and bed files from amplicons with given classification(s).) -
--report_comlexity
: Report a measurement of the amplicon's 'complexity' score, which represents a measurement of the complexity of the AA breakpoint graph decomposition. -
--verbose_classification
: Output verbose information in theamplicon_classification_profiles.tsv
file, and createedge_classification_profiles.tsv
. Useful for debugging. -
--force
: Disable No amp/Invalid class, if possible. Use only when extremely large CN seeds were used in AA amplicon generation (>10 Mbp intervals). -
--plotstyle [noplot, individual]
: Produce a radar-style plot of classification strengths. Defaultnoplot
. -
--decomposition_strictness
: Value between 0 and 1 reflecting how strictly to filter low CN decompositions (default = 0.1). Higher values filter more of the low-weight decompositions. -
--exclude_bed
: Provide a bed file of regions to ignore during classification. Useful for separating linked amplicons or augmenting low-complexity annotations. -
--no_LC_filter
: Set this to turn off filtering low-complexity & poor mappability genome region paths & cycles. -
--no_LC_filter
: Set this to turn off filtering low-complexity & poor mappability genome region paths & cycles. -
"--filter_similar
: Only use if all samples are of independent origins (not replicates and not multi-region biopsies). Permits filtering of false-positive amps arising in multiple independent samples based on similarity calculation.
One may wish to compare two overlapping focal amplifications and quantify their similarity - particularly when they are derived from multi-region
or longitudinal sampling. We provide a script which a) identifies overlap between pairs of amplicons (using the same input file as amplicon_classifier.py
),
b) computes measurements of the similarity of the two overlapping amplicons based on shared breakpoints and shared genomic content -
using both a Jaccard index approach and also our own Symmetric Similarity Score and Asymmetric Similarity Score approaches, and c) compares the scores against
the similarity scores for overlapping amplicons derived from unrelated origins (data derived from Turner et al. Nature 2017 and deCarvalho et al. Nature Genetics 2018, Bergstrom et al. Nature 2020 and Paulson et al. Nature Communications).
The output file *_similarity_scores.tsv
reports the following columns:
- Amplicon 1 ID & Amplicon 2 ID
- Symmetric Similarity Score (a combination of GenomicSegment and Breakpoint scores)
- Percentile and P-value of the Sym. Score in the background unrelated overlapping amplicon distribution. P-value based on beta distribution fit to similarity scores.
GenomicSegmentScore1
&GenomicSegmentScore2
based on the directional similarity of genomic segment overlap (Amp1 and Amp2)/(Amp1) or (Amp1 and Amp2)/(Amp2), respectively.BreakpointScore1
&BreakpointScore2
based on the directional similarity of breakpoint matching (Amp1 and Amp2)/(Amp1) or (Amp1 and Amp2)/(Amp2), respectively.JaccardGenomicSegment
, based on overlap of genomic segments (based on overlap of genomic coordinates)JaccardBreakpoint
, based on overlap from matching of breakpoints.
The primary difference between amplicon_similarity.py
and feature_similarity.py
is that the latter only considers regions annotated as specific classification features (e.g. ecDNA, BFB) when computing similarity scores.
Example command for amplicon_similarity.py
$AC_SRC/amplicon_similarity.py --ref hg19 --add_chr_tag -i examples.input -o examples [--subset_bed some_intervals.bed] [--classification_file examples_amplicon_classification_profiles.tsv] [--required_classifications ecDNA BFB "Complex non-cyclic"]
Where "examples.input"
is the tab-separate file generated by make_input.sh /path/to/AA/outputs/ examples
and --subset_bed
is an optional bed file restricting the similarity
calculations to one or more regions of the genome. Can provide classification file and list of classes to only perform similarity classifications on some amplicon classes.
Example command for feature_similarity.py
$AC_SRC/feature_similarity.py --ref hg38 -f [sample]_features_to_graph.txt --min_de 0 [--include_path_in_feature_name]
Where "[sample]_features_to_graph.txt" is one of the output files generated by amplicon_classifier.py
. --include_path_in_features
is useful for comparing two runs with the same sample names
(after combining both _features_to_graph.txt
files from each run into one file).