Skip to content
Christoffer Flensburg edited this page Jan 30, 2022 · 5 revisions

Interpreting output

CNAsegments*.tsv

These tab separated files hold all the superFreq information about the copy number calls, one line for each segment, and servers both as manual QC and interpretation of copy number calls, as well as user-run automated downstream analysis of copy number calls.

The sheet contains a lot of columns with internal superFreq metrics and statistics for completeness, the most relevant for users are:

chr: chromosome
start: start
end: end
M: Consensus Log2 fold change of read depth with respect to the reference normals across the segment, normalised for ploidy so that 0 corresponds to 2 copies, -1 corresponds to 1 copy, +1 is 4 copies, etc.
width: uncertainty of M.
f: Consensus B-allele frequency across the segment, running between 0 and 0.5. Due to variability, segments with balanced alleles will not have exactly 0.5, but rather found around 0.45 or slightly higher for exomes. See postHet for more detailed analysis.
ferr: uncertainty of f.
call: The absolute copy number call, written as the A and B alleles, such that for example AAAB means 3 copies of the major "A" allele, and one copy of the minor B allele. Only exception is complete loss, copy number of 0, which is denoted "CL".
clonality: the fraction of cells in the sample with the copy number alteration. Note, this is not CANCER fraction, it's SAMPLE fraction. SuperFreq does not explicitly calculate purity of samples, but instead calls the sample fraction of each copy number segment independently.
clonalityError: The uncertainty of clonality.
genes: the genes in the copy number segment. Useful for interpretations.
COSMIC_genes: the genes in the copy number segment that are also COSMIC census genes, ie thought to be cancer related.

The remaining columns are unlikely to be helpful to most users:

x1: start in a one-number position coordinate running across all chromsomes from 1 to a bit above 3 billions for a human.
x2: end in a one-number position coordinate running across all chromsomes from 1 to a bit above 3 billions for a human.
df: degrees of freedom of the t-distribution used to model the log fold change M (with the help of limma-voom). var: total number of sample minor allele reads across all heterozygous SNPs in the segment, where the sample minor allele is the allele with <50% VAF.
cov: total read depth across all heterozygous SNPs in the segment.
Nsnps: number of heterozygous SNPs in the segment.
pHet: likelihood of the alleles being balanced.
pAlt: likelihood of the allele being unbalanced at the observed rate in column f.
odsHet: odds of the segment being balanced.
stat: log odds of the segment being balanced.
nullStat: Expected log odds under the null hypothesis of balanced alleles.
altStat: Expected log odds under the alternative hypothesis of unbalanced allele with fraction f.
nullStatErr: Expected variability of log odds in null hypothesis.
altStatErr: Expected variability of log odds in alternative hypothesis.
postHet: Posterior probability that the alleles are balanced in the segment, ie that the true value of f is 0.5.
sigma: Number of uncertainties that the data deviates from the called copy number state and clonality.
pCall: likelihood of data from the called copy number state and clonality.
subclonality: Clonality of group of segments after a primitive clustering algorithm. Deprecated. Use the clonal tracking (that includes copy number alterations) in the river output if you want to look at clones or purity.
subclonalityError: Uncertainty of subclonality. Deprecated.

The most common question for people looking at this file is if the copy number call is real. SuperFreq in general is made to return only reliable calls, which can be seen in figure 3 of https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007603, where the bottom panels show recall and precision of copy number calls as function of size. In short, if it's a Mbp or larger, then it's probably real, while 100kbp and smaller can be either way, but you should normally not see a lot of calls that small. For Copy Number Neutral Loss of Heterozygousity, you probably want them to be a bit larger, a few Mbps, before you trust them. Similar plots for RNA-Seq are in figure 1 of https://www.biorxiv.org/content/10.1101/2020.05.31.126888v1.full.pdf.

If you suspect that your samples don't have the expected precision, then there might be quality issues with the samples or reference normals. In general, superFreq CNA calls rely heavily on the reference normals, and it's important that they are generated in the same way as the studied samples.

Clone this wiki locally