Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"No CNV remains after filtering" using default settings on 10x lymphoma data #55

Closed
juliashay opened this issue Nov 8, 2022 · 4 comments

Comments

@juliashay
Copy link

I wanted to use Numbat to investigate the copy number variations in a publicly available 10x lymphoma dataset. My attempts ended with Numbat finding no copy number variations in this dataset, which presumably should have some.

Steps to reproduce:

  1. Run a docker with Numbat: docker run -v /home/jupyter/docker_test:/mnt/mydata -it pkharchenkolab/numbat-rbase:latest /bin/bash

  2. Download and extract the following files from 10x link: save the feature matrix along with barcodes and features from lymph_node_lymphoma_14k_filtered_feature_bc_matrix.tar.gz as lymphoma_matrix.mtx, lymphoma_barcodes.tsv and lymphoma_features.tsv; save the BAM file and its index from lymph_node_lymphoma_14k_gex_possorted_bam.bam as lymphoma.bam and lymphoma.bam.bai.

  3. Inside the docker, run

Rscript /numbat/inst/bin/pileup_and_phase.R --label lymphoma --samples lymphoma --bams /mnt/mydata/lymphoma.bam --barcodes /mnt/mydata/lymphoma_barcodes.tsv --outdir /mnt/mydata/lymphoma --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf --paneldir /data/1000G_hg38 --ncores 16

to generate an allele count file.

  1. Prepare the matrix and allele count file to work with Numbat
mtx <- as(Matrix::readMM("/mnt/mydata/lymphoma_matrix.mtx"), "dgCMatrix")
df_allele_cnt <- read.csv("/mnt/mydata/lymphoma/lymphoma_allele_counts.tsv", sep = "\t")
features <- read.csv("/mnt/mydata/lymphoma_features.tsv", sep = "\t", header = F)
barcodes <- read.csv("/mnt/mydata/lymphoma_barcodes.tsv", sep = "\t", header = F)
dimnames(mtx) = list(features$V2, barcodes$V1)

duplicate_genes = duplicated(features$V2)
features_new = features[!duplicate_genes, ]
mtx_new = mtx[!duplicate_genes, ]
  1. Run Numbat:
run_numbat(mtx_new, ref_hca, df_allele_cnt, genome = "hg38", t = 1e-5, ncores = 4, plot = TRUE, out_dir = "/mnt/mydata/lymphoma_results")

Expected result:
I expected Numbat to find some copy number variations in this dataset.

Actual result:
Numbat runs for 1 iteration, then terminates with "No CNV after filtering - terminating" message (see attached image).

photo1665699191

Please help me understand the results I am getting: are there settings that should be different? Is the result expected for this dataset?

@teng-gao
Copy link
Collaborator

teng-gao commented Nov 8, 2022

Thanks for the detailed report. We previously analyzed this sample before and it should find some CNVs. Could you attach the outputs (bulk_subtrees_1.png, exp_roll_clust.png, and joint_post_1.tsv) in the results folder?

@juliashay
Copy link
Author

Thanks for the detailed report. We previously analyzed this sample before and it should find some CNVs.

That's good to know.

Could you attach the outputs (bulk_subtrees_1.png, exp_roll_clust.png, and joint_post_1.tsv) in the results folder?

In my run, Numbat terminated before it generated the results folder and any of these files.

@teng-gao
Copy link
Collaborator

teng-gao commented Nov 8, 2022

That would be surprising, because the output folder is created before any of the analysis steps are executed:

https://github.com/kharchenkolab/numbat/blob/main/R/main.R#L109-L112

Please double-check - I'm almost sure that the result folder shouldn't be empty. The output files should also be generated at each step, not all at once at the last step.

@teng-gao
Copy link
Collaborator

teng-gao commented Nov 9, 2022

Hi @juliashay ,

I was able to reproduce your results. The CNVs are filtered out because of the max_entropy threshold. You can check by

fread('~/lymphoma/lymphoma_results/joint_post_1.tsv') %>% distinct(seg, avg_entropy) %>% count(avg_entropy < 0.5)

This is also documented in the FAQ section.

The uncertainty in single-cell posterior is high because of the high noise in expression logFC (ref_hca is not a great match for this dataset; better use a custom reference). Alternatively, change max_entropy=0.8 and the workflow should succeed.

@teng-gao teng-gao closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants