Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylogeny on merged samples #130

Open
freddie090 opened this issue Jun 23, 2023 · 6 comments
Open

Phylogeny on merged samples #130

freddie090 opened this issue Jun 23, 2023 · 6 comments

Comments

@freddie090
Copy link

Hi,

I have multiple samples from an experiment that have ~6000 cells of 10X scRNA data each. If I were to try and run NUMBAT on the entire merged experiment the BAMs would be too big.

Is it possible to run NUMBAT on the independent samples, but then merge the samples for the phylogeny part of the analysis? (for example, as was done in figure 5a of the NUMBAT paper).

Additionally, if I were to subset the BAMs given some high quality cells, merge the subsetted samples and then run NUMBAT on these merged BAMs/expression matrices, would this improve the robustness of the analysis? ie is there any advantage to the samples being processed simultaneously vs independently?

Thanks

@teng-gao
Copy link
Collaborator

Hi @freddie090 ,

You can genotype the samples (from the same individual) using the multi-sample mode of pileup_and_phase (you can provide a list of BAMs), and provide the combined count_mat and alelle_df in a single numbat run. Numbat should be able to handle 6000 cells fine.

The advantage is that you get consistent CNV and clone calls across samples and get an integrated phylogeny. Genotyping using multiple samples can also improve phasing accuracy.

Best,
Teng

@freddie090
Copy link
Author

Hi Teng,

Ah great, okay - my hunch was the inference would be more robust if it had access to information from all samples at once. I'll give it a go!

Thanks -
Freddie

@freddie090
Copy link
Author

Hi @teng-gao - sorry, just to clarify:

After running pileup and phase where I provide a list of BAM files and corresponding sample names (as comma separated values as a single argument, e.g.: -- samples samp_1,samp_2,samp_3 \ --bams samp_1.bam,samp_2.bam,samp_3.bam the script produces an 'allele_counts.tsv' file for each sample.

Has the multi sample mode worked? I wasn't sure whether I should expect a single combined allele counts table for all samples. If not, then do you suggest manually merging the expression matrix and allele counts for each sample before running Numbat?

Best
Freddie

@teng-gao
Copy link
Collaborator

Yes you should get a separate allele count df for each sample. You can then concatenate them (ditto for expression count matrix) before feeding to run_numbat.

@freddie090
Copy link
Author

freddie090 commented Jul 27, 2023

Okay - and sorry final Q @teng-gao - are the sample identities preserved somewhere for distinguishing in the phylogeny plots later?

@teng-gao
Copy link
Collaborator

Okay - and sorry final Q @teng-gao - are the sample identities preserved somewhere for distinguishing in the phylogeny plots later?

You can plot the sample identities associated with cell barcodes on a sidebar using the annot = option in plot_phylo_heatmap:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants