Co-genotype with bulk WGS and scRNA-seq #106

DrFengchenzhao · 2023-02-21T11:47:11Z

Hi Teng Gao,

I have implemented your methods on my scRNA-seq data which showed great advantages over others. I just noticed that you mentioned
'Therefore, a priori genotyping by DNA or co-genotyping with scRNA-seq (via multi-sample mode of pileup-and-phase) would be especially useful.'
on suggestions of preparing spatial transcriptomics data.

We sequenced bulk WGS, scRNA-seq and ST on the same sample. I wondered how I can harmonize these omics, especially using SNPs genotyped by bulk WGS to further increase power to detect CNVs in scRNA-seq.

Also, does 'via multi-sample mode of pileup-and-phase' mean that I should run pileup-and-phase.R with scRNA-seq and ST together?

Thank you for your patience. Looking forward to your reply.

Dr. Chenzhao Feng

teng-gao · 2023-02-27T15:52:41Z

Hi @DrFengchenzhao ,

Thanks for the issue.

We sequenced bulk WGS, scRNA-seq and ST on the same sample. I wondered how I can harmonize these omics, especially using SNPs genotyped by bulk WGS to further increase power to detect CNVs in scRNA-seq.

Yes, this is possible. You can first run allele pileup using cellsnp-lite with the WGS heterozygous SNP VCF. Then you run eagle2 phasing on the WGS VCF, and merge the phased GT fields with the allele counts to produce an allele dataframe in the format of Numbat input. Please refer to the underlying code in pileup_and_phase.R script to see how to modify the procedure to do this.

Also, does 'via multi-sample mode of pileup-and-phase' mean that I should run pileup-and-phase.R with scRNA-seq and ST together?

Yes, you can do that. This will help increase allele coverage for co-genotyping.

teng-gao · 2023-04-02T20:26:29Z

Instructions are now added in the getting started vignette.

cnk113 · 2023-06-13T19:42:08Z

Hey Teng,

I'm a bit lost after running cellsnp on the het regions, how do proceed after that?
Would I attempt to the df_allele matrix from the cellsnp output?

Thanks,
Chang

teng-gao · 2023-06-14T02:14:25Z

Hi @cnk113 ,

You can probably directly use the preprocess_allele function to get df_allele from the cell-snp counts and WGS phased VCF:

numbat/R/genotyping.R

Lines 141 to 293 in dc5c3fe

    
           #' Preprocess allele data 
        
           #'  
        
           #' @param sample character Sample label 
        
           #' @param vcf_pu dataframe Pileup VCF from cell-snp-lite 
        
           #' @param vcf_phased dataframe Phased VCF from eagle2 
        
           #' @param AD dgTMatrix Alt allele depth matrix from pileup 
        
           #' @param DP dgTMatrix Total allele depth matrix from pileup 
        
           #' @param barcodes vector List of barcodes from pileup 
        
           #' @param gtf dataframe Transcript GTF 
        
           #' @param gmap dataframe Genetic map 
        
           #' @return dataframe Allele counts by cell 
        
           #' @keywords internal 
        
           preprocess_allele = function( 
        
               sample, 
        
               vcf_pu, 
        
               vcf_phased, 
        
               AD, 
        
               DP, 
        
               barcodes, 
        
               gtf, 
        
               gmap 
        
           ) { 
        
               # pileup VCF 
        
               vcf_pu = vcf_pu %>% 
        
                   mutate(INFO = str_remove_all(INFO, '[:alpha:]|=')) %>% 
        
                   tidyr::separate(col = 'INFO', into = c('AD', 'DP', 'OTH'), sep = ';') %>% 
        
                   mutate_at(c('AD', 'DP', 'OTH'), as.integer) %>% 
        
                   mutate(snp_id = paste(CHROM, POS, REF, ALT, sep = '_')) 
        
               # pileup count matrices 
        
               DP = as.data.frame(Matrix::summary(DP)) %>% 
        
                   mutate( 
        
                       cell = barcodes[j], 
        
                       snp_id = vcf_pu$snp_id[i] 
        
                   ) %>% 
        
                   select(-i, -j) %>% 
        
                   rename(DP = x) %>% 
        
                   select(cell, snp_id, DP) 
        
               AD = as.data.frame(Matrix::summary(AD)) %>% 
        
                   mutate( 
        
                       cell = barcodes[j], 
        
                       snp_id = vcf_pu$snp_id[i] 
        
                   ) %>% 
        
                   select(-i, -j) %>% 
        
                   rename(AD = x) %>% 
        
                   select(cell, snp_id, AD) 
        
               df = DP %>% left_join(AD, by = c("cell", "snp_id")) %>% 
        
                   mutate(AD = ifelse(is.na(AD), 0, AD)) 
        
               df = df %>% left_join( 
        
                   vcf_pu %>% rename(AD_all = AD, DP_all = DP, OTH_all = OTH), 
        
                   by = 'snp_id') 
        
               df = df %>% mutate( 
        
                       AR = AD/DP, 
        
                       AR_all = AD_all/DP_all 
        
                   ) 
        
               df = df %>% dplyr::filter(DP_all > 1 & OTH_all == 0) 
        
               # vcf has duplicated records sometimes 
        
               df = df %>% distinct()  
        
               df = df %>% mutate( 
        
                   snp_index = as.integer(factor(snp_id, unique(snp_id))), 
        
                   cell_index = as.integer(factor(cell, sample(unique(cell)))) 
        
               ) 
        
               # phased VCF 
        
               vcf_phased = vcf_phased %>%  
        
                   mutate(INFO = str_remove_all(INFO, '[:alpha:]|=')) %>% 
        
                   tidyr::separate(col = 'INFO', into = c('AD', 'DP', 'OTH'), sep = ';') %>% 
        
                   mutate_at(c('AD', 'DP', 'OTH'), as.integer) %>% 
        
                   mutate(snp_id = paste(CHROM, POS, REF, ALT, sep = '_')) %>% 
        
                   mutate(GT = get(sample)) 
        
               # annotate SNP by gene 
        
               overlap_transcript = GenomicRanges::findOverlaps( 
        
                       vcf_phased %>% {GenomicRanges::GRanges( 
        
                           seqnames = .$CHROM, 
        
                           IRanges::IRanges(start = .$POS, 
        
                               end = .$POS) 
        
                       )}, 
        
                       gtf %>% {GenomicRanges::GRanges( 
        
                           seqnames = .$CHROM, 
        
                           IRanges::IRanges(start = .$gene_start, 
        
                               end = .$gene_end) 
        
                       )} 
        
                   ) %>% 
        
                   as.data.frame() %>% 
        
                   setNames(c('snp_index', 'gene_index')) %>% 
        
                   left_join( 
        
                       vcf_phased %>% mutate(snp_index = 1:n()) %>% 
        
                           select(snp_index, snp_id), 
        
                       by = c('snp_index') 
        
                   ) %>% 
        
                   left_join( 
        
                       gtf %>% mutate(gene_index = 1:n()), 
        
                       by = c('gene_index') 
        
                   ) %>% 
        
                   arrange(snp_index, gene) %>% 
        
                   distinct(snp_index, `.keep_all` = TRUE) 
        
               vcf_phased = vcf_phased %>% 
        
                   left_join( 
        
                       overlap_transcript %>% select(snp_id, gene, gene_start, gene_end), 
        
                       by = c('snp_id') 
        
                   ) 
        
               # annotate SNPs by genetic map 
        
               marker_map = GenomicRanges::findOverlaps( 
        
                       vcf_phased %>% {GenomicRanges::GRanges( 
        
                           seqnames = .$CHROM, 
        
                           IRanges::IRanges(start = .$POS, 
        
                               end = .$POS) 
        
                       )}, 
        
                       gmap %>% {GenomicRanges::GRanges( 
        
                           seqnames = .$CHROM, 
        
                           IRanges::IRanges(start = .$start, 
        
                               end = .$end) 
        
                       )} 
        
                   ) %>% 
        
                   as.data.frame() %>% 
        
                   setNames(c('marker_index', 'map_index')) %>% 
        
                   left_join( 
        
                       vcf_phased %>% mutate(marker_index = 1:n()) %>% 
        
                           select(marker_index, snp_id), 
        
                       by = c('marker_index') 
        
                   ) %>% 
        
                   left_join( 
        
                       gmap %>% mutate(map_index = 1:n()), 
        
                       by = c('map_index') 
        
                   ) %>% 
        
                   arrange(marker_index, -start) %>% 
        
                   distinct(marker_index, `.keep_all` = TRUE) %>% 
        
                   select(snp_id, cM) 
        
               vcf_phased = vcf_phased %>%  
        
                   left_join(marker_map, by = 'snp_id') 
        
               # add annotation to cell counts 
        
               df = df %>% left_join(vcf_phased %>% select(snp_id, gene, GT, cM), by = 'snp_id') 
        
               df = df %>% mutate(CHROM = factor(CHROM, unique(CHROM))) 
        
               df = df %>% select(cell, snp_id, CHROM, POS, cM, REF, ALT, AD, DP, GT, gene) 
        
               # only keep hets 
        
               df = df %>% filter(GT %in% c('1|0', '0|1')) 
        
               return(df) 
        
           }

teng-gao closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Co-genotype with bulk WGS and scRNA-seq #106

Co-genotype with bulk WGS and scRNA-seq #106

DrFengchenzhao commented Feb 21, 2023

teng-gao commented Feb 27, 2023 •

edited

Loading

teng-gao commented Apr 2, 2023

cnk113 commented Jun 13, 2023

teng-gao commented Jun 14, 2023 •

edited

Loading

Co-genotype with bulk WGS and scRNA-seq #106

Co-genotype with bulk WGS and scRNA-seq #106

Comments

DrFengchenzhao commented Feb 21, 2023

teng-gao commented Feb 27, 2023 • edited Loading

teng-gao commented Apr 2, 2023

cnk113 commented Jun 13, 2023

teng-gao commented Jun 14, 2023 • edited Loading

teng-gao commented Feb 27, 2023 •

edited

Loading

teng-gao commented Jun 14, 2023 •

edited

Loading