Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Co-genotype with bulk WGS and scRNA-seq #106

Closed
DrFengchenzhao opened this issue Feb 21, 2023 · 4 comments
Closed

Co-genotype with bulk WGS and scRNA-seq #106

DrFengchenzhao opened this issue Feb 21, 2023 · 4 comments

Comments

@DrFengchenzhao
Copy link

Hi Teng Gao,

I have implemented your methods on my scRNA-seq data which showed great advantages over others. I just noticed that you mentioned
'Therefore, a priori genotyping by DNA or co-genotyping with scRNA-seq (via multi-sample mode of pileup-and-phase) would be especially useful.'
on suggestions of preparing spatial transcriptomics data.

We sequenced bulk WGS, scRNA-seq and ST on the same sample. I wondered how I can harmonize these omics, especially using SNPs genotyped by bulk WGS to further increase power to detect CNVs in scRNA-seq.

Also, does 'via multi-sample mode of pileup-and-phase' mean that I should run pileup-and-phase.R with scRNA-seq and ST together?

Thank you for your patience. Looking forward to your reply.

Dr. Chenzhao Feng

@teng-gao
Copy link
Collaborator

teng-gao commented Feb 27, 2023

Hi @DrFengchenzhao ,

Thanks for the issue.

We sequenced bulk WGS, scRNA-seq and ST on the same sample. I wondered how I can harmonize these omics, especially using SNPs genotyped by bulk WGS to further increase power to detect CNVs in scRNA-seq.

Yes, this is possible. You can first run allele pileup using cellsnp-lite with the WGS heterozygous SNP VCF. Then you run eagle2 phasing on the WGS VCF, and merge the phased GT fields with the allele counts to produce an allele dataframe in the format of Numbat input. Please refer to the underlying code in pileup_and_phase.R script to see how to modify the procedure to do this.

Also, does 'via multi-sample mode of pileup-and-phase' mean that I should run pileup-and-phase.R with scRNA-seq and ST together?

Yes, you can do that. This will help increase allele coverage for co-genotyping.

@teng-gao
Copy link
Collaborator

teng-gao commented Apr 2, 2023

Instructions are now added in the getting started vignette.

@teng-gao teng-gao closed this as completed Apr 2, 2023
@cnk113
Copy link

cnk113 commented Jun 13, 2023

Hey Teng,

I'm a bit lost after running cellsnp on the het regions, how do proceed after that?
Would I attempt to the df_allele matrix from the cellsnp output?

Thanks,
Chang

@teng-gao
Copy link
Collaborator

teng-gao commented Jun 14, 2023

Hi @cnk113 ,

You can probably directly use the preprocess_allele function to get df_allele from the cell-snp counts and WGS phased VCF:

numbat/R/genotyping.R

Lines 141 to 293 in dc5c3fe

#' Preprocess allele data
#'
#' @param sample character Sample label
#' @param vcf_pu dataframe Pileup VCF from cell-snp-lite
#' @param vcf_phased dataframe Phased VCF from eagle2
#' @param AD dgTMatrix Alt allele depth matrix from pileup
#' @param DP dgTMatrix Total allele depth matrix from pileup
#' @param barcodes vector List of barcodes from pileup
#' @param gtf dataframe Transcript GTF
#' @param gmap dataframe Genetic map
#' @return dataframe Allele counts by cell
#' @keywords internal
preprocess_allele = function(
sample,
vcf_pu,
vcf_phased,
AD,
DP,
barcodes,
gtf,
gmap
) {
# pileup VCF
vcf_pu = vcf_pu %>%
mutate(INFO = str_remove_all(INFO, '[:alpha:]|=')) %>%
tidyr::separate(col = 'INFO', into = c('AD', 'DP', 'OTH'), sep = ';') %>%
mutate_at(c('AD', 'DP', 'OTH'), as.integer) %>%
mutate(snp_id = paste(CHROM, POS, REF, ALT, sep = '_'))
# pileup count matrices
DP = as.data.frame(Matrix::summary(DP)) %>%
mutate(
cell = barcodes[j],
snp_id = vcf_pu$snp_id[i]
) %>%
select(-i, -j) %>%
rename(DP = x) %>%
select(cell, snp_id, DP)
AD = as.data.frame(Matrix::summary(AD)) %>%
mutate(
cell = barcodes[j],
snp_id = vcf_pu$snp_id[i]
) %>%
select(-i, -j) %>%
rename(AD = x) %>%
select(cell, snp_id, AD)
df = DP %>% left_join(AD, by = c("cell", "snp_id")) %>%
mutate(AD = ifelse(is.na(AD), 0, AD))
df = df %>% left_join(
vcf_pu %>% rename(AD_all = AD, DP_all = DP, OTH_all = OTH),
by = 'snp_id')
df = df %>% mutate(
AR = AD/DP,
AR_all = AD_all/DP_all
)
df = df %>% dplyr::filter(DP_all > 1 & OTH_all == 0)
# vcf has duplicated records sometimes
df = df %>% distinct()
df = df %>% mutate(
snp_index = as.integer(factor(snp_id, unique(snp_id))),
cell_index = as.integer(factor(cell, sample(unique(cell))))
)
# phased VCF
vcf_phased = vcf_phased %>%
mutate(INFO = str_remove_all(INFO, '[:alpha:]|=')) %>%
tidyr::separate(col = 'INFO', into = c('AD', 'DP', 'OTH'), sep = ';') %>%
mutate_at(c('AD', 'DP', 'OTH'), as.integer) %>%
mutate(snp_id = paste(CHROM, POS, REF, ALT, sep = '_')) %>%
mutate(GT = get(sample))
# annotate SNP by gene
overlap_transcript = GenomicRanges::findOverlaps(
vcf_phased %>% {GenomicRanges::GRanges(
seqnames = .$CHROM,
IRanges::IRanges(start = .$POS,
end = .$POS)
)},
gtf %>% {GenomicRanges::GRanges(
seqnames = .$CHROM,
IRanges::IRanges(start = .$gene_start,
end = .$gene_end)
)}
) %>%
as.data.frame() %>%
setNames(c('snp_index', 'gene_index')) %>%
left_join(
vcf_phased %>% mutate(snp_index = 1:n()) %>%
select(snp_index, snp_id),
by = c('snp_index')
) %>%
left_join(
gtf %>% mutate(gene_index = 1:n()),
by = c('gene_index')
) %>%
arrange(snp_index, gene) %>%
distinct(snp_index, `.keep_all` = TRUE)
vcf_phased = vcf_phased %>%
left_join(
overlap_transcript %>% select(snp_id, gene, gene_start, gene_end),
by = c('snp_id')
)
# annotate SNPs by genetic map
marker_map = GenomicRanges::findOverlaps(
vcf_phased %>% {GenomicRanges::GRanges(
seqnames = .$CHROM,
IRanges::IRanges(start = .$POS,
end = .$POS)
)},
gmap %>% {GenomicRanges::GRanges(
seqnames = .$CHROM,
IRanges::IRanges(start = .$start,
end = .$end)
)}
) %>%
as.data.frame() %>%
setNames(c('marker_index', 'map_index')) %>%
left_join(
vcf_phased %>% mutate(marker_index = 1:n()) %>%
select(marker_index, snp_id),
by = c('marker_index')
) %>%
left_join(
gmap %>% mutate(map_index = 1:n()),
by = c('map_index')
) %>%
arrange(marker_index, -start) %>%
distinct(marker_index, `.keep_all` = TRUE) %>%
select(snp_id, cM)
vcf_phased = vcf_phased %>%
left_join(marker_map, by = 'snp_id')
# add annotation to cell counts
df = df %>% left_join(vcf_phased %>% select(snp_id, gene, GT, cM), by = 'snp_id')
df = df %>% mutate(CHROM = factor(CHROM, unique(CHROM)))
df = df %>% select(cell, snp_id, CHROM, POS, cM, REF, ALT, AD, DP, GT, gene)
# only keep hets
df = df %>% filter(GT %in% c('1|0', '0|1'))
return(df)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants