Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long vectors not supported yet #32

Closed
josegarciamanteiga opened this issue May 13, 2022 · 5 comments
Closed

long vectors not supported yet #32

josegarciamanteiga opened this issue May 13, 2022 · 5 comments

Comments

@josegarciamanteiga
Copy link

josegarciamanteiga commented May 13, 2022

Hi,
Thanks for the package. Spectacular results with single cell RNASeq in tumors. I'd like to publish the identification of CAFs as normal cells in my tumors using it and it works smoothly in single datasets from 10X but I tried to mix three samples and got this error:

..../....
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Retesting CNVs..
Finishing..
Finishing..
Finishing..
Finishing..
Finishing..
Error in vec_slice(x_out, x_slicer) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

Error: Tibble columns must have compatible sizes.

  • Size 70731: Column 3.
  • Size 84967: Column 1.
  • Size 104417: Column 2.
  • Size 112067: Column 0.
    ℹ Only values of size one are recycled.
    Backtrace:
    1. ├─numbat::run_numbat(...)
    2. │ └─%>%(...)
    3. ├─numbat::run_group_hmms(...)
    4. │ └─%>%(...)
    5. ├─dplyr::ungroup(.)
    6. ├─dplyr::mutate(., seg_start_index = min(snp_index), seg_end_index = max(snp_index))
    7. ├─dplyr::group_by(., seg, sample)
    8. └─dplyr::bind_rows(.)
    9. ├─tibble::as_tibble(dots)
  1. └─tibble:::as_tibble.list(dots)
  2. └─tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
    
  3.   └─tibble:::recycle_columns(x, .rows, lengths)
    

Warning message:
In mclapply(bulks %>% split(.$sample), mc.cores = ncores, function(bulk) { :
scheduled core 2 encountered error in user code, all values of the job will be affected
Execution halted

I used pileup_and_phase.R without problems on a bam merged from the cellranger bams where I substituted the "-1" at the end of the barcodes to avoid collisions after using cellranger aggr to generate barcodes.
The error is thrown by run_numbat run with 64GB and 12 cores.
Thanks for the help
Jose

@evanbiederstedt
Copy link
Contributor

Hi @josegarciamanteiga

The error is actually from R itself: https://github.com/wch/r-source/blob/trunk/src/include/Rinlinedfuns.h

This used to be a more common error in R before version...3 maybe?

There's possibly something we could do to fix this. We'll investigate.

For context:
https://stackoverflow.com/questions/24335692/large-matrices-in-r-long-vectors-not-supported-yet
https://support.bioconductor.org/p/118016/

Best, Evan

@teng-gao
Copy link
Collaborator

Hi @josegarciamanteiga,

Thanks for the issue! Are the three samples from the same individual (so that they have the same germline SNP profile)? If so, there's no need to merge the bams manually; You can supply multiple BAMs and barcode files to pileup_and_phase.R and it will produce a consensus VCF for the individual the allele data frames for each sample. More details here:
https://kharchenkolab.github.io/numbat/articles/numbat.html#preparing-data

Best,
Teng

@josegarciamanteiga
Copy link
Author

Dear Teng,
Thanks for the reply! Two out of three are indeed from the same individual. I have used them now to run pileup_and_phase.R as you advised and indeed produced the data without errors. But now, with run_numbat.R, how should I give the two gene x umi matrices and the allele data tables? My point would be to have the posteriors and all the numbat output taking into account both samples so that I can load it onto a Seurat/Pagoda scRNA-Seq that contains an integration of both datasets.

As for the 'long vectors error', it is strange since it is running with R 4.0.3, here the sessionInfo() for further details:

library(numbat)
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /home/garciamanteiga.jose/.conda/envs/numbat/lib/libblas.so.3.8.0
LAPACK: /home/garciamanteiga.jose/.conda/envs/numbat/lib/liblapack.so.3.8.0

locale:
[1] LC_CTYPE=en_US.utf-8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf-8 LC_COLLATE=en_US.utf-8
[5] LC_MONETARY=en_US.utf-8 LC_MESSAGES=en_US.utf-8
[7] LC_PAPER=en_US.utf-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] numbat_0.1.0

loaded via a namespace (and not attached):
[1] treeio_1.14.4 tidyselect_1.1.1 purrr_0.3.4
[4] graphlayouts_0.8.0 lattice_0.20-45 ggfun_0.0.5
[7] colorspace_2.0-3 vctrs_0.3.8 generics_0.1.2
[10] viridisLite_0.4.0 utf8_1.2.2 gridGraphics_0.5-1
[13] rlang_0.4.12 pillar_1.7.0 glue_1.6.2
[16] DBI_1.1.2 tweenr_1.0.2 rvcheck_0.1.8
[19] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0
[22] gtable_0.3.0 parallel_4.0.3 fansi_1.0.2
[25] tidygraph_1.2.0 Rcpp_1.0.7 scales_1.1.1
[28] BiocManager_1.30.16 jsonlite_1.8.0 farver_2.1.0
[31] gridExtra_2.3 ggforce_0.3.3 ggplot2_3.3.2
[34] digest_0.6.29 aplot_0.1.2 stringi_1.7.6
[37] dplyr_1.0.7 ggrepel_0.9.1 polyclip_1.10-0
[40] grid_4.0.3 ggtree_2.4.2 tools_4.0.3
[43] yulab.utils_0.0.4 logger_0.2.2 magrittr_2.0.2
[46] lazyeval_0.2.2 patchwork_1.1.1 tibble_3.1.6
[49] ggraph_2.0.5 crayon_1.5.0 ape_5.6-2
[52] tidyr_1.1.2 pkgconfig_2.0.3 tidytree_0.3.9
[55] MASS_7.3-55 ellipsis_0.3.2 data.table_1.14.2
[58] ggplotify_0.1.0 extraDistr_1.9.1 assertthat_0.2.1
[61] viridis_0.6.2 R6_2.5.1 igraph_1.2.11
[64] nlme_3.1-155 compiler_4.0.3

@teng-gao
Copy link
Collaborator

Hi @josegarciamanteiga,

The error occurred because there were more than one individual's genotypes in the allele data. Only data from the same individual should be provided to pileup_and_phase.R and run_numbat. If you have two samples from the same individual, you can concatenate the gene count matrices (e.g. cbind) and allele dataframes (e.g. rbind) as input to run_numbat. If the third sample belongs to a separate individual, I would run it separately. If you want to plot the single-cell posteriors in an integrated expression embedding from different samples/individuals, you can combine the posterior dataframes (e.g. nb$joint_post, nb$clone_post) after reading in the results for each individual separately. For more info on the output, please see this tutorial.

Thanks,
Teng

@josegarciamanteiga
Copy link
Author

josegarciamanteiga commented May 17, 2022 via email

teng-gao pushed a commit that referenced this issue May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants