Downloading disambiguate reference files and alternative solutions #34

skchronicles · 2024-01-12T17:32:38Z

About
At the current moment, the cache subcommand of the pipeline does not download disambiguate's reference files, i.e. the bwa indices for each of the supporting reference genomes. As so, these reference files should exist on the host's filesystem prior to execution. These files have already been downloaded/exist on BigSky and Biowulf; however, if the pipeline were to be setup on another cluster, they would need to be downloaded outside the cache subcommand.

Here is an example command to download disambiguate's reference files from helix/biowulf:

rsync -rav -e ssh helix.nih.gov:/data/OpenOmics/references/genomes .

Road map
Here are some proposed long-term solutions:

Move the reference files into our data-share directory for easy downloads, update the cache sub command to pull from this location.
Build the alignment indices on the fly in the output directory and blow them away as a post-processing hook. This should not be a rate-limiting step of the pipeline. It can start running during the bcl2fastq conversion and should be completed way before trimming completes. The only down-side is a slight increase in disk space while the pipeline is running; although if the pipeline cleans up these files after the run completes, it's not really a big deal.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading disambiguate reference files and alternative solutions #34

Downloading disambiguate reference files and alternative solutions #34

skchronicles commented Jan 12, 2024

Downloading disambiguate reference files and alternative solutions #34

Downloading disambiguate reference files and alternative solutions #34

Comments

skchronicles commented Jan 12, 2024