Run HaplotypeCallerSpark on WGS in strict mode #5721

tomwhite · 2019-02-26T09:59:42Z

These are the changes needed to run on a whole genome in strict mode. We get out of memory errors without these changes.

Reads downsampling was missing for the part where AssemblyRegions are filled with reads - this PR adds it in. Downsampling is not deterministic yet, since that depends on #5437, but that's an orthogonal issue so it's OK to merge this change and add #5437 later.

codecov-io · 2019-02-26T10:38:34Z

Codecov Report

Merging #5721 into master will decrease coverage by 0.001%.
The diff coverage is 91.667%.

@@               Coverage Diff               @@
##              master     #5721       +/-   ##
===============================================
- Coverage     86.985%   86.984%   -0.001%     
- Complexity     31863     31865        +2     
===============================================
  Files           1943      1943               
  Lines         146768    146775        +7     
  Branches       16223     16225        +2     
===============================================
+ Hits          127666    127671        +5     
  Misses         13189     13189               
- Partials        5913      5915        +2

Impacted Files	Coverage Δ	Complexity Δ
...ils/activityprofile/ActivityProfileStateRange.java	`94.286% <100%> (+0.168%)`	`7 <0> (+1)`	⬆️
...lbender/engine/spark/FindAssemblyRegionsSpark.java	`82.09% <87.5%> (+0.122%)`	`20 <3> (+1)`	⬆️
...nder/utils/runtime/StreamingProcessController.java	`67.299% <0%> (-0.474%)`	`33% <0%> (ø)`

jamesemery

Looks good, my primary comment is that you should probably leave a warning about the inconsistent downsampling between the two places in the code

jamesemery · 2019-02-26T17:40:21Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

-        // 5. Convert shards to assembly regions.
-        JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header));
+        // 5. Convert shards to assembly regions. Reads downsampling is done again here, and is assumed to be consistent
+        // with the downsampling done in step 1, since it is deterministic by locus.


Since it sounds like you want to get this branch in first, i would instead make this a comment referencing the other branch and noting the issue.

jamesemery · 2019-02-26T17:45:46Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

@@ -159,12 +159,20 @@
        // at which points the reads can be filled in. (See next step.)
        JavaRDD<ReadlessAssemblyRegion> readlessAssemblyRegions = contigToGroupedStates
                .flatMap(getReadlessAssemblyRegionsFunction(header, assemblyRegionArgs));
+        // repartition to distribute the data evenly across the cluster again
+        readlessAssemblyRegions = readlessAssemblyRegions.repartition(readlessAssemblyRegions.getNumPartitions());


Before this repartitioning it looks like the reads may have been partitioned by contig? Flatmap I assume doesn't repartition automatically.

That's right

…-sized data.

…ing assembly regions

…#5476.

tomwhite added the HaplotypeCallerSpark label Feb 26, 2019

tomwhite self-assigned this Feb 26, 2019

tomwhite requested a review from jamesemery February 26, 2019 09:59

tomwhite mentioned this pull request Feb 26, 2019

Perform downsampling in AssemblyRegionWalkerSpark's strict mode #5508

Closed

jamesemery approved these changes Mar 4, 2019

View reviewed changes

tomwhite added 4 commits March 8, 2019 12:13

Reduce ActivityProfileStateRange memory usage so it can run on genome…

5778385

…-sized data.

Repartition data so it is spread evenly across the cluster after find…

5c969f5

…ing assembly regions

Perform downsampling in AssemblyRegionWalkerSpark's strict mode. Fixes …

dfde7cc

…#5476.

Update comment

955d034

tomwhite force-pushed the tw_hcs_strict_genome branch from 9034302 to 955d034 Compare March 8, 2019 12:16

tomwhite merged commit 02dca71 into master Mar 8, 2019

tomwhite deleted the tw_hcs_strict_genome branch March 8, 2019 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run HaplotypeCallerSpark on WGS in strict mode #5721

Run HaplotypeCallerSpark on WGS in strict mode #5721

tomwhite commented Feb 26, 2019

codecov-io commented Feb 26, 2019 •

edited

Loading

jamesemery left a comment

jamesemery Feb 26, 2019

tomwhite Mar 8, 2019

jamesemery Feb 26, 2019

tomwhite Mar 8, 2019

Run HaplotypeCallerSpark on WGS in strict mode #5721

Run HaplotypeCallerSpark on WGS in strict mode #5721

Conversation

tomwhite commented Feb 26, 2019

codecov-io commented Feb 26, 2019 • edited Loading

Codecov Report

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Feb 26, 2019

Choose a reason for hiding this comment

tomwhite Mar 8, 2019

Choose a reason for hiding this comment

jamesemery Feb 26, 2019

Choose a reason for hiding this comment

tomwhite Mar 8, 2019

Choose a reason for hiding this comment

codecov-io commented Feb 26, 2019 •

edited

Loading