-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run HaplotypeCallerSpark on WGS in strict mode #5721
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5721 +/- ##
===============================================
- Coverage 86.985% 86.984% -0.001%
- Complexity 31863 31865 +2
===============================================
Files 1943 1943
Lines 146768 146775 +7
Branches 16223 16225 +2
===============================================
+ Hits 127666 127671 +5
Misses 13189 13189
- Partials 5913 5915 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, my primary comment is that you should probably leave a warning about the inconsistent downsampling between the two places in the code
// 5. Convert shards to assembly regions. | ||
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header)); | ||
// 5. Convert shards to assembly regions. Reads downsampling is done again here, and is assumed to be consistent | ||
// with the downsampling done in step 1, since it is deterministic by locus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it sounds like you want to get this branch in first, i would instead make this a comment referencing the other branch and noting the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -159,12 +159,20 @@ | |||
// at which points the reads can be filled in. (See next step.) | |||
JavaRDD<ReadlessAssemblyRegion> readlessAssemblyRegions = contigToGroupedStates | |||
.flatMap(getReadlessAssemblyRegionsFunction(header, assemblyRegionArgs)); | |||
// repartition to distribute the data evenly across the cluster again | |||
readlessAssemblyRegions = readlessAssemblyRegions.repartition(readlessAssemblyRegions.getNumPartitions()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this repartitioning it looks like the reads may have been partitioned by contig? Flatmap I assume doesn't repartition automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right
…ing assembly regions
9034302
to
955d034
Compare
These are the changes needed to run on a whole genome in strict mode. We get out of memory errors without these changes.
Reads downsampling was missing for the part where
AssemblyRegion
s are filled with reads - this PR adds it in. Downsampling is not deterministic yet, since that depends on #5437, but that's an orthogonal issue so it's OK to merge this change and add #5437 later.