Processing Datasets that Span Multiple Data Collection Runs #88

barkasn · 2019-07-22T12:42:06Z

This RFC proposes a design solution to allow datasets that span multiple data collection runs to be processed in unison as per the scientific requirements. The proposal generalizes to multiple data type modalities, but also provides specific details for 10X V2 3’ single-cell data as an example and to address an immediate need for a data processing solution.

This RFC has undergone a substantial rewrite based on comments.

Summary of Discussion for Approvers:
This RFC initially proposed the concept of DAPS as a design solution to allow datasets that span multiple data collection runs to be processed in unison. After an extensive rewrite the concept of data groups is now proposed. A data group is a collection of bundles that are version-complete, that is at a given point in time fulfill specified completeness criteria. An implementation of the data group concept as a bundle is proposed in order to re-use existing infrastructure. The completeness criteria that the data group must fulfill depend on its type. This RFC proposes the implementation of the PROJECT_SUBMISSION data group type, that specifies that a submission of a data set is complete and downstream processing can commence. Other data group types can be added in the future as project needs dictate.

August 27th, 2019 Last Call for Community Review
September 27th, 2019 Last call for oversight review

…rious edits suggested by Garmmarly

rfcs/text/0000-multi-data-collection-dataset-processing.md

kislyuk

I think the term "DAPS" suffers from being abstract and detached from existing terminology. I understand the distinction that you're trying to draw between "project" and DAPS, but I'd prefer that this be called a "dataset" or "primary dataset", in line with existing terminology.

I don't have specific objections to the design described in this RFC (except for the confusion about the "submission envelope" term pointed out by @mckinsel, and most things described in the "Alternative Approaches" section, which I presume wouldn't make its way into the body of the RFC without further discussion). However, I think at this point this RFC is too abstract and complex to adequately inform the design and implementation of DCP features. It does not concretely explain the status quo (the data wrangling/ingestion/analysis SOP as it exists today) and how this RFC changes it. I think the in-depth, non-normative discussion should be moved somewhere else, and the RFC simplified to just the normative parts and a description of how they change the SOP.

diekhans · 2019-07-23T13:32:45Z

I can see how DAPS is confusing. However, "dataset" has a lot of assumed meaning in bioinformatics and would likely create more confusion. Many would view what the DCP calls "project" as a "dataset'.

A DAPS is more akin to a message about a set of data than an actual collection of data that is more rigidly defined.

Ideas:
"Processing subset"?

claymfischer · 2019-07-23T15:37:42Z

Ideas:
"Processing subset"?

I think that is a clearer to an average user. +1

kbergin · 2019-07-25T16:01:35Z

Are the reviewers requested automated? I think @jkaneria should definitely be in the list but I don't think I can add her for some reason

lauraclarke

There there are several points which lack clarity in this document. Some diagrams which demonstrate the lifecycle changes proposed and some specific examples of real experiments and how this would improve how we handle them would both be very useful to improve this RFC

rfcs/text/0000-multi-data-collection-dataset-processing.md

diekhans

Didn't actually mean to enter review mode. Anyway, many suggestions to clarify text based on comments that show things are not clear.

rfcs/text/0000-multi-data-collection-dataset-processing.md

briandoconnor · 2019-09-22T18:28:03Z

@diekhans thanks very much for your comments. I don't feel, though, that the data browser will have sufficient information to show/not show a project based on what's proposed here. If the goal is to show only things that have made it through ingest, have had secondary pipelines run on them, and have been loaded in the matrix service then you need to define and agree on those scopes. As it stands right now, this only partially solves the problem from the browser perspective (but may very well solve problems from other components' perspectives). I'm curious to hear what @hannes-ucsc thinks as well since he's the tech lead for the browser.

diekhans · 2019-09-22T21:23:27Z

@briandoconnor I think the concept of scope isn't clear enough, I will update the text. Scope right now is "bundle" this extends it to "project", with the possibilities of other scopes, especially if we ended up with a "data set". Submission is not only primary data submission (FASTQs), also secondary submission (BAMs), etc. Any data in the DSS is intended to go through ingest via a submission. Submission is a verb. There is no metdata directly associated with a submission event. The intention is to allow querying the metadata associated with the data bundles referenced by the data group bundle. We can and should add attributes to the data group to make this process faster and easier. We are waiting for the client components to engage more to do this. I will make this explicit in the text. It is not surprising this is confusion, as there is a lack of clear definition in the DCP. We look forward to @hannes-ucsc comments. briandoconnor <notifications@github.com> writes:

…

@diekhans thanks very much for your comments. I don't feel, though, that the data browser will have sufficient information to show/not show a project based on what's proposed here. If the goal is to show only things that have made it through ingest, have had secondary pipelines run on them, and have been loaded in the matrix service then you need to define and agree on those scopes. As it stands right now, this only partially solves the problem from the browser perspective (but may very well solve problems from other components' perspectives). I'm curious to hear what @hannes-ucsc thinks as well since he's the tech lead for the browser. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #88 (comment) @diekhans thanks very much for your comments. I don't feel, though, that the data browser will have sufficient information to show/not show a project based on what's proposed here. If the goal is to show only things that have made it through ingest, have had secondary pipelines run on them, and have been loaded in the matrix service then you need to define and agree on those scopes. As it stands right now, this only partially solves the problem from the browser perspective (but may very well solve problems from other components' perspectives). I'm curious to hear what @hannes-ucsc thinks as well since he's the tech lead for the browser. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

…avoid conflusion about the meanin of submission. At step definition to be able to avoid introspection of data bundles

diekhans · 2019-09-23T19:27:51Z

@briandoconnor @hannes-ucsc I updated the text to help remove the confusion about the term "submission*. I also when in more detail about what scope is and make the life-cycle step implicit.

Hannes, please review.

maniarathi

Approval is for the overall design; though I think taking into account Ingest/Secondary Analysis's listed comments will be a much stronger approval than mine!

samanehsan · 2019-09-24T16:00:05Z

rfcs/text/0000-multi-data-collection-dataset-processing.md

+While this RFC proposes a general mechanism, it was developed in response to the needs of analysis to co-process data that are in multiple assay bundles.  This section uses co-processing of multiple sequencing run from a given library as an example of the use of data groups to identify data that should be processed together.  A key concept behind this proposal is that upstream submission should not be defining how downstream analysis groups inputs for processing.  The analysis pipelines need to be able to group input based on ways that may not be defined at submission time.  New analysis pipelines should be able to group inputs in arbitrary ways without requiring changes to the way data is packaged.
+
+### Identifying data to co-process
+Pipelines that require grouping data from multiple sequencing assays (co-processing) subscribe to *PROJECT_SCOPES* events instead of per-bundle events. Unlike with the assay bundle per pipeline run, this makes the pipeline framework responsible for collecting assay bundles into pipeline runs.  Pipelines that do not require co-processing continue to operate on a per-bundle basis, however they still need to subscribe to the project data group so they can notify ingest when the input *data group* process is complete for use by downstream consumers, such as Azul.


I think this would be more accurate if it said the pipeline service needs to subscribe to the project data group...

The difference is that if every pipeline subscribed to "project data group notifications", whenever a project is submitted the same notification would get sent to each pipeline. This creates a lot of duplicated work in terms of inspecting the contents to determine if that pipeline should be run so it would be more efficient to have a single subscription to data group notifications. However, if there's a way to query the project data group bundles without requesting all of the metadata files from the data store then one subscription/pipeline would be fine.

thanks, @samanehsan text has been clarified

samanehsan

Having primary and secondary project submission data groups would add a lot of value in terms of indicating whether a project is “version-complete” and ready to be consumed by downstream services such as the data browser and the matrix service. My understanding is that the immediate need for identifying 10x v2 sequencing data from the same library preparation will be addressed by ingest so re-architecting all of secondary analysis to run the existing SmartSeq2 and Optimus pipelines based on project submission data group notifications is not currently necessary.

This allows us to take a more incremental approach with the implementation. Specifically, analysis could first consider how to determine that the analysis for a project submission data group is complete. Then, once we have the need to run a pipeline (secondary or tertiary) that requires input from multiple bundles we can subscribe to project submission data group notifications and modify our infrastructure to handle that use case.

diekhans · 2019-09-24T22:28:51Z

thanks @samanehsan I think the plan is to not change the bundling for 10x v2, but I am not 100%. However, do we really need to do this incrementally. So an approach that only handled 10x v2, and didn't create analysis data groups would be the second step after ingest implements primary data groups. Saman Ehsan <notifications@github.com> writes:

…

samanehsan approved this pull request. Having primary and secondary project submission data groups would add a lot of value in terms of indicating whether a project is “version-complete” and ready to be consumed by downstream services such as the data browser and the matrix service. My understanding is that the immediate need for identifying 10x v2 sequencing data from the same library preparation will be addressed by ingest so re-architecting all of secondary analysis to run the existing SmartSeq2 and Optimus pipelines based on project submission data group notifications is not an urgent priority. This allows us to take a more incremental approach with the implementation. Specifically, analysis could first consider how to determine that the analysis for a project submission data group is complete. Then, once we have the need to run a pipeline (secondary or tertiary) that requires input from multiple bundles we can subscribe to project submission data group notifications and modify our infrastructure to handle that use case. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: ***@***.*** approved this pull request. Having primary and secondary project submission data groups would add a lot of value in terms of indicating whether a project is “version-complete” and ready to be consumed by downstream services such as the data browser and the matrix service. My understanding is that the immediate need for identifying 10x v2 sequencing data from the same library preparation will be addressed by ingest so re-architecting all of secondary analysis to run the existing SmartSeq2 and Optimus pipelines based on project submission data group notifications is not an urgent priority. This allows us to take a more incremental approach with the implementation. Specifically, analysis could first consider how to determine that the analysis for a project submission data group is complete. Then, once we have the need to run a pipeline (secondary or tertiary) that requires input from multiple bundles we can subscribe to project submission data group notifications and modify our infrastructure to handle that use case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

briandoconnor

This RFC seems like it would be the place to map out how we use data groups to signal to downstream systems when collections of data are "ready". It doesn't go far enough, though, to explicitly solve my problems with the portal e.g. how do we know when a project has all their primary data uploaded and ready for display in the portal, when it's finished being analyzed through secondary pipelines, and when it's data is uploaded and ready in the matrix service. If we all agreed to use data groups for these key milestones in the data lifecycle in the project then we would have actions to take on the portal indexing side. Until there's agreement there, I don't think there's anything for the portal or data store to do with regards to this RFC. Mark tells me the proposal I'm asking for here will be part of another RFC so I'm just approving this one since I'm not opposed to the ideas presented here.

diekhans · 2019-09-25T21:16:54Z

thanks @briandoconnor

Actually, I indicated this RFC would be updated rather than a new RFC. We need to work with the Azul and expression matrix groups to flesh out the details and make any required modification.

samanehsan · 2019-09-27T14:56:48Z

@diekhans, maybe @justincc can clarify this for us:

I think the plan is to not change the bundling for 10x v2, but I am not 100%.

This RFC about representing sequencing library preparations says "Data consumers will not need to depend on multi-bundle notifications for processing all data files from the same library preparation.. All data files from a single library preparation will be put in the same logical unit if submitted at one time", so I thought the plan was to update bundling in ingest.

diekhans · 2019-09-27T17:00:24Z

It looks 10x v2 is still up in theair. Note Mallory's use of "logic unit" instead of bundle.

Personally, I would like to avoid making a more complex upstream bundling structure base on current limitations. However, there is also a project reality of getting datasets into the DCP that might make waiting for data groups very disadvantageous in the medium term.

I think this is mostly about schedule and priorities. Norman started a working group to oversee the implementation of both RFCs, so we should let lead to a decision.

samanehsan · 2019-09-27T17:38:15Z

Yes, I see what you mean. If the Optimus pipeline is the only current use-case, I think that can be solved more simply with updating the bundling logic in ingest vs. re-writing analysis. And of course all of the bundles from a submission could still go into a data group.

diekhans · 2019-09-30T20:17:31Z

I talked to @MDunitz today and she is happy that this RFC doesn't cause even more problems than it solves.

so we are declaring it done

…lAtlas#88)

barkasn and others added 7 commits July 17, 2019 14:14

add multi data collection dataset processing

45b08f3

Amend multi-data-collection-dataset-processing to match template

0e0daa9

Document cleanup and adding TODOs

02a0aa2

Added more justification for submission not defining the analysis. Va…

78ac972

…rious edits suggested by Garmmarly

restored accidently deleted text (emacs aimed at foot)

1ad087c

delete stray character

9cabc99

Minor fixes and cleanup

4d5cc79

barkasn added rfc-community-review Architecture labels Jul 22, 2019

barkasn requested review from TimothyTickle and lauraclarke July 22, 2019 13:35

ambrosejcarr self-requested a review July 22, 2019 14:53

kislyuk assigned kislyuk and unassigned kislyuk Jul 22, 2019

kislyuk self-requested a review July 22, 2019 18:51

mckinsel reviewed Jul 22, 2019

View reviewed changes

rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved

rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved

rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved

kislyuk reviewed Jul 23, 2019

View reviewed changes

brianraymor requested a review from jkaneria July 25, 2019 16:14

lauraclarke requested changes Jul 26, 2019

View reviewed changes

diekhans reviewed Jul 26, 2019

View reviewed changes

diekhans and others added 4 commits July 28, 2019 15:18

edit passing on summary, motivation, and user stories

4fd862e

edits to design section

162df3d

Minor edits while reading

7783758

cleanups

34bfa55

kislyuk mentioned this pull request Jul 30, 2019

RFC: HCA DCP Bundle Types and Definitions #93

Closed

barkasn added 2 commits July 30, 2019 17:27

working updates

55c5036

refactoring savepoint1

c8957a6

updated to more clearly define scope and rename constant for JSON to …

9f5914d

…avoid conflusion about the meanin of submission. At step definition to be able to avoid introspection of data bundles

maniarathi approved these changes Sep 23, 2019

View reviewed changes

clarification thanks to on Saman's input

466410b

justincc approved these changes Sep 24, 2019

View reviewed changes

samanehsan reviewed Sep 24, 2019

View reviewed changes

clarified text as requested by samanehsan

1fc19d4

samanehsan approved these changes Sep 24, 2019

View reviewed changes

briandoconnor approved these changes Sep 25, 2019

View reviewed changes

updated figure 1 to match text changes

e6ce46d

MightyAx approved these changes Sep 26, 2019

View reviewed changes

diekhans added rfc-approved and removed rfc-oversight-review labels Sep 30, 2019

barkasn added 7 commits September 30, 2019 16:22

pre-merge commit

03e4a98

revert the rename for merge with master

ad1db04

rename with correct number

51116d0

update image location

d451765

image update

5301acd

change link

436b502

fix link

37602ba

barkasn merged commit 45b039d into HumanCellAtlas:master Sep 30, 2019

diekhans pushed a commit to diekhans/dcp-community that referenced this pull request Oct 31, 2019

Processing Datasets that Span Multiple Data Collection Runs (HumanCel…

70edcf8

…lAtlas#88)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing Datasets that Span Multiple Data Collection Runs #88

Processing Datasets that Span Multiple Data Collection Runs #88

barkasn commented Jul 22, 2019 •

edited

Loading

kislyuk left a comment

diekhans commented Jul 23, 2019

claymfischer commented Jul 23, 2019

kbergin commented Jul 25, 2019

lauraclarke left a comment •

edited

Loading

diekhans left a comment

briandoconnor commented Sep 22, 2019

diekhans commented Sep 22, 2019 via email

diekhans commented Sep 23, 2019

maniarathi left a comment

samanehsan Sep 24, 2019 •

edited

Loading

diekhans Sep 24, 2019

samanehsan left a comment •

edited

Loading

diekhans commented Sep 24, 2019 via email

briandoconnor left a comment

diekhans commented Sep 25, 2019

samanehsan commented Sep 27, 2019

diekhans commented Sep 27, 2019

samanehsan commented Sep 27, 2019

diekhans commented Sep 30, 2019

Processing Datasets that Span Multiple Data Collection Runs #88

Processing Datasets that Span Multiple Data Collection Runs #88

Conversation

barkasn commented Jul 22, 2019 • edited Loading

kislyuk left a comment

Choose a reason for hiding this comment

diekhans commented Jul 23, 2019

claymfischer commented Jul 23, 2019

kbergin commented Jul 25, 2019

lauraclarke left a comment • edited Loading

Choose a reason for hiding this comment

diekhans left a comment

Choose a reason for hiding this comment

briandoconnor commented Sep 22, 2019

diekhans commented Sep 22, 2019 via email

diekhans commented Sep 23, 2019

maniarathi left a comment

Choose a reason for hiding this comment

samanehsan Sep 24, 2019 • edited Loading

Choose a reason for hiding this comment

diekhans Sep 24, 2019

Choose a reason for hiding this comment

samanehsan left a comment • edited Loading

Choose a reason for hiding this comment

diekhans commented Sep 24, 2019 via email

briandoconnor left a comment

Choose a reason for hiding this comment

diekhans commented Sep 25, 2019

samanehsan commented Sep 27, 2019

diekhans commented Sep 27, 2019

samanehsan commented Sep 27, 2019

diekhans commented Sep 30, 2019

barkasn commented Jul 22, 2019 •

edited

Loading

lauraclarke left a comment •

edited

Loading

samanehsan Sep 24, 2019 •

edited

Loading

samanehsan left a comment •

edited

Loading