Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing Datasets that Span Multiple Data Collection Runs #88

Merged
merged 117 commits into from
Sep 30, 2019

Conversation

barkasn
Copy link
Contributor

@barkasn barkasn commented Jul 22, 2019

This RFC proposes a design solution to allow datasets that span multiple data collection runs to be processed in unison as per the scientific requirements. The proposal generalizes to multiple data type modalities, but also provides specific details for 10X V2 3’ single-cell data as an example and to address an immediate need for a data processing solution.

This RFC has undergone a substantial rewrite based on comments.

Summary of Discussion for Approvers:
This RFC initially proposed the concept of DAPS as a design solution to allow datasets that span multiple data collection runs to be processed in unison. After an extensive rewrite the concept of data groups is now proposed. A data group is a collection of bundles that are version-complete, that is at a given point in time fulfill specified completeness criteria. An implementation of the data group concept as a bundle is proposed in order to re-use existing infrastructure. The completeness criteria that the data group must fulfill depend on its type. This RFC proposes the implementation of the PROJECT_SUBMISSION data group type, that specifies that a submission of a data set is complete and downstream processing can commence. Other data group types can be added in the future as project needs dictate.

August 27th, 2019 Last Call for Community Review
September 27th, 2019 Last call for oversight review

Copy link
Member

@kislyuk kislyuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "DAPS" suffers from being abstract and detached from existing terminology. I understand the distinction that you're trying to draw between "project" and DAPS, but I'd prefer that this be called a "dataset" or "primary dataset", in line with existing terminology.

I don't have specific objections to the design described in this RFC (except for the confusion about the "submission envelope" term pointed out by @mckinsel, and most things described in the "Alternative Approaches" section, which I presume wouldn't make its way into the body of the RFC without further discussion). However, I think at this point this RFC is too abstract and complex to adequately inform the design and implementation of DCP features. It does not concretely explain the status quo (the data wrangling/ingestion/analysis SOP as it exists today) and how this RFC changes it. I think the in-depth, non-normative discussion should be moved somewhere else, and the RFC simplified to just the normative parts and a description of how they change the SOP.

@diekhans
Copy link
Contributor

I can see how DAPS is confusing. However, "dataset" has a lot of assumed meaning in bioinformatics and would likely create more confusion. Many would view what the DCP calls "project" as a "dataset'.

A DAPS is more akin to a message about a set of data than an actual collection of data that is more rigidly defined.

Ideas:
"Processing subset"?

@claymfischer
Copy link
Contributor

Ideas:
"Processing subset"?

I think that is a clearer to an average user. +1

@kbergin
Copy link
Contributor

kbergin commented Jul 25, 2019

Are the reviewers requested automated? I think @jkaneria should definitely be in the list but I don't think I can add her for some reason

Copy link
Member

@lauraclarke lauraclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There there are several points which lack clarity in this document. Some diagrams which demonstrate the lifecycle changes proposed and some specific examples of real experiments and how this would improve how we handle them would both be very useful to improve this RFC

rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
rfcs/text/0000-multi-data-collection-dataset-processing.md Outdated Show resolved Hide resolved
Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't actually mean to enter review mode. Anyway, many suggestions to clarify text based on comments that show things are not clear.

@briandoconnor
Copy link

@diekhans thanks very much for your comments. I don't feel, though, that the data browser will have sufficient information to show/not show a project based on what's proposed here. If the goal is to show only things that have made it through ingest, have had secondary pipelines run on them, and have been loaded in the matrix service then you need to define and agree on those scopes. As it stands right now, this only partially solves the problem from the browser perspective (but may very well solve problems from other components' perspectives). I'm curious to hear what @hannes-ucsc thinks as well since he's the tech lead for the browser.

@diekhans
Copy link
Contributor

diekhans commented Sep 22, 2019 via email

…avoid conflusion about the meanin of submission. At step definition to be able to avoid introspection of data bundles
@diekhans
Copy link
Contributor

@briandoconnor @hannes-ucsc I updated the text to help remove the confusion about the term "submission*. I also when in more detail about what scope is and make the life-cycle step implicit.

Hannes, please review.

Copy link
Contributor

@maniarathi maniarathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval is for the overall design; though I think taking into account Ingest/Secondary Analysis's listed comments will be a much stronger approval than mine!

While this RFC proposes a general mechanism, it was developed in response to the needs of analysis to co-process data that are in multiple assay bundles. This section uses co-processing of multiple sequencing run from a given library as an example of the use of data groups to identify data that should be processed together. A key concept behind this proposal is that upstream submission should not be defining how downstream analysis groups inputs for processing. The analysis pipelines need to be able to group input based on ways that may not be defined at submission time. New analysis pipelines should be able to group inputs in arbitrary ways without requiring changes to the way data is packaged.

### Identifying data to co-process
Pipelines that require grouping data from multiple sequencing assays (co-processing) subscribe to *PROJECT_SCOPES* events instead of per-bundle events. Unlike with the assay bundle per pipeline run, this makes the pipeline framework responsible for collecting assay bundles into pipeline runs. Pipelines that do not require co-processing continue to operate on a per-bundle basis, however they still need to subscribe to the project data group so they can notify ingest when the input *data group* process is complete for use by downstream consumers, such as Azul.
Copy link

@samanehsan samanehsan Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more accurate if it said the pipeline service needs to subscribe to the project data group...

The difference is that if every pipeline subscribed to "project data group notifications", whenever a project is submitted the same notification would get sent to each pipeline. This creates a lot of duplicated work in terms of inspecting the contents to determine if that pipeline should be run so it would be more efficient to have a single subscription to data group notifications. However, if there's a way to query the project data group bundles without requesting all of the metadata files from the data store then one subscription/pipeline would be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @samanehsan text has been clarified

Copy link

@samanehsan samanehsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having primary and secondary project submission data groups would add a lot of value in terms of indicating whether a project is “version-complete” and ready to be consumed by downstream services such as the data browser and the matrix service. My understanding is that the immediate need for identifying 10x v2 sequencing data from the same library preparation will be addressed by ingest so re-architecting all of secondary analysis to run the existing SmartSeq2 and Optimus pipelines based on project submission data group notifications is not currently necessary.

This allows us to take a more incremental approach with the implementation. Specifically, analysis could first consider how to determine that the analysis for a project submission data group is complete. Then, once we have the need to run a pipeline (secondary or tertiary) that requires input from multiple bundles we can subscribe to project submission data group notifications and modify our infrastructure to handle that use case.

@diekhans
Copy link
Contributor

diekhans commented Sep 24, 2019 via email

Copy link

@briandoconnor briandoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC seems like it would be the place to map out how we use data groups to signal to downstream systems when collections of data are "ready". It doesn't go far enough, though, to explicitly solve my problems with the portal e.g. how do we know when a project has all their primary data uploaded and ready for display in the portal, when it's finished being analyzed through secondary pipelines, and when it's data is uploaded and ready in the matrix service. If we all agreed to use data groups for these key milestones in the data lifecycle in the project then we would have actions to take on the portal indexing side. Until there's agreement there, I don't think there's anything for the portal or data store to do with regards to this RFC. Mark tells me the proposal I'm asking for here will be part of another RFC so I'm just approving this one since I'm not opposed to the ideas presented here.

@diekhans
Copy link
Contributor

thanks @briandoconnor

Actually, I indicated this RFC would be updated rather than a new RFC. We need to work with the Azul and expression matrix groups to flesh out the details and make any required modification.

@samanehsan
Copy link

@diekhans, maybe @justincc can clarify this for us:

I think the plan is to not change the bundling for 10x v2, but I am not 100%.

This RFC about representing sequencing library preparations says "Data consumers will not need to depend on multi-bundle notifications for processing all data files from the same library preparation.. All data files from a single library preparation will be put in the same logical unit if submitted at one time", so I thought the plan was to update bundling in ingest.

@diekhans
Copy link
Contributor

It looks 10x v2 is still up in theair. Note Mallory's use of "logic unit" instead of bundle.

Personally, I would like to avoid making a more complex upstream bundling structure base on current limitations. However, there is also a project reality of getting datasets into the DCP that might make waiting for data groups very disadvantageous in the medium term.

I think this is mostly about schedule and priorities. Norman started a working group to oversee the implementation of both RFCs, so we should let lead to a decision.

@samanehsan
Copy link

Yes, I see what you mean. If the Optimus pipeline is the only current use-case, I think that can be solved more simply with updating the bundling logic in ingest vs. re-writing analysis. And of course all of the bundles from a submission could still go into a data group.

@diekhans
Copy link
Contributor

I talked to @MDunitz today and she is happy that this RFC doesn't cause even more problems than it solves.

so we are declaring it done

@barkasn barkasn merged commit 45b039d into HumanCellAtlas:master Sep 30, 2019
diekhans pushed a commit to diekhans/dcp-community that referenced this pull request Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.