Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal: a Merger Tool #1036

Open
1 of 8 tasks
Donaim opened this issue Feb 12, 2024 · 1 comment · May be fixed by #1026
Open
1 of 8 tasks

Feature Proposal: a Merger Tool #1036

Donaim opened this issue Feb 12, 2024 · 1 comment · May be fixed by #1026
Assignees
Milestone

Comments

@Donaim
Copy link
Member

Donaim commented Feb 12, 2024

Background:
The MiCall pipeline currently processes reads on per-real-sample basis and outputs an assembled consensus sequence for them. Each run relies on SampleSheet.csv files for input and output details. A feature to merge samples, ideally across different runs, would simplify the downstream analysis.

Feature Description:
Introduce a merger tool that takes a .csv mapping file and generates a merged SampleSheet.csv, RunInfo.xml, and a duplicate of the input .csv for traceability. The mapping file correlates sample_name and run_folder with output_name, specifying the merging plan.

Feature Objectives:

  1. Facilitate efficient sample mergers across different run folders.
  2. Ensure consistency and traceability for merged samples.
  3. Handle default values and conflicts in input .csv files.

Functional Requirements:

  • Input to the tool:
    • Path to the mapping .csv file.
    • Path to the output folder.
  • Outputs of the tool:
    • SampleSheet.csv with merged output_name records.
    • RunInfo.xml copied from the first associated run_folder.
    • Input .csv file to trace origins of merged data.
  • Conflict resolution strategy, with a strict mode option (--strict flag).

Conflict Resolution Rules:

  • project_name header field to follow the $current_date.merged pattern.
  • date header field to reflect the actual merge date.
  • All other fields should use the first observed value unless --strict is enabled.
  • Fields index and index2 should default to XXXXX.

Implementation Tasks:

  • Develop a merging script for the underlying sample files.
  • Develop logic to parse the input .csv and handle row defaults.
  • Implement conflict detection logic with stdout reporting.
  • Create file generation procedures for SampleSheet.csv and RunInfo.xml.
  • Build merging algorithm to create a consolidated .csv from the mapping file.
  • Add a --non-strict mode for conflict resolution, with it becoming the default.
  • Write unit tests to validate merging logic and conflict handling.
  • Add documentation for the merger tool usage and features.
@Donaim Donaim self-assigned this Feb 12, 2024
@Donaim
Copy link
Member Author

Donaim commented Feb 12, 2024

Merging script implemented in #1026

@Donaim Donaim linked a pull request Feb 14, 2024 that will close this issue
@Donaim Donaim added this to the 7.18 milestone May 8, 2024
@Donaim Donaim modified the milestones: 7.18, near future Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant