-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflows: Add hardcoded workdir
directives
#16
Conversation
This seems like a good direction to me (no surprise there since we talked about it previously), and your solution of BTW, https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md is a great design decision doc. Props for writing it up! I wonder if we should document this design decision somewhere more discoverable/official, like as a file in this repo? Out of curiosity, I quickly tested if For runtimes without an isolated filesystem, I believe |
|
||
From within the `ingest` directory, run the workflow with: | ||
All workflows are expected to the be run from the top level pathogen repo directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still work if we run snakemake
from within ./ingest
, or does this break some paths? (I imagine paths such as ../shared/reference.fasta
are the most likely to not work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just read pt. 1 in https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md#different-nextstrain-build-commands -- it looks like the answer is that this won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still work if we run snakemake from within ./ingest, or does this break some paths?
The paths restrictions only apply when using the containerized runtimes with nextstrain build
, so using snakemake
natively should still work.
From within ./ingest
, you can run either snakemake --cores 1
or nextstrain build --conda
. Although, as @tsibley commented, we'd probably only recommend the top level directory for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Yes, good to be consistent and recommend the approach you want people to use, but it's a big win that people can jump into the ~siloed workflow directory and just run a normal snake-make pipeline.
The paths restrictions only apply when using the containerized runtimes with nextstrain build
Yup, but you can get around this using this approach (internal slack link) which is really nice.
The recommendation to run workflows via As discussed in a recent meeting notes (gdoc) there's some thinking to be done around how our machinery to run repos (currently pathogen-repo-build) will interact with a repo where multiple commands will need to be run. For instance, will each command be run independently, or will they be combined so that they run in a single instance (with shared filesystem etc), or a bit of both? For instance, zika will have
and mpx will have (up to) 5:
Another way of thinking about this may be to treat the above as a DAG and ask how we manage the orchestration of each of the steps. I don't think this is blocking for this PR, but good to keep discussing what we want to do here. |
@jameshadfield in re-reading your comment just now, I noticed what seemed like a minor typo, so I corrected it to avoid future reader confusion. Noting it here in case I misunderstood though! |
Thanks! And maybe there's more than one invocation required depending on how the nextclade pipeline is set up in a given repo. Either way I don't think we're going to add Nextclade to the scheduled / automated rebuild action(s) so I think it's safe to ignore it for the purposes of discussing automation? @joverlee521 what's your current thinking on using |
I just realized today that there could be confusion with filepaths when needing to run a
Notice the target (This is an issue even outside of this change, but wanted to be explicit about it.) |
Generally, I'm not sold because it breaks some existing scenarios as mentioned in comments above. I would suggest additional changes on the CLI side:
|
comment originally written in my notes, copied here I think standardizing on That said, I have a separate proposal for addressing the The proposal by example: $ tree mpox/
mpox/
├── nextstrain-pathogen.yaml
├── ingest/
│ ├── Snakefile
│ └── …
├── phylogenetic/
│ ├── Snakefile
│ ├── config/
│ │ ├── hmpxv1/
│ │ │ ├── config.yaml
│ │ │ └── …
│ │ └── …
│ ├── results/
│ │ ├── hmpxv1/
│ │ │ ├── good_sequences.fasta
│ │ │ └── …
│ │ └── …
│ └── …
├── nextclade/
│ ├── Snakefile
│ └── …
├── shared/
│ ├── reference.fasta
│ └── …
└── …
$ cat nextstrain-pathogen.yaml
---
name: mpox
url: https://github.com/nextstrain/mpox
maintainers:
- name: the Nextstrain team
url: https://nextstrain.org
$ nextstrain build mpox/ingest
$ nextstrain build mpox/phylogenetic
$ nextstrain build mpox/phylogenetic \
> --configfile config/hmpxv1/config.yaml
$ nextstrain build mpox/phylogenetic \
> results/hmpxv1/good_sequences.fasta \
> --configfile config/hmpxv1/config.yaml
$ cd mpox/phylogenetic
$ nextstrain build . \
> --configfile config/hmpxv1/config.yaml
$ nextstrain build . \
> results/hmpxv1/good_sequences.fasta \
> --configfile config/hmpxv1/config.yaml
$ snakemake --cores all \
> results/hmpxv1/good_sequences.fasta \
> --configfile config/hmpxv1/config.yaml How would this work? By detecting and drawing the filesystem isolation boundary at the pathogen repo root instead of at the working directory given to The pathogen repo root is indicated by the presence of What does this look like in a container in practice? $ nextstrain shell --docker mpox/phylogenetic
Entering the Nextstrain runtime (docker)
Mapped volumes:
/nextstrain/build is from /home/tom/nextstrain/mpox
Run the command "exit" to leave the runtime.
$ pwd
/nextstrain/build/phylogenetic
$ tree
/nextstrain/build/phylogenetic/
├── Snakefile
├── config/
│ ├── hmpxv1/
│ │ ├── config.yaml
│ │ └── …
│ └── …
├── results/
│ ├── hmpxv1/
│ │ ├── good_sequences.fasta
│ │ └── …
│ └── …
└── …
$ cd ..
$ tree -L 1
/nextstrain/build/
├── nextstrain-pathogen.yaml
├── ingest/
├── phylogenetic/
├── nextclade/
├── shared/
└── … The core ideas in this proposal have stewed for years now---typically as part of my "workflows as programs" thinking---and while the iron's been warm previously, it now seems like---for this specific UI-focused change---the iron is hot and it's time to strike. ⁂ As a final thought: the tangle thru which we're trying to find the clearest path here arises solely, I believe, from the desire to have files shared between our various workflows (e.g. the workflows in I'm not saying we want to do this... but it doesn't seem terrible and I'm not sure we've seriously considered it? |
Yes, but I think it's the other way around. The tangle we've created here arises from the desire to create siloed workflows for a process which is inherently not made up of independent parts. The need to access shared data, in the form of FWIW, if we continue with the ~siloed approach, I like Tom's proposed CLI UI, it seems to strike a decent balance between "Just Works" whilst also being trivial to recreate the behaviour using snakemake directly. The latter is really important for those who have snakemake familiarity and are tying to understand the CLI. |
I think it's true that we're moving complexity around, not reducing it on a global/absolute level. But I also think that by moving it around, we can reduce local/relative complexity for each part, which can be worth it for someone who doesn't need/want to understand the the whole thing at once. (I don't think a motivating reason for the split is to avoid the use of targets when running workflows. No one seems to mind that?) |
Workflows must be run from the top level pathogen repo directory in order for them to have access to the `shared` directory in the containerized runtimes. After some external testing of different ways to invoke the workflows with the Nextstrain CLI¹, I've decided to add the hardcoded `workdir` directive in each workflow's main Snakefile. Then the default workflows can be run with: ``` nextstrain build . -s phylogenetic/Snakefile nextstrain build . -s ingest/Snakefile nextstrain build . -s nextclade/Snakefile ``` With this uniform organization of workflows across pathogens, we can then move the Snakemake option to behind the scenes within the Nextclade CLI. This will allow us to have a workflow manager agnostic option, such as: ``` nextstrain build --workflow phylogenetic . ``` ¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md
410b56a
to
9be7a57
Compare
Rebased onto main to resolve merge conflicts. |
…ctories …as this is the direction we're moving. The main change is making the filesystem isolation boundary (i.e. what's mapped to `/nextstrain/build` in a container) a _separate thing_ from the workflow and initial working directory (i.e. what's given to `nextstrain build`). In this codebase, these two things are referred to as the _build volume_ (aka `opts.build`) and the _working volume_. Historically, the working volume given to the runners for `nextstrain build` and `nextstrain shell` _was_ the build volume; now, they're separately considered and sometimes differ. See the included changelog entry for usage details and rationale. For background, I made the initial proposal for this feature¹ on a PR in our pathogen-repo-template repository and some discussion ensued. ¹ <nextstrain/pathogen-repo-guide#16 (comment)> or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>
I've implemented my proposal above in Nextstrain CLI as nextstrain/cli#355. |
…ctories …as this is the direction we're moving. The main change is making the filesystem isolation boundary (i.e. what's mapped to `/nextstrain/build` in a container) a _separate thing_ from the workflow and initial working directory (i.e. what's given to `nextstrain build`). In this codebase, these two things are referred to as the _build volume_ (aka `opts.build`) and the _working volume_. Historically, the working volume given to the runners for `nextstrain build` and `nextstrain shell` _was_ the build volume; now, they're separately considered and sometimes differ. See the included changelog entry for usage details and rationale. For background, I made the initial proposal for this feature¹ on a PR in our pathogen-repo-template repository and some discussion ensued. ¹ <nextstrain/pathogen-repo-guide#16 (comment)> or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>
With the new support from Nextstrain CLI for workflows in subdirectories, this change is no longer necessary to standardize the However, it seems like it's still helpful for the maintenance of workflows by forcing all relative paths in workflows to be relative to the top Snakefile. Merging this PR now and will update repo to match the Nextstrain CLI changes in a separate PR. |
Workflows must be run from the top level pathogen repo directory in order for them to have access to the
shared
directory in the containerized runtimes. After some external testing of different ways to invoke the workflows with the Nextstrain CLI¹, I've decided to add the hardcodedworkdir
directive in each workflow's main Snakefile.Then the default workflows can be run with:
With this uniform organization of workflows across pathogens, we can then move the Snakemake option to behind the scenes within the Nextclade CLI.
This will allow us to have a workflow manager agnostic option, such as:
¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md