workflows: Add hardcoded `workdir` directives #16

joverlee521 · 2023-11-21T20:43:32Z

Workflows must be run from the top level pathogen repo directory in order for them to have access to the shared directory in the containerized runtimes. After some external testing of different ways to invoke the workflows with the Nextstrain CLI¹, I've decided to add the hardcoded workdir directive in each workflow's main Snakefile.

Then the default workflows can be run with:

nextstrain build . -s phylogenetic/Snakefile
nextstrain build . -s ingest/Snakefile
nextstrain build . -s nextclade/Snakefile

With this uniform organization of workflows across pathogens, we can then move the Snakemake option to behind the scenes within the Nextclade CLI.

This will allow us to have a workflow manager agnostic option, such as:

nextstrain build --workflow phylogenetic .

¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md

tsibley · 2024-01-10T00:42:18Z

This seems like a good direction to me (no surprise there since we talked about it previously), and your solution of workdir: workflow.current_basedir + -s seems quite succinct and manageable. I'll be interested to see what others think.

BTW, https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md is a great design decision doc. Props for writing it up! I wonder if we should document this design decision somewhere more discoverable/official, like as a file in this repo?

Out of curiosity, I quickly tested if --directory (-d) overrode the declared workdir and it appears to. That's nice because it would, I believe, still let someone reuse these workflows in completely other contexts.

For runtimes without an isolated filesystem, I believe -s-less invocations from within each workflow directory (e.g. within ingest/) are also possible with this approach. Is that right? Though we probably should recommend against them for consistency.

jameshadfield · 2024-01-10T02:17:53Z

ingest/README.md


-From within the `ingest` directory, run the workflow with:
+All workflows are expected to the be run from the top level pathogen repo directory.


Does this still work if we run snakemake from within ./ingest, or does this break some paths? (I imagine paths such as ../shared/reference.fasta are the most likely to not work.)

I've just read pt. 1 in https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md#different-nextstrain-build-commands -- it looks like the answer is that this won't work.

Does this still work if we run snakemake from within ./ingest, or does this break some paths?

The paths restrictions only apply when using the containerized runtimes with nextstrain build, so using snakemake natively should still work.

From within ./ingest, you can run either snakemake --cores 1 or nextstrain build --conda. Although, as @tsibley commented, we'd probably only recommend the top level directory for consistency.

That's great! Yes, good to be consistent and recommend the approach you want people to use, but it's a big win that people can jump into the ~siloed workflow directory and just run a normal snake-make pipeline.

The paths restrictions only apply when using the containerized runtimes with nextstrain build

Yup, but you can get around this using this approach (internal slack link) which is really nice.

jameshadfield · 2024-01-10T23:46:37Z

The recommendation to run workflows via nextstrain build . -s phylogenetic/Snakefile (etc) seems alright to me.

As discussed in a recent meeting notes (gdoc) there's some thinking to be done around how our machinery to run repos (currently pathogen-repo-build) will interact with a repo where multiple commands will need to be run. For instance, will each command be run independently, or will they be combined so that they run in a single instance (with shared filesystem etc), or a bit of both? For instance, zika will have

nextstrain build . -s ingest/Snakefile
nextstrain build . -s phylogenetic/Snakefile

and mpx will have (up to) 5:

nextstrain build . -s ingest/Snakefile
nextstrain build . -s phylogenetic/Snakefile --configfile config/mpxv/config.yaml
nextstrain build . -s phylogenetic/Snakefile --configfile config/hmpxv1/config.yaml
nextstrain build . -s phylogenetic/Snakefile --configfile config/hmpxv1_big/config.yaml
nextstrain build . -s nextclade/Snakefile

Another way of thinking about this may be to treat the above as a DAG and ask how we manage the orchestration of each of the steps. I don't think this is blocking for this PR, but good to keep discussing what we want to do here.

tsibley · 2024-01-24T08:17:50Z

@jameshadfield in re-reading your comment just now, I noticed what seemed like a minor typo, so I corrected it to avoid future reader confusion. Noting it here in case I misunderstood though!

jameshadfield · 2024-01-24T21:25:52Z

I noticed what seemed like a minor typo, so I corrected it to avoid future reader confusion

Thanks! And maybe there's more than one invocation required depending on how the nextclade pipeline is set up in a given repo. Either way I don't think we're going to add Nextclade to the scheduled / automated rebuild action(s) so I think it's safe to ignore it for the purposes of discussing automation?

@joverlee521 what's your current thinking on using workdir as in this PR? It seems there's consensus that this is a good direction

joverlee521 · 2024-01-24T23:54:45Z

I just realized today that there could be confusion with filepaths when needing to run a nextstrain build command that specifies a file as a target.

nextstrain build .  \
    results/hmpxv1/good_sequences.fasta \
    -s phylogenetic/Snakefile \
    --configfile phylogenetic/config/hmpxv1/config.yaml

Notice the target results/hmpxv1/good_sequences.fasta must exclude the phylogenetic/ directory because the targets are relative to the working directory.

(This is an issue even outside of this change, but wanted to be explicit about it.)

victorlin · 2024-01-25T01:51:16Z

Generally, workdir: workflow.current_basedir seems like a good direction.

I'm not sold because it breaks some existing scenarios as mentioned in comments above. I would suggest additional changes on the CLI side:

Add common Snakemake options and examples in the docs for nextstrain build.

Handle common Snakemake errors with useful output.

Example:

$ nextstrain build .
Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile.

can be:

Error: you must specify a workflow in the pathogen repo, e.g.
nextstrain build . -s phylogenetic/Snakefile

Another one which I'm less certain about:

$ nextstrain build .  \
    phylogenetic/results/hmpxv1/good_sequences.fasta \
    -s phylogenetic/Snakefile \
    --configfile phylogenetic/config/hmpxv1/config.yaml
MissingRuleException:
No rule to produce phylogenetic/results/hmpxv1/good_sequences.fasta (if you use input functions make sure that they don't raise unexpected exceptions).

can be:

Error: Unable to use Snakemake target 'phylogenetic/results/hmpxv1/good_sequences.fasta'.
Note that paths must be relative to the Snakefile.

tsibley · 2024-01-25T22:09:19Z

comment originally written in my notes, copied here

I think standardizing on workdir: workflow.current_basedir in our workflows makes sense and is a good idea. It makes paths within the workflow much more predictable: they're always relative to the Snakefile unless you override the workdir (e.g. with --directory or -d). This inverts the default Snakemake behaviour of paths in the workflow being relative to whatever the working directory is when snakemake is invoked, which I've long thought was confusing and backwards. Still, with the workdir: proposal here, that default behaviour may be regained if needed by passing --directory . when invoking workflows from outside their containing directory.

That said, I have a separate proposal for addressing the nextstrain build UI issues. Or, at least, how to address them further so that even -s isn't necessary and specifying files as targets (results/hmpxv1/good_sequences.fasta) doesn't feel as "odd".

The proposal by example:

$ tree mpox/
mpox/
├── nextstrain-pathogen.yaml
├── ingest/
│   ├── Snakefile
│   └── …
├── phylogenetic/
│   ├── Snakefile
│   ├── config/
│   │   ├── hmpxv1/
│   │   │   ├── config.yaml
│   │   │   └── …
│   │   └── …
│   ├── results/
│   │   ├── hmpxv1/
│   │   │   ├── good_sequences.fasta
│   │   │   └── …
│   │   └── …
│   └── …
├── nextclade/
│   ├── Snakefile
│   └── …
├── shared/
│   ├── reference.fasta
│   └── …
└── …

$ cat nextstrain-pathogen.yaml
---
name: mpox
url: https://github.com/nextstrain/mpox
maintainers:
  - name: the Nextstrain team
    url: https://nextstrain.org

$ nextstrain build mpox/ingest
$ nextstrain build mpox/phylogenetic
$ nextstrain build mpox/phylogenetic \
>   --configfile config/hmpxv1/config.yaml

$ nextstrain build mpox/phylogenetic \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

$ cd mpox/phylogenetic
$ nextstrain build . \
>   --configfile config/hmpxv1/config.yaml

$ nextstrain build . \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

$ snakemake --cores all \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

How would this work? By detecting and drawing the filesystem isolation boundary at the pathogen repo root instead of at the working directory given to nextstrain build. That is, making the filesystem isolation boundary (i.e. what's mapped to /nextstrain/build in a container) a separate thing from the workflow and starting working directory (i.e. what's given to nextstrain build).

The pathogen repo root is indicated by the presence of nextstrain-pathogen.yaml, which we can then later leverage to also contain pathogen-level metadata for use elsewhere in indexing, listing, attribution, etc. An empty file would work just fine for now for the purposes of this proposal; the important bit is having a marker file (and one not tied to Git, i.e. .git/ is not a good substitute).

What does this look like in a container in practice?

$ nextstrain shell --docker mpox/phylogenetic
Entering the Nextstrain runtime (docker)

Mapped volumes:
  /nextstrain/build is from /home/tom/nextstrain/mpox

Run the command "exit" to leave the runtime.

$ pwd
/nextstrain/build/phylogenetic

$ tree
/nextstrain/build/phylogenetic/
├── Snakefile
├── config/
│   ├── hmpxv1/
│   │   ├── config.yaml
│   │   └── …
│   └── …
├── results/
│   ├── hmpxv1/
│   │   ├── good_sequences.fasta
│   │   └── …
│   └── …
└── …

$ cd ..
$ tree -L 1
/nextstrain/build/
├── nextstrain-pathogen.yaml
├── ingest/
├── phylogenetic/
├── nextclade/
├── shared/
└── …

The core ideas in this proposal have stewed for years now---typically as part of my "workflows as programs" thinking---and while the iron's been warm previously, it now seems like---for this specific UI-focused change---the iron is hot and it's time to strike.

⁂

As a final thought: the tangle thru which we're trying to find the clearest path here arises solely, I believe, from the desire to have files shared between our various workflows (e.g. the workflows in ingest/ and phylogenetic/ reference shared/). If we ditched the premise that they could access them directly, the whole tangle would go away, right? We could still maintain shared/ but as part of the dev process ensure they're copied into the places in ingest/ and phylogenetic/ as needed (e.g. by running make or ./devel/proliferate-shared or similar and enforcing so with CI/linting).

I'm not saying we want to do this... but it doesn't seem terrible and I'm not sure we've seriously considered it?

jameshadfield · 2024-01-26T00:07:24Z

As a final thought: the tangle thru which we're trying to find the clearest path here arises solely, I believe, from the desire to have files shared between our various workflows (e.g. the workflows in ingest/ and phylogenetic/ reference shared/)

Yes, but I think it's the other way around. The tangle we've created here arises from the desire to create siloed workflows for a process which is inherently not made up of independent parts. The need to access shared data, in the form of ./shared but also in the form of the phylo workflow accessing data from the ingest silo, as well as the need to create an orchestration workflow on top of the snakemake workflows is a result of this siloing. My understanding for why we did this is largely to reduce the complexity of each snakemake workflow and to avoid having to use targets when running workflows. I think this issue shows that this complexity has just been moved elsewhere, in the form of workdirs and proposals like Tom's above.

FWIW, if we continue with the ~siloed approach, I like Tom's proposed CLI UI, it seems to strike a decent balance between "Just Works" whilst also being trivial to recreate the behaviour using snakemake directly. The latter is really important for those who have snakemake familiarity and are tying to understand the CLI.

tsibley · 2024-01-29T18:58:23Z

The tangle we've created here arises from the desire to create siloed workflows for a process which is inherently not made up of independent parts. […] My understanding for why we did this is largely to reduce the complexity of each snakemake workflow and to avoid having to use targets when running workflows. I think this issue shows that this complexity has just been moved elsewhere, in the form of workdirs and proposals like #16 (comment).

I think it's true that we're moving complexity around, not reducing it on a global/absolute level. But I also think that by moving it around, we can reduce local/relative complexity for each part, which can be worth it for someone who doesn't need/want to understand the the whole thing at once.

(I don't think a motivating reason for the split is to avoid the use of targets when running workflows. No one seems to mind that?)

Workflows must be run from the top level pathogen repo directory in order for them to have access to the `shared` directory in the containerized runtimes. After some external testing of different ways to invoke the workflows with the Nextstrain CLI¹, I've decided to add the hardcoded `workdir` directive in each workflow's main Snakefile. Then the default workflows can be run with: ``` nextstrain build . -s phylogenetic/Snakefile nextstrain build . -s ingest/Snakefile nextstrain build . -s nextclade/Snakefile ``` With this uniform organization of workflows across pathogens, we can then move the Snakemake option to behind the scenes within the Nextclade CLI. This will allow us to have a workflow manager agnostic option, such as: ``` nextstrain build --workflow phylogenetic . ``` ¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md

joverlee521 · 2024-01-29T22:45:27Z

Rebased onto main to resolve merge conflicts.

…ctories …as this is the direction we're moving. The main change is making the filesystem isolation boundary (i.e. what's mapped to `/nextstrain/build` in a container) a _separate thing_ from the workflow and initial working directory (i.e. what's given to `nextstrain build`). In this codebase, these two things are referred to as the _build volume_ (aka `opts.build`) and the _working volume_. Historically, the working volume given to the runners for `nextstrain build` and `nextstrain shell` _was_ the build volume; now, they're separately considered and sometimes differ. See the included changelog entry for usage details and rationale. For background, I made the initial proposal for this feature¹ on a PR in our pathogen-repo-template repository and some discussion ensued. ¹ <nextstrain/pathogen-repo-guide#16 (comment)> or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>

tsibley · 2024-01-31T23:10:58Z

I've implemented my proposal above in Nextstrain CLI as nextstrain/cli#355.

…ctories …as this is the direction we're moving. The main change is making the filesystem isolation boundary (i.e. what's mapped to `/nextstrain/build` in a container) a _separate thing_ from the workflow and initial working directory (i.e. what's given to `nextstrain build`). In this codebase, these two things are referred to as the _build volume_ (aka `opts.build`) and the _working volume_. Historically, the working volume given to the runners for `nextstrain build` and `nextstrain shell` _was_ the build volume; now, they're separately considered and sometimes differ. See the included changelog entry for usage details and rationale. For background, I made the initial proposal for this feature¹ on a PR in our pathogen-repo-template repository and some discussion ensued. ¹ <nextstrain/pathogen-repo-guide#16 (comment)> or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>

joverlee521 · 2024-02-02T23:09:49Z

With the new support from Nextstrain CLI for workflows in subdirectories, this change is no longer necessary to standardize the nextstrain build commands.

However, it seems like it's still helpful for the maintenance of workflows by forcing all relative paths in workflows to be relative to the top Snakefile.

Merging this PR now and will update repo to match the Nextstrain CLI changes in a separate PR.

joverlee521 requested a review from a team November 21, 2023 20:45

jameshadfield reviewed Jan 10, 2024

View reviewed changes

joverlee521 mentioned this pull request Jan 17, 2024

How should we standardize config schema for files? #26

Open

jameshadfield mentioned this pull request Jan 26, 2024

How should we standardize GitHub Action workflows per repo? #25

Open

joverlee521 force-pushed the add-workdir branch from 410b56a to 9be7a57 Compare January 29, 2024 22:41

tsibley mentioned this pull request Jan 31, 2024

Better support pathogen repositories which place workflows in subdirectories nextstrain/cli#355

Merged

1 task

joverlee521 merged commit 54d270e into main Feb 2, 2024

joverlee521 deleted the add-workdir branch February 2, 2024 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows: Add hardcoded `workdir` directives #16

workflows: Add hardcoded `workdir` directives #16

joverlee521 commented Nov 21, 2023

tsibley commented Jan 10, 2024

jameshadfield Jan 10, 2024

jameshadfield Jan 10, 2024

joverlee521 Jan 10, 2024 •

edited

Loading

jameshadfield Jan 10, 2024

jameshadfield commented Jan 10, 2024 •

edited by tsibley

Loading

tsibley commented Jan 24, 2024

jameshadfield commented Jan 24, 2024

joverlee521 commented Jan 24, 2024

victorlin commented Jan 25, 2024 •

edited

Loading

tsibley commented Jan 25, 2024

jameshadfield commented Jan 26, 2024 •

edited

Loading

tsibley commented Jan 29, 2024

joverlee521 commented Jan 29, 2024

tsibley commented Jan 31, 2024

joverlee521 commented Feb 2, 2024


		From within the `ingest` directory, run the workflow with:
		All workflows are expected to the be run from the top level pathogen repo directory.

workflows: Add hardcoded workdir directives #16

workflows: Add hardcoded workdir directives #16

Conversation

joverlee521 commented Nov 21, 2023

tsibley commented Jan 10, 2024

jameshadfield Jan 10, 2024

Choose a reason for hiding this comment

jameshadfield Jan 10, 2024

Choose a reason for hiding this comment

joverlee521 Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

jameshadfield Jan 10, 2024

Choose a reason for hiding this comment

jameshadfield commented Jan 10, 2024 • edited by tsibley Loading

tsibley commented Jan 24, 2024

jameshadfield commented Jan 24, 2024

joverlee521 commented Jan 24, 2024

victorlin commented Jan 25, 2024 • edited Loading

tsibley commented Jan 25, 2024

jameshadfield commented Jan 26, 2024 • edited Loading

tsibley commented Jan 29, 2024

joverlee521 commented Jan 29, 2024

tsibley commented Jan 31, 2024

joverlee521 commented Feb 2, 2024

workflows: Add hardcoded `workdir` directives #16

workflows: Add hardcoded `workdir` directives #16

joverlee521 Jan 10, 2024 •

edited

Loading

jameshadfield commented Jan 10, 2024 •

edited by tsibley

Loading

victorlin commented Jan 25, 2024 •

edited

Loading

jameshadfield commented Jan 26, 2024 •

edited

Loading