Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows: Add hardcoded workdir directives #16

Merged
merged 1 commit into from
Feb 2, 2024
Merged

Conversation

joverlee521
Copy link
Contributor

Workflows must be run from the top level pathogen repo directory in order for them to have access to the shared directory in the containerized runtimes. After some external testing of different ways to invoke the workflows with the Nextstrain CLI¹, I've decided to add the hardcoded workdir directive in each workflow's main Snakefile.

Then the default workflows can be run with:

nextstrain build . -s phylogenetic/Snakefile
nextstrain build . -s ingest/Snakefile
nextstrain build . -s nextclade/Snakefile

With this uniform organization of workflows across pathogens, we can then move the Snakemake option to behind the scenes within the Nextclade CLI.

This will allow us to have a workflow manager agnostic option, such as:

nextstrain build --workflow phylogenetic .

¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md

@joverlee521 joverlee521 requested a review from a team November 21, 2023 20:45
@tsibley
Copy link
Member

tsibley commented Jan 10, 2024

This seems like a good direction to me (no surprise there since we talked about it previously), and your solution of workdir: workflow.current_basedir + -s seems quite succinct and manageable. I'll be interested to see what others think.

BTW, https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md is a great design decision doc. Props for writing it up! I wonder if we should document this design decision somewhere more discoverable/official, like as a file in this repo?

Out of curiosity, I quickly tested if --directory (-d) overrode the declared workdir and it appears to. That's nice because it would, I believe, still let someone reuse these workflows in completely other contexts.

For runtimes without an isolated filesystem, I believe -s-less invocations from within each workflow directory (e.g. within ingest/) are also possible with this approach. Is that right? Though we probably should recommend against them for consistency.


From within the `ingest` directory, run the workflow with:
All workflows are expected to the be run from the top level pathogen repo directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still work if we run snakemake from within ./ingest, or does this break some paths? (I imagine paths such as ../shared/reference.fasta are the most likely to not work.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@joverlee521 joverlee521 Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still work if we run snakemake from within ./ingest, or does this break some paths?

The paths restrictions only apply when using the containerized runtimes with nextstrain build, so using snakemake natively should still work.

From within ./ingest, you can run either snakemake --cores 1 or nextstrain build --conda. Although, as @tsibley commented, we'd probably only recommend the top level directory for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Yes, good to be consistent and recommend the approach you want people to use, but it's a big win that people can jump into the ~siloed workflow directory and just run a normal snake-make pipeline.

The paths restrictions only apply when using the containerized runtimes with nextstrain build

Yup, but you can get around this using this approach (internal slack link) which is really nice.

@jameshadfield
Copy link
Member

jameshadfield commented Jan 10, 2024

The recommendation to run workflows via nextstrain build . -s phylogenetic/Snakefile (etc) seems alright to me.

As discussed in a recent meeting notes (gdoc) there's some thinking to be done around how our machinery to run repos (currently pathogen-repo-build) will interact with a repo where multiple commands will need to be run. For instance, will each command be run independently, or will they be combined so that they run in a single instance (with shared filesystem etc), or a bit of both? For instance, zika will have

nextstrain build . -s ingest/Snakefile
nextstrain build . -s phylogenetic/Snakefile

and mpx will have (up to) 5:

nextstrain build . -s ingest/Snakefile
nextstrain build . -s phylogenetic/Snakefile --configfile config/mpxv/config.yaml
nextstrain build . -s phylogenetic/Snakefile --configfile config/hmpxv1/config.yaml
nextstrain build . -s phylogenetic/Snakefile --configfile config/hmpxv1_big/config.yaml
nextstrain build . -s nextclade/Snakefile

Another way of thinking about this may be to treat the above as a DAG and ask how we manage the orchestration of each of the steps. I don't think this is blocking for this PR, but good to keep discussing what we want to do here.

@tsibley
Copy link
Member

tsibley commented Jan 24, 2024

@jameshadfield in re-reading your comment just now, I noticed what seemed like a minor typo, so I corrected it to avoid future reader confusion. Noting it here in case I misunderstood though! Screenshot from 2024-01-24 00-15-55

@jameshadfield
Copy link
Member

I noticed what seemed like a minor typo, so I corrected it to avoid future reader confusion

Thanks! And maybe there's more than one invocation required depending on how the nextclade pipeline is set up in a given repo. Either way I don't think we're going to add Nextclade to the scheduled / automated rebuild action(s) so I think it's safe to ignore it for the purposes of discussing automation?

@joverlee521 what's your current thinking on using workdir as in this PR? It seems there's consensus that this is a good direction

@joverlee521
Copy link
Contributor Author

I just realized today that there could be confusion with filepaths when needing to run a nextstrain build command that specifies a file as a target.

nextstrain build .  \
    results/hmpxv1/good_sequences.fasta \
    -s phylogenetic/Snakefile \
    --configfile phylogenetic/config/hmpxv1/config.yaml

Notice the target results/hmpxv1/good_sequences.fasta must exclude the phylogenetic/ directory because the targets are relative to the working directory.

(This is an issue even outside of this change, but wanted to be explicit about it.)

@victorlin
Copy link
Member

victorlin commented Jan 25, 2024

Generally, workdir: workflow.current_basedir seems like a good direction.

I'm not sold because it breaks some existing scenarios as mentioned in comments above. I would suggest additional changes on the CLI side:

  1. Add common Snakemake options and examples in the docs for nextstrain build.

  2. Handle common Snakemake errors with useful output.

    Example:

    $ nextstrain build .
    Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile.
    

    can be:

    Error: you must specify a workflow in the pathogen repo, e.g.
    nextstrain build . -s phylogenetic/Snakefile
    

    Another one which I'm less certain about:

    $ nextstrain build .  \
        phylogenetic/results/hmpxv1/good_sequences.fasta \
        -s phylogenetic/Snakefile \
        --configfile phylogenetic/config/hmpxv1/config.yaml
    MissingRuleException:
    No rule to produce phylogenetic/results/hmpxv1/good_sequences.fasta (if you use input functions make sure that they don't raise unexpected exceptions).
    

    can be:

    Error: Unable to use Snakemake target 'phylogenetic/results/hmpxv1/good_sequences.fasta'.
    Note that paths must be relative to the Snakefile.
    

@tsibley
Copy link
Member

tsibley commented Jan 25, 2024

comment originally written in my notes, copied here

I think standardizing on workdir: workflow.current_basedir in our workflows makes sense and is a good idea. It makes paths within the workflow much more predictable: they're always relative to the Snakefile unless you override the workdir (e.g. with --directory or -d). This inverts the default Snakemake behaviour of paths in the workflow being relative to whatever the working directory is when snakemake is invoked, which I've long thought was confusing and backwards. Still, with the workdir: proposal here, that default behaviour may be regained if needed by passing --directory . when invoking workflows from outside their containing directory.

That said, I have a separate proposal for addressing the nextstrain build UI issues. Or, at least, how to address them further so that even -s isn't necessary and specifying files as targets (results/hmpxv1/good_sequences.fasta) doesn't feel as "odd".

The proposal by example:

$ tree mpox/
mpox/
├── nextstrain-pathogen.yaml
├── ingest/
│   ├── Snakefile
│   └── …
├── phylogenetic/
│   ├── Snakefile
│   ├── config/
│   │   ├── hmpxv1/
│   │   │   ├── config.yaml
│   │   │   └── …
│   │   └── …
│   ├── results/
│   │   ├── hmpxv1/
│   │   │   ├── good_sequences.fasta
│   │   │   └── …
│   │   └── …
│   └── …
├── nextclade/
│   ├── Snakefile
│   └── …
├── shared/
│   ├── reference.fasta
│   └── …
└── …

$ cat nextstrain-pathogen.yaml
---
name: mpox
url: https://github.com/nextstrain/mpox
maintainers:
  - name: the Nextstrain team
    url: https://nextstrain.org

$ nextstrain build mpox/ingest
$ nextstrain build mpox/phylogenetic
$ nextstrain build mpox/phylogenetic \
>   --configfile config/hmpxv1/config.yaml

$ nextstrain build mpox/phylogenetic \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

$ cd mpox/phylogenetic
$ nextstrain build . \
>   --configfile config/hmpxv1/config.yaml

$ nextstrain build . \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

$ snakemake --cores all \
>   results/hmpxv1/good_sequences.fasta \
>   --configfile config/hmpxv1/config.yaml

How would this work? By detecting and drawing the filesystem isolation boundary at the pathogen repo root instead of at the working directory given to nextstrain build. That is, making the filesystem isolation boundary (i.e. what's mapped to /nextstrain/build in a container) a separate thing from the workflow and starting working directory (i.e. what's given to nextstrain build).

The pathogen repo root is indicated by the presence of nextstrain-pathogen.yaml, which we can then later leverage to also contain pathogen-level metadata for use elsewhere in indexing, listing, attribution, etc. An empty file would work just fine for now for the purposes of this proposal; the important bit is having a marker file (and one not tied to Git, i.e. .git/ is not a good substitute).

What does this look like in a container in practice?

$ nextstrain shell --docker mpox/phylogenetic
Entering the Nextstrain runtime (docker)

Mapped volumes:
  /nextstrain/build is from /home/tom/nextstrain/mpox

Run the command "exit" to leave the runtime.

$ pwd
/nextstrain/build/phylogenetic

$ tree
/nextstrain/build/phylogenetic/
├── Snakefile
├── config/
│   ├── hmpxv1/
│   │   ├── config.yaml
│   │   └── …
│   └── …
├── results/
│   ├── hmpxv1/
│   │   ├── good_sequences.fasta
│   │   └── …
│   └── …
└── …

$ cd ..
$ tree -L 1
/nextstrain/build/
├── nextstrain-pathogen.yaml
├── ingest/
├── phylogenetic/
├── nextclade/
├── shared/
└── …

The core ideas in this proposal have stewed for years now---typically as part of my "workflows as programs" thinking---and while the iron's been warm previously, it now seems like---for this specific UI-focused change---the iron is hot and it's time to strike.

As a final thought: the tangle thru which we're trying to find the clearest path here arises solely, I believe, from the desire to have files shared between our various workflows (e.g. the workflows in ingest/ and phylogenetic/ reference shared/). If we ditched the premise that they could access them directly, the whole tangle would go away, right? We could still maintain shared/ but as part of the dev process ensure they're copied into the places in ingest/ and phylogenetic/ as needed (e.g. by running make or ./devel/proliferate-shared or similar and enforcing so with CI/linting).

I'm not saying we want to do this... but it doesn't seem terrible and I'm not sure we've seriously considered it?

@jameshadfield
Copy link
Member

jameshadfield commented Jan 26, 2024

As a final thought: the tangle thru which we're trying to find the clearest path here arises solely, I believe, from the desire to have files shared between our various workflows (e.g. the workflows in ingest/ and phylogenetic/ reference shared/)

Yes, but I think it's the other way around. The tangle we've created here arises from the desire to create siloed workflows for a process which is inherently not made up of independent parts. The need to access shared data, in the form of ./shared but also in the form of the phylo workflow accessing data from the ingest silo, as well as the need to create an orchestration workflow on top of the snakemake workflows is a result of this siloing. My understanding for why we did this is largely to reduce the complexity of each snakemake workflow and to avoid having to use targets when running workflows. I think this issue shows that this complexity has just been moved elsewhere, in the form of workdirs and proposals like Tom's above.

FWIW, if we continue with the ~siloed approach, I like Tom's proposed CLI UI, it seems to strike a decent balance between "Just Works" whilst also being trivial to recreate the behaviour using snakemake directly. The latter is really important for those who have snakemake familiarity and are tying to understand the CLI.

@tsibley
Copy link
Member

tsibley commented Jan 29, 2024

The tangle we've created here arises from the desire to create siloed workflows for a process which is inherently not made up of independent parts. […] My understanding for why we did this is largely to reduce the complexity of each snakemake workflow and to avoid having to use targets when running workflows. I think this issue shows that this complexity has just been moved elsewhere, in the form of workdirs and proposals like #16 (comment).

I think it's true that we're moving complexity around, not reducing it on a global/absolute level. But I also think that by moving it around, we can reduce local/relative complexity for each part, which can be worth it for someone who doesn't need/want to understand the the whole thing at once.

(I don't think a motivating reason for the split is to avoid the use of targets when running workflows. No one seems to mind that?)

Workflows must be run from the top level pathogen repo directory in order
for them to have access to the `shared` directory in the containerized
runtimes. After some external testing of different ways to invoke the
workflows with the Nextstrain CLI¹, I've decided to add the hardcoded
`workdir` directive in each workflow's main Snakefile.

Then the default workflows can be run with:
```
nextstrain build . -s phylogenetic/Snakefile
nextstrain build . -s ingest/Snakefile
nextstrain build . -s nextclade/Snakefile
```

With this uniform organization of workflows across pathogens, we can
then move the Snakemake option to behind the scenes within the Nextclade CLI.

This will allow us to have a workflow manager agnostic option, such as:
```
nextstrain build --workflow phylogenetic .
```

¹ https://github.com/joverlee521/nextstrain-testing/blob/2650d10d1eb66a722469bcddda584bc1f50fae04/snakemake/shared-testing/current/README.md
@joverlee521
Copy link
Contributor Author

Rebased onto main to resolve merge conflicts.

tsibley added a commit to nextstrain/cli that referenced this pull request Jan 31, 2024
…ctories

…as this is the direction we're moving.

The main change is making the filesystem isolation boundary (i.e. what's
mapped to `/nextstrain/build` in a container) a _separate thing_ from
the workflow and initial working directory (i.e. what's given to
`nextstrain build`).  In this codebase, these two things are referred to
as the _build volume_ (aka `opts.build`) and the _working volume_.
Historically, the working volume given to the runners for `nextstrain
build` and `nextstrain shell` _was_ the build volume; now, they're
separately considered and sometimes differ.

See the included changelog entry for usage details and rationale.  For
background, I made the initial proposal for this feature¹ on a PR in our
pathogen-repo-template repository and some discussion ensued.

¹ <nextstrain/pathogen-repo-guide#16 (comment)>
  or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>
@tsibley
Copy link
Member

tsibley commented Jan 31, 2024

I've implemented my proposal above in Nextstrain CLI as nextstrain/cli#355.

tsibley added a commit to nextstrain/cli that referenced this pull request Jan 31, 2024
…ctories

…as this is the direction we're moving.

The main change is making the filesystem isolation boundary (i.e. what's
mapped to `/nextstrain/build` in a container) a _separate thing_ from
the workflow and initial working directory (i.e. what's given to
`nextstrain build`).  In this codebase, these two things are referred to
as the _build volume_ (aka `opts.build`) and the _working volume_.
Historically, the working volume given to the runners for `nextstrain
build` and `nextstrain shell` _was_ the build volume; now, they're
separately considered and sometimes differ.

See the included changelog entry for usage details and rationale.  For
background, I made the initial proposal for this feature¹ on a PR in our
pathogen-repo-template repository and some discussion ensued.

¹ <nextstrain/pathogen-repo-guide#16 (comment)>
  or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>
@joverlee521
Copy link
Contributor Author

With the new support from Nextstrain CLI for workflows in subdirectories, this change is no longer necessary to standardize the nextstrain build commands.

However, it seems like it's still helpful for the maintenance of workflows by forcing all relative paths in workflows to be relative to the top Snakefile.

Merging this PR now and will update repo to match the Nextstrain CLI changes in a separate PR.

@joverlee521 joverlee521 merged commit 54d270e into main Feb 2, 2024
@joverlee521 joverlee521 deleted the add-workdir branch February 2, 2024 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants