Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter: Errors out when there are more groups than requested sequences #654

Closed
trvrb opened this issue Dec 31, 2020 · 16 comments · Fixed by #659
Closed

Filter: Errors out when there are more groups than requested sequences #654

trvrb opened this issue Dec 31, 2020 · 16 comments · Fixed by #659
Labels
bug Something isn't working

Comments

@trvrb
Copy link
Member

trvrb commented Dec 31, 2020

Current Behavior
I encountered this when running the latest ncov global build. It includes the command:

augur filter
  --sequences results/filtered.fasta
  --metadata data/metadata.tsv
  --include defaults/include.txt
  --exclude defaults/exclude.txt
  --max-date 2020-08-26
  --exclude-where 'region!=Europe'
  --group-by country year month
  --subsample-max-sequences 200
  --output results/global/sample-europe_early.fasta 2>&1 | tee logs/subsample_global_europe_early.txt

This is requesting 200 sequences from the "early" timepoint for Europe. However, this errors out with:

ERROR: Asked to provide at most 200 sequences, but there are 208 groups.

If I understand #629 correctly, I can fix this with --probabilistic-sampling. Okay, and this indeed fixes the issue. However... I believe this shouldn't error out without --probabilistic-sampling. It's a surprise to the user and tricky to debug.

Expected behavior
I would definitely suggest to just take the first 200 groups deterministically when --probabilistic-sampling is not set. This will be nicer for users.

I realize this is somewhere between a bug and a feature request.

@trvrb trvrb added the bug Something isn't working label Dec 31, 2020
@emmahodcroft
Copy link
Member

I definitely agree we should probably change the error message to advise users to run again, using --probabilistic-sampling, but I'm hesitant to just go ahead without and error, taking the first 200 groups.

For having 208 groups and wanting 200 samples, this is pretty harmless, but if there were 500 groups and you wanted 200 samples, you could end up with pretty different sampling than you intended. I'm not sure that we want that to happen silently?

But we could definitely make it much easier for users to figure out whether they wither want to re-think their groupings & continue with what they ask for, or whether they want to keep everything else 'as-is' and add --probabilistic-sampling.

@rneher
Copy link
Member

rneher commented Dec 31, 2020

The thinking back then was that if the user requests fewer sequences with categories, that is an unfulfillable request should result in an error:

#593 (review)

@rneher
Copy link
Member

rneher commented Dec 31, 2020

I think we should get the probablistic subsampling released and flesh out the error message. We could also default to probabilistic sampling if there are too many groups. Seems less surprising then taking the first X groups. But I'd prefer an error.

@trvrb
Copy link
Member Author

trvrb commented Dec 31, 2020

I see what you mean @rneher and @emmahodcroft. Immediately I certainly agree with updating the error message to recommend probabilistic subsampling.

However, what happens now if you have a situation that looks like: --subsample-max-sequences 200 and there are 150 groups? We make sure that each of the 150 groups is picked once. What happens with the other 50? I was assuming that this was deterministic. For me, this is the same situation as:

For having 208 groups and wanting 200 samples, this is pretty harmless, but if there were 500 groups and you wanted 200 samples, you could end up with pretty different sampling than you intended. I'm not sure that we want that to happen silently?

Ie if we allow --subsample-max-sequences 200 to act deterministically when there are 150 groups we should allow --subsample-max-sequences 200 to act deterministically when there are 250 groups. In the former we get the consistent 50 picked twice, in the latter we get the consistent 50 missed.

Actually, I'm now confused on why the --probabilistic-sampling flag is needed. Why isn't this just the default behavior? I understand that this changes the current behavior when --subsample-max-sequences is used. But I don't understand why this should really be an option.

 --probabilistic-sampling
                        Sample probabilitically from groups -- useful when
                        there are more groups than requested sequences
                        (default: False)

isn't a lot to go on. It's also confusing that --probabilistic-sampling only applies to --subsample-max-sequences and not --sequences-per-group (if I'm reading the code correctly).

@rneher
Copy link
Member

rneher commented Dec 31, 2020

Ie if we allow --subsample-max-sequences 200 to act deterministically when there are 150 groups we should allow --subsample-max-sequences 200 to act deterministically when there are 250 groups. In the former we get the consistent 50 picked twice, in the latter we get the consistent 50 missed.

I don't think that is what happens. I the former case, we get 150 sequences, one from each group. In the latter case, we get an error.

Actually, I'm now confused on why the --probabilistic-sampling flag is needed. Why isn't this just the default behavior? I understand that this changes the current behavior when --subsample-max-sequences is used. But I don't understand why this should really be an option.

mainly so that we don't change default behavior as this is an addition to what was there before. But I agree that in most cases prob sampling is probably the desired behavior.

@trvrb
Copy link
Member Author

trvrb commented Dec 31, 2020

Got it... this makes much more sense to me now. A couple thoughts then. I would have been less confused if instead of calling this --probabilistic-sampling, it was called something more like --allow-fractional-sequences-per-group, but I realize this is overly long. Also would be helpful to know from the command line interface that this is tied to --subsample-max-sequences.

Additionally, I don't like a flag like this for something we probably want as default behavior. We'll be stuck with everyone writing --probabilistic-sampling. In other places in Augur we do things like augur refine --covariance vs --no-covariance, where the default is --covariance, but it's possible to flag either direction.

Here, my thinking is either:

  1. Change flag to --enforce-integer-group-sizes and have the new default be the behavior currently encoded with --probabilistic-sampling. This gives the command line flag a clearer name (--probabilistic-sampling is really confusing to me in that we take a random sample from a group if there are more options than sequences, but --probabilistic-sampling implies that we only do this if the flag is on) and makes for a better default behavior.
  2. Change flag to --integer-group-sizes vs --fractional-group-sizes and have this default to --fractional-group-sizes and be mutually exclusive the way that --covariance and --no-covariance are.

@huddlej
Copy link
Contributor

huddlej commented Jan 1, 2021

I don't like a flag like this for something we probably want as default behavior.

One concern was that it's easy for users to unknowingly request more groups than there are total samples requested. The only way users would find out that this is happening is for Augur to throw an error.

If the "probabilistic sampling" behavior should be the default, though, is there really a use case for enforcing the integer group sizes? This opt-in flag would require users to know enough about the subsampling logic to override the fractional group sizes. I personally have to re-read this subsampling code to remember how the logic works. I wonder how many users would care to have this option...

As another way forward could we just do:

  1. Make --probabilistic-sampling behavior the default
  2. Add a description of how this sampling works in the --subsample-max-sequences argument's help
  3. Make a new major release of Augur (11.0.0) reflecting the change in default behavior of the augur filter API.

Then, if people find a need to enforce the old behavior, we could add a flag to toggle it in a later minor release.

Separately, it would be nice to have a user interface to explore different subsampling schemes interactively for a given metadata file. Then users could be more confident about the composition of their builds. This probably relates to @jameshadfield's RFC for an augur subsample command.

@rneher
Copy link
Member

rneher commented Jan 1, 2021

I am not a sure --integer vs --fractional is more or less confusing. my suggestion would be to make a minor release NOW so that we avoid running into the subsampling problems we currently have in ncov. Then we can make it default in the next major release.

@huddlej
Copy link
Contributor

huddlej commented Jan 1, 2021

Ok, I just released augur v10.2.0 with probabilistic sampling.

@emmahodcroft
Copy link
Member

That would be brilliant John, and really help out our shepherds! Thank you!

@trvrb
Copy link
Member Author

trvrb commented Jan 1, 2021

So in Augur v11 when --probabilistic-sampling becomes default we'll keep the flag around in the command line interface, but it will no longer do anything? If it's removed in v11 it will break builds that have added --probabilistic-sampling in the meantime.

@huddlej
Copy link
Contributor

huddlej commented Jan 1, 2021

We can keep the flag in v11, label it as deprecated, and remove it in the next major version (v12). I think this is what we did with --output-node-data vs. --output.

It sounds like we want to push for this new default behavior sooner than later, though. If so, we could plan to release v11 with this change next week.

@trvrb
Copy link
Member Author

trvrb commented Jan 2, 2021

Makes sense. Copying @rneher's Slack comment here as well:

I'd suggest to add a flag --no-probabilistic-subsampling in v11 and then set the default.

@huddlej
Copy link
Contributor

huddlej commented Jan 5, 2021

So, I accidentally pushed commits that should have been in a branch to master (magit is awesome but sometimes too powerful for me!), but I've added a --no-probabilistic-sampling flag and added a deprecation warning for users who explicitly request the flag.

Besides addressing any errors that might arise when CI tests finish, do folks have any comments about these changes?

@trvrb
Copy link
Member Author

trvrb commented Jan 6, 2021

No worries at all. Thanks for these additions John. My feeling is that it's legitimate to have both --probabilistic-sampling and --no-probabilistic-sampling flags. This is in line with existing command line interface of --covariance vs --no-covariance. Ie I don't think that deprecated flag here 9d21aad is necessary. If someone wants to explicitly flag the default this is should be fine (just like flagging --covariance).

@huddlej
Copy link
Contributor

huddlej commented Jan 6, 2021

I just reverted those two commits on master just to be safe (probably should have done that right away). I'll continue work on this including removing the deprecation warning over in the PR referenced above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants