Filter: Errors out when there are more groups than requested sequences #654

trvrb · 2020-12-31T06:06:47Z

Current Behavior
I encountered this when running the latest ncov global build. It includes the command:

augur filter
  --sequences results/filtered.fasta
  --metadata data/metadata.tsv
  --include defaults/include.txt
  --exclude defaults/exclude.txt
  --max-date 2020-08-26
  --exclude-where 'region!=Europe'
  --group-by country year month
  --subsample-max-sequences 200
  --output results/global/sample-europe_early.fasta 2>&1 | tee logs/subsample_global_europe_early.txt

This is requesting 200 sequences from the "early" timepoint for Europe. However, this errors out with:

ERROR: Asked to provide at most 200 sequences, but there are 208 groups.

If I understand #629 correctly, I can fix this with --probabilistic-sampling. Okay, and this indeed fixes the issue. However... I believe this shouldn't error out without --probabilistic-sampling. It's a surprise to the user and tricky to debug.

Expected behavior
I would definitely suggest to just take the first 200 groups deterministically when --probabilistic-sampling is not set. This will be nicer for users.

I realize this is somewhere between a bug and a feature request.

The text was updated successfully, but these errors were encountered:

emmahodcroft · 2020-12-31T10:32:19Z

I definitely agree we should probably change the error message to advise users to run again, using --probabilistic-sampling, but I'm hesitant to just go ahead without and error, taking the first 200 groups.

For having 208 groups and wanting 200 samples, this is pretty harmless, but if there were 500 groups and you wanted 200 samples, you could end up with pretty different sampling than you intended. I'm not sure that we want that to happen silently?

But we could definitely make it much easier for users to figure out whether they wither want to re-think their groupings & continue with what they ask for, or whether they want to keep everything else 'as-is' and add --probabilistic-sampling.

rneher · 2020-12-31T10:50:31Z

The thinking back then was that if the user requests fewer sequences with categories, that is an unfulfillable request should result in an error:

#593 (review)

rneher · 2020-12-31T10:59:49Z

I think we should get the probablistic subsampling released and flesh out the error message. We could also default to probabilistic sampling if there are too many groups. Seems less surprising then taking the first X groups. But I'd prefer an error.

trvrb · 2020-12-31T17:46:31Z

I see what you mean @rneher and @emmahodcroft. Immediately I certainly agree with updating the error message to recommend probabilistic subsampling.

However, what happens now if you have a situation that looks like: --subsample-max-sequences 200 and there are 150 groups? We make sure that each of the 150 groups is picked once. What happens with the other 50? I was assuming that this was deterministic. For me, this is the same situation as:

For having 208 groups and wanting 200 samples, this is pretty harmless, but if there were 500 groups and you wanted 200 samples, you could end up with pretty different sampling than you intended. I'm not sure that we want that to happen silently?

Ie if we allow --subsample-max-sequences 200 to act deterministically when there are 150 groups we should allow --subsample-max-sequences 200 to act deterministically when there are 250 groups. In the former we get the consistent 50 picked twice, in the latter we get the consistent 50 missed.

Actually, I'm now confused on why the --probabilistic-sampling flag is needed. Why isn't this just the default behavior? I understand that this changes the current behavior when --subsample-max-sequences is used. But I don't understand why this should really be an option.

 --probabilistic-sampling
                        Sample probabilitically from groups -- useful when
                        there are more groups than requested sequences
                        (default: False)

isn't a lot to go on. It's also confusing that --probabilistic-sampling only applies to --subsample-max-sequences and not --sequences-per-group (if I'm reading the code correctly).

rneher · 2020-12-31T18:29:36Z

Ie if we allow --subsample-max-sequences 200 to act deterministically when there are 150 groups we should allow --subsample-max-sequences 200 to act deterministically when there are 250 groups. In the former we get the consistent 50 picked twice, in the latter we get the consistent 50 missed.

I don't think that is what happens. I the former case, we get 150 sequences, one from each group. In the latter case, we get an error.

Actually, I'm now confused on why the --probabilistic-sampling flag is needed. Why isn't this just the default behavior? I understand that this changes the current behavior when --subsample-max-sequences is used. But I don't understand why this should really be an option.

mainly so that we don't change default behavior as this is an addition to what was there before. But I agree that in most cases prob sampling is probably the desired behavior.

trvrb · 2020-12-31T19:14:34Z

Got it... this makes much more sense to me now. A couple thoughts then. I would have been less confused if instead of calling this --probabilistic-sampling, it was called something more like --allow-fractional-sequences-per-group, but I realize this is overly long. Also would be helpful to know from the command line interface that this is tied to --subsample-max-sequences.

Additionally, I don't like a flag like this for something we probably want as default behavior. We'll be stuck with everyone writing --probabilistic-sampling. In other places in Augur we do things like augur refine --covariance vs --no-covariance, where the default is --covariance, but it's possible to flag either direction.

Here, my thinking is either:

Change flag to --enforce-integer-group-sizes and have the new default be the behavior currently encoded with --probabilistic-sampling. This gives the command line flag a clearer name (--probabilistic-sampling is really confusing to me in that we take a random sample from a group if there are more options than sequences, but --probabilistic-sampling implies that we only do this if the flag is on) and makes for a better default behavior.
Change flag to --integer-group-sizes vs --fractional-group-sizes and have this default to --fractional-group-sizes and be mutually exclusive the way that --covariance and --no-covariance are.

huddlej · 2021-01-01T02:19:03Z

I don't like a flag like this for something we probably want as default behavior.

One concern was that it's easy for users to unknowingly request more groups than there are total samples requested. The only way users would find out that this is happening is for Augur to throw an error.

If the "probabilistic sampling" behavior should be the default, though, is there really a use case for enforcing the integer group sizes? This opt-in flag would require users to know enough about the subsampling logic to override the fractional group sizes. I personally have to re-read this subsampling code to remember how the logic works. I wonder how many users would care to have this option...

As another way forward could we just do:

Make --probabilistic-sampling behavior the default
Add a description of how this sampling works in the --subsample-max-sequences argument's help
Make a new major release of Augur (11.0.0) reflecting the change in default behavior of the augur filter API.

Then, if people find a need to enforce the old behavior, we could add a flag to toggle it in a later minor release.

Separately, it would be nice to have a user interface to explore different subsampling schemes interactively for a given metadata file. Then users could be more confident about the composition of their builds. This probably relates to @jameshadfield's RFC for an augur subsample command.

rneher · 2021-01-01T11:39:02Z

I am not a sure --integer vs --fractional is more or less confusing. my suggestion would be to make a minor release NOW so that we avoid running into the subsampling problems we currently have in ncov. Then we can make it default in the next major release.

huddlej · 2021-01-01T20:10:11Z

Ok, I just released augur v10.2.0 with probabilistic sampling.

emmahodcroft · 2021-01-01T20:12:11Z

That would be brilliant John, and really help out our shepherds! Thank you!

trvrb · 2021-01-01T21:52:05Z

So in Augur v11 when --probabilistic-sampling becomes default we'll keep the flag around in the command line interface, but it will no longer do anything? If it's removed in v11 it will break builds that have added --probabilistic-sampling in the meantime.

huddlej · 2021-01-01T23:22:03Z

We can keep the flag in v11, label it as deprecated, and remove it in the next major version (v12). I think this is what we did with --output-node-data vs. --output.

It sounds like we want to push for this new default behavior sooner than later, though. If so, we could plan to release v11 with this change next week.

trvrb · 2021-01-02T20:34:08Z

Makes sense. Copying @rneher's Slack comment here as well:

I'd suggest to add a flag --no-probabilistic-subsampling in v11 and then set the default.

huddlej · 2021-01-05T22:01:32Z

So, I accidentally pushed commits that should have been in a branch to master (magit is awesome but sometimes too powerful for me!), but I've added a --no-probabilistic-sampling flag and added a deprecation warning for users who explicitly request the flag.

Besides addressing any errors that might arise when CI tests finish, do folks have any comments about these changes?

trvrb · 2021-01-06T02:24:22Z

No worries at all. Thanks for these additions John. My feeling is that it's legitimate to have both --probabilistic-sampling and --no-probabilistic-sampling flags. This is in line with existing command line interface of --covariance vs --no-covariance. Ie I don't think that deprecated flag here 9d21aad is necessary. If someone wants to explicitly flag the default this is should be fine (just like flagging --covariance).

huddlej · 2021-01-06T02:27:34Z

I just reverted those two commits on master just to be safe (probably should have done that right away). I'll continue work on this including removing the deprecation warning over in the PR referenced above.

trvrb added the bug Something isn't working label Dec 31, 2020

huddlej mentioned this issue Jan 5, 2021

Set probabilistic sampling as default subsampling behavior #659

Merged

huddlej closed this as completed in #659 Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter: Errors out when there are more groups than requested sequences #654

Filter: Errors out when there are more groups than requested sequences #654

trvrb commented Dec 31, 2020 •

edited

Loading

emmahodcroft commented Dec 31, 2020

rneher commented Dec 31, 2020

rneher commented Dec 31, 2020

trvrb commented Dec 31, 2020 •

edited

Loading

rneher commented Dec 31, 2020

trvrb commented Dec 31, 2020

huddlej commented Jan 1, 2021

rneher commented Jan 1, 2021

huddlej commented Jan 1, 2021 •

edited

Loading

emmahodcroft commented Jan 1, 2021

trvrb commented Jan 1, 2021 •

edited

Loading

huddlej commented Jan 1, 2021

trvrb commented Jan 2, 2021

huddlej commented Jan 5, 2021

trvrb commented Jan 6, 2021

huddlej commented Jan 6, 2021

Filter: Errors out when there are more groups than requested sequences #654

Filter: Errors out when there are more groups than requested sequences #654

Comments

trvrb commented Dec 31, 2020 • edited Loading

emmahodcroft commented Dec 31, 2020

rneher commented Dec 31, 2020

rneher commented Dec 31, 2020

trvrb commented Dec 31, 2020 • edited Loading

rneher commented Dec 31, 2020

trvrb commented Dec 31, 2020

huddlej commented Jan 1, 2021

rneher commented Jan 1, 2021

huddlej commented Jan 1, 2021 • edited Loading

emmahodcroft commented Jan 1, 2021

trvrb commented Jan 1, 2021 • edited Loading

huddlej commented Jan 1, 2021

trvrb commented Jan 2, 2021

huddlej commented Jan 5, 2021

trvrb commented Jan 6, 2021

huddlej commented Jan 6, 2021

trvrb commented Dec 31, 2020 •

edited

Loading

trvrb commented Dec 31, 2020 •

edited

Loading

huddlej commented Jan 1, 2021 •

edited

Loading

trvrb commented Jan 1, 2021 •

edited

Loading