Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Split filter.py into smaller files #997

Merged
merged 7 commits into from
Jan 12, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jul 1, 2022

Description of proposed changes

No functional changes here, just splitting a big file up into smaller pieces.

Organization is up for debate.

Note: if #1087 gets merged before this PR, I'll add docs changes here.

Related issue(s)

Testing

Tests updated to reflect new organization of files (see #996).

@victorlin victorlin requested a review from a team July 1, 2022 22:37
@victorlin victorlin self-assigned this Jul 1, 2022
@victorlin victorlin marked this pull request as draft July 1, 2022 23:02
@victorlin victorlin force-pushed the filter/split-file branch 5 times, most recently from 118a499 to f596267 Compare July 1, 2022 23:44
@victorlin victorlin marked this pull request as ready for review July 1, 2022 23:53
@codecov
Copy link

codecov bot commented Jul 1, 2022

Codecov Report

Base: 67.98% // Head: 68.10% // Increases project coverage by +0.12% 🎉

Coverage data is based on head (c00db66) compared to base (c8900c5).
Patch coverage: 95.92% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #997      +/-   ##
==========================================
+ Coverage   67.98%   68.10%   +0.12%     
==========================================
  Files          57       62       +5     
  Lines        6656     6682      +26     
  Branches     1637     1637              
==========================================
+ Hits         4525     4551      +26     
  Misses       1825     1825              
  Partials      306      306              
Impacted Files Coverage Δ
augur/filter/io.py 77.50% <77.50%> (ø)
augur/filter/validate_arguments.py 86.95% <86.95%> (ø)
augur/filter/_run.py 96.39% <96.39%> (ø)
augur/filter/include_exclude_rules.py 97.27% <97.27%> (ø)
augur/filter/subsample.py 98.76% <98.76%> (ø)
augur/filter/__init__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Organization is up for debate, but keeping the top level filter.py around is currently required for the CLI to function properly:

My general tendency is to avoid non-specific names like "support" when organizing code bases. Instead, I'd suggest nesting these new files directly under the augur.filter namespace instead, e.g. augur.filter.io, augur.filter.errors, etc. This in turn means extending the command_name() function referenced above to handle the case when command.__name__ == command.__package__, but I expect that to be a small straightforward change.

augur/support/filter/subsample.py Outdated Show resolved Hide resolved
@tsibley
Copy link
Member

tsibley commented Jul 5, 2022

…but I expect that to be a small straightforward change.

Untested, but here are two ways to make that change.

Smallest diff:

diff --git a/augur/__init__.py b/augur/__init__.py
index 31ee6826..a69dcf44 100644
--- a/augur/__init__.py
+++ b/augur/__init__.py
@@ -142,4 +142,9 @@ def command_name(command):
     package     = command.__package__
     module_name = command.__name__
 
+    # If the command module is also a package, e.g. augur/x/__init__.py instead
+    # of augur/x.py, then use the parent package.
+    if package == module_name:
+        package = package.rsplit(".", 1)[0]
+
     return remove_prefix(package, module_name).lstrip(".").replace("_", "-")

Reconsidering some unnecessary machinery:

diff --git a/augur/__init__.py b/augur/__init__.py
index 31ee6826..0a7186fa 100644
--- a/augur/__init__.py
+++ b/augur/__init__.py
@@ -19,7 +19,7 @@ recursion_limit = os.environ.get("AUGUR_RECURSION_LIMIT")
 if recursion_limit:
     sys.setrecursionlimit(int(recursion_limit))
 
-command_strings = [
+COMMAND_NAMES = [
     "parse",
     "index",
     "filter",
@@ -29,10 +29,10 @@ command_strings = [
     "refine",
     "ancestral",
     "translate",
-    "reconstruct_sequences",
+    "reconstruct-sequences",
     "clades",
     "traits",
-    "sequence_traits",
+    "sequence-traits",
     "lbi",
     "distance",
     "titers",
@@ -44,7 +44,7 @@ command_strings = [
     "measurements",
 ]
 
-COMMANDS = [importlib.import_module('augur.' + c) for c in command_strings]
+COMMANDS = [(c, importlib.import_module('augur.' + c.replace("-", "_"))) for c in COMMAND_NAMES]
 
 def make_parser():
     parser = argparse.ArgumentParser(
@@ -56,10 +56,10 @@ def make_parser():
     add_default_command(parser)
     add_version_alias(parser)
 
-    for command in COMMANDS:
+    for command_name, command in COMMANDS:
         # Add a subparser for each command.
         subparser = subparsers.add_parser(
-            command_name(command),
+            command_name,
             help        = first_line(command.__doc__),
             description = command.__doc__)
 
@@ -129,17 +129,3 @@ def add_version_alias(parser):
         nargs  = 0,
         help   = argparse.SUPPRESS,
         action = run_version_command)
-
-
-def command_name(command):
-    """
-    Returns a short name for a command module.
-    """
-
-    def remove_prefix(prefix, string):
-        return re.sub('^' + re.escape(prefix), '', string)
-
-    package     = command.__package__
-    module_name = command.__name__
-
-    return remove_prefix(package, module_name).lstrip(".").replace("_", "-")

@victorlin victorlin mentioned this pull request Jul 7, 2022
joverlee521 added a commit that referenced this pull request Jul 8, 2022
Removes the `command_name` function to simplify the command name parsing
by directly replacing underscores with dashes. This allows us to create
the correct CLI command name for a command regardless of whether the
command is a module or a package and without the "unnecessary machinery."

Note this change was built on a suggestion by @tsibley in another PR¹,
but I'm incorporating here to support the reorganization of commands
with subcommands into packages. I chose to do the string replacement for
the command name rather than the command module so that we can directly
list module names and make the computer do the snake_case to kebab-case
conversion for us.

¹ #997 (comment)
@victorlin
Copy link
Member Author

Will get back to this once #1002 is done due to overlapping desires.

@victorlin victorlin marked this pull request as draft July 12, 2022 18:52
@victorlin victorlin force-pushed the filter/split-file branch 6 times, most recently from 13e160d to 76566bc Compare July 20, 2022 22:36
@victorlin
Copy link
Member Author

Rebased on latest changes with the notable difference of using augur/filter/ instead of augur/support/filter.

@victorlin victorlin marked this pull request as ready for review July 20, 2022 22:41
@victorlin victorlin requested a review from a team July 20, 2022 22:41
@victorlin victorlin force-pushed the filter/split-file branch 2 times, most recently from 735fa4e to f01c450 Compare July 28, 2022 18:16
@victorlin
Copy link
Member Author

@tsibley I'm still not convinced... or at least the following lines should change under your thinking.

from augur.errors import AugurError

from augur.filter.errors import FilterException

Would you suggest to change them to these?

from augur import AugurError
from augur.filter import FilterException

In my opinion, these are to be used internally, and we should just reference them from the original location so that we as developers can clearly see that FilterException is defined in augur/filter/errors.py.

@victorlin victorlin requested a review from tsibley July 28, 2022 18:29
@tsibley
Copy link
Member

tsibley commented Jul 28, 2022

@victorlin

In my opinion, these are to be used internally, and we should just reference them from the original location so that we as developers can clearly see that FilterException is defined in augur/filter/errors.py.

Ah, I have no issue with internal usage referencing the original location they're defined, and indeed agree it's preferable.

My point with #997 (comment) was that if we consider at least some of the functions in augur.filter to be part of Augur's public Python API, then we should keep FilterException available via augur.filter by re-exporting it (rather than making external callers dig around internals to find it). But if none of augur.filter is part of Augur's public Python API—and indeed I don't think we've ever fully clarified what is and isn't—then my point is moot. :-)

If you still don't agree, that's fine, and feel free to drop the re-export!

@victorlin
Copy link
Member Author

victorlin commented Jul 28, 2022

@tsibley Ok I understand now! Indeed, augur.filter.FilterException is the first thing listed on the augur.filter module page. Your small flag on FilterException raises a bigger flag here – pretty much everything else on that page won't be available anymore (look at the new preview page)!

So before continuing here, I think we first need to clarify what part of augur.filter (and other modules) is part of Augur's Python Development API: #1011

@victorlin victorlin marked this pull request as draft July 28, 2022 23:25
@victorlin victorlin force-pushed the filter/split-file branch 2 times, most recently from b45703b to 1ff27fc Compare October 3, 2022 17:29
@victorlin
Copy link
Member Author

victorlin commented Nov 4, 2022

About to rebase this onto the latest changes, but dropping the re-export of FilterException (8ab610b) given work in #1087 that does not include augur.filter in the public API.

@victorlin victorlin marked this pull request as ready for review November 4, 2022 19:28
@victorlin victorlin requested review from joverlee521 and a team November 4, 2022 19:28
@victorlin
Copy link
Member Author

Removed changelog entry b98580c given updated changelog intentions proposed in #1089.

This is an intermediate step before splitting the file even more.

This module needs to be prefixed with an underscore otherwise there is a
name conflict with the run() function in filter/__init__.py.
Also, validate arguments in the top-level `filter.run`. This allows the
imported run() function to be responsible for one less thing.
@victorlin
Copy link
Member Author

Rebased onto latest master and double checked the force-push diffs on the relative paths – changes are as expected.

# force-push
git diff 574355b16f630df83baef59edc4a3b8e60c59e4d c00db66c5747faf7bb5a3563368dcbdbfe2211a3 augur/filter tests/filter
Output
diff --git a/augur/filter/_run.py b/augur/filter/_run.py
index 7c599221..eadb03d1 100644
--- a/augur/filter/_run.py
+++ b/augur/filter/_run.py
@@ -10,7 +10,10 @@ from tempfile import NamedTemporaryFile
 
 from augur.errors import AugurError
 from augur.index import index_sequences, index_vcf
-from augur.io import open_file, read_metadata, read_sequences, write_sequences, is_vcf as filename_is_vcf, write_vcf
+from augur.io.file import open_file
+from augur.io.metadata import read_metadata
+from augur.io.sequences import read_sequences, write_sequences
+from augur.io.vcf import is_vcf as filename_is_vcf, write_vcf
 from .io import cleanup_outputs, read_priority_scores
 from .include_exclude_rules import apply_filters, construct_filters
 from .subsample import PriorityQueue, TooManyGroupsError, calculate_sequences_per_group, create_queues_by_group, get_groups_for_subsampling
diff --git a/augur/filter/errors.py b/augur/filter/errors.py
deleted file mode 100644
index 54a131c4..00000000
--- a/augur/filter/errors.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from augur.errors import AugurError
-
-
-class FilterException(AugurError):
-    """Representation of an error that occurred during filtering.
-    """
-    pass
diff --git a/augur/filter/include_exclude_rules.py b/augur/filter/include_exclude_rules.py
index d4881a34..0edf407c 100644
--- a/augur/filter/include_exclude_rules.py
+++ b/augur/filter/include_exclude_rules.py
@@ -6,7 +6,7 @@ import pandas as pd
 
 from augur.dates import numeric_date, is_date_ambiguous, get_numerical_dates
 from augur.errors import AugurError
-from augur.io import is_vcf as filename_is_vcf
+from augur.io.vcf import is_vcf as filename_is_vcf
 from augur.utils import read_strains
 from .io import filter_kwargs_to_str
 
diff --git a/augur/filter/subsample.py b/augur/filter/subsample.py
index f2bb6f31..d0729286 100644
--- a/augur/filter/subsample.py
+++ b/augur/filter/subsample.py
@@ -9,7 +9,6 @@ from typing import Collection
 from augur.dates import get_iso_year_week
 from augur.errors import AugurError
 from . import GROUP_BY_GENERATED_COLUMNS
-from .errors import FilterException
 
 
 def get_groups_for_subsampling(strains, metadata, group_by=None):
@@ -62,7 +61,7 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
     >>> get_groups_for_subsampling(strains, metadata, group_by)
     Traceback (most recent call last):
       ...
-    augur.filter.errors.FilterException: The specified group-by categories (['missing_column']) were not found.
+    augur.errors.AugurError: The specified group-by categories (['missing_column']) were not found.
 
     If we try to group by some columns that exist and some that don't, we allow
     grouping to continue and print a warning message to stderr.
@@ -110,9 +109,9 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
 
     # If we could not find any requested categories, we cannot complete subsampling.
     if 'date' not in metadata and group_by_set <= GROUP_BY_GENERATED_COLUMNS:
-        raise FilterException(f"The specified group-by categories ({group_by}) were not found. Note that using any of {sorted(GROUP_BY_GENERATED_COLUMNS)} requires a column called 'date'.")
+        raise AugurError(f"The specified group-by categories ({group_by}) were not found. Note that using any of {sorted(GROUP_BY_GENERATED_COLUMNS)} requires a column called 'date'.")
     if not group_by_set & (set(metadata.columns) | GROUP_BY_GENERATED_COLUMNS):
-        raise FilterException(f"The specified group-by categories ({group_by}) were not found.")
+        raise AugurError(f"The specified group-by categories ({group_by}) were not found.")
 
     # Warn/error based on other columns grouped with 'week'.
     if 'week' in group_by_set:
diff --git a/augur/filter/validate_arguments.py b/augur/filter/validate_arguments.py
index 322417e9..b0c81d59 100644
--- a/augur/filter/validate_arguments.py
+++ b/augur/filter/validate_arguments.py
@@ -1,5 +1,5 @@
 import sys
-from augur.io import is_vcf as filename_is_vcf
+from augur.io.vcf import is_vcf as filename_is_vcf
 
 
 SEQUENCE_ONLY_FILTERS = (
diff --git a/tests/filter/test_run.py b/tests/filter/test_run.py
index 71338d64..253160d5 100644
--- a/tests/filter/test_run.py
+++ b/tests/filter/test_run.py
@@ -16,7 +16,7 @@ import augur.filter
 import augur.filter._run
 import augur.filter.io
 import augur.filter.include_exclude_rules
-from augur.io import read_metadata
+from augur.io.metadata import read_metadata
 
 @pytest.fixture
 def argparser():
diff --git a/tests/filter/test_subsample.py b/tests/filter/test_subsample.py
index ffe58855..8d1b401f 100644
--- a/tests/filter/test_subsample.py
+++ b/tests/filter/test_subsample.py
@@ -2,7 +2,7 @@ import pytest
 import pandas as pd
 
 import augur.filter.subsample
-from augur.filter.errors import FilterException
+from augur.errors import AugurError
 
 
 @pytest.fixture
@@ -47,7 +47,7 @@ class TestFilterGroupBy:
         groups = ['invalid']
         metadata = valid_metadata.copy()
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.subsample.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['invalid']) were not found."
 
@@ -127,7 +127,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.subsample.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['year']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."
 
@@ -136,7 +136,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.subsample.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['month']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."
 
@@ -145,7 +145,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.subsample.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['year', 'month']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."
# changes to master
git diff f0f9200cbd7c2a2dcaa209292e161aefebf1dc93 c8900c5cdba6a83312dd350b4a9ad487f63aa7a5 augur/filter.py tests/test_filter.py
Output
diff --git a/augur/filter.py b/augur/filter.py
index 482f6520..65c965c4 100644
--- a/augur/filter.py
+++ b/augur/filter.py
@@ -21,7 +21,10 @@ from typing import Collection
 from .dates import numeric_date, numeric_date_type, SUPPORTED_DATE_HELP_TEXT, is_date_ambiguous, get_numerical_dates, get_iso_year_week
 from .errors import AugurError
 from .index import index_sequences, index_vcf
-from .io import open_file, read_metadata, read_sequences, write_sequences, is_vcf as filename_is_vcf, write_vcf
+from .io.file import open_file
+from .io.metadata import read_metadata
+from .io.sequences import read_sequences, write_sequences
+from .io.vcf import is_vcf as filename_is_vcf, write_vcf
 from .utils import read_strains
 
 comment_char = '#'
@@ -105,12 +108,6 @@ def register_parser(parent_subparsers):
     return parser
 
 
-class FilterException(AugurError):
-    """Representation of an error that occurred during filtering.
-    """
-    pass
-
-
 def read_priority_scores(fname):
     def constant_factory(value):
         return lambda: value
@@ -957,7 +954,7 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
     >>> get_groups_for_subsampling(strains, metadata, group_by)
     Traceback (most recent call last):
       ...
-    augur.filter.FilterException: The specified group-by categories (['missing_column']) were not found.
+    augur.errors.AugurError: The specified group-by categories (['missing_column']) were not found.
 
     If we try to group by some columns that exist and some that don't, we allow
     grouping to continue and print a warning message to stderr.
@@ -1005,9 +1002,9 @@ def get_groups_for_subsampling(strains, metadata, group_by=None):
 
     # If we could not find any requested categories, we cannot complete subsampling.
     if 'date' not in metadata and group_by_set <= GROUP_BY_GENERATED_COLUMNS:
-        raise FilterException(f"The specified group-by categories ({group_by}) were not found. Note that using any of {sorted(GROUP_BY_GENERATED_COLUMNS)} requires a column called 'date'.")
+        raise AugurError(f"The specified group-by categories ({group_by}) were not found. Note that using any of {sorted(GROUP_BY_GENERATED_COLUMNS)} requires a column called 'date'.")
     if not group_by_set & (set(metadata.columns) | GROUP_BY_GENERATED_COLUMNS):
-        raise FilterException(f"The specified group-by categories ({group_by}) were not found.")
+        raise AugurError(f"The specified group-by categories ({group_by}) were not found.")
 
     # Warn/error based on other columns grouped with 'week'.
     if 'week' in group_by_set:
diff --git a/tests/test_filter.py b/tests/test_filter.py
index d8b23a3e..580203c5 100644
--- a/tests/test_filter.py
+++ b/tests/test_filter.py
@@ -14,8 +14,8 @@ from Bio.SeqRecord import SeqRecord
 from freezegun import freeze_time
 
 import augur.filter
-from augur.io import read_metadata
-from augur.filter import FilterException
+from augur.io.metadata import read_metadata
+from augur.errors import AugurError
 
 @pytest.fixture
 def argparser():
@@ -391,7 +391,7 @@ class TestFilterGroupBy:
         groups = ['invalid']
         metadata = valid_metadata.copy()
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['invalid']) were not found."
 
@@ -471,7 +471,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['year']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."
 
@@ -480,7 +480,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['month']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."
 
@@ -489,7 +489,7 @@ class TestFilterGroupBy:
         metadata = valid_metadata.copy()
         metadata = metadata.drop('date', axis='columns')
         strains = metadata.index.tolist()
-        with pytest.raises(FilterException) as e_info:
+        with pytest.raises(AugurError) as e_info:
             augur.filter.get_groups_for_subsampling(strains, metadata, group_by=groups)
         assert str(e_info.value) == "The specified group-by categories (['year', 'month']) were not found. Note that using any of ['month', 'week', 'year'] requires a column called 'date'."

Given previous approvals, I'll plan to merge this once checks pass.

@victorlin victorlin merged commit d7d4380 into nextstrain:master Jan 12, 2023
@victorlin victorlin deleted the filter/split-file branch January 12, 2023 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants