Skip to content

Commit

Permalink
[#141] Test and document the "skip" attribute for feature selection t…
Browse files Browse the repository at this point in the history
…ransforms
  • Loading branch information
riley-harper committed Aug 27, 2024
1 parent 9b3b233 commit ada39ea
Show file tree
Hide file tree
Showing 5 changed files with 58 additions and 1 deletion.
3 changes: 3 additions & 0 deletions docs/_sources/feature_selection_transforms.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ few utility attributes which are available for all transforms:
- `checkpoint` - Type: `boolean`. Optional. If set to true, checkpoint the
dataset in Spark before computing the feature selection. This can reduce some
resource usage for very complex workflows, but should not be necessary.
- `skip` - Type: `boolean`. Optional. If set to true, don't compute this
feature selection. This has the same effect as commenting the feature
selection out of your config file.

## bigrams

Expand Down
3 changes: 3 additions & 0 deletions docs/feature_selection_transforms.html
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ <h1>Feature Selection Transforms<a class="headerlink" href="#feature-selection-t
<li><p><code class="docutils literal notranslate"><span class="pre">checkpoint</span></code> - Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional. If set to true, checkpoint the
dataset in Spark before computing the feature selection. This can reduce some
resource usage for very complex workflows, but should not be necessary.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">skip</span></code> - Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional. If set to true, don’t compute this
feature selection. This has the same effect as commenting the feature
selection out of your config file.</p></li>
</ul>
<section id="bigrams">
<h2>bigrams<a class="headerlink" href="#bigrams" title="Link to this heading"></a></h2>
Expand Down
2 changes: 1 addition & 1 deletion docs/searchindex.js

Large diffs are not rendered by default.

48 changes: 48 additions & 0 deletions hlink/tests/core/transforms_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,54 @@ def test_generate_transforms_override_column_b(
]


@pytest.mark.parametrize("is_a", [True, False])
def test_generate_transforms_skip_attribute_skips_transform(
spark: SparkSession, preprocessing: LinkTask, is_a: bool
) -> None:
"""When a feature selection has an attribute "skip" set to True,
generate_transforms() ignores it and doesn't include it in the output data
frame.
"""
feature_selections = [
{
"input_column": "name",
"output_column": "name_bigrams",
"transform": "bigrams",
"skip": True,
}
]

df = spark.createDataFrame([[0, "martin"]], "id:integer, name:string")
df_result = generate_transforms(
spark, df, feature_selections, preprocessing, is_a, "id"
)
# There's no output "name_bigrams" column because the feature selection was skipped
assert df_result.columns == ["id", "name"]


@pytest.mark.parametrize("is_a", [True, False])
def test_generate_transforms_skip_attribute_does_not_skip_if_false(
spark: SparkSession, preprocessing: LinkTask, is_a: bool
) -> None:
"""When a feature selection has an attribute "skip", but it's set to False,
generate_transforms() computes the feature selection as normal.
"""
feature_selections = [
{
"input_column": "name",
"output_column": "name_bigrams",
"transform": "bigrams",
"skip": False,
}
]

df = spark.createDataFrame([[0, "martin"]], "id:integer, name:string")
df_result = generate_transforms(
spark, df, feature_selections, preprocessing, is_a, "id"
)
assert "name_bigrams" in df_result.columns


@pytest.mark.parametrize("is_a", [True, False])
def test_generate_transforms_error_when_unrecognized_transform(
spark: SparkSession, preprocessing: LinkTask, is_a: bool
Expand Down
3 changes: 3 additions & 0 deletions sphinx-docs/feature_selection_transforms.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ few utility attributes which are available for all transforms:
- `checkpoint` - Type: `boolean`. Optional. If set to true, checkpoint the
dataset in Spark before computing the feature selection. This can reduce some
resource usage for very complex workflows, but should not be necessary.
- `skip` - Type: `boolean`. Optional. If set to true, don't compute this
feature selection. This has the same effect as commenting the feature
selection out of your config file.

## bigrams

Expand Down

0 comments on commit ada39ea

Please sign in to comment.