Subject: Support to include ID/PK in validation result for each row t… #5876

abekfenn · 2022-08-30T05:00:13Z

…hat failed an expectations

Ticket Number: #3195

Problem:

No arguments available to configure column subset of unexpected_rows

Solution:

Add args to subset unexpected_rows

Updated 2022-11-10 by @Shinnnyshinshin

Changes proposed in this pull request:

Introduction of unexpected_index_columns value that allows users to specify a primary key (PK) column for identifying rows that failed an Expectation (usually returned as part of the unexpected_index_list).

Changes were made to the map_metric_provider to take in the parameter from result_format and output unexpected_index_list as key-value pairs of the primary key column.

Note This is only the Pandas implementation, with Spark and sql to follow.

Can you give me an Example?

Given the following DataFrame:

df = pd.DataFrame(
        {
            "pk_1": [0, 1, 2, 3, 4, 5],
            "pk_2": ["zero", "one", "two", "three", "four", "five"],
            "numbers_with_duplicates": [1, 5, 22, 3, 5, 10],
            "animal_names_no_duplicates": [
                "cat",
                "fish",
                "dog",
                "giraffe",
                "lion",
                "zebra",
            ],
        }
    )

We could run the expect_column_values_to_be_in_set Expectation on the numbers_with_duplicates column (which has [1, 5, 22, 3, 5, 10] as its values) with [1, 5, 22] as the value_set.

    expectationConfiguration = ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_in_set",
        kwargs={
            "column": "numbers_with_duplicates",
            "value_set": [1, 5, 22],
        },
    )

After running the ExpectationConfiguration, we would expect the unexpected_index_list to be [3, 5] which correspond to the indices of 3 and 10, the values that are not in the value_set.

This PR enables the following configuration, which sets unexpected_index_columns to be pk_2.

    expectationConfiguration = ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_in_set",
        kwargs={
            "column": "numbers_with_duplicates",
            "value_set": [1, 5, 22],
            "result_format": {
                "unexpected_index_columns": ["pk_2"],
            },
        },
    )

After running this new ExpectationConfiguration, we expect the unexpected_index_list to be [{"pk_2": "three"}, {"pk_2": "five"}] which correspond to the values in the pk_2 column of the indices of 22 and 10 the numbers_with_duplicates column.

Note

The PR also allows specifying multiple unexpected_index_columns, which are described in the tests.
The PR also allows for the output to follow patterns established by existing result_format levels (ie. SUMMARY, COMPLETE, BASIC and BOOLEAN_ONLY.

Definition of Done

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

netlify · 2022-08-30T05:00:15Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`2ff38b9`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/6372cb297531e70008ddfb65
😎 Deploy Preview	https://deploy-preview-5876--niobium-lead-7998.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

austiezr

Hey @abekfenn! Thanks for the work here; love to see it! Looks like you're very close to having this functional; just a couple of small fixes and I think you'll be able to move ahead with this.

great_expectations/expectations/expectation.py

austiezr · 2022-08-31T16:40:04Z

tests/expectations/metrics/test_map_metric.py

+    assert convert_to_json_serializable(result.result) == {
+        "element_count": 6,
+        "unexpected_count": 2,
+        "unexpected_index_list": [3, 5],
+        "unexpected_percent": 33.33333333333333,
+        "partial_unexpected_list": [3, 10],
+        "unexpected_list": [3, 10],
+        "unexpected_index_columns": [{"a": 3, "b": "giraffe"}, {"a": 10, "b": "zebra"}],
+        "partial_unexpected_index_list": [3, 5],
+        "partial_unexpected_counts": [
+            {"value": 3, "count": 1},
+            {"value": 10, "count": 1},
+        ],
+        "missing_count": 0,
+        "missing_percent": 0.0,
+        "unexpected_percent_total": 33.33333333333333,
+        "unexpected_percent_nonmissing": 33.33333333333333,
+    }


This assert will need a review once the above metric changes are in place.

austiezr · 2022-08-31T16:41:30Z

Looks like we'll need a linting pass as well!

abekfenn · 2022-09-01T02:27:25Z

Thanks @austiezr your feedback was very helpful. I'm definitely much closer now and was looking to add a test for an expectation that leverages the SqlAlchemyExecutionEngine but I don't see any existing tests or datasets that I could reference.
I assume they're out there, would you be able to point me to an example test or dataset that I could use to test an expectation with these args?

austiezr · 2022-09-01T17:16:00Z

Hey @abekfenn ! Have you had an opportunity to check out these tests?

abekfenn · 2022-09-01T18:08:26Z

@austiezr I haven't seen those tests but I was looking for tests that test metric providers and the results that different configurations and providers produce. Such as those in tests/expectations/metrics/test_map_metric.py. I'll see if I can put something together from both test files, but surprised there are no tests that test the results produced by the metric providers for SQLAlchemy datasets.

* develop: [MAINTENANCE] Change the number of expected Expectations in the 'quick check' stage of build_gallery pipeline (great-expectations#6333) [BUGFIX] add create_temp_table flag to ExecutionEngineConfigSchema (great-expectations#6331) [DOCS] Add ZenML into integration table in Readme (great-expectations#6144)

* develop: [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334)

Shinnnyshinshin · 2022-11-10T03:11:01Z

Thank you very very much @abekfenn for getting the ball rolling, and all the work you put in. I was able to start where you left off, and have a PR for the Pandas implementation that is currently under review by the team. I'll post more updates here :)

…ndex_list * f/dx-21/pandas-adding-pk: [DOCS] DOC-366 updates to docs in support of branding updates (great-expectations#5766) [RELEASE] 0.15.32 (great-expectations#6338) [DOCS] DOC-308 update CLI command in docs when working with RBPs instead of Data Assistants (great-expectations#6222) [DOCS] add `pypi` release badge (great-expectations#6324) cleaned up tests [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334) better typing Update test_map_metric.py small clean up clean up expectation intermediate push pushing before refactoring and polishing

…nn/great_expectations into feature/sql_unexpected_index_list * 'feature/sql_unexpected_index_list' of github.com:abekfenn/great_expectations: [MAINTENANCE] In ExecutionEngine: Make variable names and usage more descriptive of their purpose. (great-expectations#6342) [MAINTENANCE] Refactor variable aggregation/substitution logic into `ConfigurationProvider` hierarchy (great-expectations#6321) [MAINTENANCE] Add missing one-line docstrings and try to make the others consistent (great-expectations#6340) [BUGFIX] Fix issue with misaligned indentation in docs snippets (great-expectations#6339)

Shinnnyshinshin · 2022-11-14T22:20:30Z

great_expectations/expectations/metrics/map_metric_provider.py

+        if result_format["result_format"] == "COMPLETE":
+            return unexpected_index_list
+        return unexpected_index_list[: result_format["partial_unexpected_count"]]


From #6329

Added the filtering for partial_unexpected_count for both unexpected_index_column and just unexpected_index (existing behavior)

Shinnnyshinshin · 2022-11-14T22:21:11Z

tests/expectations/metrics/test_map_metric.py

+def test_pandas_multiple_unexpected_index_column_names_summary_result_format_limit_1(
+    pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame,
+):


From #6329

1st test for testing partial_unexpected_count limit

Shinnnyshinshin · 2022-11-14T22:21:41Z

tests/expectations/metrics/test_map_metric.py

+def test_pandas_multiple_unexpected_index_column_names_complete_result_format_limit_1(
+    pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame,
+):


From #6329

2nd test for testing partial_unexpected_count limit

anthonyburdi

LGTM! Thank you for these changes 🙇

austiezr

❤️ 🚀 LGTM!

abekfenn · 2022-11-14T23:08:37Z

Thanks a lot for wrapping up the changes on these @Shinnnyshinshin!

I only see changes related to pandas but my understanding was that this feature (stemming from #3195) is really needed in the SQL engine in order to support identification of the Pk/Id of a failed row.
This functionality already exists for pandas in GE but is not supported for SQL.
Just want to make sure that merging doesn't mean that this issue will be closed out and that we will continue work on this issue for the SQL engine implementation?

Shinnnyshinshin · 2022-11-14T23:12:57Z

Hi @abekfenn thank you for the follow-up message 😄 Yes the next steps for us are to do the implementation for SQL and Spark (we'll be sure to do SQL first). Wanting to thank you again for helping us get this far, and Ill be sure to update you once the SQL implementation has been implemented and merged.

abekfenn · 2022-11-14T23:38:13Z

Awesome! Thanks so much @Shinnnyshinshin

* develop: [MAINTENANCE] Update column_reflection_fallback to also use schema name for Trino (#6350) [MAINTENANCE] Misc cleanup of GX Cloud helpers (#6352) [MAINTENANCE] Pin `mypy` to `0.990` (#6361) Subject: Support to include ID/PK in validation result for each row t… (#5876) [MAINTENANCE] Fix computed metrics type hint in ExecutionEngine.resolve_metrics() method (#6347) [MAINTENANCE] Refactor `usage_stats_opt_out` method in DataContext (#5339) Zep PostgresDatasource returns a list of batches. (#6341) [DOCS] Add `yarn snippet-check` command (#6351) [MAINTENANCE] Refactor out `termcolor` dependency (#6348) [MAINTENANCE] Move Cloud-specific enums to `cloud_constants.py` (#6349)

abekfenn mentioned this pull request Aug 30, 2022

Support to include ID/PK in validation result for each row that failed an expectations #3195

Closed

austiezr suggested changes Aug 31, 2022

View reviewed changes

austiezr added community devrel This item is being addressed by the Developer Relations Team labels Aug 31, 2022

talagluck added the feature label Sep 22, 2022

anthonyburdi self-assigned this Oct 25, 2022

Will Shin added 13 commits November 8, 2022 17:09

pushing before refactoring and polishing

05c4678

Merge branch 'develop' into f/dx-21/pandas-adding-pk

bcf983d

intermediate push

377c7bb

summary will return partial index lists

5e7e67b

after review

feb266f

clean up expectation

0bfc6f1

small clean up

6c913e7

Update test_map_metric.py

a1cdd11

better typing

117bb38

Merge branch 'develop' into f/dx-21/pandas-adding-pk

76d5728

* develop: [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334)

cleaned up tests

1723794

testing push

b826b21

Shinnnyshinshin mentioned this pull request Nov 10, 2022

[FEATURE] Support to include ID/PK in validation result for each row - Pandas #6329

Closed

5 tasks

Merge branch 'develop' into f/dx-21/pandas-adding-pk

f6fbdb1

Shinnnyshinshin force-pushed the feature/sql_unexpected_index_list branch 2 times, most recently from 90eda4a to b826b21 Compare November 10, 2022 22:29

Will Shin added 5 commits November 10, 2022 15:46

rename parameter to unexpected_index_column_names

0a62f5d

remove extra comment

36ddc8a

Merge branch 'develop' into feature/sql_unexpected_index_list

22f804c

Update map_metric_provider.py

274cf4b

anthonyburdi assigned Shinnnyshinshin and unassigned anthonyburdi Nov 14, 2022

pushing after review

982644b

Shinnnyshinshin requested review from austiezr and anthonyburdi November 14, 2022 22:18

Merge branch 'develop' into feature/sql_unexpected_index_list

6ea8d6e

Shinnnyshinshin reviewed Nov 14, 2022

View reviewed changes

Shinnnyshinshin self-requested a review November 14, 2022 22:23

anthonyburdi approved these changes Nov 14, 2022

View reviewed changes

austiezr approved these changes Nov 14, 2022

View reviewed changes

Shinnnyshinshin enabled auto-merge (squash) November 14, 2022 22:34

Merge branch 'develop' into feature/sql_unexpected_index_list

2ff38b9

Shinnnyshinshin merged commit 0df9ff7 into great-expectations:develop Nov 15, 2022

Shinnnyshinshin mentioned this pull request Dec 23, 2022

[FEATURE] ID/PK Rendering in DataDocs #6637

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subject: Support to include ID/PK in validation result for each row t… #5876

Subject: Support to include ID/PK in validation result for each row t… #5876

abekfenn commented Aug 30, 2022 •

edited by Shinnnyshinshin

Loading

netlify bot commented Aug 30, 2022 •

edited

Loading

austiezr left a comment

austiezr Aug 31, 2022

austiezr commented Aug 31, 2022

abekfenn commented Sep 1, 2022

austiezr commented Sep 1, 2022

abekfenn commented Sep 1, 2022

Shinnnyshinshin commented Nov 10, 2022

Shinnnyshinshin Nov 14, 2022

Shinnnyshinshin Nov 14, 2022

Shinnnyshinshin Nov 14, 2022

anthonyburdi left a comment

austiezr left a comment

abekfenn commented Nov 14, 2022

Shinnnyshinshin commented Nov 14, 2022

abekfenn commented Nov 14, 2022

Subject: Support to include ID/PK in validation result for each row t… #5876

Subject: Support to include ID/PK in validation result for each row t… #5876

Conversation

abekfenn commented Aug 30, 2022 • edited by Shinnnyshinshin Loading

Can you give me an Example?

Definition of Done

netlify bot commented Aug 30, 2022 • edited Loading

✅ Deploy Preview for niobium-lead-7998 ready!

austiezr left a comment

Choose a reason for hiding this comment

austiezr Aug 31, 2022

Choose a reason for hiding this comment

austiezr commented Aug 31, 2022

abekfenn commented Sep 1, 2022

austiezr commented Sep 1, 2022

abekfenn commented Sep 1, 2022

Shinnnyshinshin commented Nov 10, 2022

Shinnnyshinshin Nov 14, 2022

Choose a reason for hiding this comment

Shinnnyshinshin Nov 14, 2022

Choose a reason for hiding this comment

Shinnnyshinshin Nov 14, 2022

Choose a reason for hiding this comment

anthonyburdi left a comment

Choose a reason for hiding this comment

austiezr left a comment

Choose a reason for hiding this comment

abekfenn commented Nov 14, 2022

Shinnnyshinshin commented Nov 14, 2022

abekfenn commented Nov 14, 2022

abekfenn commented Aug 30, 2022 •

edited by Shinnnyshinshin

Loading

netlify bot commented Aug 30, 2022 •

edited

Loading