Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subject: Support to include ID/PK in validation result for each row t… #5876

Conversation

abekfenn
Copy link
Contributor

@abekfenn abekfenn commented Aug 30, 2022

…hat failed an expectations

Ticket Number: #3195

Problem:

  • No arguments available to configure column subset of unexpected_rows

Solution:

  • Add args to subset unexpected_rows

Updated 2022-11-10 by @Shinnnyshinshin

Changes proposed in this pull request:

Introduction of unexpected_index_columns value that allows users to specify a primary key (PK) column for identifying rows that failed an Expectation (usually returned as part of the unexpected_index_list).

  • Changes were made to the map_metric_provider to take in the parameter from result_format and output unexpected_index_list as key-value pairs of the primary key column.
  • Note This is only the Pandas implementation, with Spark and sql to follow.

Can you give me an Example?

Given the following DataFrame:

df = pd.DataFrame(
        {
            "pk_1": [0, 1, 2, 3, 4, 5],
            "pk_2": ["zero", "one", "two", "three", "four", "five"],
            "numbers_with_duplicates": [1, 5, 22, 3, 5, 10],
            "animal_names_no_duplicates": [
                "cat",
                "fish",
                "dog",
                "giraffe",
                "lion",
                "zebra",
            ],
        }
    )

We could run the expect_column_values_to_be_in_set Expectation on the numbers_with_duplicates column (which has [1, 5, 22, 3, 5, 10] as its values) with [1, 5, 22] as the value_set.

    expectationConfiguration = ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_in_set",
        kwargs={
            "column": "numbers_with_duplicates",
            "value_set": [1, 5, 22],
        },
    )

After running the ExpectationConfiguration, we would expect the unexpected_index_list to be [3, 5] which correspond to the indices of 3 and 10, the values that are not in the value_set.

This PR enables the following configuration, which sets unexpected_index_columns to be pk_2.

    expectationConfiguration = ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_in_set",
        kwargs={
            "column": "numbers_with_duplicates",
            "value_set": [1, 5, 22],
            "result_format": {
                "unexpected_index_columns": ["pk_2"],
            },
        },
    )

After running this new ExpectationConfiguration, we expect the unexpected_index_list to be [{"pk_2": "three"}, {"pk_2": "five"}] which correspond to the values in the pk_2 column of the indices of 22 and 10 the numbers_with_duplicates column.

Note

  • The PR also allows specifying multiple unexpected_index_columns, which are described in the tests.
  • The PR also allows for the output to follow patterns established by existing result_format levels (ie. SUMMARY, COMPLETE, BASIC and BOOLEAN_ONLY.

Definition of Done

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

@netlify
Copy link

netlify bot commented Aug 30, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 2ff38b9
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/6372cb297531e70008ddfb65
😎 Deploy Preview https://deploy-preview-5876--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Contributor

@austiezr austiezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @abekfenn! Thanks for the work here; love to see it! Looks like you're very close to having this functional; just a couple of small fixes and I think you'll be able to move ahead with this.

great_expectations/expectations/expectation.py Outdated Show resolved Hide resolved
great_expectations/expectations/expectation.py Outdated Show resolved Hide resolved
great_expectations/expectations/expectation.py Outdated Show resolved Hide resolved
Comment on lines 303 to 334
assert convert_to_json_serializable(result.result) == {
"element_count": 6,
"unexpected_count": 2,
"unexpected_index_list": [3, 5],
"unexpected_percent": 33.33333333333333,
"partial_unexpected_list": [3, 10],
"unexpected_list": [3, 10],
"unexpected_index_columns": [{"a": 3, "b": "giraffe"}, {"a": 10, "b": "zebra"}],
"partial_unexpected_index_list": [3, 5],
"partial_unexpected_counts": [
{"value": 3, "count": 1},
{"value": 10, "count": 1},
],
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_percent_total": 33.33333333333333,
"unexpected_percent_nonmissing": 33.33333333333333,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert will need a review once the above metric changes are in place.

@austiezr austiezr added community devrel This item is being addressed by the Developer Relations Team labels Aug 31, 2022
@austiezr
Copy link
Contributor

Looks like we'll need a linting pass as well!

@abekfenn
Copy link
Contributor Author

abekfenn commented Sep 1, 2022

Thanks @austiezr your feedback was very helpful. I'm definitely much closer now and was looking to add a test for an expectation that leverages the SqlAlchemyExecutionEngine but I don't see any existing tests or datasets that I could reference.
I assume they're out there, would you be able to point me to an example test or dataset that I could use to test an expectation with these args?

@austiezr
Copy link
Contributor

austiezr commented Sep 1, 2022

Hey @abekfenn ! Have you had an opportunity to check out these tests?

@abekfenn
Copy link
Contributor Author

abekfenn commented Sep 1, 2022

@austiezr I haven't seen those tests but I was looking for tests that test metric providers and the results that different configurations and providers produce. Such as those in tests/expectations/metrics/test_map_metric.py. I'll see if I can put something together from both test files, but surprised there are no tests that test the results produced by the metric providers for SQLAlchemy datasets.

@anthonyburdi anthonyburdi self-assigned this Oct 25, 2022
Will Shin added 13 commits November 8, 2022 17:09
* develop:
  [MAINTENANCE] Change the number of expected Expectations in the 'quick check' stage of build_gallery pipeline (great-expectations#6333)
  [BUGFIX] add create_temp_table flag to ExecutionEngineConfigSchema (great-expectations#6331)
  [DOCS] Add ZenML into integration table in Readme (great-expectations#6144)
* develop:
  [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334)
@Shinnnyshinshin
Copy link
Contributor

Thank you very very much @abekfenn for getting the ball rolling, and all the work you put in. I was able to start where you left off, and have a PR for the Pandas implementation that is currently under review by the team. I'll post more updates here :)

@Shinnnyshinshin Shinnnyshinshin force-pushed the feature/sql_unexpected_index_list branch 2 times, most recently from 90eda4a to b826b21 Compare November 10, 2022 22:29
…ndex_list

* f/dx-21/pandas-adding-pk:
  [DOCS] DOC-366 updates to docs in support of branding updates (great-expectations#5766)
  [RELEASE] 0.15.32 (great-expectations#6338)
  [DOCS] DOC-308 update CLI command in docs when working with RBPs instead of Data Assistants (great-expectations#6222)
  [DOCS] add `pypi` release badge (great-expectations#6324)
  cleaned up tests
  [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334)
  better typing
  Update test_map_metric.py
  small clean up
  clean up expectation
  intermediate push
  pushing before refactoring and polishing
Will Shin added 5 commits November 10, 2022 15:46
…nn/great_expectations into feature/sql_unexpected_index_list

* 'feature/sql_unexpected_index_list' of github.com:abekfenn/great_expectations:
  [MAINTENANCE] In ExecutionEngine: Make variable names and usage more descriptive of their purpose. (great-expectations#6342)
  [MAINTENANCE] Refactor variable aggregation/substitution logic into `ConfigurationProvider` hierarchy (great-expectations#6321)
  [MAINTENANCE] Add missing one-line docstrings and try to make the others consistent (great-expectations#6340)
  [BUGFIX] Fix issue with misaligned indentation in docs snippets (great-expectations#6339)
Comment on lines +1753 to +1755
if result_format["result_format"] == "COMPLETE":
return unexpected_index_list
return unexpected_index_list[: result_format["partial_unexpected_count"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #6329

Added the filtering for partial_unexpected_count for both unexpected_index_column and just unexpected_index (existing behavior)

Comment on lines +603 to +605
def test_pandas_multiple_unexpected_index_column_names_summary_result_format_limit_1(
pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #6329

1st test for testing partial_unexpected_count limit

Comment on lines +509 to +511
def test_pandas_multiple_unexpected_index_column_names_complete_result_format_limit_1(
pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #6329

2nd test for testing partial_unexpected_count limit

Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for these changes 🙇

Copy link
Contributor

@austiezr austiezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ 🚀 LGTM!

@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) November 14, 2022 22:34
@abekfenn
Copy link
Contributor Author

Thanks a lot for wrapping up the changes on these @Shinnnyshinshin!

I only see changes related to pandas but my understanding was that this feature (stemming from #3195) is really needed in the SQL engine in order to support identification of the Pk/Id of a failed row.
This functionality already exists for pandas in GE but is not supported for SQL.
Just want to make sure that merging doesn't mean that this issue will be closed out and that we will continue work on this issue for the SQL engine implementation?

@Shinnnyshinshin
Copy link
Contributor

Hi @abekfenn thank you for the follow-up message 😄 Yes the next steps for us are to do the implementation for SQL and Spark (we'll be sure to do SQL first). Wanting to thank you again for helping us get this far, and Ill be sure to update you once the SQL implementation has been implemented and merged.

@abekfenn
Copy link
Contributor Author

Awesome! Thanks so much @Shinnnyshinshin

@Shinnnyshinshin Shinnnyshinshin merged commit 0df9ff7 into great-expectations:develop Nov 15, 2022
Shinnnyshinshin pushed a commit that referenced this pull request Nov 15, 2022
* develop:
  [MAINTENANCE] Update column_reflection_fallback to also use schema name for Trino (#6350)
  [MAINTENANCE] Misc cleanup of GX Cloud helpers (#6352)
  [MAINTENANCE] Pin `mypy` to `0.990` (#6361)
  Subject: Support to include ID/PK in validation result for each row t… (#5876)
  [MAINTENANCE] Fix computed metrics type hint in ExecutionEngine.resolve_metrics() method (#6347)
  [MAINTENANCE] Refactor `usage_stats_opt_out` method in DataContext (#5339)
  Zep PostgresDatasource returns a list of batches. (#6341)
  [DOCS] Add `yarn snippet-check` command (#6351)
  [MAINTENANCE] Refactor out `termcolor` dependency (#6348)
  [MAINTENANCE] Move Cloud-specific enums to `cloud_constants.py` (#6349)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community devrel This item is being addressed by the Developer Relations Team feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support to include ID/PK in validation result for each row that failed an expectations
5 participants