-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subject: Support to include ID/PK in validation result for each row t… #5876
Subject: Support to include ID/PK in validation result for each row t… #5876
Conversation
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @abekfenn! Thanks for the work here; love to see it! Looks like you're very close to having this functional; just a couple of small fixes and I think you'll be able to move ahead with this.
assert convert_to_json_serializable(result.result) == { | ||
"element_count": 6, | ||
"unexpected_count": 2, | ||
"unexpected_index_list": [3, 5], | ||
"unexpected_percent": 33.33333333333333, | ||
"partial_unexpected_list": [3, 10], | ||
"unexpected_list": [3, 10], | ||
"unexpected_index_columns": [{"a": 3, "b": "giraffe"}, {"a": 10, "b": "zebra"}], | ||
"partial_unexpected_index_list": [3, 5], | ||
"partial_unexpected_counts": [ | ||
{"value": 3, "count": 1}, | ||
{"value": 10, "count": 1}, | ||
], | ||
"missing_count": 0, | ||
"missing_percent": 0.0, | ||
"unexpected_percent_total": 33.33333333333333, | ||
"unexpected_percent_nonmissing": 33.33333333333333, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert will need a review once the above metric changes are in place.
Looks like we'll need a linting pass as well! |
Thanks @austiezr your feedback was very helpful. I'm definitely much closer now and was looking to add a test for an expectation that leverages the |
Hey @abekfenn ! Have you had an opportunity to check out these tests? |
@austiezr I haven't seen those tests but I was looking for tests that test metric providers and the results that different configurations and providers produce. Such as those in |
* develop: [MAINTENANCE] Change the number of expected Expectations in the 'quick check' stage of build_gallery pipeline (great-expectations#6333) [BUGFIX] add create_temp_table flag to ExecutionEngineConfigSchema (great-expectations#6331) [DOCS] Add ZenML into integration table in Readme (great-expectations#6144)
* develop: [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334)
Thank you very very much @abekfenn for getting the ball rolling, and all the work you put in. I was able to start where you left off, and have a PR for the Pandas implementation that is currently under review by the team. I'll post more updates here :) |
90eda4a
to
b826b21
Compare
…ndex_list * f/dx-21/pandas-adding-pk: [DOCS] DOC-366 updates to docs in support of branding updates (great-expectations#5766) [RELEASE] 0.15.32 (great-expectations#6338) [DOCS] DOC-308 update CLI command in docs when working with RBPs instead of Data Assistants (great-expectations#6222) [DOCS] add `pypi` release badge (great-expectations#6324) cleaned up tests [BUGFIX] MapMetrics now return `partial_unexpected` values for `SUMMARY` format (great-expectations#6334) better typing Update test_map_metric.py small clean up clean up expectation intermediate push pushing before refactoring and polishing
…nn/great_expectations into feature/sql_unexpected_index_list * 'feature/sql_unexpected_index_list' of github.com:abekfenn/great_expectations: [MAINTENANCE] In ExecutionEngine: Make variable names and usage more descriptive of their purpose. (great-expectations#6342) [MAINTENANCE] Refactor variable aggregation/substitution logic into `ConfigurationProvider` hierarchy (great-expectations#6321) [MAINTENANCE] Add missing one-line docstrings and try to make the others consistent (great-expectations#6340) [BUGFIX] Fix issue with misaligned indentation in docs snippets (great-expectations#6339)
if result_format["result_format"] == "COMPLETE": | ||
return unexpected_index_list | ||
return unexpected_index_list[: result_format["partial_unexpected_count"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #6329
Added the filtering for partial_unexpected_count
for both unexpected_index_column
and just unexpected_index
(existing behavior)
def test_pandas_multiple_unexpected_index_column_names_summary_result_format_limit_1( | ||
pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #6329
1st test for testing partial_unexpected_count
limit
def test_pandas_multiple_unexpected_index_column_names_complete_result_format_limit_1( | ||
pandas_dataframe_for_unexpected_rows_with_index: pd.DataFrame, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #6329
2nd test for testing partial_unexpected_count limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for these changes 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ 🚀 LGTM!
Thanks a lot for wrapping up the changes on these @Shinnnyshinshin! I only see changes related to pandas but my understanding was that this feature (stemming from #3195) is really needed in the SQL engine in order to support identification of the Pk/Id of a failed row. |
Hi @abekfenn thank you for the follow-up message 😄 Yes the next steps for us are to do the implementation for SQL and Spark (we'll be sure to do SQL first). Wanting to thank you again for helping us get this far, and Ill be sure to update you once the SQL implementation has been implemented and merged. |
Awesome! Thanks so much @Shinnnyshinshin |
* develop: [MAINTENANCE] Update column_reflection_fallback to also use schema name for Trino (#6350) [MAINTENANCE] Misc cleanup of GX Cloud helpers (#6352) [MAINTENANCE] Pin `mypy` to `0.990` (#6361) Subject: Support to include ID/PK in validation result for each row t… (#5876) [MAINTENANCE] Fix computed metrics type hint in ExecutionEngine.resolve_metrics() method (#6347) [MAINTENANCE] Refactor `usage_stats_opt_out` method in DataContext (#5339) Zep PostgresDatasource returns a list of batches. (#6341) [DOCS] Add `yarn snippet-check` command (#6351) [MAINTENANCE] Refactor out `termcolor` dependency (#6348) [MAINTENANCE] Move Cloud-specific enums to `cloud_constants.py` (#6349)
…hat failed an expectations
Ticket Number: #3195
Problem:
Solution:
Updated 2022-11-10 by @Shinnnyshinshin
Changes proposed in this pull request:
Introduction of
unexpected_index_columns
value that allows users to specify a primary key (PK) column for identifying rows that failed an Expectation (usually returned as part of theunexpected_index_list
).map_metric_provider
to take in the parameter fromresult_format
and outputunexpected_index_list
as key-value pairs of the primary key column.Pandas
implementation, withSpark
andsql
to follow.Can you give me an Example?
Given the following DataFrame:
We could run the
expect_column_values_to_be_in_set
Expectation on thenumbers_with_duplicates
column (which has[1, 5, 22, 3, 5, 10]
as its values) with[1, 5, 22]
as thevalue_set
.After running the ExpectationConfiguration, we would expect the
unexpected_index_list
to be[3, 5]
which correspond to the indices of3
and10
, the values that are not in thevalue_set
.This PR enables the following configuration, which sets
unexpected_index_columns
to bepk_2
.After running this new ExpectationConfiguration, we expect the
unexpected_index_list
to be[{"pk_2": "three"}, {"pk_2": "five"}]
which correspond to the values in thepk_2
column of the indices of22
and10
thenumbers_with_duplicates
column.Note
unexpected_index_columns
, which are described in the tests.result_format
levels (ie.SUMMARY
,COMPLETE
,BASIC
andBOOLEAN_ONLY
.Definition of Done