-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset inspect #510
Dataset inspect #510
Conversation
Click to view CI ResultsGitHub pull request #510 of commit 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b, no merge conflicts. Running as SYSTEM Setting status of 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b to PENDING with url http://10.20.13.93:8080/job/nvtabular_tests/1390/ and message: 'Pending' Using context: Jenkins Unit Test Run Building in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/pull/510/*:refs/remotes/origin/pr/510/* # timeout=10 > git rev-parse 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b^{commit} # timeout=10 Checking out Revision 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b # timeout=10 Commit message: "Works for cats, conts, and label" > git rev-list --no-walk 913f0cd50a1e6ab049816bee7013dd3b3165a4b0 # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins7702758098135908096.sh Obtaining file:///var/jenkins_home/workspace/nvtabular_tests/nvtabular Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing wheel metadata: started Preparing wheel metadata: finished with status 'done' Installing collected packages: nvtabular Running setup.py develop for nvtabular Successfully installed nvtabular would reformat /var/jenkins_home/workspace/nvtabular_tests/nvtabular/tools/inspect_dataset.py Oh no! 💥 💔 💥 1 file would be reformatted, 78 files would be left unchanged. Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash source activate rapids cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins4574887399636030078.sh |
I don't think so. As far as I can see: Your PR consist in some tools that make sure a dataset is in a correct an efficient way to be processed by NVTabular. This PR: Given a dataset extract its stats. General Stats like: num_rows,num_cats,num_conts; but also stats per column: max/min/std/%nans.. |
Perfect! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some unit tests. So that we can ensure any changes we introduce later, continue to support the correct behaviors of the tool.
data[col]["cardinality"] = ddf[col].nunique().compute() | ||
|
||
# Get max/min/mean for cat, cat_mh, and cont | ||
if col_type != "label": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we dont want to get the min and max for the label? The rossmann dataset label is a column the represents the number of sales on a given day, so this is not always 0, 1. So might be good to have this also. @EvenOldridge or am I over thinking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albert17 I saw that you did this originally because of the client description break down... but wondering if it wouldnt be good to have the information for multi label classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it makes sense, we could calculate for label. I guess extra information will not hurt.
data = {} | ||
# Store general info | ||
data["num_rows"] = ddf.shape[0].compute() | ||
data["cats"] = cats | ||
data["cats_mh"] = cats_mh | ||
data["conts"] = conts | ||
data["labels"] = labels | ||
|
||
# Get categoricals columns stats | ||
for col in cats: | ||
get_stats(ddf, col, data, "cat") | ||
|
||
# Get categoricals multihot columns stats | ||
for col in cats_mh: | ||
get_stats(ddf, col, data, "cat_mh") | ||
|
||
# Get continuous columns stats | ||
for col in conts: | ||
get_stats(ddf, col, data, "cont") | ||
|
||
# Get labels columns stats | ||
for col in conts: | ||
get_stats(ddf, col, data, "label") | ||
|
||
# Write json file | ||
with fsspec.open(args.output_file, "w") as outfile: | ||
json.dump(data, outfile, cls=NpEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in a function as this is the actual "inspect" that is occurring. You should also return the file after writing it. This will allow us to use the generated config on the fly by just taking the return from the inspect call
i.e:
# columns_dict is dictionary of col_type: [cols]
class DatasetInspector():
# Should also support nvt.dataset
def inspect(self, <path>, columns_dict):
....
return json
a = DatasetInspector()
json_a = a.inspect(<ds_path/nvt.dataset>, columns_dict)
new_data = DatasetGenerator(json_a) # creates new dataset based on json config
Initial dataset inspect tool.
Uses
dask_cudf
to be able to scale. Works for cats, conts, and label. Dealing with an issue related to list dtypes indask_cudf
dataframes.@jperez999 for feedback.
@rjzamora for feedback about how I am using
dask_cudf
and for the listdtype
problem.