Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset inspect #510

Closed
wants to merge 6 commits into from
Closed

Dataset inspect #510

wants to merge 6 commits into from

Conversation

albert17
Copy link
Contributor

@albert17 albert17 commented Dec 18, 2020

Initial dataset inspect tool.

Uses dask_cudf to be able to scale. Works for cats, conts, and label. Dealing with an issue related to list dtypes in dask_cudf dataframes.

@jperez999 for feedback.

@rjzamora for feedback about how I am using dask_cudf and for the list dtype problem.

@albert17 albert17 changed the base branch from main to new_api December 18, 2020 15:47
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #510 of commit 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b, no merge conflicts.
Running as SYSTEM
Setting status of 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b to PENDING with url http://10.20.13.93:8080/job/nvtabular_tests/1390/ and message: 'Pending'
Using context: Jenkins Unit Test Run
Building in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/pull/510/*:refs/remotes/origin/pr/510/* # timeout=10
 > git rev-parse 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b^{commit} # timeout=10
Checking out Revision 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b # timeout=10
Commit message: "Works for cats, conts, and label"
 > git rev-list --no-walk 913f0cd50a1e6ab049816bee7013dd3b3165a4b0 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins7702758098135908096.sh
Obtaining file:///var/jenkins_home/workspace/nvtabular_tests/nvtabular
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Installing collected packages: nvtabular
  Running setup.py develop for nvtabular
Successfully installed nvtabular
would reformat /var/jenkins_home/workspace/nvtabular_tests/nvtabular/tools/inspect_dataset.py
Oh no! 💥 💔 💥
1 file would be reformatted, 78 files would be left unchanged.
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash
source activate rapids
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" 
[nvtabular_tests] $ /bin/bash /tmp/jenkins4574887399636030078.sh

@rjzamora
Copy link
Collaborator

Thanks @albert17 - I'll look at this asap. Let me know if there is any overlap with #484

@albert17
Copy link
Contributor Author

Thanks @albert17 - I'll look at this asap. Let me know if there is any overlap with #484

I don't think so.

As far as I can see: Your PR consist in some tools that make sure a dataset is in a correct an efficient way to be processed by NVTabular.

This PR: Given a dataset extract its stats. General Stats like: num_rows,num_cats,num_conts; but also stats per column: max/min/std/%nans..

@rjzamora
Copy link
Collaborator

This PR: Given a dataset extract its stats. General Stats like: num_rows,num_cats,num_conts; but also stats per column: max/min/std/%nans..

Perfect!

Copy link
Contributor

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some unit tests. So that we can ensure any changes we introduce later, continue to support the correct behaviors of the tool.

data[col]["cardinality"] = ddf[col].nunique().compute()

# Get max/min/mean for cat, cat_mh, and cont
if col_type != "label":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we dont want to get the min and max for the label? The rossmann dataset label is a column the represents the number of sales on a given day, so this is not always 0, 1. So might be good to have this also. @EvenOldridge or am I over thinking this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albert17 I saw that you did this originally because of the client description break down... but wondering if it wouldnt be good to have the information for multi label classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it makes sense, we could calculate for label. I guess extra information will not hurt.

Comment on lines +135 to +161
data = {}
# Store general info
data["num_rows"] = ddf.shape[0].compute()
data["cats"] = cats
data["cats_mh"] = cats_mh
data["conts"] = conts
data["labels"] = labels

# Get categoricals columns stats
for col in cats:
get_stats(ddf, col, data, "cat")

# Get categoricals multihot columns stats
for col in cats_mh:
get_stats(ddf, col, data, "cat_mh")

# Get continuous columns stats
for col in conts:
get_stats(ddf, col, data, "cont")

# Get labels columns stats
for col in conts:
get_stats(ddf, col, data, "label")

# Write json file
with fsspec.open(args.output_file, "w") as outfile:
json.dump(data, outfile, cls=NpEncoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a function as this is the actual "inspect" that is occurring. You should also return the file after writing it. This will allow us to use the generated config on the fly by just taking the return from the inspect call
i.e:

# columns_dict is dictionary of col_type: [cols]
class DatasetInspector():
     # Should also support nvt.dataset
     def inspect(self, <path>, columns_dict):
          ....
          return json


a = DatasetInspector()
json_a = a.inspect(<ds_path/nvt.dataset>, columns_dict)

new_data = DatasetGenerator(json_a) # creates new dataset based on json config

@albert17 albert17 linked an issue Jan 4, 2021 that may be closed by this pull request
@benfred benfred closed this Jan 6, 2021
@benfred benfred deleted the branch NVIDIA-Merlin:new_api January 6, 2021 00:51
@albert17 albert17 mentioned this pull request Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants