Dataset inspect #510

albert17 · 2020-12-18T15:47:10Z

Initial dataset inspect tool.

Uses dask_cudf to be able to scale. Works for cats, conts, and label. Dealing with an issue related to list dtypes in dask_cudf dataframes.

@jperez999 for feedback.

@rjzamora for feedback about how I am using dask_cudf and for the list dtype problem.

nvidia-merlin-bot · 2020-12-18T15:47:30Z

Click to view CI Results

GitHub pull request #510 of commit 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b, no merge conflicts.
Running as SYSTEM
Setting status of 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b to PENDING with url http://10.20.13.93:8080/job/nvtabular_tests/1390/ and message: 'Pending'
Using context: Jenkins Unit Test Run
Building in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --progress -- https://github.com/NVIDIA/NVTabular.git +refs/pull/510/*:refs/remotes/origin/pr/510/* # timeout=10
 > git rev-parse 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b^{commit} # timeout=10
Checking out Revision 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7dfa5bf3a668ead8c504e6ab7ee9c502368fa95b # timeout=10
Commit message: "Works for cats, conts, and label"
 > git rev-list --no-walk 913f0cd50a1e6ab049816bee7013dd3b3165a4b0 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins7702758098135908096.sh
Obtaining file:///var/jenkins_home/workspace/nvtabular_tests/nvtabular
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Installing collected packages: nvtabular
  Running setup.py develop for nvtabular
Successfully installed nvtabular
would reformat /var/jenkins_home/workspace/nvtabular_tests/nvtabular/tools/inspect_dataset.py
Oh no! 💥 💔 💥
1 file would be reformatted, 78 files would be left unchanged.
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash
source activate rapids
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" 
[nvtabular_tests] $ /bin/bash /tmp/jenkins4574887399636030078.sh

rjzamora · 2020-12-18T15:53:02Z

Thanks @albert17 - I'll look at this asap. Let me know if there is any overlap with #484

albert17 · 2020-12-18T15:59:17Z

Thanks @albert17 - I'll look at this asap. Let me know if there is any overlap with #484

I don't think so.

As far as I can see: Your PR consist in some tools that make sure a dataset is in a correct an efficient way to be processed by NVTabular.

This PR: Given a dataset extract its stats. General Stats like: num_rows,num_cats,num_conts; but also stats per column: max/min/std/%nans..

rjzamora · 2020-12-18T16:56:31Z

This PR: Given a dataset extract its stats. General Stats like: num_rows,num_cats,num_conts; but also stats per column: max/min/std/%nans..

Perfect!

jperez999

Please add some unit tests. So that we can ensure any changes we introduce later, continue to support the correct behaviors of the tool.

jperez999 · 2020-12-18T17:53:14Z

tools/inspect_dataset.py

+    data[col]["cardinality"] = ddf[col].nunique().compute()
+
+    # Get max/min/mean for cat, cat_mh, and cont
+    if col_type != "label":


What is the reason we dont want to get the min and max for the label? The rossmann dataset label is a column the represents the number of sales on a given day, so this is not always 0, 1. So might be good to have this also. @EvenOldridge or am I over thinking this?

@albert17 I saw that you did this originally because of the client description break down... but wondering if it wouldnt be good to have the information for multi label classes.

If it makes sense, we could calculate for label. I guess extra information will not hurt.

jperez999 · 2020-12-18T18:08:54Z

tools/inspect_dataset.py

+    data = {}
+    # Store general info
+    data["num_rows"] = ddf.shape[0].compute()
+    data["cats"] = cats
+    data["cats_mh"] = cats_mh
+    data["conts"] = conts
+    data["labels"] = labels
+
+    # Get categoricals columns stats
+    for col in cats:
+        get_stats(ddf, col, data, "cat")
+
+    # Get categoricals multihot columns stats
+    for col in cats_mh:
+        get_stats(ddf, col, data, "cat_mh")
+
+    # Get continuous columns stats
+    for col in conts:
+        get_stats(ddf, col, data, "cont")
+
+    # Get labels columns stats
+    for col in conts:
+        get_stats(ddf, col, data, "label")
+
+    # Write json file
+    with fsspec.open(args.output_file, "w") as outfile:
+        json.dump(data, outfile, cls=NpEncoder)


This should be in a function as this is the actual "inspect" that is occurring. You should also return the file after writing it. This will allow us to use the generated config on the fly by just taking the return from the inspect call
i.e:

# columns_dict is dictionary of col_type: [cols] class DatasetInspector(): # Should also support nvt.dataset def inspect(self, <path>, columns_dict): .... return json a = DatasetInspector() json_a = a.inspect(<ds_path/nvt.dataset>, columns_dict) new_data = DatasetGenerator(json_a) # creates new dataset based on json config

albert17 changed the base branch from main to new_api December 18, 2020 15:47

Alberto Alvarez added 4 commits December 18, 2020 07:51

Initial inspect dataset tool

688e783

Basic Function Working + config files

5b47ef2

All working but multi-hot

8723e28

Works for cats, conts, and label

5588da4

Format and rebase

5e10269

jperez999 requested changes Dec 18, 2020

View reviewed changes

albert17 linked an issue Jan 4, 2021 that may be closed by this pull request

[FEA] Dataset Generation tool #473

Closed

Converts tool into class

bd3749d

benfred closed this Jan 6, 2021

benfred deleted the branch NVIDIA-Merlin:new_api January 6, 2021 00:51

albert17 mentioned this pull request Jan 8, 2021

Data inspect #521

Merged

benfred removed a link to an issue Jan 19, 2021

[FEA] Dataset Generation tool #473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset inspect #510

Dataset inspect #510

albert17 commented Dec 18, 2020 •

edited

Loading

nvidia-merlin-bot commented Dec 18, 2020

rjzamora commented Dec 18, 2020

albert17 commented Dec 18, 2020

rjzamora commented Dec 18, 2020

jperez999 left a comment

jperez999 Dec 18, 2020

jperez999 Dec 18, 2020

albert17 Jan 4, 2021

jperez999 Dec 18, 2020

Dataset inspect #510

Dataset inspect #510

Conversation

albert17 commented Dec 18, 2020 • edited Loading

nvidia-merlin-bot commented Dec 18, 2020

rjzamora commented Dec 18, 2020

albert17 commented Dec 18, 2020

rjzamora commented Dec 18, 2020

jperez999 left a comment

Choose a reason for hiding this comment

jperez999 Dec 18, 2020

Choose a reason for hiding this comment

jperez999 Dec 18, 2020

Choose a reason for hiding this comment

albert17 Jan 4, 2021

Choose a reason for hiding this comment

jperez999 Dec 18, 2020

Choose a reason for hiding this comment

albert17 commented Dec 18, 2020 •

edited

Loading