Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide what metrics to capture to ASV for merlin-models notebooks #235

Closed
Tracked by #344
EvenOldridge opened this issue Apr 25, 2022 · 6 comments
Closed
Tracked by #344
Assignees

Comments

@EvenOldridge
Copy link
Member

No description provided.

@EvenOldridge
Copy link
Member Author

@bschifferer can you please review the notebooks and share what the standard outputs for these notebooks are. Can we standardize all of the notebooks to a singe set of metrics? If not what are the differences?

@bschifferer
Copy link
Contributor

bschifferer commented May 3, 2022

I took a look and Merlin Models have only a few set of metrics. Merlin Systems and Merlin/Merlin are based on Merlin Models examples and uses the same metrics. Can we discuss that based on the use cases/models?

My proposal is:

All Notebooks:

  • Runtime
  • % GPU utilization average/peak?

Training:

  • Throughput (ex/sec)
  • Ranking Metrics (both train and valid):
    -- If Binary:
    --- Recall
    --- Precision
    --- AUC
    -- We dont have a multi-class example, yet.
  • Retrieval Metrics (both train and valid):
    -- Loss
    -- Recall@10
    -- NDCG@10

Inference:

  • We need to add more tests for it
  • p50, p95, p99 latency
  • Same metrics as for training - we need to check, if inference is deployed correctly

@sararb @gabrielspmoreira do you want to add additional metrics for training? Are other metrics helpful to understand, if everything is running correctly?

@bschifferer
Copy link
Contributor

bschifferer commented May 9, 2022

After our last CI meeting, we want only track a few metrics for some specific notebooks.

@jperez999 is our CI a single or multi-GPU environment?

Currently, we use following datasets in our repositories:

  • MovieLens - too small, not sure if validation loss/AUC is meaningful
  • Aliccp - good dataset size, but dataset is pretty unbalanced, we do not have an example with high accuracy results
  • Outbrain - only used in one example and hasnt been updated for a while
  • Rossmann - not a recsys example

Proposal is to use criteo: It has a large dataset size and it is used in research community/perf benchmark, therefore, there are examples for good AUC scores.

Proposal:

  • 7 metrics:
    -- NVtabular runtime
    -- TensorFlow runtime
    -- TensorFlow AUC
    -- PyTorch runtime
    -- PyTorch AUC
    -- Triton p95 latency TensorFlow
    -- Triton p95 latency PyTorch

  • Runtime for NVTabular workflow - Criteo / 02-ETL-with-NVTabular.ipynb
    -- tracks if NVTabular workflow runs efficient on large dataset

  • Runtime and Validation AUC for Criteo example with Merlin Models (needs to be built) for TensorFlow
    -- tracks if tensorflow dataloaders are efficient
    -- tracks if merlin models are efficient
    -- tracks if training logic is still correct

  • Runtime and Validation AUC for Criteo example for PyTorch
    -- tracks if pytorch dataloaders are efficient
    -- tracks if training logic is still correct
    -- needs to be upgraded to merlin models when PyTorch version is available
    ** Triton Examples needs to be defined**

@bschifferer
Copy link
Contributor

How about HugeCTR?

@karlhigley
Copy link
Contributor

Not quite sure I understand the Triton part: How do we intend to measure p95 latency? Do we have a way to generate realistic requests that would make p95 meaningful? Or are we just intending to allow for some variance in the latency of serving the same request over and over?

@bschifferer
Copy link
Contributor

I think we decided to collect the metrics here: https://nvidia.slack.com/archives/CVBDJUPEZ/p1679617586380369

and we continue the ticket with NVIDIA-Merlin/models#1047 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants