[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

mengdong · 2022-02-28T23:43:46Z

run into following errors

I0318 00:00:18.082645 172 hugectr.cc:1926] TRITONBACKEND_ModelInstanceInitialize: deepfm_0 (device 0)
I0318 00:00:18.082694 172 hugectr.cc:1566] Triton Model Instance Initialization on device 0
I0318 00:00:18.082792 172 hugectr.cc:1576] Dense Feature buffer allocation:
I0318 00:00:18.083026 172 hugectr.cc:1583] Categorical Feature buffer allocation:
I0318 00:00:18.083095 172 hugectr.cc:1601] Categorical Row Index buffer allocation:
I0318 00:00:18.083143 172 hugectr.cc:1611] Predict result buffer allocation:
I0318 00:00:18.083203 172 hugectr.cc:1939] ******Loading HugeCTR Model******
I0318 00:00:18.083217 172 hugectr.cc:1631] The model origin json configuration file path is: /ensemble_models/deepfm/1/deepfm.json
[HCTR][00:00:18][INFO][RK0][main]: Global seed is 1305961709
[HCTR][00:00:19][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:00:19][INFO][RK0][main]: Start all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: End all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: Create inference session on device: 0
[HCTR][00:00:19][INFO][RK0][main]: Model name: deepfm
[HCTR][00:00:19][INFO][RK0][main]: Use mixed precision: False
[HCTR][00:00:19][INFO][RK0][main]: Use cuda graph: True
[HCTR][00:00:19][INFO][RK0][main]: Max batchsize: 64
[HCTR][00:00:19][INFO][RK0][main]: Use I64 input key: True
[HCTR][00:00:19][INFO][RK0][main]: start create embedding for inference
[HCTR][00:00:19][INFO][RK0][main]: sparse_input name data1
[HCTR][00:00:19][INFO][RK0][main]: create embedding for inference success
[HCTR][00:00:19][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
I0318 00:00:19.826815 172 hugectr.cc:1639] ******Loading HugeCTR model successfully
I0318 00:00:19.827763 172 model_repository_manager.cc:1149] successfully loaded 'deepfm' version 1
E0318 00:00:19.827767 172 model_repository_manager.cc:1152] failed to load 'deepfm_nvt' version 1: Internal: TypeError: 'NoneType' object is not subscriptable

At:
  /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype
  /ensemble_models/deepfm_nvt/1/model.py(76): initialize

E0318 00:00:19.827960 172 model_repository_manager.cc:1332] Invalid argument: ensemble 'deepfm_ens' depends on 'deepfm_nvt' which has no loaded version
I0318 00:00:19.828048 172 server.cc:522]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0318 00:00:19.828117 172 server.cc:549]
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path                                                    | Config                                        |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0318 00:00:19.828209 172 server.cc:592]
+------------+---------+--------------------------------------------------------------------------+
| Model      | Version | Status                                                                   |
+------------+---------+--------------------------------------------------------------------------+
| deepfm     | 1       | READY                                                                    |
| deepfm_nvt | 1       | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
|            |         |                                                                          |
|            |         | At:                                                                      |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype          |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(76): initialize                 |
+------------+---------+--------------------------------------------------------------------------+

I0318 00:00:19.845925 172 metrics.cc:623] Collecting metrics for GPU 0: Tesla T4
I0318 00:00:19.846404 172 tritonserver.cc:1932]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                              |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                             |
| server_version                   | 2.19.0                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_mem |
|                                  | ory cuda_shared_memory binary_tensor_data statistics trace                                                                         |
| model_repository_path[0]         | /ensemble_models                                                                                                                   |
| model_control_mode               | MODE_NONE                                                                                                                          |
| strict_model_config              | 1                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                           |
| response_cache_byte_size         | 0                                                                                                                                  |
| min_supported_compute_capability | 6.0                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+

Aha! Link: https://nvaiinfa.aha.io/features/MERLIN-818

The text was updated successfully, but these errors were encountered:

rnyak · 2022-03-07T00:29:23Z

@mengdong What functions are you using to generate config files? NVTabular export_tensorflow_ensemble ? Is this a TF or PyT model? Can you remove max_batch_size in the config file and try again?

albert17 · 2022-03-10T21:04:11Z

@mengdong still having this problem in 22.03?

viswa-nvidia · 2022-03-14T21:21:10Z

Two issues - HugeCTR has resolved it . NVTabular model - type error - our plan is to laod the model in the latest container and check if the problem persists

mengdong · 2022-03-18T00:42:45Z

Yes, I was able to replicate the error in 22.03

mengdong · 2022-03-18T00:43:21Z

@mengdong What functions are you using to generate config files? NVTabular export_tensorflow_ensemble ? Is this a TF or PyT model? Can you remove max_batch_size in the config file and try again?

we use export hugectr ensemble, it appears the hugectr model loads fine, updated the bug to reflect latest update

karlhigley · 2022-03-18T01:57:10Z

This looks to me like an issue with loading information about the NVTabular workflow from the Triton config, and in particular with parsing the data_type field from the config on these lines. Could you share the workflow and exported Triton config (here or on Slack) or a minimal repro so we can debug what's going on?

mengdong · 2022-03-18T06:35:16Z

I have shared the exported model (including Triton config) on Slack, the repro is a bit complicated, let me know if you still need it.

Since HugeCTR always expects the same three fields, we don't have to consult the `Workflow`'s output schema to determine the dtypes. We can just hard-code them. Partially addresses NVIDIA-Merlin/Merlin#125

karlhigley · 2022-03-22T23:17:34Z

We just merged a fix for the issue that was occurring when trying to load the NVT workflow in Triton. I think there's another issue here though, which is that some configuration properties moved out of the DeepFM model's config.pbtxt (for Triton) and into a separate ps.json file (for parameter server configuration.) That file is missing in the ensemble provided for debugging, which is preventing the DeepFM model itself from being loaded successfully by Triton.

@zehuanw @minseokl Could you assign someone to this issue who's familiar with the parameter server config file and how to create one?

(cc @EvenOldridge)

yingcanw · 2022-03-25T09:00:43Z

Since HugeCTR backend has added more new features in the past few release, I suggest that you can manually create a ps.json file. For details, please refer to https://github.com/triton-inference-server/hugectr_backend#independent-inference-hierarchical-parameter-server-configuration

mengdong · 2022-03-25T17:23:41Z

Thanks, hugectr is not a issue for this bug. I have manually created ps.json and it worked.

karlhigley · 2022-03-25T22:11:27Z

I'm going to track that part of the issue here and close this one, but I don't think a Triton ensemble creation process that requires our customers to manually create a config file and place it in the exported Triton model repo directory is very user friendly. 😕

EvenOldridge · 2022-03-25T22:44:02Z

@yingcanw @zehuanw @jconwayNV I'm with @karlhigley on this. We need to move away from manually creating json files as a part of our config.

yingcanw · 2022-03-25T23:38:30Z

@karlhigley @EvenOldridge @zehuanw The ps.json was added manually just to verify that wether the issue was caused by missing key parameters in ps.json. I remember that ps.json was always generated manually in nvt&hugectr ensamble mode before，regardless of whether triton was generated automatically. If I understand this correctly, you can just add the logic to generate ps.json automatically here
https://github.com/NVIDIA-Merlin/NVTabular/blob/c1ef698212c75203909d6c16e70e0e9236ea7d62/nvtabular/inference/triton/ensemble.py#L613

mengdong · 2022-03-29T15:37:07Z

I would keep this bug open as NVtabular still has error when loading the model.

mengdong · 2022-03-29T15:37:34Z

and error indicates NVtabular error not hugectr.

viswa-nvidia · 2022-03-29T18:02:58Z

bug's marked done but SA wasn't able to verify the fix that was merged to main. couple of items here
nightly container w/ fix is unavailable. Alberto said CI's broken. We need to know where we are at
SA pip reinstall the main branch after fix was merged for testing but wasn't able to verify the fix. need Karl's comment on how fix can be verified, or whether it needs more work to resolve the issue (maybe from HCTR team as he commented in the bug)

@karlhigley @albert17 , I am reopening this bug. Please review these comments. Thanks.

EvenOldridge · 2022-03-29T20:51:50Z

Can you give the version you're having this issue with please @mengdong and clarify what issue you're having. You'd previously posted:

Thanks, hugectr is not a issue for this bug. I have manually created ps.json and it worked.

What worked? And what do you expect to work now that doesn't that you're flagging in this issue?

mengdong · 2022-03-29T21:10:53Z

Initially, the bug contains 2 errors, 1 on hugectr, 1 on nvtabular.

With manually created ps.json, hugectr model worked. NVtabular error still withstand with latest NVtabular pip installation.

Per the current description of the bug, the error message shows:

| deepfm_nvt | 1       | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
|            |         |                                                                          |
|            |         | At:                                                                      |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype          |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(76): initialize

The environment I reproduce the bug is: nightly Merlin inference container (which I reinstall nvtabular from TOT of main branch)

karlhigley · 2022-03-31T18:31:05Z

@mengdong Did you re-run the existing export with the new version of NVTabular or re-export the ensemble with the new version? You'll have to re-export the ensemble with the latest NVTabular to see a difference, since updating NVT doesn't change the code in exported Python models. If you've done that and it's still not working, let me know.

mengdong · 2022-03-31T19:26:33Z

Thanks Karl. This makes sense. Let me give it another try.

mengdong · 2022-03-31T20:57:05Z

Hello @karlhigley Here is what I have done:
1, reinstall nvtabular in 22.03 NGC merlin training container, export the ensemble model
2, reinstall nvtabular in 22.03 NGC merlin nightly inference container, run Triton to serve the model.

Error message

I0331 20:55:12.749189 348 server.cc:549] 
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path                                                    | Config                                        |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0331 20:55:12.749500 348 server.cc:592] 
+------------+---------+--------------------------------------------------------------------------------+
| Model      | Version | Status                                                                         |
+------------+---------+--------------------------------------------------------------------------------+
| deepfm     | 1       | READY                                                                          |
| deepfm_nvt | 1       | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'merlin.io.worker' |
|            |         |                                                                                |
|            |         | At:                                                                            |
|            |         |   /nvtabular/nvtabular/ops/categorify.py(41): <module>                         |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/ops/__init__.py(29): <module>                           |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/workflow/node.py(17): <module>                          |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/workflow/__init__.py(18): <module>                      |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap>(1050): _handle_fromlist                        |
|            |         |   /nvtabular/nvtabular/__init__.py(25): <module>                               |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(41): <module>                         |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
+------------+---------+--------------------------------------------------------------------------------+

mengdong · 2022-03-31T21:03:26Z

If I omit a few latest commit, and checkout the snapshot with your fix git checkout 4196e6a4323738a798fbd5e84e81ea1b53fc05ae. I see the same error. I wonder if I should create the workflow from scratch, that is the only thing left out now. Did you test my model and it works for you?

I0331 21:02:14.652909 684 server.cc:549]
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path | Config |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

karlhigley · 2022-04-01T19:54:59Z

Using the ensemble provided in merlin-ensemble.zip, I am able to replicate the error reported above with the latest version of NVTabular from the main branch. If I replace triton-ensemble-20220317011121/deepfm_nvt/1/model.py with nvtabular/inference/triton/workflow_model.py (which is what should be included in a re-export), I am able to successfully load the deepfm_nvt model without encountering the error.

bschifferer · 2022-04-22T16:59:32Z

I tested the Criteo HugeCTR Inference Example and it worked for me

viswa-nvidia · 2022-04-28T21:53:46Z

@mengdong , can we close this issue ? @sohn21c for viz.

mengdong · 2022-04-28T22:10:38Z

yes please

…

On Thu, Apr 28, 2022 at 14:53 viswa-nvidia ***@***.***> wrote: @mengdong <https://github.com/mengdong> , can we close this issue ? @sohn21c <https://github.com/sohn21c> for viz. — Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQN42LNNQ2LZDN55F5GXLVHMCHLANCNFSM5PSPJBTA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

viswa-nvidia added the bug Something isn't working label Mar 8, 2022

viswa-nvidia added this to the Merlin 22.04 milestone Mar 14, 2022

viswa-nvidia assigned rnyak Mar 14, 2022

karlhigley self-assigned this Mar 18, 2022

karlhigley changed the title ~~Cannot load a exported deepfm model with NGC 22.02 inference container~~ [BUG] Cannot load a exported deepfm model with NGC 22.02 inference container Mar 22, 2022

mengdong changed the title ~~[BUG] Cannot load a exported deepfm model with NGC 22.02 inference container~~ [BUG] Cannot load a exported deepfm model with NGC 22.03 inference container Mar 22, 2022

karlhigley mentioned this issue Mar 22, 2022

Hard-code the Workflow output dtypes for HugeCTR in Triton NVIDIA-Merlin/NVTabular#1468

Merged

karlhigley closed this as completed Mar 25, 2022

viswa-nvidia reopened this Mar 29, 2022

karlhigley mentioned this issue Apr 5, 2022

[BUG] NVTabular model exported by Ensemble HugeCTR cannot be loaded by Triton NVIDIA-Merlin/NVTabular#1498

Closed

karlhigley mentioned this issue Apr 14, 2022

[RMP] HugeCTR compatiblity improvements #204

Closed

8 tasks

viswa-nvidia closed this as completed Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

mengdong commented Feb 28, 2022 •

edited by viswa-nvidia

Loading

rnyak commented Mar 7, 2022 •

edited

Loading

albert17 commented Mar 10, 2022

viswa-nvidia commented Mar 14, 2022

mengdong commented Mar 18, 2022

mengdong commented Mar 18, 2022 •

edited

Loading

karlhigley commented Mar 18, 2022

mengdong commented Mar 18, 2022

karlhigley commented Mar 22, 2022

yingcanw commented Mar 25, 2022

mengdong commented Mar 25, 2022

karlhigley commented Mar 25, 2022

EvenOldridge commented Mar 25, 2022

yingcanw commented Mar 25, 2022

mengdong commented Mar 29, 2022

mengdong commented Mar 29, 2022

viswa-nvidia commented Mar 29, 2022

EvenOldridge commented Mar 29, 2022

mengdong commented Mar 29, 2022

karlhigley commented Mar 31, 2022

mengdong commented Mar 31, 2022

mengdong commented Mar 31, 2022 •

edited

Loading

mengdong commented Mar 31, 2022 •

edited

Loading

karlhigley commented Apr 1, 2022

bschifferer commented Apr 22, 2022

viswa-nvidia commented Apr 28, 2022

mengdong commented Apr 28, 2022 via email

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

Comments

mengdong commented Feb 28, 2022 • edited by viswa-nvidia Loading

rnyak commented Mar 7, 2022 • edited Loading

albert17 commented Mar 10, 2022

viswa-nvidia commented Mar 14, 2022

mengdong commented Mar 18, 2022

mengdong commented Mar 18, 2022 • edited Loading

karlhigley commented Mar 18, 2022

mengdong commented Mar 18, 2022

karlhigley commented Mar 22, 2022

yingcanw commented Mar 25, 2022

mengdong commented Mar 25, 2022

karlhigley commented Mar 25, 2022

EvenOldridge commented Mar 25, 2022

yingcanw commented Mar 25, 2022

mengdong commented Mar 29, 2022

mengdong commented Mar 29, 2022

viswa-nvidia commented Mar 29, 2022

EvenOldridge commented Mar 29, 2022

mengdong commented Mar 29, 2022

karlhigley commented Mar 31, 2022

mengdong commented Mar 31, 2022

mengdong commented Mar 31, 2022 • edited Loading

mengdong commented Mar 31, 2022 • edited Loading

karlhigley commented Apr 1, 2022

bschifferer commented Apr 22, 2022

viswa-nvidia commented Apr 28, 2022

mengdong commented Apr 28, 2022 via email

mengdong commented Feb 28, 2022 •

edited by viswa-nvidia

Loading

rnyak commented Mar 7, 2022 •

edited

Loading

mengdong commented Mar 18, 2022 •

edited

Loading

mengdong commented Mar 31, 2022 •

edited

Loading

mengdong commented Mar 31, 2022 •

edited

Loading