Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot load a exported deepfm model with NGC 22.03 inference container #125

Closed
mengdong opened this issue Feb 28, 2022 · 26 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mengdong
Copy link

mengdong commented Feb 28, 2022

run into following errors

I0318 00:00:18.082645 172 hugectr.cc:1926] TRITONBACKEND_ModelInstanceInitialize: deepfm_0 (device 0)
I0318 00:00:18.082694 172 hugectr.cc:1566] Triton Model Instance Initialization on device 0
I0318 00:00:18.082792 172 hugectr.cc:1576] Dense Feature buffer allocation:
I0318 00:00:18.083026 172 hugectr.cc:1583] Categorical Feature buffer allocation:
I0318 00:00:18.083095 172 hugectr.cc:1601] Categorical Row Index buffer allocation:
I0318 00:00:18.083143 172 hugectr.cc:1611] Predict result buffer allocation:
I0318 00:00:18.083203 172 hugectr.cc:1939] ******Loading HugeCTR Model******
I0318 00:00:18.083217 172 hugectr.cc:1631] The model origin json configuration file path is: /ensemble_models/deepfm/1/deepfm.json
[HCTR][00:00:18][INFO][RK0][main]: Global seed is 1305961709
[HCTR][00:00:19][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:00:19][INFO][RK0][main]: Start all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: End all2all warmup
[HCTR][00:00:19][INFO][RK0][main]: Create inference session on device: 0
[HCTR][00:00:19][INFO][RK0][main]: Model name: deepfm
[HCTR][00:00:19][INFO][RK0][main]: Use mixed precision: False
[HCTR][00:00:19][INFO][RK0][main]: Use cuda graph: True
[HCTR][00:00:19][INFO][RK0][main]: Max batchsize: 64
[HCTR][00:00:19][INFO][RK0][main]: Use I64 input key: True
[HCTR][00:00:19][INFO][RK0][main]: start create embedding for inference
[HCTR][00:00:19][INFO][RK0][main]: sparse_input name data1
[HCTR][00:00:19][INFO][RK0][main]: create embedding for inference success
[HCTR][00:00:19][INFO][RK0][main]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
I0318 00:00:19.826815 172 hugectr.cc:1639] ******Loading HugeCTR model successfully
I0318 00:00:19.827763 172 model_repository_manager.cc:1149] successfully loaded 'deepfm' version 1
E0318 00:00:19.827767 172 model_repository_manager.cc:1152] failed to load 'deepfm_nvt' version 1: Internal: TypeError: 'NoneType' object is not subscriptable

At:
  /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype
  /ensemble_models/deepfm_nvt/1/model.py(76): initialize

E0318 00:00:19.827960 172 model_repository_manager.cc:1332] Invalid argument: ensemble 'deepfm_ens' depends on 'deepfm_nvt' which has no loaded version
I0318 00:00:19.828048 172 server.cc:522]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0318 00:00:19.828117 172 server.cc:549]
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path                                                    | Config                                        |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0318 00:00:19.828209 172 server.cc:592]
+------------+---------+--------------------------------------------------------------------------+
| Model      | Version | Status                                                                   |
+------------+---------+--------------------------------------------------------------------------+
| deepfm     | 1       | READY                                                                    |
| deepfm_nvt | 1       | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
|            |         |                                                                          |
|            |         | At:                                                                      |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype          |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(76): initialize                 |
+------------+---------+--------------------------------------------------------------------------+

I0318 00:00:19.845925 172 metrics.cc:623] Collecting metrics for GPU 0: Tesla T4
I0318 00:00:19.846404 172 tritonserver.cc:1932]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                              |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                             |
| server_version                   | 2.19.0                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_mem |
|                                  | ory cuda_shared_memory binary_tensor_data statistics trace                                                                         |
| model_repository_path[0]         | /ensemble_models                                                                                                                   |
| model_control_mode               | MODE_NONE                                                                                                                          |
| strict_model_config              | 1                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                           |
| response_cache_byte_size         | 0                                                                                                                                  |
| min_supported_compute_capability | 6.0                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+


Aha! Link: https://nvaiinfa.aha.io/features/MERLIN-818

@rnyak
Copy link
Contributor

rnyak commented Mar 7, 2022

@mengdong What functions are you using to generate config files? NVTabular export_tensorflow_ensemble ? Is this a TF or PyT model? Can you remove max_batch_size in the config file and try again?

@viswa-nvidia viswa-nvidia added the bug Something isn't working label Mar 8, 2022
@albert17
Copy link
Contributor

@mengdong still having this problem in 22.03?

@viswa-nvidia
Copy link

Two issues - HugeCTR has resolved it . NVTabular model - type error - our plan is to laod the model in the latest container and check if the problem persists

@viswa-nvidia viswa-nvidia added this to the Merlin 22.04 milestone Mar 14, 2022
@mengdong
Copy link
Author

Yes, I was able to replicate the error in 22.03

@mengdong
Copy link
Author

mengdong commented Mar 18, 2022

@mengdong What functions are you using to generate config files? NVTabular export_tensorflow_ensemble ? Is this a TF or PyT model? Can you remove max_batch_size in the config file and try again?

we use export hugectr ensemble, it appears the hugectr model loads fine, updated the bug to reflect latest update

@karlhigley
Copy link
Contributor

This looks to me like an issue with loading information about the NVTabular workflow from the Triton config, and in particular with parsing the data_type field from the config on these lines. Could you share the workflow and exported Triton config (here or on Slack) or a minimal repro so we can debug what's going on?

@karlhigley karlhigley self-assigned this Mar 18, 2022
@mengdong
Copy link
Author

I have shared the exported model (including Triton config) on Slack, the repro is a bit complicated, let me know if you still need it.

@karlhigley karlhigley changed the title Cannot load a exported deepfm model with NGC 22.02 inference container [BUG] Cannot load a exported deepfm model with NGC 22.02 inference container Mar 22, 2022
@mengdong mengdong changed the title [BUG] Cannot load a exported deepfm model with NGC 22.02 inference container [BUG] Cannot load a exported deepfm model with NGC 22.03 inference container Mar 22, 2022
karlhigley added a commit to karlhigley/NVTabular that referenced this issue Mar 22, 2022
Since HugeCTR always expects the same three fields, we don't have to consult the `Workflow`'s output schema to determine the dtypes. We can just hard-code them.

Partially addresses NVIDIA-Merlin/Merlin#125
karlhigley added a commit to NVIDIA-Merlin/NVTabular that referenced this issue Mar 22, 2022
Since HugeCTR always expects the same three fields, we don't have to consult the `Workflow`'s output schema to determine the dtypes. We can just hard-code them.

Partially addresses NVIDIA-Merlin/Merlin#125
@karlhigley
Copy link
Contributor

We just merged a fix for the issue that was occurring when trying to load the NVT workflow in Triton. I think there's another issue here though, which is that some configuration properties moved out of the DeepFM model's config.pbtxt (for Triton) and into a separate ps.json file (for parameter server configuration.) That file is missing in the ensemble provided for debugging, which is preventing the DeepFM model itself from being loaded successfully by Triton.

@zehuanw @minseokl Could you assign someone to this issue who's familiar with the parameter server config file and how to create one?

(cc @EvenOldridge)

@yingcanw
Copy link
Contributor

Since HugeCTR backend has added more new features in the past few release, I suggest that you can manually create a ps.json file. For details, please refer to https://github.com/triton-inference-server/hugectr_backend#independent-inference-hierarchical-parameter-server-configuration

@mengdong
Copy link
Author

Thanks, hugectr is not a issue for this bug. I have manually created ps.json and it worked.

@karlhigley
Copy link
Contributor

I'm going to track that part of the issue here and close this one, but I don't think a Triton ensemble creation process that requires our customers to manually create a config file and place it in the exported Triton model repo directory is very user friendly. 😕

@EvenOldridge
Copy link
Member

@yingcanw @zehuanw @jconwayNV I'm with @karlhigley on this. We need to move away from manually creating json files as a part of our config.

@yingcanw
Copy link
Contributor

@karlhigley @EvenOldridge @zehuanw The ps.json was added manually just to verify that wether the issue was caused by missing key parameters in ps.json. I remember that ps.json was always generated manually in nvt&hugectr ensamble mode before,regardless of whether triton was generated automatically. If I understand this correctly, you can just add the logic to generate ps.json automatically here
https://github.com/NVIDIA-Merlin/NVTabular/blob/c1ef698212c75203909d6c16e70e0e9236ea7d62/nvtabular/inference/triton/ensemble.py#L613

@mengdong
Copy link
Author

I would keep this bug open as NVtabular still has error when loading the model.

@mengdong
Copy link
Author

and error indicates NVtabular error not hugectr.

@viswa-nvidia
Copy link

bug's marked done but SA wasn't able to verify the fix that was merged to main. couple of items here
nightly container w/ fix is unavailable. Alberto said CI's broken. We need to know where we are at
SA pip reinstall the main branch after fix was merged for testing but wasn't able to verify the fix. need Karl's comment on how fix can be verified, or whether it needs more work to resolve the issue (maybe from HCTR team as he commented in the bug)

@karlhigley @albert17 , I am reopening this bug. Please review these comments. Thanks.

@viswa-nvidia viswa-nvidia reopened this Mar 29, 2022
@EvenOldridge
Copy link
Member

Can you give the version you're having this issue with please @mengdong and clarify what issue you're having. You'd previously posted:

Thanks, hugectr is not a issue for this bug. I have manually created ps.json and it worked.

What worked? And what do you expect to work now that doesn't that you're flagging in this issue?

@mengdong
Copy link
Author

Initially, the bug contains 2 errors, 1 on hugectr, 1 on nvtabular.

With manually created ps.json, hugectr model worked. NVtabular error still withstand with latest NVtabular pip installation.

Per the current description of the bug, the error message shows:

| deepfm_nvt | 1       | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
|            |         |                                                                          |
|            |         | At:                                                                      |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype          |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(76): initialize   

The environment I reproduce the bug is: nightly Merlin inference container (which I reinstall nvtabular from TOT of main branch)

@karlhigley
Copy link
Contributor

@mengdong Did you re-run the existing export with the new version of NVTabular or re-export the ensemble with the new version? You'll have to re-export the ensemble with the latest NVTabular to see a difference, since updating NVT doesn't change the code in exported Python models. If you've done that and it's still not working, let me know.

@mengdong
Copy link
Author

Thanks Karl. This makes sense. Let me give it another try.

@mengdong
Copy link
Author

mengdong commented Mar 31, 2022

Hello @karlhigley Here is what I have done:
1, reinstall nvtabular in 22.03 NGC merlin training container, export the ensemble model
2, reinstall nvtabular in 22.03 NGC merlin nightly inference container, run Triton to serve the model.

Error message

I0331 20:55:12.749189 348 server.cc:549] 
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path                                                    | Config                                        |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0331 20:55:12.749500 348 server.cc:592] 
+------------+---------+--------------------------------------------------------------------------------+
| Model      | Version | Status                                                                         |
+------------+---------+--------------------------------------------------------------------------------+
| deepfm     | 1       | READY                                                                          |
| deepfm_nvt | 1       | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'merlin.io.worker' |
|            |         |                                                                                |
|            |         | At:                                                                            |
|            |         |   /nvtabular/nvtabular/ops/categorify.py(41): <module>                         |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/ops/__init__.py(29): <module>                           |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/workflow/node.py(17): <module>                          |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /nvtabular/nvtabular/workflow/__init__.py(18): <module>                      |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap>(1050): _handle_fromlist                        |
|            |         |   /nvtabular/nvtabular/__init__.py(25): <module>                               |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
|            |         |   /ensemble_models/deepfm_nvt/1/model.py(41): <module>                         |
|            |         |   <frozen importlib._bootstrap>(219): _call_with_frames_removed                |
|            |         |   <frozen importlib._bootstrap_external>(848): exec_module                     |
|            |         |   <frozen importlib._bootstrap>(686): _load_unlocked                           |
|            |         |   <frozen importlib._bootstrap>(975): _find_and_load_unlocked                  |
|            |         |   <frozen importlib._bootstrap>(991): _find_and_load                           |
+------------+---------+--------------------------------------------------------------------------------+

@mengdong
Copy link
Author

mengdong commented Mar 31, 2022

If I omit a few latest commit, and checkout the snapshot with your fix git checkout 4196e6a4323738a798fbd5e84e81ea1b53fc05ae. I see the same error. I wonder if I should create the workflow from scratch, that is the only thing left out now. Did you test my model and it works for you?

I0331 21:02:14.652909 684 server.cc:549]
+---------+---------------------------------------------------------+-----------------------------------------------+
| Backend | Path | Config |
+---------+---------------------------------------------------------+-----------------------------------------------+
| hugectr | /opt/tritonserver/backends/hugectr/libtriton_hugectr.so | {"cmdline":{"ps":"/ensemble_models/ps.json"}} |
+---------+---------------------------------------------------------+-----------------------------------------------+

I0331 21:02:14.653137 684 server.cc:592]
+------------+---------+--------------------------------------------------------------------------+
| Model | Version | Status |
+------------+---------+--------------------------------------------------------------------------+
| deepfm | 1 | READY |
| deepfm_nvt | 1 | UNAVAILABLE: Internal: TypeError: 'NoneType' object is not subscriptable |
| | | |
| | | At: |
| | | /ensemble_models/deepfm_nvt/1/model.py(91): _set_output_dtype |
| | | /ensemble_models/deepfm_nvt/1/model.py(76): initialize |
+------------+---------+--------------------------------------------------------------------------+

@karlhigley
Copy link
Contributor

Using the ensemble provided in merlin-ensemble.zip, I am able to replicate the error reported above with the latest version of NVTabular from the main branch. If I replace triton-ensemble-20220317011121/deepfm_nvt/1/model.py with nvtabular/inference/triton/workflow_model.py (which is what should be included in a re-export), I am able to successfully load the deepfm_nvt model without encountering the error.

@bschifferer
Copy link
Contributor

I tested the Criteo HugeCTR Inference Example and it worked for me

@viswa-nvidia
Copy link

@mengdong , can we close this issue ? @sohn21c for viz.

@mengdong
Copy link
Author

mengdong commented Apr 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants