[QST] How to use pretrained embeddings as features in DLRM? #1013

aabdullah-getguru · 2023-03-06T13:44:26Z

❓ Questions & Help

I'm a beginner with Merlin Models. I'm setting up a DLRM model, with 3 types of input features:

categorical features
continuous features
pre_trained embeddings for user/item

For simplicity, we can assume we have a data frame with columns user_id, item_id, categorical_1, continuous_1, embeddings_user, embeddings_item.

(1) and (2) are straightforward to add to the architecture via simply using the right tags and nvt.ops. However, I'm not sure how one could add in the embeddings_1. Is the right approach just to define a custom architecture using the merlin provided blocks? I would prefer these embeddings_1 to be trainable if possible.

Or is there a quicker way to use them with DLRM via the right nvtabular ops and tags? Thanks!

The text was updated successfully, but these errors were encountered:

rnyak · 2023-03-06T18:13:23Z

@aabdullah-getguru

Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

Or Are you planning to feed embeddings to embedding layer? If yes, then, please check out this notebook as an example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/entertainment-with-pretrained-embeddings.ipynb

aabdullah-getguru · 2023-03-10T21:00:59Z

@aabdullah-getguru

Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

Or Are you planning to feed embeddings to embedding layer? If yes, then, please check out this notebook as an example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/entertainment-with-pretrained-embeddings.ipynb

@rynak Thank you that's a very helpful lead. I'll look at the embeddings to the embedding layer.

hkristof03 · 2023-09-21T16:44:10Z

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:

item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None

So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector.
Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

CarloNicolini · 2024-05-30T13:07:43Z

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:
item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None
So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector. Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

Hi, have you found a way to pass precomputed embeddings to the model? I have a very similar case and I cannot understasnd whether it is possible to just use the nvtabular Workflow or other methods to pass both user embeddings and item embeddings. For both I have a 1024 elements array associated with user or item respectively. I believe that these kind of inputs could be of very great help for the model performance but there are lot of memory issues as with ~3M rows this explodes quickly.

The merlin-tensorflow documentation is missing this kind of examples, rather it associates the embeddings to the movieId which I don't understand why is necessary.

hkristof03 · 2024-06-01T11:59:53Z

Hi @CarloNicolini , yes solved the problem. Just follow this example. If the embedding table is large, you have the option not to move it to the GPU, only in batches during training (see the 2nd case in the notebook). Keep in mind that the 0th index of the embedding table should be a full 0. vector which will correspond to unknown IDs. If there are multiple features corresponding to the same embedding table, you can make the embedding table shared for those features with this syntax:

[['feature_x', 'feature_y']] >> nvt.ops. ...

You can verify that the features share the embedding table by checking the schema DataFrame.

I hope this helps.

CarloNicolini · 2024-06-10T23:56:48Z

``

You can verify that the features share the embedding table by checking the schema DataFrame.

In my case I have the remapped values from categorify starting from 3 (reading the unique.item_id.parquet file).
I don't understand why I should only add a single row of zeros instead of three (corresponding to 0,1 and 2) as they are the reserved categories for NULLs, padding and out-of-vocabulary respectively.

hkristof03 · 2024-06-11T09:26:17Z

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row.

@rnyak could you please comment on this?

CarloNicolini · 2024-06-24T14:35:27Z

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row.
@rnyak could you please comment on this?

I've experimented and checked thoroughly the values using Loader.peek() as in the example.
I can confirm that one row vector of zeros is not enough, otherwise the data are not correctly aligned.
Since my id categorical variable after workflow.transform starts from the value 3 I had to do prepend np.vstack with a np.zeros([3,1024]) in order for the dataloader to pass the pretrained embeddings correctly to the model.
P.S.
The value 3 clearly is because I use nvt.ops.Categorify with the default num_buckets option.
Your mileage may vary, depending on the number of buckets in the categorify, I believe.

rnyak · 2024-06-28T13:59:13Z

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3.
you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

CarloNicolini · 2024-06-28T17:26:19Z

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3. you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

Thanks for your feedback!
With freq_threshold I can confirm that the results seems to map correctly.
I've been manually testing some tens of indices and verified that the values correctly map to the values I expect.
As for the num_buckets I did not check, mine were only hypotheses.

By the way, this kind of operations strongly call for the need of a nvt.Workflow method .inverse_transform.
That would be fantastically useful to perform certain back-mappings.

rnyak · 2024-06-28T18:18:25Z

@CarloNicolini thanks. Currently we do not have bandwidth to add extra features to the library. If you are interested in, feel free to open a PR.

aabdullah-getguru added the status/needs-triage label Mar 6, 2023

rnyak added the question Further information is requested label Mar 6, 2023

rnyak added this to the Merlin 23.03 milestone Mar 8, 2023

rnyak modified the milestones: Merlin 23.03, Merlin 23.04 Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to use pretrained embeddings as features in DLRM? #1013

[QST] How to use pretrained embeddings as features in DLRM? #1013

aabdullah-getguru commented Mar 6, 2023 •

edited

Loading

rnyak commented Mar 6, 2023

aabdullah-getguru commented Mar 10, 2023

hkristof03 commented Sep 21, 2023 •

edited

Loading

CarloNicolini commented May 30, 2024

hkristof03 commented Jun 1, 2024 •

edited

Loading

CarloNicolini commented Jun 10, 2024

hkristof03 commented Jun 11, 2024

CarloNicolini commented Jun 24, 2024 •

edited

Loading

rnyak commented Jun 28, 2024 •

edited

Loading

CarloNicolini commented Jun 28, 2024

rnyak commented Jun 28, 2024

[QST] How to use pretrained embeddings as features in DLRM? #1013

[QST] How to use pretrained embeddings as features in DLRM? #1013

Comments

aabdullah-getguru commented Mar 6, 2023 • edited Loading

❓ Questions & Help

rnyak commented Mar 6, 2023

aabdullah-getguru commented Mar 10, 2023

hkristof03 commented Sep 21, 2023 • edited Loading

CarloNicolini commented May 30, 2024

hkristof03 commented Jun 1, 2024 • edited Loading

CarloNicolini commented Jun 10, 2024

hkristof03 commented Jun 11, 2024

CarloNicolini commented Jun 24, 2024 • edited Loading

rnyak commented Jun 28, 2024 • edited Loading

CarloNicolini commented Jun 28, 2024

rnyak commented Jun 28, 2024

aabdullah-getguru commented Mar 6, 2023 •

edited

Loading

hkristof03 commented Sep 21, 2023 •

edited

Loading

hkristof03 commented Jun 1, 2024 •

edited

Loading

CarloNicolini commented Jun 24, 2024 •

edited

Loading

rnyak commented Jun 28, 2024 •

edited

Loading