Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to use pretrained embeddings as features in DLRM? #1013

Open
aabdullah-getguru opened this issue Mar 6, 2023 · 11 comments
Open

[QST] How to use pretrained embeddings as features in DLRM? #1013

aabdullah-getguru opened this issue Mar 6, 2023 · 11 comments
Labels
question Further information is requested status/needs-triage
Milestone

Comments

@aabdullah-getguru
Copy link

aabdullah-getguru commented Mar 6, 2023

❓ Questions & Help

I'm a beginner with Merlin Models. I'm setting up a DLRM model, with 3 types of input features:

  1. categorical features
  2. continuous features
  3. pre_trained embeddings for user/item

For simplicity, we can assume we have a data frame with columns user_id, item_id, categorical_1, continuous_1, embeddings_user, embeddings_item.

(1) and (2) are straightforward to add to the architecture via simply using the right tags and nvt.ops. However, I'm not sure how one could add in the embeddings_1. Is the right approach just to define a custom architecture using the merlin provided blocks? I would prefer these embeddings_1 to be trainable if possible.

Or is there a quicker way to use them with DLRM via the right nvtabular ops and tags? Thanks!

@rnyak
Copy link
Contributor

rnyak commented Mar 6, 2023

@aabdullah-getguru

  • Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

@rnyak rnyak added the question Further information is requested label Mar 6, 2023
@rnyak rnyak added this to the Merlin 23.03 milestone Mar 8, 2023
@aabdullah-getguru
Copy link
Author

@aabdullah-getguru

  • Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

@rynak Thank you that's a very helpful lead. I'll look at the embeddings to the embedding layer.

@rnyak rnyak modified the milestones: Merlin 23.03, Merlin 23.04 Mar 14, 2023
@hkristof03
Copy link

hkristof03 commented Sep 21, 2023

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:

item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None

So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector.
Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

@CarloNicolini
Copy link

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:

item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None

So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector. Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

Hi, have you found a way to pass precomputed embeddings to the model? I have a very similar case and I cannot understasnd whether it is possible to just use the nvtabular Workflow or other methods to pass both user embeddings and item embeddings. For both I have a 1024 elements array associated with user or item respectively. I believe that these kind of inputs could be of very great help for the model performance but there are lot of memory issues as with ~3M rows this explodes quickly.

The merlin-tensorflow documentation is missing this kind of examples, rather it associates the embeddings to the movieId which I don't understand why is necessary.

@hkristof03
Copy link

hkristof03 commented Jun 1, 2024

Hi @CarloNicolini , yes solved the problem. Just follow this example. If the embedding table is large, you have the option not to move it to the GPU, only in batches during training (see the 2nd case in the notebook). Keep in mind that the 0th index of the embedding table should be a full 0. vector which will correspond to unknown IDs. If there are multiple features corresponding to the same embedding table, you can make the embedding table shared for those features with this syntax:

[['feature_x', 'feature_y']] >> nvt.ops. ...

You can verify that the features share the embedding table by checking the schema DataFrame.

I hope this helps.

@CarloNicolini
Copy link

``

You can verify that the features share the embedding table by checking the schema DataFrame.

In my case I have the remapped values from categorify starting from 3 (reading the unique.item_id.parquet file).
I don't understand why I should only add a single row of zeros instead of three (corresponding to 0,1 and 2) as they are the reserved categories for NULLs, padding and out-of-vocabulary respectively.

@hkristof03
Copy link

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row.

@rnyak could you please comment on this?

@CarloNicolini
Copy link

CarloNicolini commented Jun 24, 2024

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row.
@rnyak could you please comment on this?

I've experimented and checked thoroughly the values using Loader.peek() as in the example.
I can confirm that one row vector of zeros is not enough, otherwise the data are not correctly aligned.
Since my id categorical variable after workflow.transform starts from the value 3 I had to do prepend np.vstack with a np.zeros([3,1024]) in order for the dataloader to pass the pretrained embeddings correctly to the model.
P.S.
The value 3 clearly is because I use nvt.ops.Categorify with the default num_buckets option.
Your mileage may vary, depending on the number of buckets in the categorify, I believe.

@rnyak
Copy link
Contributor

rnyak commented Jun 28, 2024

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3.
you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

@CarloNicolini
Copy link

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3. you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

Thanks for your feedback!
With freq_threshold I can confirm that the results seems to map correctly.
I've been manually testing some tens of indices and verified that the values correctly map to the values I expect.
As for the num_buckets I did not check, mine were only hypotheses.

By the way, this kind of operations strongly call for the need of a nvt.Workflow method .inverse_transform.
That would be fantastically useful to perform certain back-mappings.

@rnyak
Copy link
Contributor

rnyak commented Jun 28, 2024

@CarloNicolini thanks. Currently we do not have bandwidth to add extra features to the library. If you are interested in, feel free to open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested status/needs-triage
Projects
None yet
Development

No branches or pull requests

4 participants