Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How can I fit a Workflow to a large dataset ? #1761

Closed
Azilyss opened this issue Feb 14, 2023 · 27 comments
Closed

[QST] How can I fit a Workflow to a large dataset ? #1761

Azilyss opened this issue Feb 14, 2023 · 27 comments
Labels
P1 question Further information is requested

Comments

@Azilyss
Copy link

Azilyss commented Feb 14, 2023

What is the best way to fit a Workflow to a large dataset?

When fitting the workflow I run into OOM error:

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /project/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Setup
GPU - 1 x NVIDIA Tesla V100
Docker image : nvcr.io/nvidia/merlin/merlin-pytorch:22.12
Platform: Ubuntu 20.04
Python version: 3.8

Dataset
1.7M rows - 5 columns
4GB
Operations include nvt.ops.Categorify() with 250k unique items in a column.

I have come across this troubleshooting document but did not succeed in implementing a working solution. Is setting up a LocalCUDACluster necessary to handle this issue ?

@Azilyss Azilyss added the question Further information is requested label Feb 14, 2023
@Tselmeg-C
Copy link

I am facing the same problem: I am running the merlin ts container using this command
docker run -it --network="<user-container-preprocessing-data>" --name merlin --shm-size=1g --ulimit memlock=-1 ulimit stack=67108864 -v <my_vol> nvcr.io/nvidia/merlin/merlin-tensorflow:22.12
for smaller data was working well, but for data with 3M rows, 90 cols with a size of 2+ GB, the workflow_fit_transform is not working, throwing 'kernel dies' error. And I have only CPU for now. Pls tell me there are other ways to walkaround this problem instead of setting up a GPU.

@rnyak
Copy link
Contributor

rnyak commented Feb 14, 2023

@Azilyss may I ask

  • what's your GPU? is it a single 16GB ?
  • Is your raw dataset one single csv file or parquet file or it consists of multiple files?
  • if you read your dataset with the following step what's the number of row groups? if your num_row_groups 1 or 2, meaning a small number that means your row_group_size memory is pretty big. Note that we expect row_group memory size to be smaller than part_size.
num_rows, num_row_groups, names = cudf.io.read_parquet_metadata('<yourdataset.parquet'>)
print(num_rows), print(num_row_groups)

you can always set part_size argument in the NVTabular nvt.Dataset() something like that before doing workflow.fit(dataset):

dataset= nvt.Dataset(PATH, part_size="500MB")

you can change part_size.

@Azilyss
Copy link
Author

Azilyss commented Feb 14, 2023

Hi @rnyak,

  • Yes it's a single Tesla V100 16GB GPU.
  • The dataset is split into 100 parquet files so ~35MB each
  • num_rows = 12_757_831 and num_row_groups = 1

Setting part_size as follow does not help :

dataset = nvt.Dataset(list_paths, engine="parquet", part_size="128MB")
workflow.fit(dataset)

@rnyak
Copy link
Contributor

rnyak commented Feb 14, 2023

@Azilyss where do you get the OOM? during workflow.fit() ? can please set the part_size ='1000MB' and try again? Note that your parquet size is for a compressed file.

you can simply read one parquet file via cudf.read_parquet(..) and then check the GPU memory consumption on nvidia-smi.

And please be sure for GPU memory is free when you start using NVTabular.. does your notebook/script have any import tensorflow as tf command?

@Azilyss
Copy link
Author

Azilyss commented Feb 14, 2023

Hi @rnyak,

Yes, the error occur during workflow.fit() at :

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 463, in transform
    encoded = _encode(
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 1427, in _encode
    labels = codes.merge(
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/indexed_frame.py", line 1902, in sort_values
    self._get_columns_by_label(by)._get_sorted_inds(
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 1539, in _get_sorted_inds
    return libcudf.sort.order_by(to_sort, ascending, na_position)
  File "sort.pyx", line 138, in cudf._lib.sort.order_by
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

The above exception was the direct cause of the following exception:

...

Traceback (most recent call last):
  File "/pythonproject/dataset/workflow.py", line 238, in run_workflow
    workflow.fit(dataset)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/workflow/workflow.py", line 198, in fit
    self.executor.fit(ddf, current_phase)
  ...
  File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 170, in _transform_data
    output_data = node.op.transform(selection, input_data)
  File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 483, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column item_id

The GPU memory consumption is 246MB for a single uncompressed file.

I am using PyTorch for this project, but this script is not importing torch.

Using part_size = "1000MB", the workflow.fit() still fails at categorify.

@rnyak
Copy link
Contributor

rnyak commented Feb 16, 2023

@Azilyss can you try to feed only 1 parquet file to NVT and see if you are getting OOM? Also, does your dataset have very long string columns? If yes, that might be problematic for categorify which relies on cudf and dask_cudf under the hood. can you share a sample dataset? and your NVT workflow code pls?

@Azilyss
Copy link
Author

Azilyss commented Feb 24, 2023

Hi @rnyak,

I am not getting OOM when loading only one file.

The dataset contains np.int64 columns item_id and session_id. The NVT workflow mainly categorifies these 2 columns and groups the item_ids by session_id into a list. A session typically has 60 items, there are about 200k sessions per part, and 200k unique items in the entire dataset.

The workflow code is as follows :

import nvtabular as nvt
from merlin.dag import ColumnSelector
from merlin.schema import Schema, Tags
from nvtabular import ops

input_paths = ["part.0.parquet", "part.1.parquet", ... , "part.99.parquet"]
output_path = "output"

input_features = ["session_id", "item_id"]
max_len = 200
min_len = 2

cat_features = (
    ColumnSelector(input_features)
    >> ops.Categorify()
    >> nvt.ops.AddMetadata(tags=[Tags.CATEGORICAL])
)
groupby_features = cat_features >> nvt.ops.Groupby(
    groupby_cols=["session_id"],
    aggs={"item_id": ["list", "count"]},
    name_sep="-",
)
seq_feats_list = (
    groupby_features["item_id-list"]
    >> nvt.ops.ListSlice(-max_len, pad=True, pad_value=0)
    >> nvt.ops.Rename(postfix="_seq")
    >> nvt.ops.AddMetadata(tags=[Tags.LIST])
    >> nvt.ops.ValueCount()
)
seq_feats_list = seq_feats_list >> nvt.ops.AddMetadata(tags=[Tags.ITEM, Tags.ID, Tags.ITEM_ID])

selected_features = seq_feats_list + groupby_features["item_id-count"]

features = selected_features >> nvt.ops.Filter(
    f=lambda df: df["item_id-count"] >= min_len
)

dataset = nvt.Dataset(input_paths, engine="parquet", part_size="1000MB")
dataset = dataset.shuffle_by_keys(keys=["session_id"])

workflow = nvt.Workflow(features)
workflow.fit(dataset)

@rnyak
Copy link
Contributor

rnyak commented Feb 24, 2023

@Azilyss can you please set max_len = 200 to something smaller? to like 60 and test again? IF a session typically has 60 items, is there a reason you set the max_length to 200? it will add lots of 0 padding to the session that have less than 200 items in it..

@Azilyss
Copy link
Author

Azilyss commented Feb 24, 2023

Apologies, I meant most sessions have around 60 items but a session can contain between 2 and 200 items.

@rnyak
Copy link
Contributor

rnyak commented Feb 24, 2023

@Azilyss you can still reduce the max_length to 60 and test :) where do you read these parquet files from? from a google cloud or AWS storage? or from a local disk?

@rnyak rnyak added the P1 label Feb 28, 2023
@Azilyss
Copy link
Author

Azilyss commented Feb 28, 2023

Hi @rnyak,

I did but without success. The data is loaded from a local disk.

I modified the code to do the shuffle_by_keys with dask beforehand so it does not pile up with other operations :
ddf.shuffle(on="session_id", ignore_index=True, npartitions=100).

At this point, I am not encountering an OOM error if I remove the ValueCount() operation from the workflow, but the schema is generated without the valuecount.min and valuecount.max fields, which in this example should be :

value_count {
    min: 60 
    max: 60
}

Is there any reason why this operation would be costly in this case and how it could be generated otherwise, given that the sequence length is constant thanks to pad=True ?

@rnyak
Copy link
Contributor

rnyak commented Feb 28, 2023

@Azilyss not sure why this is required? dataset = dataset.shuffle_by_keys(keys=["session_id"]) if you do shuffle_by_keys you are adding extra computational complexity. are your same session_ids divided in between different parquet files? meaning for example two different parquet files, say "part.0.parquet", "part.1.parquet" include same raw session_id?

With the latest changes in the repo, if you are using ListSlice (pad=True), you can remove the ValueCount() and you should be able to see properties.value_count.min and properties.value_count.max in your output schema. I tested it at my end. These changes will be part of our upcoming release merlin-pytorch:23.02 docker image which is gonna be released soon.

@liyunrui
Copy link

liyunrui commented Mar 2, 2023

@rnyak

I got same issue when I try to load data with nvt.Dataset(data_path)

And I use the way you suggested to see num_row_groups:

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/")

num_rows, num_row_groups, names = cudf.io.read_parquet_metadata(DATA_FOLDER + "raw_train/data.parquet")
print(num_rows), print(num_row_groups) # 13704855, 1

Then, you're saying row_group_size memory is pretty big. Note that we expect row_group memory size to be smaller than part_size.

So, i try to use part_size like [1] suggest but do not work, I got error.

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/")

train_dataset = nvt.Dataset(DATA_FOLDER + "/raw_train/.parquet", part_size="256MB")
valid_dataset = nvt.Dataset(DATA_FOLDER + "/raw_valid/
.parquet")

Error:
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/ec2-user/anaconda3/envs/rapids-22.08/include/rmm/mr/device/cuda_memory_resource.hpp

Anything I can try, be excited to see use more data to train and see if it works
[1].https://nvidia-merlin.github.io/NVTabular/v0.6.0/api/dataset.html

Note:
One weird thing is no matter which part_size I tried, for example(512MB, 256MB, 1000MB), it always consume same amount of gpu memory(8273MiB) in the below(nvidia-smi)
| 0 N/A N/A 357 C ...s/rapids-22.08/bin/python 8273MiB |

@liyunrui
Copy link

liyunrui commented Mar 2, 2023

I solved my issue by doing the below step:
Step1: Increase row_group_size as [1] suggested

You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the row_group_size is the number of rows that will be stored in each row group (internal structure within the parquet file):

#Pandas
pandas_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
#cuDF
cudf_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)

Step2: change part_size
train_dataset = nvt.Dataset(DATA_FOLDER + "/raw_train/.parquet", part_size="256MB")

[1].https://nvidia-merlin.github.io/NVTabular/v0.7.0/resources/troubleshooting.html

@rnyak
Copy link
Contributor

rnyak commented Mar 2, 2023

Yes, one can reset the row_group_size of the each parquet file and save it again on disk. We recommend not a large row_group_size value. instead we recommend to keep it small. Please note that if you read and save data with cudf the correct arg is row_group_size_rows like below:

cudf_df.to_parquet("/file/path",  row_group_size_rows=10000)

note that 10000 is just an example. you can set it accordingly by checking the total number of rows in your dataset, so that the total number of rows can be divisible by row_group_size_rows.

Please check this document for further info.

@Azilyss
Copy link
Author

Azilyss commented Mar 3, 2023

Yes, sessions could be divided between different parquet files, but this is no longer the case.

I had pip install nvtabular==1.8.1, and the ListSlice() changes are not included in that release.

Removing ValueCount() operation solved my OOM issue actually. I did not see any direct improvement by increasing row_group_size in the raw parquet files nor by specifying a part_size in nvt.Dataset but did take into account these suggestions when resolving this issue.

Thanks for your time and helpful suggestions during this resolution.

@ShengyuanSW
Copy link

Hi all, I got similar OOM issues, I first encountered this problem when building the dataset (nvt.Dataset) and I solved the problem by adding row_group_size at this stage. But in the model training stage I encountered the same problem again. When using model.fit(train_dataset, validation_data=valid_dataset, batch_size=128), OOM error appeared again.
std::bad_alloc: out_of_memory: CUDA error at: /opt/rapids/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Any suggestions would be appreciated!

@rnyak
Copy link
Contributor

rnyak commented Apr 13, 2023

@ShengyuanSW can you pls give us some info about your HW, GPU memory size, and your env? are you getting OOM when you run your NVTabular workflow? or when you train your model? what you shared above looks like you are getting an error from model training is that correcT? is yes, what model is that?

@ShengyuanSW
Copy link

@ShengyuanSW can you pls give us some info about your HW, GPU memory size, and your env? are you getting OOM when you run your NVTabular workflow? or when you train your model? what you shared above looks like you are getting an error from model training is that correcT? is yes, what model is that?

  1. GPU:24GB, python 3.8.1, inside docker container

  2. training model stage

3.twoTower model and DLRM in Merlin models

Thanks!

@rnyak
Copy link
Contributor

rnyak commented Apr 13, 2023

@ShengyuanSW thanks. Couple follow up questions:

  • what's the size of your dataset?
  • how many parquet files do you have and what's the size of each parquet file that you feed to the Two-tower or DLRM model?
  • are you using Merlin Models?

@ShengyuanSW
Copy link

ShengyuanSW commented Apr 14, 2023

@ShengyuanSW thanks. Couple follow up questions:

  • what's the size of your dataset?
  • how many parquet files do you have and what's the size of each parquet file that you feed to the Two-tower or DLRM model?
  • are you using Merlin Models?
  1. 1M+ data records, 800+ features
  2. I used one parquet file with row_group_size: 10k , part size: 256MB.
    I can create the data by nvt.Dataset
  3. yes, reference here https://nvidia-merlin.github.io/models/main/models_overview.html

@rnyak
Copy link
Contributor

rnyak commented Apr 18, 2023

800+ features

Just to confirm, do you have ~800 input features? and you are using Two-Towel model? how many input features are you feeding user-tower and item-tower?

do you mind sharing your model script please?

I used one parquet file
can you pls tell us size of this one parquet file. what's it? 1GB, 5 GB?

you are able to run NVTabular pipeline is that correct? you can process your dataset and then export your processed parquet file is that correcT?

@ShengyuanSW
Copy link

ShengyuanSW commented Apr 26, 2023

features:

I used user embeddings and item embeddings. each embeddings is a 384 dimensions Bert embedding. I split them into 384 columns. and beside there are some other properties for users and items used as features.
Number of user features: 412
Number of item features: 390

can you pls tell us size of this one parquet file. what's it? 1GB, 5 GB?

About 11GB.

you are able to run NVTabular pipeline is that correct? you can process your dataset and then export your processed parquet file is that correcT?

yes. run NVTabular pipeline no error, export data and import data no error. only has OOM error in Two-Tower model.fit

@rnyak
Copy link
Contributor

rnyak commented May 1, 2023

@ShengyuanSW can you calculate the model size in GB? what's the size of embedding tables? can you share your schema file please?

also be sure you are adding this line at the very beginning of your TT model code before doing import tensorflow as tf

import os
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

@rnyak
Copy link
Contributor

rnyak commented May 25, 2023

closing this issue due to low activity. Please reopen if you still have an issue.

@rnyak rnyak closed this as completed May 25, 2023
@Chevolier
Copy link

Encountered the same problem, .fit() cannot handle large dataset.

@rnyak
Copy link
Contributor

rnyak commented Aug 20, 2024

@Chevolier thanks for your interest in Merlin. We would need more than Encountered the same problem :) If you can give us some information about your model size, HW specs, number of features, etc, that'd be more informative.

if your embedding table cannot fit in a single GPU memory, naturally .fit() will result in OOM. Did you check your embedding table size? what's your GPU memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants