Specify `columns` when reading files with `DocumentDataset` #311

sarahyurick · 2024-10-18T19:44:30Z

Closes #180.

All of these will now work:

(1) Pandas and cuDF read_json do not support a columns parameter, so we read in the entire DataFrame and then remove unwanted columns behind the scenes.

dataset = DocumentDataset.read_json(dataset_path, columns=["col1", "col2"])

(2) Pandas and cuDF read_parquet both support a columns parameter, so we are able to take advantage of this functionality.

dataset = DocumentDataset.read_parquet(dataset_path, columns=["col1", "col2"])

(3) Pandas read_pickle (there is no cuDF read_pickle) does not support a columns parameter, so we read in the entire DataFrame and then remove unwanted columns behind the scenes.

dataset = DocumentDataset.read_pickle(dataset_path, columns=["col1", "col2"])

(4) Following cudf.read_json, you can specify dtype and prune_columns=True to only return the columns mentioned in the dtype argument. Note that Pandas does not support prune_columns.

dataset = DocumentDataset.read_json(
    dataset_path,
    dtype={"col1": str, "col2": str},
    prune_columns=True,
    backend="cudf",
)

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

nemo_curator/datasets/doc_dataset.py

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick added 2 commits October 18, 2024 12:30

add column param

7036095

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

read_pickle and black

8b9fbc3

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested a review from ryantwolf October 18, 2024 20:05

praateekmahajan reviewed Oct 18, 2024

View reviewed changes

nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved

optional param

f1ee675

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested a review from praateekmahajan October 18, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify `columns` when reading files with `DocumentDataset` #311

Specify `columns` when reading files with `DocumentDataset` #311

sarahyurick commented Oct 18, 2024

Specify columns when reading files with DocumentDataset #311

Are you sure you want to change the base?

Specify columns when reading files with DocumentDataset #311

Conversation

sarahyurick commented Oct 18, 2024

Specify `columns` when reading files with `DocumentDataset` #311

Specify `columns` when reading files with `DocumentDataset` #311