Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow specify fields when reading files with DocumentDataset. #180

Open
miguelusque opened this issue Aug 5, 2024 · 0 comments · May be fixed by #311
Open

[FEA] Allow specify fields when reading files with DocumentDataset. #180

miguelusque opened this issue Aug 5, 2024 · 0 comments · May be fixed by #311
Assignees
Labels
enhancement New feature or request

Comments

@miguelusque
Copy link
Contributor

In some scenarios, a corpus file may contain columns that are not needed during the data curation step.

We might reduce memory footprint by allowing the user to specify which columns should be loaded when invoking DocumentDataset.read_json or other similar methods.

Maybe something similar to the following snipped, where I have added the columns parameter.

Load the dataset

dataset = DocumentDataset.read_json("./corpus", add_filename=True,
input_meta={"file_name":str,
"language": str},
columns=["file_name", "language"])

Hope it helps!
Miguel

@miguelusque miguelusque added the enhancement New feature or request label Aug 5, 2024
@miguelusque miguelusque changed the title Allow selecting specific fields when reading json or other file types with DocumentDataset [FEA} Allow selecting specific fields when reading json or other file types with DocumentDataset Aug 5, 2024
@miguelusque miguelusque changed the title [FEA} Allow selecting specific fields when reading json or other file types with DocumentDataset [FEA] Allow selecting specific fields when reading json or other file types with DocumentDataset Aug 5, 2024
@miguelusque miguelusque changed the title [FEA] Allow selecting specific fields when reading json or other file types with DocumentDataset [FEA] Allow specify fields when reading files with DocumentDataset. Aug 5, 2024
@sarahyurick sarahyurick self-assigned this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants