Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #53 - Add batched files reading support to separate_by_metadata script #54

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

miguelusque
Copy link
Contributor

Added batched files reading support to separate_by_metadata.py, in order to avoid OOMs.

In the current implementation, all the files are read at once, and distributed to the workers.

With this approach, only a batch of them are sent to the workers at once.

…g performance

This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Fix issue NVIDIA#43 (empty files creation) and improve reading/writing speed
In order to reduce potential OOMs, I have added support to batched files reading.

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Fixed some style errors

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
I should have not pushed these changes in this PR. Sorry

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
@miguelusque
Copy link
Contributor Author

Please, do not review this PR yet. There is something wrong when writing metadata to this.

I will fix it asap. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant