Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update incorrect data processing in DataCollatorForChatML #2172

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ruijunfeng
Copy link

What does this PR do?

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

Fixes #2169

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.
…orForChatML

Update incorrect data processing in DataCollatorForChatML
@kashif
Copy link
Collaborator

kashif commented Oct 4, 2024

awesome @ruijunfeng can we also have a test for this?

@ruijunfeng
Copy link
Author

ruijunfeng commented Oct 4, 2024

awesome @ruijunfeng can we also have a test for this?

Sure thing, I have tested it on the instruct-tuned version of Llama2 series and gemma1 series with my own dataset, and it seems to work well. Let me know if you need me to provide anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect data processing in DataCollatorForChatML
2 participants