Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix attention mask collation #1603

Merged
merged 1 commit into from
May 14, 2024
Merged

fix attention mask collation #1603

merged 1 commit into from
May 14, 2024

Conversation

winglian
Copy link
Collaborator

@winglian winglian commented May 8, 2024

fixes #1597 @timpal0l

not sure even how the previous code worked 🤷

here's part of the yaml that worked for me

pretraining_dataset:
  - path: HuggingFaceTB/cosmopedia_6M
    split: train
    type: completion
max_steps: 10_000

pretrain_multipack_attn: true
pretrain_multipack_buffer_size: 10_000

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

@timpal0l
Copy link

timpal0l commented May 8, 2024

@winglian I will try!

this should make the training faster if i understand correctly? since i understand that the attention mask was not set correctly with this set to false? (i guess this is only relevant when sample_packing: false also?)

This new error popped up now instead;

RuntimeErrorRuntimeError: : CUDA error: an illegal memory access was encountered

@ali-mosavian
Copy link
Contributor

Not sure how this is done in Axolotl, but the idea is to use the attention mask to prohibit attention across samples. If correctly implemented.

@winglian
Copy link
Collaborator Author

Not sure how this is done in Axolotl, but the idea is to use the attention mask to prohibit attention across samples. If correctly implemented.

yes, this is the general philosophy of axolotl is to ensure the highest quality for SFT by preventing cross-attention. However, we enable this to be toggled on or off for (continued) pretraining b/c in those cases, you want to concatenate the samples to reach the full context length. Depending on the implementation, (FA2 vs SDPA vs eager) the model may never see position_ids up to the maximum context length. I don't have evidence one way or another that this is bad, but I would expect you want to maintain the existing context length during continued pretraining.

@winglian winglian merged commit 0298273 into main May 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streaming large datasets not working with pretrain_multipack_attn: true
3 participants