Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc Changes #7264

Merged
merged 7 commits into from
Oct 11, 2024
Merged

Misc Changes #7264

merged 7 commits into from
Oct 11, 2024

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Oct 9, 2024

The changes here include the following:

  • Support o1 model labels in the Tiktoken tokenizer
  • Replace the usage of the Tuple in the EncodedToken with the Range and remove the TorchSharp Range/Index implementation
  • Rename the SentencePieceBpeTokenizer to allow adding more models to it in the future
  • Make the Tokenizer.Decode method return a non-nullable string
  • Add support for added tokens in the BPE tokenizer

Every commit in the PR is representing one of the listed changes. Reviewing every commit separately will make it easier to have a clear understanding of the changes.

@tarekgh tarekgh self-assigned this Oct 9, 2024
@tarekgh
Copy link
Member Author

tarekgh commented Oct 9, 2024

@michaelgsharp Could you please help review this change?

@ericstj could you please help reviewing the changes in the csproj files for a new added Microsoft.Bcl.Memory dependency and enable restoring from net9 feed? The second commit include such changes.

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency changes look ok, can you please review as well @michaelgsharp

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 92.19858% with 22 lines in your changes missing coverage. Please review.

Project coverage is 68.80%. Comparing base (9baf26b) to head (af53b84).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 84.21% 5 Missing and 1 partial ⚠️
src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs 80.00% 4 Missing ⚠️
...crosoft.ML.GenAI.Core/Pipeline/CausalLMPipeline.cs 0.00% 3 Missing ⚠️
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 66.66% 2 Missing and 1 partial ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 90.00% 0 Missing and 2 partials ⚠️
src/Microsoft.ML.TorchSharp/Utils/RangeUtil.cs 60.00% 0 Missing and 2 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs 0.00% 1 Missing ⚠️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7264      +/-   ##
==========================================
+ Coverage   68.78%   68.80%   +0.01%     
==========================================
  Files        1463     1461       -2     
  Lines      272288   272405     +117     
  Branches    28177    28176       -1     
==========================================
+ Hits       187299   187427     +128     
+ Misses      77745    77743       -2     
+ Partials     7244     7235       -9     
Flag Coverage Δ
Debug 68.80% <92.19%> (+0.01%) ⬆️
production 63.29% <86.66%> (+<0.01%) ⬆️
test 89.07% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...c/Microsoft.ML.GenAI.LLaMA/LlamaTokenizerHelper.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/EncodedToken.cs 100.00% <100.00%> (ø)
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs 59.09% <ø> (ø)
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs 78.28% <100.00%> (+0.37%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Word.cs 61.42% <100.00%> (ø)
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 100.00% <100.00%> (ø)
...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs 87.23% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/Tokenizer.cs 84.57% <100.00%> (ø)
...rc/Microsoft.ML.TorchSharp/AutoFormerV2/Anchors.cs 88.13% <100.00%> (ø)
.../Microsoft.ML.TorchSharp/AutoFormerV2/Attention.cs 98.87% <100.00%> (ø)
... and 18 more

... and 10 files with indirect coverage changes

@@ -919,7 +919,7 @@ private Tensor PrepInputTensors(ref MLImage image, ValueGetter<MLImage> imageGet
var padW = 32 - (image.Width % 32);
var padH = 32 - (image.Height % 32);
var transMidTensor = torch.zeros(1, 3, image.Height + padH, image.Width + padW, device: _parent.Device);
transMidTensor[.., .., ..image.Height, ..image.Width] = reMidTensor / 255.0;
transMidTensor[RangeUtil.ToTensorIndex(..), RangeUtil.ToTensorIndex(..), RangeUtil.ToTensorIndex(..image.Height), RangeUtil.ToTensorIndex(..image.Width)] = reMidTensor / 255.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make change this from ..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transMidTensor is accepting TensorIndex. And .. is a Range type which is defined in another library now. So, need a way to convert .. which is Range to TensorIndex. What can be done to make this syntax work again is to add implicit conversion inside the TensorIndex struct. @tannergooding may help advising about that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this syntax used to work before because we had our own implementation of Range which had implicit conversion to TensorIndex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelgsharp let me know if you want open issue track that?

@tarekgh tarekgh merged commit 823fc17 into dotnet:main Oct 11, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants