Misc Changes #7264

tarekgh · 2024-10-09T00:53:39Z

The changes here include the following:

Support o1 model labels in the Tiktoken tokenizer
Replace the usage of the Tuple in the EncodedToken with the Range and remove the TorchSharp Range/Index implementation
Rename the SentencePieceBpeTokenizer to allow adding more models to it in the future
Make the Tokenizer.Decode method return a non-nullable string
Add support for added tokens in the BPE tokenizer

Every commit in the PR is representing one of the listed changes. Reviewing every commit separately will make it easier to have a clear understanding of the changes.

…arp Range/Index implementation

… the future.

tarekgh · 2024-10-09T00:56:27Z

@michaelgsharp Could you please help review this change?

@ericstj could you please help reviewing the changes in the csproj files for a new added Microsoft.Bcl.Memory dependency and enable restoring from net9 feed? The second commit include such changes.

eng/Versions.props

docs/samples/Microsoft.ML.AutoML.Samples/Microsoft.ML.AutoML.Samples.csproj

ericstj

Dependency changes look ok, can you please review as well @michaelgsharp

codecov · 2024-10-09T18:23:49Z

Codecov Report

Attention: Patch coverage is 92.19858% with 22 lines in your changes missing coverage. Please review.

Project coverage is 68.80%. Comparing base (9baf26b) to head (af53b84).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs	84.21%	5 Missing and 1 partial ⚠️
src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs	80.00%	4 Missing ⚠️
...crosoft.ML.GenAI.Core/Pipeline/CausalLMPipeline.cs	0.00%	3 Missing ⚠️
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs	66.66%	2 Missing and 1 partial ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs	90.00%	0 Missing and 2 partials ⚠️
src/Microsoft.ML.TorchSharp/Utils/RangeUtil.cs	60.00%	0 Missing and 2 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs	0.00%	1 Missing ⚠️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7264      +/-   ##
==========================================
+ Coverage   68.78%   68.80%   +0.01%     
==========================================
  Files        1463     1461       -2     
  Lines      272288   272405     +117     
  Branches    28177    28176       -1     
==========================================
+ Hits       187299   187427     +128     
+ Misses      77745    77743       -2     
+ Partials     7244     7235       -9

Flag	Coverage Δ
Debug	`68.80% <92.19%> (+0.01%)`	⬆️
production	`63.29% <86.66%> (+<0.01%)`	⬆️
test	`89.07% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...c/Microsoft.ML.GenAI.LLaMA/LlamaTokenizerHelper.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/EncodedToken.cs	`100.00% <100.00%> (ø)`
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs	`59.09% <ø> (ø)`
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs	`78.28% <100.00%> (+0.37%)`	⬆️
src/Microsoft.ML.Tokenizers/Model/Word.cs	`61.42% <100.00%> (ø)`
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs	`100.00% <100.00%> (ø)`
...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs	`87.23% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/Tokenizer.cs	`84.57% <100.00%> (ø)`
...rc/Microsoft.ML.TorchSharp/AutoFormerV2/Anchors.cs	`88.13% <100.00%> (ø)`
.../Microsoft.ML.TorchSharp/AutoFormerV2/Attention.cs	`98.87% <100.00%> (ø)`
... and 18 more

... and 10 files with indirect coverage changes

src/Microsoft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs

michaelgsharp · 2024-10-11T19:46:13Z

src/Microsoft.ML.TorchSharp/AutoFormerV2/ObjectDetectionTrainer.cs

@@ -919,7 +919,7 @@ private Tensor PrepInputTensors(ref MLImage image, ValueGetter<MLImage> imageGet
                    var padW = 32 - (image.Width % 32);
                    var padH = 32 - (image.Height % 32);
                    var transMidTensor = torch.zeros(1, 3, image.Height + padH, image.Width + padW, device: _parent.Device);
-                    transMidTensor[.., .., ..image.Height, ..image.Width] = reMidTensor / 255.0;
+                    transMidTensor[RangeUtil.ToTensorIndex(..), RangeUtil.ToTensorIndex(..), RangeUtil.ToTensorIndex(..image.Height), RangeUtil.ToTensorIndex(..image.Width)] = reMidTensor / 255.0;


Why do we need to make change this from ..?

transMidTensor is accepting TensorIndex. And .. is a Range type which is defined in another library now. So, need a way to convert .. which is Range to TensorIndex. What can be done to make this syntax work again is to add implicit conversion inside the TensorIndex struct. @tannergooding may help advising about that?

By the way, this syntax used to work before because we had our own implementation of Range which had implicit conversion to TensorIndex.

@michaelgsharp let me know if you want open issue track that?

test/Microsoft.ML.AutoML.Tests/Microsoft.ML.AutoML.Tests.csproj

tarekgh added 5 commits October 8, 2024 17:24

Add o1 model support

5a7036e

Replace Usage of tuples with Range in EncodedToken and Remove TorchSh…

fb64c6c

…arp Range/Index implementation

Rename SentencePieceBpeTokenizer to allow adding more models to it in…

055bd56

… the future.

Make Tokenizer.Decode returns non-nullable string

15c0acb

Make BPE tokenizer support added tokens

e641bb3

tarekgh requested a review from michaelgsharp October 9, 2024 00:53

tarekgh added the Tokenizers label Oct 9, 2024

tarekgh self-assigned this Oct 9, 2024

ericstj reviewed Oct 9, 2024

View reviewed changes

eng/Versions.props Show resolved Hide resolved

ericstj reviewed Oct 9, 2024

View reviewed changes

docs/samples/Microsoft.ML.AutoML.Samples/Microsoft.ML.AutoML.Samples.csproj Show resolved Hide resolved

ericstj approved these changes Oct 9, 2024

View reviewed changes

add net9 package source to the nuget.config file

4ae8b72

Rename TiktokenPreTokenizer to RegexPreTokenizer

af53b84

tarekgh force-pushed the MiscChanges branch from 39bfa97 to af53b84 Compare October 10, 2024 18:05

tarekgh mentioned this pull request Oct 10, 2024

[Tracking] Clean up item related to Tokenizers #7268

Open

michaelgsharp reviewed Oct 11, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs Show resolved Hide resolved

michaelgsharp reviewed Oct 11, 2024

View reviewed changes

test/Microsoft.ML.AutoML.Tests/Microsoft.ML.AutoML.Tests.csproj Show resolved Hide resolved

tarekgh merged commit 823fc17 into dotnet:main Oct 11, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc Changes #7264

Misc Changes #7264

tarekgh commented Oct 9, 2024

tarekgh commented Oct 9, 2024

ericstj left a comment

codecov bot commented Oct 9, 2024 •

edited

Loading

michaelgsharp Oct 11, 2024

tarekgh Oct 11, 2024

tarekgh Oct 11, 2024

tarekgh Oct 11, 2024

Misc Changes #7264

Misc Changes #7264

Conversation

tarekgh commented Oct 9, 2024

tarekgh commented Oct 9, 2024

ericstj left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

michaelgsharp Oct 11, 2024

Choose a reason for hiding this comment

tarekgh Oct 11, 2024

Choose a reason for hiding this comment

tarekgh Oct 11, 2024

Choose a reason for hiding this comment

tarekgh Oct 11, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 9, 2024 •

edited

Loading