WebCmdlets get encoding from BOM #19379

CarloToso · 2023-03-21T12:19:49Z

PR Summary

Add EncodingHelper to detect the response encoding from the BOM
Add tests

PR Context

using: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs

fixes #11547

PR Checklist

iSazonov · 2023-03-21T17:05:55Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs

+                stream.ReadExactly(buffer, 0, 4);
+            }
+
+            EncodingHelper.TryDetectEncodingFromBom(buffer, out encoding, out int preambleLength);


I would be surprised if this is not already done in .Net. It might be worth looking at HttpClient and other related code.

I read the code in .Net and it doesn't match perfectly with our needs (it doesn't consider UTF32-BE, it checks for the BOM after checking the charset)

it checks for the BOM after checking the charset

Maybe it is more right behavior? Could you please point the code?

Gets encoding from charset:
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L194-L219

Gets encoding from bom:
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L221-L234

The code is 8 years old. So we should follow the logic too.

I followed what was discussed here: #11547 (comment)

Thanks for pointing the history. It is correct.

Can we use StreamReader with auto encoding detection?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,109

@iSazonov I tried using StreamReader with auto encoding detection, what do you think?

internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding) { bool isDefaultEncoding = false; StreamReader reader = new(stream, detectEncodingFromByteOrderMarks: true); // reader.CurrentEncoding defaults to UTF8 encoding = reader.CurrentEncoding; if (encoding == Encoding.UTF8) { isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out encoding); } if (isDefaultEncoding) { reader = new(stream, encoding); // We only look within the first 1k characters as the meta element and // the xml declaration are at the start of the document int bufferLength = (int)Math.Min(reader.BaseStream.Length, 1024); char[] buffer = new char[bufferLength]; reader.ReadBlock(buffer, 0, bufferLength); stream.Seek(0, SeekOrigin.Begin); string substring = new(buffer); // Check for a charset attribute on the meta element to override the default Match match = s_metaRegex.Match(substring); // Check for a encoding attribute on the xml declaration to override the default if (!match.Success) { match = s_xmlRegex.Match(substring); } if (match.Success) { characterSet = match.Groups["charset"].Value; if (TryGetEncodingFromCharset(characterSet, out Encoding localEncoding)) { encoding = localEncoding; } } } reader.Dispose(); return new StreamReader(stream, encoding).ReadToEnd(); }

I guess we can use https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,119
and pass encoding based on characterSet. This will aromatically use BOM if present.

src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs

iSazonov · 2023-03-23T08:37:22Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs

-
-            stream.Seek(preambleLength, SeekOrigin.Begin);
-            string content = StreamToString(stream, encoding);
+            encoding = reader.CurrentEncoding;


CurrentEncoding is updated only after we start reading.

But should we process isDefaultEncoding block below? I guess if we had to process XML this would mean that the original request is fundamentally flawed. I doubt it's worth complicating the code for the sake of it.
@mklement0 What do you think?

@iSazonov, I'm afraid I don't understand the question (I'm not familiar with the code).

@mklement0 I ask about p.3 from #11547 (comment)

otherwise, for XML and HTML, respect the encoding specified in the XML declaration

@mklement0 Friendly ping.

@iSazonov, sorry, I forgot to respond: I'm still unclear on the details, but what I suggested in #11547 (comment) meant looking no further once a BOM is found. Only if there's none, look at charset. Only if there's none, look at the XML declaration / HTML <meta> element (the latter seems to be absent from the snippet above - is HTML handled elsewhere?).
This hinges on explicitly knowing if a BOM is present.

It is true that the above means that there can be inconsistencies that will be tolerated (e.g., a UTF-8 BOM, but an XML declaration specifying a different encoding), but the above precedence makes it clear which information "wins".

@iSazonov, good point about breaking changes.
Note that I am not deeply immersed in this, so do tell me if I'm getting something wrong:

Looking at the old code (

PowerShell/src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs

Line 389 in 420c29d

internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding, CancellationToken cancellationToken)

), I see that the out-of-band information - i.e. a charset attribute - currently already takes precedence, short-circuiting further investigation.

So we need to stick with that to avoid a breaking change, but do note that it contradicts the relevant RFC, as noted in #11547 (comment) (which recommends prioritizing in-band information).
Rethinking my over-specification statement: perhaps the right thing to do, if both out-of-band encoding information and a BOM are present, is to remove the BOM if the indicated and BOM-implied encoding is the same.

In the absence of out-of-band information, the question is then how the handle the in-band information - BOM and/or encoding / charset attribute in <?xml> declaration / <meta> HTML element.

The current code searches only for the latter, and does so incorrectly, as discussed (no support for 2+-byte Unicode encodings).

To fix this:

For XML-based in-band encoding information, we can delegate to the .NET APIs based on raw byte streams.

For the HTML <meta> case, we still need to dour own text-based matching, but it requires something like the .Replace("\0") approach mentioned above.

This too avoids a breaking change, though (like now) it requires the relatively expensive parsing of the first 1K / letting a .NET method call potentially fail (for potential XML content).

Giving precedence to manual BOM detection would perform better at least with non-XML content, but the only way this would not amount to a breaking change is if the presence of a BOM currently breaks things, which is the case, right?
I'm not sure how much performance is a concern here, however.

Manual BOM detection is still necessary, at least in the absence of out-of-band encoding information and possibly also XML-declaration / HTML <meta> in-band information.

.Net does encoding detection for XML https://source.dot.net/#System.Private.Xml/System/Xml/Core/XmlTextReaderImpl.cs,2895
So we have no need to do the same but worse.
Since first we try Feeds/Atom as XML we can simplify the code and first try XML reading then fallback to Json.

@iSazonov we only try Feeds/Atom as XML in Invoke-RestMethod not in Invoke-WebRequest

This does not prevent my suggestion.

It doesn't, but from my understanding it only applies to Invoke-RestMethod, we could work on your suggestion in another PR

test/tools/WebListener/Controllers/EncodingController.cs

CarloToso · 2023-03-25T10:01:02Z

@iSazonov any insight on the many test failures after the last commit?

iSazonov · 2023-03-25T14:16:11Z

I restarted failed CIs. Let's wait result.

iSazonov · 2023-03-25T18:42:36Z

@CarloToso I guess changes in EncodingController.cs cause CI fails.

CarloToso · 2023-03-26T13:02:13Z

@iSazonov it seems the errors were caused by reader.Peek()

ghost · 2023-04-16T14:00:55Z

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Maintainer, please provide feedback and/or mark it as Waiting on Author

doctordns · 2023-05-21T10:55:05Z

What is the status of this PR??

CarloToso · 2023-05-21T16:32:39Z

@doctordns The code is complete and awaiting further review

daxian-dbw · 2023-12-11T20:12:10Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs

        {
-            StringBuilder result = new(capacity: ChunkSize);
-            Decoder decoder = encoding.GetDecoder();
+            bool isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out Encoding encoding);


The precedence of encoding detection is unclear to me.
By looking at the code, when characterSet is specified, isDefaultEncoding will be false, and

meta element will not be checked

encoding detected from BOM will take precedence, then the encoding resolved from characterSet

when characterSet is not specified, isDefaultEncoding will be true, and

meta element will be checked

character set from the meta element will take precedence, then the encoding detected from BOM, and then the default encoding

As is shown, the encoding detection is inconsistent -- if the characterSet specified in header is lower precedence than BOM, then why the characterSet specified in meta element is higher than BOM?

Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.

Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.

@CarloToso I see that you've made changes to the encoding detection. But can you please summarize the precedence and the reason for using that precedence?

…o BOM

daxian-dbw · 2023-12-11T23:58:14Z

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

CarloToso · 2023-12-12T00:10:20Z

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

stevenebutler · 2023-12-14T03:06:46Z

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

Hi @CarloToso - If it's still an issue I may be able to have a look once I'm on vacation in a week or two.

Added extension methods to StreamReader that will add a timeout to each stream read if a timeout property is set in IWR/IRM.

stevenebutler · 2024-01-04T21:53:41Z

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

Hi @CarloToso - If it's still an issue I may be able to have a look once I'm on vacation in a week or two.

Hi @CarloToso - I have made a PR on your fork with fixes for this PR

Make encoding changes handle network stall timeouts

This is needed to stop web methods from deadlocking when windows forms are loaded from within the PowerShell process.

pull-request-quantifier-deprecated · 2024-01-07T10:54:29Z

This PR has 159 quantified lines of changes. In general, a change size of upto 200 lines is ideal for the best PR experience!

Quantification details

Label      : Medium
Size       : +121 -38
Percentile : 51.8%

Total files changed: 3

Change summary by file extension:
.cs : +95 -38
.ps1 : +26 -0

Change counts above are quantified counts, based on the PullRequestQuantifier customizations.

Why proper sizing of changes matters

Optimal pull request sizes drive a better predictable PR flow as they strike a
balance between between PR complexity and PR review overhead. PRs within the
optimal size (typical small, or medium sized PRs) mean:

Fast and predictable releases to production:
- Optimal size changes are more likely to be reviewed faster with fewer
  iterations.
- Similarity in low PR complexity drives similar review times.
Review quality is likely higher as complexity is lower:
- Bugs are more likely to be detected.
- Code inconsistencies are more likely to be detected.
Knowledge sharing is improved within the participants:
- Small portions can be assimilated better.
Better engineering practices are exercised:
- Solving big problems by dividing them in well contained, smaller problems.
- Exercising separation of concerns within the code changes.

What can I do to optimize my changes

Use the PullRequestQuantifier to quantify your PR accurately
- Create a context profile for your repo using the context generator
- Exclude files that are not necessary to be reviewed or do not increase the review complexity. Example: Autogenerated code, docs, project IDE setting files, binaries, etc. Check out the Excluded section from your prquantifier.yaml context profile.
- Understand your typical change complexity, drive towards the desired complexity by adjusting the label mapping in your prquantifier.yaml context profile.
- Only use the labels that matter to you, see context specification to customize your prquantifier.yaml context profile.
Change your engineering behaviors
- For PRs that fall outside of the desired spectrum, review the details and check if:
  - Your PR could be split in smaller, self-contained PRs instead
  - Your PR only solves one particular issue. (For example, don't refactor and code new features in the same PR).

How to interpret the change counts in git diff output

One line was added: +1 -0
One line was deleted: +0 -1
One line was modified: +1 -1 (git diff doesn't know about modified, it will
interpret that line like one addition plus one deletion)
Change percentiles: Change characteristics (addition, deletion, modification)
of this PR in relation to all other PRs within the repository.

Was this comment helpful? 👍 :ok_hand: :thumbsdown: (Email)
Customize PullRequestQuantifier for this repository.

microsoft-github-policy-service · 2024-04-24T20:23:13Z

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Maintainer, please provide feedback and/or mark it as Waiting on Author

CarloToso added 3 commits March 21, 2023 12:39

Add EncodingHelper

0cebc78

remove const

dd9811b

add stream.length chack

efe8e0a

CarloToso requested a review from PaulHigin as a code owner March 21, 2023 12:19

ghost assigned TravisEz13 Mar 21, 2023

pull-request-quantifier-deprecated bot added the Small label Mar 21, 2023

use when

3d42ae5

iSazonov reviewed Mar 21, 2023

View reviewed changes

use StreamReader detectEncodingFromByteOrderMarks

b43ef89

pull-request-quantifier-deprecated bot added Extra Small and removed Small labels Mar 23, 2023

CarloToso added 2 commits March 23, 2023 09:24

fix return

1141be9

stream leaveOpen: true

4a90872

CarloToso changed the title ~~WIP WebCmdlets get encoding from BOM~~ WebCmdlets get encoding from BOM Mar 24, 2023

add tests

a2233fc

CarloToso requested review from daxian-dbw, TravisEz13, adityapatwardhan and anmenaga as code owners March 24, 2023 23:55

pull-request-quantifier-deprecated bot added Small and removed Extra Small labels Mar 24, 2023

CarloToso added 2 commits March 25, 2023 00:58

fix codefactor

af44f31

fix tests

314c979

iSazonov reviewed Mar 25, 2023

View reviewed changes

reader.Peek()

2d1107b

comment out reader.Peek()

c64591a

daxian-dbw added WG-Cmdlets general cmdlet issues Needs-Triage The issue is new and needs to be triaged by a work group. labels May 1, 2023

CarloToso added 2 commits May 3, 2023 09:48

Merge branch 'master' into encodinghelper

08f4cda

fix nullable

3ecad71

SteveL-MSFT added the CommunityDay-Small A small PR that the PS team has identified to prioritize to review label Nov 15, 2023

daxian-dbw requested changes Dec 11, 2023

View reviewed changes

microsoft-github-policy-service bot added Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept and removed Review - Needed The PR is being reviewed labels Dec 11, 2023

Add comments to code; better name for variable; encoding precedence t…

61b23d3

…o BOM

pull-request-quantifier-deprecated bot added Medium and removed Small labels Dec 11, 2023

microsoft-github-policy-service bot removed the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 11, 2023

Merge branch 'master' into encodinghelper

c9d68d9

microsoft-github-policy-service bot added the Review - Needed The PR is being reviewed label Dec 21, 2023

Make encoding changes handle network stall timeouts

2edb659

Added extension methods to StreamReader that will add a timeout to each stream read if a timeout property is set in IWR/IRM.

CarloToso and others added 3 commits January 4, 2024 23:13

Merge pull request #2 from stevenebutler/encodinghelper-patch

311a0ad

Make encoding changes handle network stall timeouts

Merge branch 'PowerShell:master' into encodinghelper

bf03dfc

Add ConfigureAwait(false) to awaits (#3)

4abf713

This is needed to stop web methods from deadlocking when windows forms are loaded from within the PowerShell process.

SteveL-MSFT removed the WG-Cmdlets general cmdlet issues label Apr 17, 2024

microsoft-github-policy-service bot removed the Review - Needed The PR is being reviewed label Apr 17, 2024

microsoft-github-policy-service bot added the Review - Needed The PR is being reviewed label Apr 24, 2024

WebCmdlets get encoding from BOM #19379

Are you sure you want to change the base?

WebCmdlets get encoding from BOM #19379

Conversation

CarloToso commented Mar 21, 2023 • edited

PR Summary

PR Context

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloToso Mar 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mklement0 Apr 7, 2023 • edited

Choose a reason for hiding this comment

mklement0 Apr 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloToso May 8, 2023 • edited

Choose a reason for hiding this comment

CarloToso commented Mar 25, 2023

iSazonov commented Mar 25, 2023

iSazonov commented Mar 25, 2023

CarloToso commented Mar 26, 2023

ghost commented Apr 16, 2023

doctordns commented May 21, 2023

CarloToso commented May 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daxian-dbw commented Dec 11, 2023

CarloToso commented Dec 12, 2023

stevenebutler commented Dec 14, 2023

stevenebutler commented Jan 4, 2024

pull-request-quantifier-deprecated bot commented Jan 7, 2024

What can I do to optimize my changes

How to interpret the change counts in git diff output

microsoft-github-policy-service bot commented Apr 24, 2024

CarloToso commented Mar 21, 2023 •

edited

CarloToso Mar 22, 2023 •

edited

mklement0 Apr 7, 2023 •

edited

mklement0 Apr 16, 2023 •

edited

CarloToso May 8, 2023 •

edited