Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebCmdlets get encoding from BOM #19379

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

CarloToso
Copy link
Contributor

@CarloToso CarloToso commented Mar 21, 2023

PR Summary

  • Add EncodingHelper to detect the response encoding from the BOM
  • Add tests

PR Context

using: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs

fixes #11547

PR Checklist

stream.ReadExactly(buffer, 0, 4);
}

EncodingHelper.TryDetectEncodingFromBom(buffer, out encoding, out int preambleLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if this is not already done in .Net. It might be worth looking at HttpClient and other related code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code in .Net and it doesn't match perfectly with our needs (it doesn't consider UTF32-BE, it checks for the BOM after checking the charset)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it checks for the BOM after checking the charset

Maybe it is more right behavior? Could you please point the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is 8 years old. So we should follow the logic too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed what was discussed here: #11547 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing the history. It is correct.

Can we use StreamReader with auto encoding detection?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,109

Copy link
Contributor Author

@CarloToso CarloToso Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov I tried using StreamReader with auto encoding detection, what do you think?

        internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding)
        {
            bool isDefaultEncoding = false;

            StreamReader reader = new(stream, detectEncodingFromByteOrderMarks: true);

            // reader.CurrentEncoding defaults to UTF8
            encoding = reader.CurrentEncoding;

            if (encoding == Encoding.UTF8)
            {
                isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out encoding);
            }

            if (isDefaultEncoding)
            {
                reader = new(stream, encoding);

                // We only look within the first 1k characters as the meta element and
                // the xml declaration are at the start of the document
                int bufferLength = (int)Math.Min(reader.BaseStream.Length, 1024);

                char[] buffer = new char[bufferLength];
                reader.ReadBlock(buffer, 0, bufferLength);
                stream.Seek(0, SeekOrigin.Begin);

                string substring = new(buffer);

                // Check for a charset attribute on the meta element to override the default
                Match match = s_metaRegex.Match(substring);
                
                // Check for a encoding attribute on the xml declaration to override the default
                if (!match.Success)
                {
                    match = s_xmlRegex.Match(substring);
                }
                
                if (match.Success)
                {
                    characterSet = match.Groups["charset"].Value;

                    if (TryGetEncodingFromCharset(characterSet, out Encoding localEncoding))
                    {
                        encoding = localEncoding;
                    }
                }
            }
            reader.Dispose();

            return new StreamReader(stream, encoding).ReadToEnd();
        }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can use https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,119
and pass encoding based on characterSet. This will aromatically use BOM if present.

@CarloToso CarloToso changed the title WIP WebCmdlets get encoding from BOM WebCmdlets get encoding from BOM Mar 24, 2023

stream.Seek(preambleLength, SeekOrigin.Begin);
string content = StreamToString(stream, encoding);
encoding = reader.CurrentEncoding;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentEncoding is updated only after we start reading.

But should we process isDefaultEncoding block below? I guess if we had to process XML this would mean that the original request is fundamentally flawed. I doubt it's worth complicating the code for the sake of it.
@mklement0 What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov, I'm afraid I don't understand the question (I'm not familiar with the code).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mklement0 I ask about p.3 from #11547 (comment)

otherwise, for XML and HTML, respect the encoding specified in the XML declaration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mklement0 Friendly ping.

Copy link
Contributor

@mklement0 mklement0 Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov, sorry, I forgot to respond: I'm still unclear on the details, but what I suggested in #11547 (comment) meant looking no further once a BOM is found. Only if there's none, look at charset. Only if there's none, look at the XML declaration / HTML <meta> element (the latter seems to be absent from the snippet above - is HTML handled elsewhere?).
This hinges on explicitly knowing if a BOM is present.

It is true that the above means that there can be inconsistencies that will be tolerated (e.g., a UTF-8 BOM, but an XML declaration specifying a different encoding), but the above precedence makes it clear which information "wins".

Copy link
Contributor

@mklement0 mklement0 Apr 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov, good point about breaking changes.
Note that I am not deeply immersed in this, so do tell me if I'm getting something wrong:

Looking at the old code (

internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding, CancellationToken cancellationToken)
), I see that the out-of-band information - i.e. a charset attribute - currently already takes precedence, short-circuiting further investigation.

So we need to stick with that to avoid a breaking change, but do note that it contradicts the relevant RFC, as noted in #11547 (comment) (which recommends prioritizing in-band information).
Rethinking my over-specification statement: perhaps the right thing to do, if both out-of-band encoding information and a BOM are present, is to remove the BOM if the indicated and BOM-implied encoding is the same.

In the absence of out-of-band information, the question is then how the handle the in-band information - BOM and/or encoding / charset attribute in <?xml> declaration / <meta> HTML element.

The current code searches only for the latter, and does so incorrectly, as discussed (no support for 2+-byte Unicode encodings).

To fix this:

  • For XML-based in-band encoding information, we can delegate to the .NET APIs based on raw byte streams.
  • For the HTML <meta> case, we still need to dour own text-based matching, but it requires something like the .Replace("\0") approach mentioned above.

This too avoids a breaking change, though (like now) it requires the relatively expensive parsing of the first 1K / letting a .NET method call potentially fail (for potential XML content).

Giving precedence to manual BOM detection would perform better at least with non-XML content, but the only way this would not amount to a breaking change is if the presence of a BOM currently breaks things, which is the case, right?
I'm not sure how much performance is a concern here, however.

Manual BOM detection is still necessary, at least in the absence of out-of-band encoding information and possibly also XML-declaration / HTML <meta> in-band information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Net does encoding detection for XML https://source.dot.net/#System.Private.Xml/System/Xml/Core/XmlTextReaderImpl.cs,2895
So we have no need to do the same but worse.
Since first we try Feeds/Atom as XML we can simplify the code and first try XML reading then fallback to Json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov we only try Feeds/Atom as XML in Invoke-RestMethod not in Invoke-WebRequest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not prevent my suggestion.

Copy link
Contributor Author

@CarloToso CarloToso May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but from my understanding it only applies to Invoke-RestMethod, we could work on your suggestion in another PR

@CarloToso
Copy link
Contributor Author

@iSazonov any insight on the many test failures after the last commit?

@iSazonov
Copy link
Collaborator

I restarted failed CIs. Let's wait result.

@iSazonov
Copy link
Collaborator

@CarloToso I guess changes in EncodingController.cs cause CI fails.

@CarloToso
Copy link
Contributor Author

@iSazonov it seems the errors were caused by reader.Peek()

@ghost
Copy link

ghost commented Apr 16, 2023

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Maintainer, please provide feedback and/or mark it as Waiting on Author

@daxian-dbw daxian-dbw added WG-Cmdlets general cmdlet issues Needs-Triage The issue is new and needs to be triaged by a work group. labels May 1, 2023
@doctordns
Copy link
Contributor

What is the status of this PR??

@CarloToso
Copy link
Contributor Author

@doctordns The code is complete and awaiting further review

@SteveL-MSFT SteveL-MSFT added the CommunityDay-Small A small PR that the PS team has identified to prioritize to review label Nov 15, 2023
{
StringBuilder result = new(capacity: ChunkSize);
Decoder decoder = encoding.GetDecoder();
bool isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out Encoding encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precedence of encoding detection is unclear to me.
By looking at the code, when characterSet is specified, isDefaultEncoding will be false, and

  • meta element will not be checked
  • encoding detected from BOM will take precedence, then the encoding resolved from characterSet

when characterSet is not specified, isDefaultEncoding will be true, and

  • meta element will be checked
  • character set from the meta element will take precedence, then the encoding detected from BOM, and then the default encoding

As is shown, the encoding detection is inconsistent -- if the characterSet specified in header is lower precedence than BOM, then why the characterSet specified in meta element is higher than BOM?

Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.

@CarloToso I see that you've made changes to the encoding detection. But can you please summarize the precedence and the reason for using that precedence?

@microsoft-github-policy-service microsoft-github-policy-service bot added Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept and removed Review - Needed The PR is being reviewed labels Dec 11, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot removed the Waiting on Author The PR was reviewed and requires changes or comments from the author before being accept label Dec 11, 2023
@daxian-dbw
Copy link
Member

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

@CarloToso
Copy link
Contributor Author

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

@stevenebutler
Copy link
Contributor

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

Hi @CarloToso - If it's still an issue I may be able to have a look once I'm on vacation in a week or two.

@microsoft-github-policy-service microsoft-github-policy-service bot added the Review - Needed The PR is being reviewed label Dec 21, 2023
Added extension methods to StreamReader that will add a timeout
to each stream read if a timeout property is set in IWR/IRM.
@stevenebutler
Copy link
Contributor

@CarloToso The CIs are failing due to a compilation error. Can you please fix it?

There were some extensive changes in #19558, @stevenebutler could you help me once more?

Hi @CarloToso - If it's still an issue I may be able to have a look once I'm on vacation in a week or two.

Hi @CarloToso - I have made a PR on your fork with fixes for this PR

CarloToso and others added 3 commits January 4, 2024 23:13
Make encoding changes handle network stall timeouts
This is needed to stop web methods from deadlocking when windows forms are loaded from
within the PowerShell process.

This PR has 159 quantified lines of changes. In general, a change size of upto 200 lines is ideal for the best PR experience!


Quantification details

Label      : Medium
Size       : +121 -38
Percentile : 51.8%

Total files changed: 3

Change summary by file extension:
.cs : +95 -38
.ps1 : +26 -0

Change counts above are quantified counts, based on the PullRequestQuantifier customizations.

Why proper sizing of changes matters

Optimal pull request sizes drive a better predictable PR flow as they strike a
balance between between PR complexity and PR review overhead. PRs within the
optimal size (typical small, or medium sized PRs) mean:

  • Fast and predictable releases to production:
    • Optimal size changes are more likely to be reviewed faster with fewer
      iterations.
    • Similarity in low PR complexity drives similar review times.
  • Review quality is likely higher as complexity is lower:
    • Bugs are more likely to be detected.
    • Code inconsistencies are more likely to be detected.
  • Knowledge sharing is improved within the participants:
    • Small portions can be assimilated better.
  • Better engineering practices are exercised:
    • Solving big problems by dividing them in well contained, smaller problems.
    • Exercising separation of concerns within the code changes.

What can I do to optimize my changes

  • Use the PullRequestQuantifier to quantify your PR accurately
    • Create a context profile for your repo using the context generator
    • Exclude files that are not necessary to be reviewed or do not increase the review complexity. Example: Autogenerated code, docs, project IDE setting files, binaries, etc. Check out the Excluded section from your prquantifier.yaml context profile.
    • Understand your typical change complexity, drive towards the desired complexity by adjusting the label mapping in your prquantifier.yaml context profile.
    • Only use the labels that matter to you, see context specification to customize your prquantifier.yaml context profile.
  • Change your engineering behaviors
    • For PRs that fall outside of the desired spectrum, review the details and check if:
      • Your PR could be split in smaller, self-contained PRs instead
      • Your PR only solves one particular issue. (For example, don't refactor and code new features in the same PR).

How to interpret the change counts in git diff output

  • One line was added: +1 -0
  • One line was deleted: +0 -1
  • One line was modified: +1 -1 (git diff doesn't know about modified, it will
    interpret that line like one addition plus one deletion)
  • Change percentiles: Change characteristics (addition, deletion, modification)
    of this PR in relation to all other PRs within the repository.


Was this comment helpful? 👍  :ok_hand:  :thumbsdown: (Email)
Customize PullRequestQuantifier for this repository.

@SteveL-MSFT SteveL-MSFT removed the WG-Cmdlets general cmdlet issues label Apr 17, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot removed the Review - Needed The PR is being reviewed label Apr 17, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added the Review - Needed The PR is being reviewed label Apr 24, 2024
Copy link
Contributor

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Maintainer, please provide feedback and/or mark it as Waiting on Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CommunityDay-Small A small PR that the PS team has identified to prioritize to review Medium Needs-Triage The issue is new and needs to be triaged by a work group. Review - Needed The PR is being reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invoke-WebRequest and Invoke-RestMethod do not decode content in accordance with BOM/Content-Type
8 participants