Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke-WebRequest and Invoke-RestMethod do not decode content in accordance with BOM/Content-Type #11547

Open
he852100 opened this issue Jan 10, 2020 · 11 comments · May be fixed by #19379
Open
Labels
Hacktoberfest Potential candidate to participate in Hacktoberfest In-PR Indicates that a PR is out for the issue Issue-Question ideally support can be provided via other mechanisms, but sometimes folks do open an issue to get a Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module

Comments

@he852100
Copy link

he852100 commented Jan 10, 2020

Unrecognizable and processed, garbled.
Example

$url='https://storage.live.com/items/A78ACCAEBB24EDD7!37945?&authkey=!APfFKTYtceWCfG0'
$g='./xmltest'
$reg='pN|utf'
((irm $URL) -split "[`r`n]+") -match $reg
irm $URL -outfile $g
(get-content   $g)-match $reg

Expected

PS /sh> irm $URL

xml                            Folder
---                            ------
version="1.0" encoding="utf-8" Folder

PS /sh> (irm $URL).Folder.Items.Document

ItemType ResourceID             RelationshipName
-------- ----------             ----------------
Document A78ACCAEBB24EDD7!37948 测试.json

Results

PS /sh> (iwr $URL).Headers.'Content-Type'
text/xml
PS /sh> ((irm $URL) -split "[`r`n]+") -match $reg
<?xml version="1.0" encoding="utf-8"?>
      <RelationshipName>æµè¯.json</RelationshipName>
  <RelationshipName>BingClients</RelationshipName>

Read saved files,Seems no problem.

PS /s> (get-content  ../aa/irm )-match 'pN|utf'
 <?xml version="1.0" encoding="utf-8"?>
<RelationshipName>测试.json</RelationshipName>
<RelationshipName>BingClients</RelationshipName>
PS /sdcard/Documents/sh>

curl

PS /sdcard/Documents/sh> ((curl $URL) -split "[`r`n]+") -match $reg
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2693  100  2693    0     0   2170      0  0:00:01  0:00:01 --:--:--  2170
<?xml version="1.0" encoding="utf-8"?>
      <RelationshipName>测试.json</RelationshipName>
  <RelationshipName>BingClients</RelationshipName>
@he852100 he852100 added the Issue-Question ideally support can be provided via other mechanisms, but sometimes folks do open an issue to get a label Jan 10, 2020
@he852100
Copy link
Author

he852100 commented Jan 10, 2020

The problem is that live.com is not returning the encoding it's using in its headers. PowerShell obeys the standard by assuming ISO-8859-1, but unfortunately the site is using UTF-8.

Stack Overflow
I am trying to get information from the Spotify database through their Web API. However, I'm facing issues with accented vowels (ä,ö,ü etc.)

Lets take Tiësto as an example.
Spotify's API Browser can

@iSazonov
Copy link
Collaborator

@he852100 Please add info about PowerShell version. Can you repo with latest PowerShell Core build?

@he852100
Copy link
Author

he852100 commented Jan 12, 2020

PSVersion                      7.0.0-daily.20200110
PSEdition                      Core
GitCommitId                    7.0.0-daily.20200110
OS                             Linux 3.10.0-1062.9.1.el7.x86_64 …
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
sh> Invoke-WebRequest 'https://pscoretestdata.blob.core.windows.net/v7-0-0-daily-20200110/powershell-7.0.0-daily.20200110-linux-arm64.tar.gz' -O ~/powershell.tar.gz -Resume
StatusCode        : 416                                           
StatusDescription : RequestedRangeNotSatisfiable                  
Content           : <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The rang
                    e specified is invalid for the current size of the resource.
                    RequestId:e8b88225-401e-0127-7cdc-c866f8000000

PS /root> $a.headers.GetEnumerator()

Key             Value
---             -----
Server          {Windows-Azure-Blob/1.0, Microsoft-HTTPAPI/2.0}
x-ms-request-id {322455bd-301e-008d-77e3-c8f642000000}
x-ms-version    {2014-02-14}
Date            {Sun, 12 Jan 2020 00:56:33 GMT}
Content-Length  {249}
Content-Type    {application/xml}
Content-Range   {bytes */46486387}

Windows.net

PowerShell obeys the standard by assuming ISO-8859-1, but unfortunately the site is using UTF-8.

@he852100
Copy link
Author

@iSazonov It can be determined that powershell does not recognize utf8bom

@iSazonov
Copy link
Collaborator

@he852100 I guess it comes from .Net Core.

@scriptingstudio
Copy link

@he852100 I guess it comes from .Net Core.

That comes from PS5 and older. If website saying, i'm utf8, why does iwr return ascii?

@iSazonov iSazonov added the WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module label May 31, 2020
@mklement0
Copy link
Contributor

mklement0 commented May 31, 2020

Note: I don't know what the intended behavior is, but here is what seems to be happening:

Because the response doesn't indicate a character encoding (charset) in its Content-Type header field (text/xml rather than text/xml; charset=utf-8), PowerShell defaults to ISO-8859-1, in accordance with the - obsolete since 2014 - RFC 2616.

Because it blindly assumes ISO-8859-1, the UTF-8 BOM is read as data, and the payload is therefore not recognized as XML, which falls back to a(n incorrectly decoded) string instead of returning an XmlDocument instance.

Note that current RFC, RFC 7231, no longer mandates an overall default and instead defers to the default encoding of the given media type.
For XML, RFC 7303 mandates looking at the BOM first and if there is none at the charset attribute in the Content-Type header. If that isn't present either, respect the encoding specified in the XML declaration, and if there is none, default to UTF-8.

Given that HTML5 now also defaults to UTF-8 and given that RFC 2616 is obsolete, we should consider implementing the following logic in both Invoke-WebRequest and Invoke-RestMethod:

@iSazonov
Copy link
Collaborator

iSazonov commented Jun 1, 2020

Currently we have many workarounds. I guess they comes from PS 5.0.
Now we could use HttpContent.ReadAsStringAsync() method. It seems it already has the decoding logic
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L182

GitHub
.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps. - dotnet/runtime

@iSazonov iSazonov added the Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors label Jun 1, 2020
@mklement0
Copy link
Contributor

mklement0 commented Jun 1, 2020

That's promising, @iSazonov, but it looks like the referenced method gives precedence to the charset attribute over the payload's BOM, correct?

This is the reverse of how XML data is supposed to be handled according to RFC 7303 (leaving the additional need to respect an encoding in the XML declaration aside), and, arguably, for all textual media types, according to section "5. Security Considerations" of RFC 6657:

this document recommends the use of charset information that is more likely to be correct (for example, in-band over out-of-band).

A BOM is an instance of in-band information, whereas the charset header-field attribute is out-of-band information; therefore, the BOM should take precedence.

Therefore, the method you link to wouldn't solve the problem described in #12861, for instance.

@iSazonov
Copy link
Collaborator

iSazonov commented Jun 2, 2020

the BOM should take precedence

It looks like a .Net bug. You could open new issue in .Net Runtime repo.

In common, I guess we could simplify the PowerShell code if we would follow the .Net API.

@rjmholt rjmholt changed the title [My bug report]irm,iwr get xml Problem Invoke-WebRequest and Invoke-RestMethod do not decode content in accordance with BOM/Content-Type Dec 11, 2020
@SteveL-MSFT
Copy link
Member

@PowerShell/wg-powershell-cmdlets reviewed this. We agree that the BOM should take precedence and where it makes sense, the web cmdlets should have the same behavior as curl. We're explicitly not making any statement about implementation

@SteveL-MSFT SteveL-MSFT added the Hacktoberfest Potential candidate to participate in Hacktoberfest label Oct 5, 2022
@CarloToso CarloToso linked a pull request Mar 21, 2023 that will close this issue
22 tasks
@ghost ghost added the In-PR Indicates that a PR is out for the issue label Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest Potential candidate to participate in Hacktoberfest In-PR Indicates that a PR is out for the issue Issue-Question ideally support can be provided via other mechanisms, but sometimes folks do open an issue to get a Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants