Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46208: Fix the relative path normalization of _Py_normpath() in fileutils.c #30362

Merged
merged 15 commits into from
Jan 6, 2022
Merged

bpo-46208: Fix the relative path normalization of _Py_normpath() in fileutils.c #30362

merged 15 commits into from
Jan 6, 2022

Conversation

neonene
Copy link
Contributor

@neonene neonene commented Jan 3, 2022

@neonene
Copy link
Contributor Author

neonene commented Jan 3, 2022

Only Windows pass the test. I'll check further.

@neonene
Copy link
Contributor Author

neonene commented Jan 4, 2022

Please review.
Historically, //../foo (relative path?) gets converted to //foo on posix.

@neonene neonene changed the title bpo-46208: Fix the relative path normalization in fileutils.c bpo-46208: Fix the relative path normalization of _Py_normpath() in fileutils.c Jan 4, 2022
@eryksun
Copy link
Contributor

eryksun commented Jan 4, 2022

Historically, //../foo (relative path?) gets converted to //foo on posix.

"//../foo" is an absolute path in POSIX. The ".." component gets resolved by the "//" root, so it normalizes as "//foo". AFAIK, in Linux "//" is functionally equivalent to "/".

In Windows, "//" has to be followed by two path components to be a valid root, e.g. "//server/share". A ".." component should be resolved by a UNC root, but this PR introduces a bug:

>>> os.path.normpath('//server/share/..')
'\\\\server\\share\\..'

It's not important how "//.." and "//server/.." get resolved in Windows, since they're invalid, but they may as well absorb the ".." component if it doesn't complicate the implementation.

@neonene
Copy link
Contributor Author

neonene commented Jan 5, 2022

Thanks for the truly helpful informations.
In Windows, it seems I need to dig into the following results.

ntpath.normpath('//server/share/..')
    pure python ver: '\\\\server\\share\\'    # expected
    c ver(original): '\\\\server\\share\\'    # expected
    c ver(this PR) : '\\\\server\\share\\..'
ntpath.normpath('//server/share/../')
    pure python ver: '\\\\server\\share\\'    # expected
    c ver(original): '\\\\server\\share\\..'
    c ver(this PR) : '\\\\server\\share\\..'
ntpath.normpath('//..')
    pure python ver: '\\'      # absorbed
    c ver(original): '\\\\..'
    c ver(this PR) : '\\\\..'
ntpath.normpath('//../')
    pure python ver: '\\\\..\\'
    c ver(original): '\\\\..\\'
    c ver(this PR) : '\\\\..\\'
ntpath.normpath('//server/..')
    pure python ver: '\\\\server\\..'
    c ver(original): '\\\\server\\..'
    c ver(this PR) : '\\\\server\\..'
ntpath.normpath('//server/../')
    pure python ver: '\\\\server\\..\\'
    c ver(original): '\\\\server\\..\\'
    c ver(this PR) : '\\\\server\\..\\'

@@ -2223,6 +2223,7 @@ _Py_normpath(wchar_t *path, Py_ssize_t size)
wchar_t *pEnd = size >= 0 ? &path[size] : NULL;
wchar_t *p2 = path;
wchar_t *minP2 = path;
int minP2_dotdot_joinable = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments on all of the variable declarations here describing what each one represents. the existing code names of p1, p2, minP2, etc. are not meaningful to reason about and understand an algorithm. This isn't supposed to be assembly code. Lets improve its readability as it gets fixed. :)

Otherwise I think your logic change in here is correct.

@@ -2273,6 +2274,7 @@ _Py_normpath(wchar_t *path, Py_ssize_t size)
*p2++ = *p1++;
minP2 = p2;
lastC = SEP;
minP2_dotdot_joinable = 0; // follow pure python's caluculation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to the code location of the pure Python code in question?

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@eryksun
Copy link
Contributor

eryksun commented Jan 5, 2022

ntpath.normpath('//..')
pure python ver: '\' # absorbed

The above quoted example is a bug in the pure Python implementation. The "//" path has been reduced to "/", which changes it from an invalid path to a valid path. The "//" path isn't valid since it translates to the NT path "\??\UNC\". A filesystem is never mounted at the root path of the "UNC" device. The "/" path, on the other hand, is a valid relative path. It's the root path of the drive or UNC share of the current working directory.

The standard for correct behavior is WinAPI GetFullPathNameW(). For example:

>>> nt._getfullpathname('//..')
'\\\\'
>>> nt._getfullpathname('//server/..')
'\\\\server\\'

@gpshead
Copy link
Member

gpshead commented Jan 5, 2022

The standard for correct behavior is WinAPI GetFullPathNameW().

Could we actually call that instead of using our own implementation?

@eryksun
Copy link
Contributor

eryksun commented Jan 5, 2022

The standard for correct behavior is WinAPI GetFullPathNameW().

Could we actually call that instead of using our own implementation?

No, it's only a standard for absolute paths, since it returns an absolute path. We use it to implement ntpath.abspath(). It also maps reserved DOS device names such as "con" -> "\\.\con" and strips trailing spaces and dots from paths, neither of which is implemented by normpath().

@nascheme
Copy link
Member

nascheme commented Jan 5, 2022

FYI, I think that _Py_normpath could use a re-write. It is really hard to review and confirm it works correctly, as is. I took a try at re-writing it using the posixpath.py version as a reference, see:

https://gist.github.com/nascheme/553576e857f89083423858f174e23d10

I can make a PR but I have some work to do on it yet. Need to add in the MS Windows behaviour. To me, the C version doesn't match what ntpath does, e.g. with respect to UNC prefixes. I think we need more test cases.

@eryksun
Copy link
Contributor

eryksun commented Jan 5, 2022

the C version doesn't match what ntpath does, e.g. with respect to UNC prefixes.

The pure Python normpath() is wrong to exclude "\\?\" and "\\.\" paths in special_prefixes. Device paths should never bypass explicit normalization. This behavior was added years ago in order to work around a bug in an old implementation that's no longer relevant. It was not added because it's the right thing to do. Again, GetFullPathNameW() sets the standard:

>>> nt._getfullpathname('\\\\?\\C:////spam////eggs. . .')
'\\\\?\\C:\\spam\\eggs'
>>> nt._getfullpathname('\\\\.\\C:////spam////eggs. . .')
'\\\\.\\C:\\spam\\eggs'

"\\.\" device paths never bypass normalization, and "\\?\" extended paths bypass normalization only when accessed, e.g. os.open(), os.mkdir(), and os.stat().

@neonene
Copy link
Contributor Author

neonene commented Jan 5, 2022

I added comments on the variable declarations and fixed //server/share/.. case (only). I'll close the PR later.

BTW, the results of WinAPI below are correct?

>>> import nt
>>> nt._getfullpathname('//..')
'\\\\'
>>> nt._getfullpathname('//../')
'\\\\..\\'
>>> nt._getfullpathname('//../..')
'\\\\..\\'
>>> nt._getfullpathname('//../../')
'\\\\..\\..\\'
>>> nt._getfullpathname('//../../..')
'\\\\..\\'

@eryksun
Copy link
Contributor

eryksun commented Jan 5, 2022

BTW, the results of WinAPI below are correct?

They're correct in as much as the behavior of GetFullPathNameW() sets the standard. Its implementation is shared with the runtime library code that Windows uses to create an NT path for file API calls such as CreateFileW() (API call) -> NtCreateFile() (system call). The NT runtime library has been hacked on since development began on Windows NT in early 1989.

Whether ".." is ever a valid server or share name in a UNC path is questionable. Technically, it's an allowed registered name (reg-name, from RFC3986), and "." is an allowed pchar character (disallowed characters: \/*?<>"|[]:;,=+). However, ".." isn't valid for SMB and NFS server/host names since ".." isn't a valid NETBIOS name, domain name, or IP address. Also, ".." isn't a valid SMB share name or NFS export name since they should be valid filenames in addition to the UNC 1*80pchar restriction [1].

It's clear that GetFullPathNameW() is not designed to handle ".." as a valid server or share name. Any path that uses them as such is de facto invalid. I wouldn't worry about matching the behavior of GetFullPathNameW() exactly for invalid paths. But I strongly prefer for normpath() to never return a valid path for an invalid path. For example, normalize invalid "//.." as invalid "\\\\" or "\\\\..", not as valid "\\".


[1] Some SMB share names are allowed in Windows even if they're not normal filenames, such as "spam..". This can be dysfunctional. For example, os.stat('//localhost/spam..') will fail as not found because the system strips the trailing dots from the share name. An extended path would be required to access the share, e.g. "\\?\UNC\localhost\spam..".

p2 = p3 + 1;
} else {
p2 = p3;
if (p2 == minP2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just be able to merge this into the condition at the original line 2295.

@@ -2265,6 +2266,7 @@ _Py_normpath(wchar_t *path, Py_ssize_t size)
}
}
minP2 = p2;
is_absolute = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this, it seems like we might need to be setting minP2 one character prior here (probably just moving it to p2 every time we copy a SEP at 2262 but before the increment).

That should leave it pointing at the slash, which is interpreted below as meaning we have an absolute path.

@@ -2273,6 +2275,7 @@ _Py_normpath(wchar_t *path, Py_ssize_t size)
*p2++ = *p1++;
minP2 = p2;
lastC = SEP;
is_absolute = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if we set minP2 here to point at the second slash, it'll act like an absolute path without the need for more state. But that isn't how we currently treat it, though I'm not sure that we treat it right at the moment anyway.

Technically, this is probably best treated like two leading slashes on Windows, but only preserving one segment instead of two. At least, that's based on my reading of the spec. I've never seen these used in real life (apart from Blender using them subtly differently), so I don't know what the correct expectation would be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POSIX spec says the following in 4.13 Pathname Resolution:

If a pathname begins with two successive characters, the first component following the leading characters may be interpreted in an implementation-defined manner, although more than two leading characters shall be treated as a single character.

Is this implying that "//.." actually should not be normalized as "//"? In Linux, a "//" root is effectively the same as "/". For example, os.stat('//..') queries the root directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what it implies. Linux could be using // to imply that the following segment is the top-level directory by that name, but if we're following that rule we should normalise all paths starting with // to start with / and we don't.

If we're allowing paths to use // as defined, we shouldn't be doing anything to change the first segment, because we don't know how it's going to be interpreted.

But that's not the release blocking bug we're facing here, and I'm not aware of anyone raising the issue about it before, so let's just get the smallest fix in to unblock the release and can dive into obscure correctness cases later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the posix spec: It's better for us to aim for strict compatibility with prior Python versions than new interpretations of specs without a practical bug being filed about something not matching a spec exactly causing a logic problem that prevents a desirable use case and on which platform(s).

Otherwise behavior change creates churn for all users who have code depending on past Python stdlib existing behaviors (whether they know it or not or explicitly test for it or not). Hyrum's Law applies here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought it up because it was suggested to shift minP2 past the first component. If so, minP2 would have to be special cased for "//.." if it should normalize as "//", per the existing test case. But using "//x" as the root isn't workable anyway because normalizing "//etc/.." as "//etc" wouldn't match the behavior of Linux.

In Cygwin and MSYS2, "//" is used for UNC paths. In MSYS2, I confirmed that the stat result for "//" is different from the result for "/"; that the stat results for "//", "//..", and "//localhost/.." are all the same; and that the stat result for "//localhost/C$/.." is the same as the result for "//localhost". This is compatible with the current implementation of normpath().

The following UNIX stack exchange question discusses current and defunct systems and applications that special case "//".

https://unix.stackexchange.com/questions/256497/on-what-systems-is-foo-bar-different-from-foo-bar

@neonene
Copy link
Contributor Author

neonene commented Jan 5, 2022

Please check.

@@ -329,13 +329,22 @@ def test_expanduser_pwd(self):
("/..", "/"),
("/../", "/"),
("/..//", "/"),
("//.", "//"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed: this matches behavior of 3.8.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me. Anyone else taking a look? Or shall I merge

@zooba
Copy link
Member

zooba commented Jan 6, 2022

The Azure Pipelines PR failure is the test_ftplib issue - doesn't have to block this issue.

@eryksun
Copy link
Contributor

eryksun commented Jan 6, 2022

All looks good to me. Anyone else taking a look? Or shall I merge

For Windows, should there be tests for invalid UNC paths such as "//", "//..", "//../..", "//../../..", "//server", "//server/..", and "//server/../.."? This will help to ensure that future changes never allow an invalid path to be normalized as a valid path.

Also, it's not a major problem that should prevent merging, but the way repeated slashes are handled prior to the second component of a UNC path is less than ideal:

>>> os.path.normpath('//spam///eggs')
'\\\\spam\\\\eggs'
>>> os.path.normpath('//spam///eggs/..')
'\\\\spam\\\\'

This case isn't a valid UNC share, since it's just "//spam", without a share component. However, the repeated slashes start the filepath part and should be reduced to a single backslash. That's what the GetFullPathNameW() call does in abspath():

>>> os.path.abspath('//spam///eggs')
'\\\\spam\\eggs'
>>> os.path.abspath('//spam///eggs/..')
'\\\\spam\\'

@zooba
Copy link
Member

zooba commented Jan 6, 2022

We've never gotten UNC paths correct in every way. They're definitely different now from 3.10, but let's file that as a new bug and try to deal with it before beta.

@zooba zooba merged commit 9c5fa9c into python:main Jan 6, 2022
@neonene
Copy link
Contributor Author

neonene commented Jan 6, 2022

Thanks all for reviewing and merging.

For Windows, should there be tests for invalid UNC paths such as "//", "//..", "//../..", "//../../..", "//server", "//server/..", and "//server/../.."?

I think //.. can be tested after pure ntpath.normpath() is fixed, which will run on non-Windows machines.

@neonene neonene deleted the bpo-46208 branch January 6, 2022 21:49
@hugovk
Copy link
Member

hugovk commented Jan 6, 2022

And thanks @neonene for the fix and identifying some followup fixes too!

Confirmed fixed in main (on macOS) for both my test case in https://bugs.python.org/issue46208 and the original Sphinx issue: sphinx-doc/sphinx#10030.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants