Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute a zipapp version of pip #158

Merged
merged 10 commits into from
Sep 18, 2022
Merged

Distribute a zipapp version of pip #158

merged 10 commits into from
Sep 18, 2022

Conversation

pfmoore
Copy link
Member

@pfmoore pfmoore commented Jul 11, 2022

This PR adds generation of a .pyz version of pip to the build script. Two copies of the .pyz are created in the public directory - an unversioned pip.pyz and a versioned pip-22.1.2.pyz. My assumption is that users will typically fetch the unversioned zipapp (and that's what our documentation should suggest) but people who want to pin a specific pip version can fetch the versioned zipapp.

As we don't do any deletion of generated files, older versioned zipapps will accumulate over time.

We could generate older versions, by fetching the wheels and rebuilding the zipapp. I'm not sure if this is worth it, but it might be if we ever plan on adding removal of previously generated files. In that case, we should probably decide on a policy for what zipapp versions we retain (regenerate) - I'd suggest versions for the current and previous year, as a simple rule.

This PR doesn't include a generated zipapp. I wanted to get the generation script sorted out and agreed, and then work on publicising the new method in the pip documentation, before actually publishing anything. Note we should probably not merge this before the pip 22.2 release, to give us time to do the pip side of things (which involves documenting the new distribution option, basically - but getting the messaging right is important so I don't think we want to rush this into 22.2).

@pfmoore pfmoore requested a review from pradyunsg July 11, 2022 10:43
lib = os.path.join(os.path.dirname(__file__), "lib")
sys.path.insert(0, lib)

runpy.run_module("pip", run_name="__main__")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some sort of minimum python version compatibility check here ? Or is it done by pip somewhere else ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean here. Are you concerned about someone running pip.pyz with Python 3.5? I'm fairly tempted to just say "don't do that". I don't know if pip has such a check, because the Python-requires metadata should stop it being installed for an unsupported version. The zipapp wouldn't have that protection, but is that a major problem?

I guess I could extract Requires-Python from the wheel I'm packing into the zipapp, and embed that into the __main__.py script so I could do a test against sys.executable before calling runpy. Is it worth it? I don't know. Reading metadata in a standards-compliant way is pretty messy. I could probably do it by hacking around with semi-documented bits of importlib.metadata. Or I could just add a dependency on pkg_metadata to the generate script, that would only be at build time. Or I could go full-on hack mode and just look for a line in pip-*.dist-info/METADATA that starts with "Requires-Python:" That would look something like

elif info.filename.endswith(".dist-info/METADATA"):
    data = src.read(info).decode("utf-8")
    m = re.search(r"^Requires-Python:\s*(.*)$", data, re.MULTILINE|re.IGNORECASE)
    rp = None
    if m:
        rp = m.group(1)
    console.log(f"  Zipapp requires Python {rp!r}")

The __main__.py script in the zipapp could use pip's vendored copy of packaging to do the actual check.

What do you think? I'm willing to include one of those approaches if you feel it's worthwhile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The check itself isn't too hard, and on reflection I'd just parse the metadata file using pkg_metadata, so implementing this isn't a major issue, the main question is whether it's worth it in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main question is whether it's worth it in the first place.

Well, even today we sometimes get support requests for users who manage to run pip with an unsupported python.
With this approach I can see myself wrapping the zipapp in a global pip script and launching it from active virtual environments. I'm sure I'll end up myself launching it by accident from older python 2 or python 3.5 environment so...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair point. I'll add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbidoul coming round to this again, the code for the version check here is not compatible with older Python versions. This is the same issue as we had with __pip-runner__.py. However, in this case, I'm a lot less comfortable trying to hard code the version as a tuple, because we'd have to either keep it in sync with the pip repository, or do something nasty like try to read the tuple from __pip-runner__.py.

What are your thoughts here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a version of the check which looks at Python-Requires, and as long as it has the form ">= X.Y" it adds a compatible version check like we do in __pip-runner__.py. If the requirement is more complex than that (which hopefully it won't ever be) the code simply omits the check with a warning (in the generate script, not in the zipapp itself).

We'd have to revisit this if we ever needed a more complex Python requirement in pip in the future. Maybe we should therefore abort the generate rather than just warn? But given that we don't generate here until release day, it might be a bit late to find out at that point 🙁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the zipapp use __pip_runner__.py ? (Disclaimer: I've not looked closely at all, just throwing the idea, which maybe complete nonsense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, and I couldn't work out how. Because __pip-runner__.py locates pip from its own location, it has to be run in place. I tried to get runpy.run_script to work, but it wouldn't co-operate, and I honestly don't know why :-( And I don't reqlly want to invoke pip via an extra subprocess, that just seems wasteful.

I think it should be possible to use runpy.run_script for this, but the runpy docs are so obscure I find it really hard to work out how to do anything with them :-( I'll do some further tests, though - I might be able to get it to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it appears that neither runpy.run_script nor the Python interpreter itself will run a file as python arc.zip/something.py. So I think we have to copy the check.

I've updated the check to be 2.x compatible, by parsing a Requires-Python of ">=X.Y". I'm happy to improve this, but I don't have any particularly good ideas of what would be better.

@pradyunsg
Copy link
Member

Whelp, before we merge this, we need to update the CI checks to account for untracked files being generated by the generate script.

@pfmoore
Copy link
Member Author

pfmoore commented Jul 15, 2022

Thanks, I wasn't aware of that. Added to the list. My current plan is to pick this up after the 22.2 release, at which time we can sort out details like

  • documenting and publicising the zipapp
  • finalising the deployment process and merging this PR
  • deciding whether we want to test the zipapp in pip's CI.

Edit: Wait, CI ran on this PR and didn't fail. So what exactly do you mean by accounting for untracked files?

@pradyunsg
Copy link
Member

pradyunsg commented Jul 15, 2022

I think the CI should fail, if files that are going to be generated by the generation script are not checked in. Right now, it's not doing that.

In other words, the zipapps generated are not staged/committed in this PR right now. Thus, when running the script would generate it, there are new files generated in the working/public directory that aren't checked in. This means that we can have a discrepency between what's checked in and what's auto-generated, which isn't great.

The CI check is designed to ensure that our copies of get-pip.py are correct/up-to-date, for the version of pip that's been released. It, however, does not ensure that the newly-generated files are also included in the repository. I reckon, it should -- which is what I was describing in the previous comment about it needing an update. The zipapp files that this would generate would be marked as "untracked" by git right now and we should have the CI fail when this happens because it's got untracked files in the working directory after generation (similar to how pip's vendoring would fail if you modified a file in the _vendor directory).

@pradyunsg
Copy link
Member

pradyunsg commented Jul 15, 2022

FWIW, another example of the sort of situation this would trip on would be when adding 3.7 to the mapping on top (i.e. logic for generating public/3.7/get-pip.py), but not adding the generated file. That's a similar sort of "generation script generates an untracked file and CI does not complain".

@pfmoore
Copy link
Member Author

pfmoore commented Jul 15, 2022

Ah, OK, so you're saying we need to improve CI, not specifically for this PR but in general. And we should do that before merging this because - I'm not quite sure why, but because this PR means we'll have more risk of forgetting to check in the newly generated files?

I think I get this, but I'm somewhat hazy on the release process, and I'm not sure I can safely play around with it because the noxfile seems to actually do a push - plus, it's doing signed releases and I'm not set up with any signing mechanism, so I don't think I can even sign a release in the first place...?

I guess this has changed a lot singe I last did a release, as I don't remember any of this. But as I say, I'll pick this up after 22.2 is out.

@pradyunsg
Copy link
Member

I'm not quite sure why, but because this PR means we'll have more risk of forgetting to check in the newly generated files?

Because this PR is adding new generated files (to the generation scripts) but not to the actual repository. This would mean that the next time someone runs the release automation, they'll also check in the additional zip app (if we land this before 22.2).

@pfmoore
Copy link
Member Author

pfmoore commented Sep 3, 2022

OK, I'm just coming back to this.

@pradyunsg what do you want to do about untracked files? As things stand, everything is fine (as far as I can tell by simply desk-checking the release scripts - see above regarding how it doesn't seem possible to test them). The generate script creates the new zipapp, the release process will add it (using git add .). The CI doesn't check new files, I understand that. But I don't see that as blocking this PR - instead, I see this PR as illustrating that the approach in #159 needs some extra thought about how we handle new files being created by the generate process. I'm happy to have that discussion, but I see it as independent of this PR.

I'll add some thoughts to #159. But are you going to insist that this PR must wait until #159 is merged in a form that handles new, untracked files?

@pfmoore
Copy link
Member Author

pfmoore commented Sep 3, 2022

Test failures are because of #170.

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

Added the zipapps to the PR, as per comments on #159

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

Hmm, why are the zipapps different when generated here? Maybe Windows vs Unix weirdness?

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

Grr, it's file timestamps.

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

Sigh, something else as well. I'll have to look at this later. But I will note that I don't think that checking that the files are binary-identical is a particularly good test here. It may be that it's flagging a real issue, but (as the timestamps show) it's just as likely it's a functionally irrelevant discrepancy.

@pradyunsg
Copy link
Member

pradyunsg commented Sep 4, 2022

I’d argue that we should also have signatures and checksums on these files, at some point in the future (like we should for get-pip.py, as the open issues suggest), so generating them in a stable/reproducible manner is important. Besides, if we don’t have this be reproducible, it’d mean that we’d regenerate all of these files on every instance that we run the generate command (i.e. new binary files being checked into git for every pip release).

Finally, I’ll remind you that all these checks and mechanisms were originally designed for the existing text files and making sure those are kept up to date (those are generated in a reproducible manner, though they bake in an existing “stable” wheel). I didn’t think of binary blobs when I was writing this CI. :P

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

Finally, I’ll remind you that all these checks and mechanisms were originally designed for the existing text files and making sure those are kept up to date (those are generated in a reproducible manner, though they bake in an existing “stable” wheel). I didn’t think of binary blobs when I was writing this CI. :P

Understood, I wasn't suggesting the checks themselves were at fault, just that they don't take into account the subtleties of binary files - and I'd not been aware of the check when I made this change, so I didn't anticipate needing to deal with it.

Regarding signatures and checksums, I see no issue with checksums, but I'm not sure how signatures would work. Do we have a "PyPA/pip" signature that we'd use for this? Expecting individual committers to use a personal signature doesn't seem ideal (I note that I can't currently do a release here, as the release automation does git tag --sign and I don't have a GPG signature). I think this is something that probably needs discussion between the pip committers somewhere - but I don't know what would be a good forum for that.

@pfmoore
Copy link
Member Author

pfmoore commented Sep 4, 2022

OK, that's fixed the reproducibility issue. Sorry about the complaints re the checks. I'm still not sure I follow the logic where we check in the generate script and the files, and then we have CI that checks that the generate produces the files we just added (I don't really understand what we're trying to check against) but that doesn't mean it's OK for me to come in after the fact and complain about it just because I don't like doing what it requires. Blame the fact that I was trying to do things in a hurry and got frustrated (not an excuse, I know).

Let me know if you have any further concerns with this PR.

@pfmoore
Copy link
Member Author

pfmoore commented Sep 17, 2022

As there have been no further comments, and I don't want to leave this until the next pip release is looming, I'd like to merge this as it stands and address any CI questions via followups. @pradyunsg are you OK with that? If so, could you approve the PR? Having requested your review, I'd rather not merge without your approval.

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one note/concern around inlined code in a string.

Comment on lines 274 to 275
ZIPAPP_MAIN = """\
#!/usr/bin/env python
Copy link
Member

@pradyunsg pradyunsg Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this string, to be loaded from a file kept in the templates directory? It’ll need a change in line 122 as well.

It might be worth inlining the version check and allowing None as a value on the variable post-render of the main script, to indicate skipping the version check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with doing this. However:

  1. The template processing code (including the detect_newline calls) will then be duplicated in 3 places, making it a good candidate for refactoring.
  2. There's a possible bug in the code as it uses a raw open() without a specified encoding. We should probably review and fix this (even if the "fix" is simply to comment that omitting the encoding is OK) when we refactor.

I'd be happier to handle this refactoring as a follow-up PR, so we don't block this change on getting the details of the template handling right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I should have said "I'd be happier to handle switching to using a template here and refactoring the common code out as a follow-up PR". In other words, can we leave the code as an inline string until we refactor the template code out (at which point I won't have to work out what parts of the current approach matter, or do my own inconsistent version)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I’m on board for this being a follow up — I should’ve been more explicit. This isn’t a blocking concern, hence the approval. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, having played around with this, I did move the code to a template, without the newline-matching code. Let me know what you think, I'm OK with it but I'm neutral on whether it's an improvement.

@pfmoore
Copy link
Member Author

pfmoore commented Sep 17, 2022

Unfortunately, I just found some issues in manual testing. I could have sworn I tested with 3.9/3.8, but maybe not (and what's more, the pip zipapp tests are only for 3.10).

The problem is in the stdlib zipimport with a subdirectory:

import sys
import importlib.resources as ir

sys.path.insert(0,"../public/pip.pyz/lib")
print(len(ir.read_text("pip._vendor.certifi", "cacert.pem")))

This fails with an exception in Python 3.9 and earlier. As 3.9 is no longer receiving non-security fixes, we're not going to get this fixed.

The issue seems to be with using a "lib" subdirectory in the zipapp and adding that to sys.path. So I've removed that directory layer from the zipfile and the top level now contains __main__.py and pip/, and we just add the pyz to sys.path. I can't think of a way that this will cause an issue.

I've also added a (very simplistic) test that the zipapp works (i.e., runs pip or reports an incompatibility error) on all Python versions we test against, just to ensure we don't break this in the future.

@pfmoore pfmoore force-pushed the zipapp branch 3 times, most recently from 9ffde85 to 6291294 Compare September 17, 2022 17:04
@pradyunsg
Copy link
Member

pradyunsg commented Sep 17, 2022

then we have CI that checks that the generate produces the files we just added (I don't really understand what we're trying to check against)

I guess I never responded to this and I'll try one last time...

Think of this similarly to pip's vendoring CI check. It's primarily because we aren't the only people who might open a PR here (wherein we know exactly how stuff works) and to reduce the effort on the reviewers' end that the scripts included in the commit match what would be generated + work as advertised.

Over on pip, we check that every PR that modifies vendoring files changes them in a way that's repeatable through our vendoring process which maintains clear patches and has pinned Python versions. This is validated in CI, so that means that any changes made in vendored files are clearly tracked in our patches and vendor.txt file.

It's the same here even though it's useful for somewhat different reasons -- we're effectively embedding blurbs of opaque data into scripts that'll get executed in thousands of machines daily, and it's useful to have those opaque blurbs be generated in a manner that they can be validated and to ensure that the changes are reflected in both the templates and the generated scripts simultaneously. It also ensures that if we change either, we change both and that they evolve in sync.

In effect, the public directory is like pip's own _vendor directory. If we make changes there, we want those to happen because the generation inputs were changed (here: templates and the script, there: vendor.txt and patches) and that changes to those inputs are reflected in the output.

@pfmoore pfmoore merged commit 56695dc into pypa:main Sep 18, 2022
@pfmoore pfmoore deleted the zipapp branch September 18, 2022 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants