Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockfiles: Add invalidation logic #12415

Closed
Eric-Arellano opened this issue Jul 23, 2021 · 10 comments
Closed

Lockfiles: Add invalidation logic #12415

Eric-Arellano opened this issue Jul 23, 2021 · 10 comments
Assignees

Comments

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Jul 23, 2021

Problem

We need to know when a lockfile is stale and then warn/error/regnereate. (Until #12014 is solved, we will not regenerate).

Usually, we would simply rely on the engine's caching and memoization: i.e. attempt to run a Process and hope for a cache hit. However, this does not work for two reasons:

  1. If a lockfile is present and the inputs have not changed, we should not attempt to regenerate a lockfile unless the user explicitly asked to.
    • Imagine you're a new team member running the repo for the first time so cache is empty, and remote caching is not used. We should use the lockfile, not regenerate a new one.
  2. We likely should not be caching lockfile generation because it depends on the external state of the world, e.g. that a new release could have come out 5 minutes ago and we want to pick this up: Cacheable Processes do not play well with non-lockfile resolves. #12199.

Instead, we need a different caching scheme.

The inputs to lockfile generation

Currently, this is our experimental request type to generate a lockfile:

@dataclass(frozen=True)
class PythonLockfileRequest:
requirements: FrozenOrderedSet[str]
interpreter_constraints: InterpreterConstraints
dest: str
description: str

The requirement strings and interpreter constraints are what really matter, dest and description should not impact the lockfile content.

The requirement strings obviously matter to know what to resolve for. If reqs are changed, added, or removed, we need to regnerate.

Interpreter constraints matter because they can impact which requirements are resolved, per #12200. However, this is a little tricky to performantly determine every Pants run what the ICs are because some contexts need to consult the entire repository to calculate which ICs are used: #12412.

Finally, we might determine that the platform (OS + arch) matter. Technically, resolves can vary between platforms, e.g. if certain requirements are only needed on Linux vs. macOS. This part can be punted on until we figure out how we want to address this problem.

Proposal

Calculate a SHA 256 hash of the input requirements + interpreter constraints every time a lockfile is used. When lockfiles are generated, write it to a custom header, like:

# DO NOT REMOVE THIS.
# Invalidation hash: abjafd....

If the calculated SHA does not match, warn or error saying the lockfile is stale and how to regenerate it. This behavior can be triggered by an option - not sure where makes sense, but maybe [python-setup].stale_lockfile_behavior and use an Enum w/ warn and error variants.

There are two parts to implement:

  1. Generation of the lockfile.
  2. Consumption of the lockfile header.

Pt 1: generation

Decide how to format the header section. Probably add a @property on PythonLockfileRequest for the hash.

Then, change the result.output_digest using the filesytem API. (Get the DigestContents, then use CreateDigest w/ a FileContent using the new header + original content)

result = await Get(
ProcessResult,
# TODO(#12314): Figure out named_caches for pip-tools. The best would be to share
# the cache between Pex and Pip. Next best is a dedicated named_cache.
VenvPexProcess(
pip_compile_pex,
description=req.description,
# TODO(#12314): Wire up all the pip options like indexes.
argv=[
"reqs.txt",
"--generate-hashes",
f"--output-file={req.dest}",
# NB: This allows pinning setuptools et al, which we must do. This will become
# the default in a future version of pip-tools.
"--allow-unsafe",
],
input_digest=input_requirements,
output_files=(req.dest,),
),
)
return PythonLockfile(result.output_digest, req.dest)

Semi blocked by #12382, in that we need to safely write a header to the generated files. We probably want to start using pip-compile's --no-header option to make it easier to add.

Pt 2: Consumption

This will be a little trickier to figure out how to factor, as this code is somewhat in flight. It might make sense to pair on this.

Consumption of lockfiles is intentionally abstracted from callers. All they do is set PexRequirements to point to a file on the system when installing a Pex via PexRequest:

@frozen_after_init
@dataclass(unsafe_hash=True)
class PexRequirements:
req_strings: FrozenOrderedSet[str]
file_content: FileContent | None
file_path: str | None
file_path_description_of_origin: str | None
is_lockfile: bool

It's possible this factoring won't work, and we need to start explicitly linking the PythonLockfileRequest to the PexRequest at the call site so that our pex.py code can look at the hash and do the right thing.

Open questions

Missing hashes

We should probably error if the lockfile does not have any hash in it. Should we suggest what the SHA should be, or force the user to regenerate?

Note: we're not hashing the requirements.txt file itself, only the inputs that were presumably used to generate the lockfile. If a user changes the lockfile on us, we wouldn't detect it, which is probably acceptable and allows users to do things like add custom comments.

Performance of determining interpreter constraints

As mentioned above, calculating the ICs for tools like Flake8 and Bandit is slow because we have to consult the whole repo...#12412. That perf is acceptable when generating lockfiles, as that's not done very frequently.

But it's not great for when you're simply trying to use the tools. Even though it will get memoized w/ Pantsd, this will probably add to the startup time of using Pants :/

I don't know what the right solution is. I think we do need to use ICs to determine if the lockfile is valid or not. I'm wondering if we could skip that piece, but that seems too risky 🤔

@chrisjrn
Copy link
Contributor

Performance of determining interpreter constraints

It seems to me that exhaustively validating lockfiles should only need to be done regularly, not every time a build is run (particularly since many individual builds will just be for a single interpreter). Perhaps exhaustive invalidation is something that's added to a daily scheduled task, with per-interpreter invalidation being the default?

@Eric-Arellano
Copy link
Contributor Author

It seems to me that exhaustively validating lockfiles should only need to be done regularly, not every time a build is run (particularly since many individual builds will just be for a single interpreter). Perhaps exhaustive invalidation is something that's added to a daily scheduled task, with per-interpreter invalidation being the default?

Unfortunately, I don't think so. If you changed [black].requirements, we need to know right away that your lockfile is now invalid and warn/error/regenerate.

But, do note that the impact of Pantsd should be that you don't need to do this check very often. Pantsd is meant to be long lived, and we've been trying to reduce the # of times it gets invalidated. Once you do that initial check, you'd only need to recheck if your inputs have changed or pantsd is killed.

@chrisjrn
Copy link
Contributor

OK, I'll wrap my head around the issue a bit better, and then I'll give a more informed opinion

@Eric-Arellano
Copy link
Contributor Author

Performance of determining interpreter constraints

We should probably default to not worrying too much about this. Note that we already have to query all of the repo for dependency inference to build the module_mapping. That slows down our performance, but should not happen frequently thanks to Pantsd.

If we can do something clever to avoid the cost, great. But we shouldn't give up correctness because Pantsd means you should not need to pay the cost frequently. (And we should keep improving Pantsd for that to be the case.)

@chrisjrn
Copy link
Contributor

chrisjrn commented Jul 26, 2021

Firstly, I'm trying to avoid the use of hash in the Python code, so I can avoid confusion with Python dictionary hash. I'm using digest at the moment, but I also realise that clashes with the concept in the filesystem API, so I'm happy to take pointers there.

I'm thinking of making sure the invalidation digest is somewhat resilient to edits -- at the very least so that it doesn't need to be the first thing in the file. Using a similar approach to PGP/OpenSSH keys would give us something like this:

# --- BEGIN LOCKFILE METADATA: DO NOT EDIT OR REMOVE ---
# invalidation digest: cafeb0bacafeb0bacafeb0bacafeb0ba
# --- END LOCKFILE METADATA ---

This way, if someone were to want to edit the file, it would be somewhat clear where they could do so, and we can also add extra metadata in future if we want to.

Thoughts on this as far as format/naming conventions go?

@Eric-Arellano
Copy link
Contributor Author

Eric-Arellano commented Jul 26, 2021

Firstly, I'm trying to avoid the use of hash in the Python code

Maybe something with sha in it? I agree with your intuition to make it very clear this invalidation mechanism != the normal engine's mechanism. Perhaps your suggestion of invalidation_digest?

Using a similar approach to PGP/OpenSSH keys would give us something like this:

Yes, I like this. And we will probably want to expand it to include things like "Input interpreter constraints", for example.

@chrisjrn
Copy link
Contributor

OK, invalidation_digest it is!

@jsirois
Copy link
Contributor

jsirois commented Jul 29, 2021

Finally, we might determine that the platform (OS + arch) matter. Technically, resolves can vary between platforms, e.g. if certain requirements are only needed on Linux vs. macOS. This part can be punted on until we figure out how we want to address this problem.

Hrm, ok. platform (pex --platform) definitely does matter. It seems a bit off to be charging forward on lockfile work without actually being able to generate lockfiles for Pants, which allows users to specify platforms ala pex --platform.

Eric-Arellano pushed a commit that referenced this issue Jul 30, 2021
This WIP adds the lockfile invalidation header, but does not yet consume it.

Partially addresses #12415.
@Eric-Arellano
Copy link
Contributor Author

Eric-Arellano commented Aug 3, 2021

Performance of determining interpreter constraints

I'm actually pretty concerned about the performance of determining the hash for lockfiles every Pants run. In particular, computing the interpreter constraints for the Pytest lockfile is slow because we have to get the transitive target for every single python_tests in your project, even if you just want to run one test: #12491. Sure, it will be memoized, but it will be invalidated if a teammate changed unrelated tests in another commit.

I'm thinking maybe we do add an ignore option to [python-setup].invalid_lockfile_behavior...Strongly encourage using error in CI, but most devs should have it set to ignore for desktop use. There is a hit to correctness, but maybe justified given the performance concerns?

cc @stuhood , thoughts?

--

EDIT: I had an idea that makes me comfortable with this! Add [python-setup].allow_mixed_interpreter_constraints, which if false, will ban the interpreter_constraints field in targets and result in only needing to look at [python-setup].interpreter_constraints, which is fast.

@Eric-Arellano
Copy link
Contributor Author

Performance of determining interpreter constraints

From #12491, there are currently 4 proposals to improve performance. All should be compatible with each other:

  1. When generating the lockfile via ./pants tool-lock, continue to inspect the whole repo. But when checking at call sites if the lockfile is stale, only inspect the code in question: create what the PythonToolLockfile would be if it were just for that code, and ensure the ICs are compatible w/ the checked-in file.
  2. Stop treating the IC field as the constraints for the source itself, and start treating it as constraints for the whole transitive closure. This would allow us to look only at the target roots, w/o needing to resolve transitive dependencies.
  3. Allow repos to opt-out of mixed interpreter constraints and only set ICs via [python-setup].interpreter_constraints (Add [python-setup].disable_mixed_interpreter_constraints for repositories not using the interpreter_constraints field #12495). This has no impact if you are using mixed ICs, but greatly speeds things up for other repos not using the mechanism.
  4. Add an ignore option to our check for stale lockfiles. Users could risk ignoring for desktop builds, and then error in CI.

@chrisjrn and I discussed this morning that the first one seems particularly important. If you run ./pants test on a single file, we should not have to consult your entire repository first to determine if the lockfile is still valid. That would be very problematic for performance and invalidation.

I also think opting out of mixed-interpreter constraints (#12495) is an easy win and makes semantic sense, beyond this particular performance consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants