Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0084] Input-aware fetchers #84

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions rfcs/0084-input-aware-fetchers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
feature: input-aware-fetchers
start-date: 2020-12-30
author: @infinisil
co-authors: (find a buddy later to help our with the RFC)
related-issues: (will contain links to implementation PRs)
---

# Summary
[summary]: #summary

Change the name of the main fixed-output derivations to contain a hash of all inputs that could affect the output hash, such that if any of them change, a previously-fetched path with the same hash won't be reused. This resolves the long-standing confusion about Nix using an old version of a source even though the url it should be fetched from was updated.

# Motivation
[motivation]: #motivation

Say we have a `fetchFromGitHub` call like
```nix
(import <nixpkgs> {}).fetchFromGitHub {
owner = "NixOS";
repo = "nix";
rev = "2.3.9";
sha256 = "0d4c3ddpqa1q4j15cl8d7g3igiw6clqczf8dcp4pbpvlm9a64rki";
}
```

With a current nixpkgs version, this can be built successfully, returning the result in store path
```
/nix/store/l98gjfznp8lpxi0hvj4i0rw34xnnqma8-source
```

However, if we now intuitively update the revision of the expression to the new 2.3.10 release like
```nix
(import <nixpkgs> {}).fetchFromGitHub {
owner = "NixOS";
repo = "nix";
rev = "2.3.10";
sha256 = "0d4c3ddpqa1q4j15cl8d7g3igiw6clqczf8dcp4pbpvlm9a64rki";
}
```

We can still build this successfully, getting the exact same store path
```
/nix/store/l98gjfznp8lpxi0hvj4i0rw34xnnqma8-source
```

This can't be right! Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch. This is because Nix already has a path with that exact output `sha256` hash, so it just reuses that. Indeed, changing the output hash to something invalid works:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch.

I'm sorry to hear that this is encouraged as a good practice. Shouldn't people be told instead to use prefetchers whenever updating a source, rather than wastefully downloading the source twice just to get a hash mismatch the first time?

(I concede that I do the invalid-hash procedure myself for Rust packages, for which using a prefetcher would have been beyond my understanding of the Rust packaging functions, but I always use prefetchers for URL sources and Git sources.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't download it twice, since it stores the result under the store path for the calculated hash the first time

```nix
(import <nixpkgs> {}).fetchFromGitHub {
owner = "NixOS";
repo = "nix";
rev = "2.3.10";
sha256 = "0000000000000000000000000000000000000000000000000000";
}
```

Building this we get a hash mismatch as expected:

```
hash mismatch in fixed-output derivation '/nix/store/5d3k20pzgjyccmpqfina1cvbl28zxz6a-source':
wanted: sha256:0000000000000000000000000000000000000000000000000000
got: sha256:1cx9yv62rylfv8p09pidsmqy8qim1bbjaa8pj1j8xj7vkrm0dri1
```

Now we can copy the hash into the expression:
```nix
(import <nixpkgs> {}).fetchFromGitHub {
owner = "NixOS";
repo = "nix";
rev = "2.3.10";
sha256 = "1cx9yv62rylfv8p09pidsmqy8qim1bbjaa8pj1j8xj7vkrm0dri1";
}
```

And build it successfully to get the correct result in the output path
```
/nix/store/5d3k20pzgjyccmpqfina1cvbl28zxz6a-source
```

It is however very easy to forget to update the hash to something invalid, which as many know has been a source of much confusion, questions and frustration.

However it doesn't have to be this way!

# Detailed design
[design]: #detailed-design

This problem can be solved relatively easily by realizing that Nix not only uses the output hash of derivations to check whether they need to be built, but the _full_ store path, which notably also includes the _derivation name_. Thus, by ensuring that the derivation name changes if (and only if) relevant inputs change, we can force Nix to rebuild a derivation even if the output hash stays the same, therefore fixing above problem.

This RFC therefore proposes to use a hash of all relevant derivation inputs as the default name of fixed-output derivations. The hash should be computed using `sha256`, and encoded using the [url-safe base64](https://tools.ietf.org/html/rfc4648#section-5), without any leading `=`, leading to a character count of 43. For reference, this hash can be computed in `nix-shell -p openssl` with
```
[nix-shell:~]$ echo -n 'example string' | openssl dgst -sha256 -binary | openssl base64 -A | cut -b1-42 | tr +/ -_
rt-5KzBTohoRT08wGgKjxq1d_1BNEk3CzuYRdiPuxw
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most sources, the pname and version attributes are probably enough to invalidate the inputs. One of the benefit is that those could be included in the name, making the store path more readable.

Another benefit of using pname and version is that it allow to invert the control. Historically the pname is first set on the package derivation, and then inherited in the source. In cases where we have a package generator function, like with most languages, the pname could be set on the source, and then inherited by the generator function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this rfc is mainly for minority, such as git tarball with revision or patch with patchutils arguments, with no pname nor version. As for normal sources, new pname and version will generate a new name, which could already trigger refetch now. (Store::makeFixedOutputPath takes name as paramater and generate a different derivation path.)

We may want something other than hash or name could affect fixedOutputPath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holymonson if I'm interpreting zimbatm's comment right, he's suggesting that name/pname/version be moved into the source rather than the derivation that builds the source.


The following describes exactly which string should be used as the hash input for each of the fetcher types. This is necessary since there are multiple ways to fetch the same resources, e.g. with the builtin `fetchTarball` and nixpkgs' `fetchzip`. In order for them to be able to reuse the same store path, they should use the same derivation name.

| Fetchers | Input string |
| --- | --- |
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@7c6f434c Oh and actually I even specified this here already that only the first URL of multiple should be used in the hash, so yeah, no need for additionalUrls, just the existing urls can do that already.

| `pkgs.fetchzip`, `<nix/fetchurl.nix>` (with `unpack = true`), `builtins.fetchTarball`, `nix-prefetch-url --unpack` | "fetchurl-unpack-" + URL |
| `pkgs.fetchgit`, `builtins.fetchGit` and co. | "fetchgit-" + url + "-" + rev |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkgs.fetchgit would also need to have deepClone, leaveDotGit and fetchSubmodules included, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, though we'll have to make sure that it harmonizes with builtins.fetchGit which doesn't support deepClone and leaveDotGit. So it would probably have to be something like

"fetchgit-${optionalString deepClone "deepclone-"}${optionalString leaveDotGit "leavedotgit-"}${optionalString fetchSubmodules "fetchsubmodules-"}..."


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth talking about fetchers that download a bunch of things, like the Rust cargoSha256 and Go vendorSha256? They are a bit different in nature than the above-mentioned fetchers.

One trick we did with Cargo is to add the Cargo.lock in the derivation itself, and then compare it in the main derivation to make sure they are the same. This was done as a workaround for the issue mentioned in this RFC. So if a good solution is found, we could remove that hack.

Changing the name of a derivation doesn't change its fixed-output hash. Therefore there's no need to update existing fixed-output hashes.

## Example

```nix
(import <nixpkgs> {}).fetchzip {
url = "mirror://gnupg/gnupg/gnupg-2.2.24.tar.bz2";
sha256 = "0ilcp7m1dvwnri3i7q9wanf5pvhwxk7h106pd62g0d5fz80b944h";
}
```

yields store path
```
/nix/store/q1nsvfvzqzfsxcdcjnnfrw9cwmr1fb2j-DRzMDNAD89ZITk4wqEOz8oELAfOdOvvBfxE9vSbEDj
```

## Synchronizing Nix and nixpkgs changes

Since this proposal requires changes on both the Nix and nixpkgs side, it should be synchronized between the two, such that both Nix and nixpkgs fetchers always use the same store paths.

To manage this, nixpkgs should condition the change of behavior on `builtins.nixVersion` being above-or-equal the version where the change is done in Nix. Therefore both the previous and the new behavior need to be maintained in nixpkgs until `lib/minver.nix` is increased to the version where the change in Nix occured.

# Drawbacks
[drawbacks]: #drawbacks

- Using only the hash as the derivation name leads to store path names that don't indicate what they contain. This is already the case with the current `/nix/store/l98gjfznp8lpxi0hvj4i0rw34xnnqma8-source` store paths for `fetchzip`, but this RFC would also make it as such for `fetchurl` and `fetchgit`.

# Alternatives
[alternatives]: #alternatives

- The trivial alternative of keeping the status quo. This leads to the common beginner problem described in the motivation section.
- Adding a special `outputHashInputs` attribute to the Nix `derivation` primitive, which can be set to any attributes. These attributes then influence Nix's store path hash directly, without the need for using the derivation name as a hash influencer. This could be a much nicer solution, but is a much more indepth change to how Nix works.
Comment on lines +131 to +134
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to highlight this alternative I haven't considered a whole lot. Implementing this is a bit harder, but doing that would allow us to have the best of both worlds: Arbitrary inputs that cause a rebuild, but no increase of store path sizes. (<- @holymonson)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could be a very good alternative. This would also create a more obvious “disable if nix is to old” situation for nixpkgs.

- In order to keep the store path shorter, the `sha256` could be truncated to a shorter length. This was not done in order to not increase the chance of collisions.
- The derivation name could not only contain the hash, but also a descriptive name of the inputs, such as `baseNameOf url`, which is what `fetchurl` currently does. While possible, this is more complicated than just `baseNameOf`, since URLs can also have characters that are invalid in a derivation name (see https://github.com/NixOS/nixpkgs/pull/107515). To keep it simple this was not done.

# Unresolved questions
[unresolved]: #unresolved-questions

- Should the hash be truncated? (see alternatives)
- Should the name also contain a descriptive name of the inputs? (see alternatives)

# Future work
[future]: #future-work

- [Restricting fixed-output derivations](https://github.com/NixOS/nix/issues/2270) would indicate moving all fetchers into Nix itself. Therefore this change wouldn't have to be synchronized across Nix and nixpkgs