Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cargo.lock considered harmful #327063

Open
Atemu opened this issue Jul 14, 2024 · 55 comments
Open

Cargo.lock considered harmful #327063

Atemu opened this issue Jul 14, 2024 · 55 comments
Labels
6.topic: hygiene 6.topic: rust 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc.

Comments

@Atemu
Copy link
Member

Atemu commented Jul 14, 2024

Introduction

I've been doing a little investigation on the impact of Cargo.lock files because, if you run ncdu against a Nixpkgs checkout, they're usually the largest individual files you come across and rust packages are frequently at the top in any given sub-directory.

AFAICT the functionality to import Cargo.lock has existed since May 2021. Usage has exploded since:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . | wc -l ; done
nixos-21.05 4
nixos-22.05 15
nixos-23.05 208
nixos-24.05 316

Measurements

Next I measured the total disk usage of all Cargo.lock files combined:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . -X du -b | cut -f 1 | jq -s 'add' ; done
nixos-21.05 156292
nixos-22.05 113643
nixos-23.05 12505511
nixos-24.05 24485533

24MiB!

Realistically though, anyone who cares about space efficiency in any way will use compression, so I measured again with each Cargo.lock compressed individually:

$ for ver in 21.05 22.05 23.05 24.05 ; do echo -n "nixos-$ver " ; git checkout --quiet NixOS/nixos-$ver && fd '^Cargo.lock$' . -x sh -c 'gzip -9 < {} | wc -c' | jq -s 'add' ; done
nixos-21.05 38708
nixos-22.05 30104
nixos-23.05 3075375
nixos-24.05 5986458

Further, evidence in #320528 (comment) suggests that handling Cargo.lock adds significant eval overhead. Eval time for Nixpkgs via nix-env is ~28% lower if parsing/handling of Cargo.lock files is stubbed.

Analysis

Just ~300/116231 packages (~0.25%) make up ~6MiB of our ~41MiB compressed nixpkgs tarball which is about 15% in relative terms (18.5KiB per package).
For comparison, our hackage-packages.nix containing the entire Hackage package set (18191 packages) is ~2.3MiB compressed (133 Bytes per package).

Breaking down eval time by package reveals that each Cargo.lock takes on average about 76.67 ms to handle/parse.

Discussion

I do not believe that this trend is sustainable, especially given the likely increasing importance of rust in the coming years. If we had one order of magnitude more rust packages in Nixpkgs and assumed the same amount of data per package that we currently observe, just the rust packages alone would take up ~54 MiB compressed.

If nothing is done, I could very well see the compressed Nixpkgs tarball bloat beyond 100MiB in just a few years.

Extrapolating eval time does not paint a bright picture either: If we assume one order of magnitude more Cargo.lock packages again, evaluating just those packages would take ~4x as long as evaluating the entire rest of Nixpkgs currently does.

This does not scale.

Solutions

I'm not deep into rust packaging but I remember the vendorHash being the predominant pattern a few years ago which did not have any of these issue as it's just one 32 Byte string literal per package.

Would it be possible to revert back to using vendorHashes again?

(At least for packages in Nixpkgs, having Cargo.lock support available for external use is fine.)

What else could be done to mitigate this situation?

Limitations/Future work

Files were compressed individually, adding gzip overhead for each lockfile. You could create a tarball out of all Cargo.lock files and compress it as a whole to mitigate this effect.

I found some Cargo.lock files that have a different name or a prefix/suffix and were not considered.


CC @NixOS/rust

@Atemu Atemu added 6.topic: rust 6.topic: hygiene 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc. labels Jul 14, 2024
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/cargo-lock-considered-harmful/49047/1

@superherointj
Copy link
Contributor

To add to the evidence:
#221716 (comment)

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I don’t think this should be prohibited entirely because I think it’s the only option if upstream doesn’t provide a Cargo.lock or it is somehow inadequate.

@diogotcorreia
Copy link
Member

I'm currently including a Cargo.lock in a package I maintain (pgvecto-rs) because upstream uses git dependencies and vendorHash does not work with that last time I checked. I'd love to be able to stop including Cargo.lock in nixpkgs, so let me know if there is an alternative.

@superherointj
Copy link
Contributor

I don’t think this should be prohibited entirely because I think it’s the only option if upstream doesn’t provide a Cargo.lock or it is somehow inadequate.

On making too easy for doing the wrong thing, only the wrong thing is done.
We should encourage to work with upstream first then.
For failures, we need to discuss the correct process for this.

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I don’t think blocking Nixpkgs on upstream packaging issues is a viable approach, especially ones that primarily only affect us. We wouldn’t have a package set at all. In this case, many upstreams are actively unwilling to maintain a Cargo.lock file in version control.

@patka-123
Copy link
Contributor

The same goes for composer.lock files for php packages. There are soooo many upstreams that will never accept a lockfile, no matter how much you talk with them.

Perhaps we can still allow lockfiles, with the recommendation that you first talk to upstream to see if they are willing to accept it there. And reduce the tarball size slowly that way?

@Atemu
Copy link
Member Author

Atemu commented Jul 14, 2024

@emilazy I don't think @superherointj suggested doing that. They suggested prioritising upstream collaboration and only falling back to in-tree workarounds when collaboration fails but have a proper process for that in place.

I don't think we need to eradicate Cargo.lock entirely myself. There are likely edge-cases where anything else is simply impossible. It needs to be the exception rather than the norm though; something that 30 packages might do, not 300.


@patka-123 w.r.t. composer.lock and friends, also see #327064.


I have two further thoughts on possible solutions:

  1. Consider having a rustPackages set for deps, similar to haskellPackages or pythonPackages if technologically feasible. It'd be a lot more work but would make for much higher quality packages and we'd never run into issues like this.
  2. Consider generating lockfiles ourselves but storing them elsewhere. We could create a separate repo which would house lockfiles for projects where upstream lockfiles don't exist. In nixpkgs, we'd simply fetchurl from there. We don't actually need the lockfiles to be in the same repo as version granularity should be enough; we just need an immutable lockfile for a certain version of the upstream software somewhere. This could actually be a cross-distro effort as we're certainly not the only ones relying on lockfiles for r13y (any GUIX readers here?).

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I’m all for upstream collaboration – I’ve opened like 7 upstream PRs during Nixpkgs work in the past month or so – but from my experience of upstream responsivity, I don’t think we can viably have a workflow like “work with upstream, then work around when that fails”. A lot of the time upstreams just take ages to even acknowledge an issue or PR, and even when they do it can take several rounds to achieve mutual understanding and consensus. “Work with upstream, apply a workaround in the meantime, then remove it if upstream collaboration succeeds” is much better for Nixpkgs maintainers and users. I think that making sure that workarounds get removed in a timely fashion when no longer necessary, and aren’t introduced unnecessarily in the first place, are more effective sites of intervention.

@superherointj
Copy link
Contributor

Since a simple way to avoid a lockfile is to ask upstream to add the lockfile.

My proposal:

  • When making a PR adding a lockfile we should open an issue at upstream and reference this issue as a comment near the lockfile code. If upstream accepts it, the package maintainer should update the package to remove the lockfile from nixpkgs. If not, we need some other solution then. (Which is unclear atm.)

Can we agree with this?

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I think that is a reasonable enough expectation for packages with non‐dormant upstreams that haven’t previously expressed an opinion on lock files, when there is no other obstacle to removing the lock file (e.g. Git dependencies or us needing to extensively patch it for our own purposes), yeah.

Note that Cargo upstream used to explicitly recommend committing lock files for applications but not libraries, but they have since changed their guidance and it is now considerably more equivocal. So we don’t really have anything slam‐dunk that we can point people to here.

@Atemu
Copy link
Member Author

Atemu commented Jul 14, 2024

Consider generating lockfiles ourselves but storing them elsewhere. We could create a separate repo which would house lockfiles for projects where upstream lockfiles don't exist. In nixpkgs, we'd simply fetchurl from there. We don't actually need the lockfiles to be in the same repo as version granularity should be enough; we just need an immutable lockfile for a certain version of the upstream software somewhere. This could actually be a cross-distro effort as we're certainly not the only ones relying on lockfiles for r13y (any GUIX readers here?).

I've written this out more thoroughly in an amendment in #327064

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I think that the Rust stuff can’t process a fetcher‐produced Cargo.lock because it’d be IFD. I’m not sure how it works in the case where you just use a cargoHash, though, and I could be mixing it up with one of the third‐party Nix Rust build solutions.

@lucasew
Copy link
Contributor

lucasew commented Jul 14, 2024

What if Nix would be able to import files from zip files?

Flakes and stuff could just download zip files instead of the extracted tree and Nix could load the files from the inside of the zip file without unpacking. Wouldn't solve the problem but could be a little easier on the IOPS, so, probably more usable in slower storage.

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

See the perpetual‐work‐in‐progress lazy trees work. It’s a hard problem, unfortunately, so we shouldn’t hold our breaths for it.

@Aleksanaa
Copy link
Member

I haven't studied the problem of git dependencies carefully, but this problem may be solved by using some scripts (instead of cargo) to parse Cargo.lock at build time (instead of eval time)?

@Aleksanaa
Copy link
Member

#217084 The original plan was to replace every cargoHash with Cargo.lock, but this was not implemented. This PR also lists some benefits of migrating to Cargo.lock.

@minijackson
Copy link
Member

minijackson commented Jul 14, 2024

I think I remember people recommending parsing the Cargo.lock file instead of using cargoHash, in order to have an easier time scanning for known vulnerabilities in dependencies.

For example, with nix path-info -r --derivation '.#gnvim-unwrapped' you can see without building anything which source dependencies it has.

But with nix path-info -r --derivation '.#bingrep' you just get a bingrep-0.11.0-vendor.tar.gz.drv.

Is this something we want to abandon, and either have ecosystem-specific tools, or wait for something like recursive Nix?

@alyssais
Copy link
Member

We don't need all the data in the Cargo.lock file. (We could drop the dependencies array that every package has.)

@Mic92
Copy link
Member

Mic92 commented Jul 14, 2024

The other big reason we vendor Cargo.lock are git dependencies. Is there maybe a way to make them worm without dependency vendoring?

@alyssais
Copy link
Member

Eventually, the way I'd like this to work would be that we have deduplication as well — some big file that defines every version of a Cargo package required for Nixpkgs, that we'd only have to process once. We could even eventually deduplicate semver-compatible packages, since the Rust ecosystem is very good about this. This would mitigate the problem of relying on upstream to update for dependency crates for security fixes.

But this would require some tooling to keep that file up to date when adding / updating a package. The quicker fix would be to remove the dependency information as suggested above, which would just require doing the deletion, and modifying the code that checks the Cargo.lock file matches to allow this.

@Mic92
Copy link
Member

Mic92 commented Jul 14, 2024

We don't need all the data in the Cargo.lock file. (We could drop the dependencies array that every package has.)

This would help with filesystem size but does it help with RAM usage?
Is the ram usage coming from that and not from the many small derivations that gets created from these crate sources?

@emilazy
Copy link
Member

emilazy commented Jul 14, 2024

I like the idea of One Gigantic Lock, but a directory with one package per file would probably be better than one file, even if less efficient on disk, because we won’t be fighting Git merges constantly. And of course we’d probably still want per‐package local additions/overrides for things like random Git fork dependencies.

@emilazy
Copy link
Member

emilazy commented Aug 10, 2024

I’ve opened #333702 to discuss an approach that I hope would make everyone happier.

@axelkar
Copy link
Contributor

axelkar commented Oct 3, 2024

Could someone explain why cargoHash doesn't work for Git dependencies and why outputHashes is needed on top of the lockfile?

@a-kenji
Copy link
Contributor

a-kenji commented Oct 6, 2024

From cargo:

Cargo locks the commits of git dependencies in Cargo.lock file at the time of their addition and checks for updates only when you run cargo update command.

There is no hash for git dependencies in the lock file. So it either needs to be computed (IFD / comparable mechanism) or pre-computed (outputHashes).

@axelkar
Copy link
Contributor

axelkar commented Oct 10, 2024

If the commits are locked in Cargo.lock, why can't the Git dependencies just be fetched in the vendoring process with a single output hash like without Git dependencies?

@Mic92
Copy link
Member

Mic92 commented Oct 10, 2024

If the commits are locked in Cargo.lock, why can't the Git dependencies just be fetched in the vendoring process with a single output hash like without Git dependencies?

This is done in most case. However sometimes we cannot use the Cargo.lock file from upstream because it's either not present or not up-to-date. In those cases we need to provide the cargo.lock in nixpkgs and because we cannot use the git hash as we need nar hashes, we have to compute those.

@Mic92
Copy link
Member

Mic92 commented Oct 10, 2024

Btw. Recent nix versions also have a feature called git-hashing. This allows to use git commits directly without having to provide a nar hash.

@emilazy
Copy link
Member

emilazy commented Oct 10, 2024

This is done in most case. However sometimes we cannot use the Cargo.lock file from upstream because it's either not present or not up-to-date. In those cases we need to provide the cargo.lock in nixpkgs and because we cannot use the git hash as we need nar hashes, we have to compute those.

Is this actually true? My experience has been that we always need to specify individual hashes for Git dependencies.

@Mic92
Copy link
Member

Mic92 commented Oct 10, 2024

This is done in most case. However sometimes we cannot use the Cargo.lock file from upstream because it's either not present or not up-to-date. In those cases we need to provide the cargo.lock in nixpkgs and because we cannot use the git hash as we need nar hashes, we have to compute those.

Is this actually true? My experience has been that we always need to specify individual hashes for Git dependencies.

Ah yeah, you are right. Wrong memory. We can only support stuff that we can reliable download with cargo-vendor.

@axelkar
Copy link
Contributor

axelkar commented Oct 11, 2024

We can only support stuff that we can reliable download with cargo-vendor.

Could we ask upstream to add an option that doesn't update Git dependencies? Isn't this an issue with cargo-vendor and not Nix/Nixpkgs?

Edit: Doesn't --locked work?

@alyssais
Copy link
Member

This is not the problem.

The problem is that cargo vendor does not promise a stable output format at all, for git dependencies or crate tarballs. The reason it is not allowed for Git dependencies specifically in Nixpkgs is that, since we became aware of this problem, only the representation of Git dependencies has changed in practice. When that happened, we decided to disallow fetching git dependencies with fetchCargoTarball, as a compromise solution that solved the immediate problem, but at any time a new version of cargo vendor could also change the representation for tarball crates, and then we'd have a big problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: hygiene 6.topic: rust 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc.
Projects
None yet
Development

No branches or pull requests