Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtask ls-apis: try SSH access for dendrite #6840

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Oct 12, 2024

Fixes #6839, at least from my perspective.

  1. If running cargo metadata on the Maghemite workspace manifest fails, try re-running it with a set of environment variables that replaces Dendrite's Git URL with an SSH one. This removes the requirement to have a functioning ~/.netrc in order to run the full test suite locally; having a functioning SSH agent is significantly less hassle than fiddling with GitHub tokens.
  2. Move the cargo xtask ls-apis exercising out of the already-quite-long build-and-test jobs. We've seen a flake on main where it took just over an hour to get to the ls-apis bits by which time the GitHub token had already expired. build-and-test is already the longest required step to merge and it's not ideal to add more things to it.

@iliana
Copy link
Contributor Author

iliana commented Oct 12, 2024

Well that didn't work in CI as I had hoped.

@davepacheco
Copy link
Collaborator

The two logical changes here address different problems. So commenting on those separately:

On the first one (re-trying a failed clone with different Git env vars to see if an ssh clone will work): I don't love the implicitness of this behavior. I burned a lot of time debugging the https/ssh URL stuff the first time I ran into it and this feels like it layers on another level of complexity if it goes wrong. This also doesn't feel like a really scalable solution, in that: the root problem here I think is only trying to use a private repo smoothly in both CI and locally, but this only fixes one particular dependency in one context. (I realize this tool is a little weird for doing some of the cargo updating at runtime rather than before the build, but I don't think that fundamentally changes what's going on here?) Would it not make sense for us to pick one standard way to do that and then use that everywhere? Maybe that's changing the Maghemite dependency to use ssh URLs and use the https re-writing that we do in Omicron to make it work in CI? Put differently: is there any reason not to do that instead, aside from it being more work? (That might well be a good reason -- just trying to understand the tradeoff.)


I'm not sure the second part of this (moving the cargo xtask ls-apis invocations) will fix the reported flake. Since ls-apis opts into the hakari machinery, I expect that building it already depends on quite a lot of the Omicron build. So I'm not sure moving it from "right after the build step in the build-and-test job" to "its own separate job [that builds most of Omicron]" is going to move the needle in terms of getting it done within the first hour. Do we have any way to know?

In the reported flake, the failure was immediately preceded by a 40-minute rustc invocation. That seems awfully long. Is that common?

Some other ideas:

  • We could extend the buildomat token mechanism to support jobs that take longer than an hour. I think @jclulow had ideas for this. This feels like it solves the more general problem, but I don't have a sense of how much work it is.
  • Do these invocations in a separate buildomat job that uses binaries built by the build-and-test job. This way, we're not using up any of the 60m building anything.
  • If we want quick-and-dirty, I wonder if it would fix things to inject some cargo metadata invocations by hand before even building so Cargo is forced to download the Git deps and their deps so that it's already present by the time we get to them later.

@davepacheco
Copy link
Collaborator

We talked about this offline, with details summarized here: #6839 (comment)

The biggest shift for me is realizing that Dendrite is the exception (because it's private) rather than the rule, and the choice here isn't forever. Given that, I think it makes less sense to invest in first-classing support for private repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

omicron-ls-apis::test_dependencies requires GitHub credentials or git rewrite rules
2 participants