Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos-13 job crashes without logs #7509

Closed
1 of 9 tasks
AkihiroSuda opened this issue Apr 28, 2023 · 20 comments
Closed
1 of 9 tasks

macos-13 job crashes without logs #7509

AkihiroSuda opened this issue Apr 28, 2023 · 20 comments

Comments

@AkihiroSuda
Copy link

AkihiroSuda commented Apr 28, 2023

Description

macos-13 job crashes after (roughly) 30 minutes, without logs

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Image: macos-13
Version: 20230419.1

https://github.com/lima-vm/lima/actions/runs/4799814791/jobs/8596609974?pr=1511

Is it regression?

Yes (macos-12 → macos-13)

Expected behavior

Should not crash without logs

Actual behavior

Crashes after (roughly) 30 minutes without logs

image

Repro steps

Run a job with macos-13 runner

@ilia-shipitsin
Copy link
Contributor

interesting. I'll have a look

@ilia-shipitsin
Copy link
Contributor

@AkihiroSuda , I tried to run in my fork

seems, it awaits for

time="2023-04-28T12:37:24Z" level=info msg="[hostagent] 2023/04/28 12:37:24 tcpproxy: for incoming conn 127.0.0.1:49278, error dialing \"192.168.5.15:22\": connect tcp 192.168.5.15:22: no route to host"
time="2023-04-28T12:37:34Z" level=info msg="[hostagent] Waiting for the essential requirement 1 of 3: \"ssh\""
time="2023-04-28T12:37:37Z" level=info msg="[hostagent] 2023/04/28 12:37:37 tcpproxy: for incoming conn 127.0.0.1:49279, error dialing \"192.168.5.15:22\": connect tcp 192.168.5.15:22: no route to host"
time="2023-04-28T12:37:47Z" level=info msg="[hostagent] Waiting for the essential requirement 1 of 3: \"ssh\""
time="2023-04-28T12:37:51Z" level=info msg="[hostagent] 2023/04/28 12:37:51 tcpproxy: for incoming conn 127.0.0.1:49280, error dialing \"192.168.5.15:22\": connect tcp 192.168.5.15:22: no route to host"

what's that ? how that routing is supposed to appear ? where it tries to connect to ? is it some qemu vm started on runner ?

@AkihiroSuda
Copy link
Author

AkihiroSuda commented Apr 28, 2023

@ilia-shipitsin Thanks for testing.

  • How did you get that log? Is that timeout relevant to the 30-min timeout of the GHA job?
  • 192.168.5.15 is a virtual IP that is not accessible from the host. This is an equivalent of 10.0.2.15 of QEMU and Virtualbox.
    Lima uses a Unix socket to dial that virtual IP, but it looks like the virtual network stack is somehow not functional.
    I'm not sure whether that virtual network failure is relevant to the 30-min timeout of the GHA job.
  • The "vz" mode uses Apple's Virtualization.framework, not QEMU.

@ilia-shipitsin
Copy link
Contributor

I cloned your repo, removed all build definitions except "vz".

30 min, I guess, defined here ...

image

@AkihiroSuda
Copy link
Author

AFAICS that's the timeout of the step, not the job?

@AkihiroSuda
Copy link
Author

Removed nick-invision/retry@v2, and the job is still failing without logs
https://github.com/lima-vm/lima/actions/runs/4832217890/jobs/8610695152?pr=1511

image

@AkihiroSuda AkihiroSuda changed the title macos-13 job crashes after 30 minutes macos-13 job crashes without logs Apr 28, 2023
@ilia-shipitsin
Copy link
Contributor

@AkihiroSuda , I see better logs with the following invocation

    - name: Test
      run: |
        ./hack/test-example.sh examples/experimental/vz.yaml

@ilia-shipitsin
Copy link
Contributor

previously I removed 30 min limit and job fails after 48 min (no logs caught)

image

@ilia-shipitsin
Copy link
Contributor

interesting observation, when build is running, I can see logs.
when build is complete, there are no logs

image

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented May 5, 2023

@AkihiroSuda , I tried to run commands on fresh macos-13 vm

INFO[0000] Terminal is not available, proceeding without opening an editor
WARN[0000] `vmType: vz` is experimental
INFO[0000] Attempting to download the image              arch=x86_64 digest= location="https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
INFO[0002] Using cache "/Users/runner/Library/Caches/lima/download/by-url-sha256/6b15519b255a45a238b7a8154cd57da120344ea388143af2821bb790af7fc587/data"
INFO[0006] Attempting to download the nerdctl archive    arch=x86_64 digest="sha256:955f9a4853762b1258cd38c967e45b6061a181a668907059e56cc01c32f1cf21" location="https://github.com/containerd/nerdctl/releases/download/v1.3.1/nerdctl-full-1.3.1-linux-amd64.tar.gz"
INFO[0006] Using cache "/Users/runner/Library/Caches/lima/download/by-url-sha256/ced4c1bbc347f1a74f9f9f25172cdb69115d21bf84000cc529013917a24a067a/data"
INFO[0006] [hostagent] Replacing "ftp_proxy" value "http://localhost:2121" with "http://192.168.5.2:2121"
INFO[0008] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/runner/.lima/vz/serial.log")
INFO[0008] [hostagent] Setting up Rosetta share
WARN[0008] [hostagent] Unable to configure Rosetta: Rosetta is unsupported on non-ARM64 hosts

after that communication to vm was lost (no idea why).
is there a way to make limactl more verbose ?

file /Users/runner/.lima/vz/serial.log is empty

@AkihiroSuda
Copy link
Author

is there a way to make limactl more verbose ?

limactl --debug may print more verbose logs.
Also, you may find some errors in ~/.lima/vz/ha.{stdout,stderr}.log.

But I'm not sure if that is relevant to my OP. (The problem is that I can't see the failure log)

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented May 5, 2023

yes, I agree that having logs sounds good.

as I discussed with colleagues, there are some edge conditions in Github Actions when logs are lost. It is bad, but we cannot fix it within "runner-images" repo. I'll ask what is proper way for tracking that issue

logs are only lost when task is completed. while it is in progress, logs are available (if that helps in investigating your issue)

@ilia-shipitsin
Copy link
Contributor

@AkihiroSuda , may I ask you to open an issue at https://support.github.com/ ?

I was told by my collegue that it is proper way for customer support.
also, I've searched for similar issues there, surprisingly, no luck

actually, we had several similar issues

#7004
#3517
#736
#7188
#6378
#6350

all were closed as "it is something with resource consumption". I'm not sure that resource overconsumtion should lead us to logs "completely" lost. for me, it would be fine to keep at least those logs avaiable during run

@AkihiroSuda
Copy link
Author

@AkihiroSuda , may I ask you to open an issue at https://support.github.com/ ?

Sure: https://support.github.com/ticket/2146507

@ilia-shipitsin ilia-shipitsin self-assigned this May 24, 2023
@ilia-shipitsin
Copy link
Contributor

@AkihiroSuda , I contacted colleguaes, they followed up escalated issue.
I'm closing this one.

feel free to reopen if needed

@AkihiroSuda
Copy link
Author

they followed up escalated issue.

Yes, but the issue is still unresolved AFAICS
https://support.github.com/ticket/2146507

@ilia-shipitsin
Copy link
Contributor

It is not something I have an access to ((

if you think it makes sense, we can keep this issue open

@AkihiroSuda
Copy link
Author

if you think it makes sense, we can keep this issue open

Yes, that would be appreciated, thanks

@ilia-shipitsin
Copy link
Contributor

Hello, @AkihiroSuda !

did you receive some update in https://support.github.com/ticket/2146507 ?
(I do not have permissions to check)

@AkihiroSuda
Copy link
Author

Yes, heard that logs are only available as best-effort. So I'm fine to close this issue.

@AkihiroSuda AkihiroSuda closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2023
iluuu1994 added a commit to php/php-src that referenced this issue Sep 15, 2023
We get some mysterious failures on macOS on GA with no evident error. This is a
blind attempt to solve it. There are many similar reports but there's no clear
resolution.

actions/runner-images#7509 (comment)

Closes GH-12210
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants