Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic failures in CRI test container. #121

Open
shyamjesal opened this issue Feb 15, 2021 · 4 comments
Open

Sporadic failures in CRI test container. #121

shyamjesal opened this issue Feb 15, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@shyamjesal
Copy link
Collaborator

Knative function deployment times out sometimes.
Test the runner for simultaneous tests inside a single machine.

@shyamjesal shyamjesal self-assigned this Feb 15, 2021
@ustiugov ustiugov added the bug Something isn't working label Feb 17, 2021
@shyamjesal
Copy link
Collaborator Author

shyamjesal commented Mar 24, 2021

Error type 1: Context deadline exceeded

=== CONT  TestAutoscaler                                                                                                                                                                          
    cri_test.go:169:                                                                                                                                                                              
                Error Trace:    cri_test.go:169
                                                        cri_test.go:102                          
                                                        cri_test.go:112                          
                Error:          Received unexpected error:                                       
                                rpc error: code = DeadlineExceeded desc = context deadline exceeded                                                                                               
                Test:           TestAutoscaler
                Messages:       Failed to get response from function

vHive Logs

time="2021-03-24T17:17:31.075989531Z" level=info msg="HEARTBEAT: number of active VMs: 0"
time="2021-03-24T17:17:31.077268378Z" level=info msg="FuncPool heartbeat: ==== Stats by cold functions ====\nfID, #started, #served\n==================================="
time="2021-03-24T17:17:53.525957406Z" level=error msg="coordinator failed to start VM" error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to build VM configuration: failed to create container stub drives: failed to create container stub drive: open /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0: file exists" image="vhiveease/helloworld:var_workload" vmID=9
time="2021-03-24T17:17:53.526081045Z" level=error msg="failed to start VM" error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to build VM configuration: failed to create container stub drives: failed to create container stub drive: open /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0: file exists"
time="2021-03-24T17:17:53.529882359Z" level=error msg="VM config for pod c7cb6f98019a2fda3920240060cebc7b1b9eabfd9d372b550041a2ea08a87a25 does not exist"
time="2021-03-24T17:17:53.529935759Z" level=error error="VM config for pod does not exist"

....

time="2021-03-24T17:17:54.957187031Z" level=warning msg="Failed to configure conntrack exit status 4\nAnother app is currently holding the xtables lock. Perhaps you want to use the -w option?\n\n"
time="2021-03-24T17:17:54.957240008Z" level=warning msg="Ni allocation failed" vmID=14
time="2021-03-24T17:17:54.957263439Z" level=error msg="failed to allocate VM in VM pool" image="vhiveease/helloworld:var_workload" vmID=14
time="2021-03-24T17:17:54.957289266Z" level=error msg="coordinator failed to start VM" error="exit status 4" image="vhiveease/helloworld:var_workload" vmID=14
time="2021-03-24T17:17:54.957320617Z" level=error msg="failed to start VM" error="exit status 4"
time="2021-03-24T17:17:54.968332094Z" level=error msg="VM config for pod c7cb6f98019a2fda3920240060cebc7b1b9eabfd9d372b550041a2ea08a87a25 does not exist"
time="2021-03-24T17:17:54.968387126Z" level=error error="VM config for pod does not exist"
time="2021-03-24T17:18:09.200570496Z" level=warning msg="Failed to configure NAT exit status 4\nAnother app is currently holding the xtables lock. Perhaps you want to use the -w option?\n\n"
time="2021-03-24T17:18:09.200615332Z" level=warning msg="Ni allocation failed" vmID=15
time="2021-03-24T17:18:09.200626540Z" level=error msg="failed to allocate VM in VM pool" image="vhiveease/helloworld:var_workload" vmID=15
time="2021-03-24T17:18:09.200639062Z" level=error msg="coordinator failed to start VM" error="exit status 4" image="vhiveease/helloworld:var_workload" vmID=15
time="2021-03-24T17:18:09.200670269Z" level=error msg="failed to start VM" error="exit status 4"
time="2021-03-24T17:18:09.204056013Z" level=error msg="VM config for pod c7cb6f98019a2fda3920240060cebc7b1b9eabfd9d372b550041a2ea08a87a25 does not exist"
time="2021-03-24T17:18:09.204089221Z" level=error error="VM config for pod does not exist"
time="2021-03-24T17:18:09.219684994Z" level=warning msg="Failed to setup forwarding out from tap exit status 4\nAnother app is currently holding the xtables lock. Perhaps you want to use the -w option?\n\n"
time="2021-03-24T17:18:09.219725679Z" level=warning msg="Ni allocation failed" vmID=16
time="2021-03-24T17:18:09.219737360Z" level=error msg="failed to allocate VM in VM pool" image="vhiveease/helloworld:var_workload" vmID=16
time="2021-03-24T17:18:09.219749769Z" level=error msg="coordinator failed to start VM" error="exit status 4" image="vhiveease/helloworld:var_workload" vmID=16
time="2021-03-24T17:18:09.219766842Z" level=error msg="failed to start VM" error="exit status 4"
time="2021-03-24T17:18:09.222554189Z" level=error msg="VM config for pod 8fb241d353e59c336b3866f83a70487b5814326181215d46e3c741c56cf12796 does not exist"
time="2021-03-24T17:18:09.222582026Z" level=error error="VM config for pod does not exist"
time="2021-03-24T17:18:21.225954250Z" level=warning msg="Failed to setup forwarding into tap exit status 4\nAnother app is currently holding the xtables lock. Perhaps you want to use the -w option?\n\n"
time="2021-03-24T17:18:21.230963337Z" level=warning msg="Ni allocation failed" vmID=21
time="2021-03-24T17:18:21.231034108Z" level=error msg="failed to allocate VM in VM pool" image="vhiveease/helloworld:var_workload" vmID=21
time="2021-03-24T17:18:21.231056239Z" level=error msg="coordinator failed to start VM" error="exit status 4" image="vhiveease/helloworld:var_workload" vmID=21
time="2021-03-24T17:18:21.231076448Z" level=error msg="failed to start VM" error="exit status 4"
time="2021-03-24T17:18:21.243821114Z" level=error msg="VM config for pod c7cb6f98019a2fda3920240060cebc7b1b9eabfd9d372b550041a2ea08a87a25 does not exist"
time="2021-03-24T17:18:21.243852213Z" level=error error="VM config for pod does not exist"
time="2021-03-24T17:18:31.076004001Z" level=info msg="HEARTBEAT: number of active VMs: 4"
time="2021-03-24T17:18:31.077267660Z" level=info msg="FuncPool heartbeat: ==== Stats by cold functions ====\nfID, #started, #served\n==================================="

Fccd logs:

time="2021-03-24T17:17:53.435502537Z" level=debug msg="create VM request: VMID:\"9\" MachineCfg:<MemSizeMib:256 VcpuCount:1 > KernelArgs:\"ro noapic reboot=k panic=1 pci=off nomodules systemd.log_color=false systemd.unit=firecracker.target init=/sbin/overlay-init tsc=reliable quiet 8250.nr_uarts=0 ipv6.disable=1\" NetworkInterfaces:<StaticConfig:<MacAddress:\"02:FC:00:00:00:08\" HostDevName:\"9_tap\" IPConfig:<PrimaryAddr:\"190.128.0.10/10\" GatewayAddr:\"190.128.0.1\" Nameservers:\"10.96.0.10\" Nameservers:\"8.8.8.8\" > > > TimeoutSeconds:100 "
time="2021-03-24T17:17:53.435585033Z" level=debug msg="using namespace: firecracker-containerd"
time="2021-03-24T17:17:53.436032591Z" level=debug msg="starting containerd-shim-aws-firecracker" vmID=9
time="2021-03-24T17:17:53.466895345Z" level=info msg="starting signal loop" namespace=firecracker-containerd path=/var/lib/firecracker-containerd/shim-base/firecracker-containerd/9 pid=94101
time="2021-03-24T17:17:53.467985819Z" level=info msg="creating new VM" runtime=aws.firecracker vmID=9
time="2021-03-24T17:17:53.468223860Z" level=error msg="failed to create VM" error="failed to build VM configuration: failed to create container stub drives: failed to create container stub drive: open /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0: file exists" runtime=aws.firecracker vmID=9
time="2021-03-24T17:17:53.468497084Z" level=error msg="shim CreateVM returned error" error="rpc error: code = Unknown desc = failed to create VM: failed to build VM configuration: failed to create container stub drives: failed to create container stub drive: open /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0: file exists"
time="2021-03-24T17:17:53.469552432Z" level=debug msg="shim has been terminated" error="signal: killed" vmID=9
time="2021-03-24T17:17:53.533591717Z" level=debug msg="create VM request: VMID:\"10\" MachineCfg:<MemSizeMib:256 VcpuCount:1 > KernelArgs:\"ro noapic reboot=k panic=1 pci=off nomodules systemd.log_color=false systemd.unit=firecracker.target init=/sbin/overlay-init tsc=reliable quiet 8250.nr_uarts=0 ipv6.disable=1\" NetworkInterfaces:<StaticConfig:<MacAddress:\"02:FC:00:00:00:09\" HostDevName:\"10_tap\" IPConfig:<PrimaryAddr:\"190.128.0.11/10\" GatewayAddr:\"190.128.0.1\" Nameservers:\"10.96.0.10\" Nameservers:\"8.8.8.8\" > > > TimeoutSeconds:100 "
time="2021-03-24T17:17:53.533667047Z" level=debug msg="using namespace: firecracker-containerd"

...

time="2021-03-24T17:18:13.453043615Z" level=debug msg="starting containerd-shim-aws-firecracker" vmID=17
time="2021-03-24T17:18:13.495012783Z" level=info msg="starting signal loop" namespace=firecracker-containerd path=/var/lib/firecracker-containerd/shim-base/firecracker-containerd/17 pid=95077
time="2021-03-24T17:18:13.496182755Z" level=info msg="creating new VM" runtime=aws.firecracker vmID=17
time="2021-03-24T17:18:13.496482776Z" level=info msg="Called startVMM(), setting up a VMM on firecracker.sock" runtime=aws.firecracker
setupLogging
setupMetrics
time="2021-03-24T17:18:13.512705517Z" level=info msg="VMM metrics disabled." runtime=aws.firecracker
time="2021-03-24T17:18:13.513262662Z" level=info msg="refreshMachineConfiguration: [GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate:T2 HtEnabled:0xc000908933 MemSizeMib:0xc000908928 VcpuCount:0xc000908920}" runtime=aws.firecracker
time="2021-03-24T17:18:13.513686040Z" level=info msg="PutGuestBootSource: [PUT /boot-source][204] putGuestBootSourceNoContent " runtime=aws.firecracker
time="2021-03-24T17:18:13.513705997Z" level=info msg="Attaching drive /var/lib/firecracker-containerd/runtime/default-rootfs.img, slot root_drive, root true." runtime=aws.firecracker
time="2021-03-24T17:18:13.514113713Z" level=info msg="Attached drive /var/lib/firecracker-containerd/runtime/default-rootfs.img: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent " runtime=aws.firecracker
time="2021-03-24T17:18:13.514148036Z" level=info msg="Attaching drive /var/lib/firecracker-containerd/shim-base/firecracker-containerd/17/ctrstub0, slot MN2HE43UOVRDA, root false." runtime=aws.firecracker
time="2021-03-24T17:18:13.514413232Z" level=info msg="Attached drive /var/lib/firecracker-containerd/shim-base/firecracker-containerd/17/ctrstub0: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent " runtime=aws.firecracker
time="2021-03-24T17:18:13.514447997Z" level=info msg="Attaching NIC 17_tap (hwaddr 02:FC:00:00:00:10) at index 1" runtime=aws.firecracker
time="2021-03-24T17:18:13.531276260Z" level=info msg="startInstance successful: [PUT /actions][204] createSyncActionNoContent " runtime=aws.firecracker
time="2021-03-24T17:18:13.531321520Z" level=info msg="calling agent" runtime=aws.firecracker vmID=17
time="2021-03-24T17:18:13.582440559Z" level=debug msg="create VM request: VMID:\"18\" MachineCfg:<MemSizeMib:256 VcpuCount:1 > KernelArgs:\"ro noapic reboot=k panic=1 pci=off nomodules systemd.log_color=false systemd.unit=firecracker.target init=/sbin/overlay-init tsc=reliable quiet 8250.nr_uarts=0 ipv6.disable=1\" NetworkInterfaces:<StaticConfig:<MacAddress:\"02:FC:00:00:00:11\" HostDevName:\"18_tap\" IPConfig:<PrimaryAddr:\"190.128.0.19/10\" GatewayAddr:\"190.128.0.1\" Nameservers:\"10.96.0.10\" Nameservers:\"8.8.8.8\" > > > TimeoutSeconds:100 "
time="2021-03-24T17:18:13.582511635Z" level=debug msg="using namespace: firecracker-containerd"
time="2021-03-24T17:18:13.582767953Z" level=debug msg="starting containerd-shim-aws-firecracker" vmID=18
time="2021-03-24T17:18:13.628695870Z" level=info msg="starting signal loop" namespace=firecracker-containerd path=/var/lib/firecracker-containerd/shim-base/firecracker-containerd/18 pid=95139

Complete logs

orch.out.txt
fccd.err.txt

@ustiugov
Copy link
Member

@shyamjesal it seems that the following folders /var/lib/firecracker-containerd/shim-base/firecracker-containerd/<vmID> are not deleted after each run of CRI tests. This might be attributed to the fact that shutting down VMs fails sometimes. This is not critical because this problem can only appear across vHive-CRI starts not within a single run (because vHive-CRI uses monotonically increasing vmIDs).

I suggest adding the following line to the CRI cleaning script:

sudo rm -rf /var/lib/firecracker-containerd/shim-base/firecracker-containerd/*

@ustiugov
Copy link
Member

@shyamjesal as a feedback, you need to collect logs of vHive with Debug. The info verbosity level may not be sufficient.

@ustiugov
Copy link
Member

ustiugov commented Jun 8, 2021

#64 seems related

HermioneKT pushed a commit that referenced this issue Jan 31, 2024
# This is the 1st commit message:

refactor

# This is the commit message #2:

modify powerNodeSelector

# This is the commit message #3:

modify powerNodeSelector

# This is the commit message #4:

modify powerNodeSelector

# This is the commit message #5:

modify namespace

# This is the commit message #6:

modify namespace

# This is the commit message #7:

add namespace

# This is the commit message #8:

init power manager script

# This is the commit message #9:

add go code

# This is the commit message #10:

test

# This is the commit message #11:

test

# This is the commit message #12:

test

# This is the commit message #13:

test

# This is the commit message #14:

go.sum

# This is the commit message #15:

go sum

# This is the commit message #16:

go mod tidy

# This is the commit message #17:

test

# This is the commit message #18:

test

# This is the commit message #19:

test

# This is the commit message #20:

test

# This is the commit message #21:

test

# This is the commit message #22:

test

# This is the commit message #23:

test

# This is the commit message #24:

test

# This is the commit message #25:

test

# This is the commit message #26:

test

# This is the commit message #27:

test

# This is the commit message #28:

test

# This is the commit message #29:

print output

# This is the commit message #30:

test

# This is the commit message #31:

test

# This is the commit message #32:

test

# This is the commit message #33:

test

# This is the commit message #34:

test

# This is the commit message #35:

test

# This is the commit message #36:

test

# This is the commit message #37:

test

# This is the commit message #38:

test

# This is the commit message #39:

test

# This is the commit message #40:

refactor

# This is the commit message #41:

change val

# This is the commit message #42:

test

# This is the commit message #43:

test

# This is the commit message #44:

test

# This is the commit message #45:

exp1

# This is the commit message #46:

add url

# This is the commit message #47:

test

# This is the commit message #48:

test

# This is the commit message #49:

test auth

# This is the commit message #50:

test

# This is the commit message #51:

test

# This is the commit message #52:

test

# This is the commit message #53:

test

# This is the commit message #54:

test

# This is the commit message #55:

fix bug

# This is the commit message #56:

remove log

# This is the commit message #57:

test

# This is the commit message #58:

test

# This is the commit message #59:

test

# This is the commit message #60:

fix loop

# This is the commit message #61:

increase invocation

# This is the commit message #62:

test

# This is the commit message #63:

test

# This is the commit message #64:

test

# This is the commit message #65:

test

# This is the commit message #66:

test

# This is the commit message #67:

test

# This is the commit message #68:

test

# This is the commit message #69:

Test

# This is the commit message #70:

test

# This is the commit message #71:

test

# This is the commit message #72:

test

# This is the commit message #73:

test

# This is the commit message #74:

test

# This is the commit message #75:

test

# This is the commit message #76:

test

# This is the commit message #77:

test

# This is the commit message #78:

test

# This is the commit message #79:

test

# This is the commit message #80:

test

# This is the commit message #81:

test

# This is the commit message #82:

test

# This is the commit message #83:

test

# This is the commit message #84:

test

# This is the commit message #85:

test

# This is the commit message #86:

test

# This is the commit message #87:

test

# This is the commit message #88:

test

# This is the commit message #89:

test

# This is the commit message #90:

test1

# This is the commit message #91:

test

# This is the commit message #92:

test

# This is the commit message #93:

test

# This is the commit message #94:

test

# This is the commit message #95:

test

# This is the commit message #96:

test

# This is the commit message #97:

test

# This is the commit message #98:

test

# This is the commit message #99:

test

# This is the commit message #100:

test

# This is the commit message #101:

test

# This is the commit message #102:

test

# This is the commit message #103:

test

# This is the commit message #104:

test

# This is the commit message #105:

test

# This is the commit message #106:

test

# This is the commit message #107:

test

# This is the commit message #108:

test

# This is the commit message #109:

test1

# This is the commit message #110:

test

# This is the commit message #111:

test

# This is the commit message #112:

Test

# This is the commit message #113:

test

# This is the commit message #114:

test

# This is the commit message #115:

test

# This is the commit message #116:

test

# This is the commit message #117:

test

# This is the commit message #118:

test

# This is the commit message #119:

Test

# This is the commit message #120:

Test

# This is the commit message #121:

test

# This is the commit message #122:

test

# This is the commit message #123:

Test

# This is the commit message #124:

test

# This is the commit message #125:

test

# This is the commit message #126:

test

# This is the commit message #127:

test

# This is the commit message #128:

test

# This is the commit message #129:

test

# This is the commit message #130:

test

# This is the commit message #131:

test

# This is the commit message #132:

test

# This is the commit message #133:

test

# This is the commit message #134:

test

# This is the commit message #135:

test

# This is the commit message #136:

test

# This is the commit message #137:

test

# This is the commit message #138:

test

# This is the commit message #139:

test

# This is the commit message #140:

test

# This is the commit message #141:

test

# This is the commit message #142:

test

# This is the commit message #143:

Test

# This is the commit message #144:

Test

# This is the commit message #145:

Test

# This is the commit message #146:

Test

# This is the commit message #147:

test

# This is the commit message #148:

test

# This is the commit message #149:

test

# This is the commit message #150:

test

# This is the commit message #151:

test

# This is the commit message #152:

test

# This is the commit message #153:

test

# This is the commit message #154:

Test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
vHive Roadmap
  
Awaiting triage
Development

No branches or pull requests

2 participants