Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmalloc (Istio 1.22+) causes Envoy to fail to startup on some CPUs (rockchip) #51708

Open
2 tasks done
phoesh opened this issue Jun 25, 2024 · 7 comments
Open
2 tasks done

Comments

@phoesh
Copy link

phoesh commented Jun 25, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

$istioctl install --set profile=ambient --set "components.ingressGateways[0].enabled=true" --set "components.ingressGateways[0].name=istio-ingressgateway" --skip-confirmation
The default revision has been updated to point to this installation.
✔ Istio core installed
✔ Istiod installed
✔ CNI installed
✔ Ztunnel installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: context deadline exceeded
  Deployment/istio-system/istio-ingressgateway (container failed to start: CrashLoopBackOff: back-off 2m40s restarting failed container=istio-proxy pod=istio-ingressgateway-6f48dfb7db-9pnlv_istio-system(22a1ec19-1b1d-416a-9053-20b21b0c153c))
- Pruning removed resources                                                                                         Error: failed to install manifests: errors occurred during operation

I used istioctl to install istio ambient mode on my cluster and I met the problem like this.

And I used helm to install it. I still met the same problem.
helm install istio-ingress istio/gateway -n istio-ingress --create-namespace --wait

The pod logs:

2024-06-25T13:27:08.977104Z	info	Set max file descriptors (ulimit -n) to: 1048576
2024-06-25T13:27:08.977376Z	info	Proxy role	ips=[10.244.161.177] type=router id=istio-ingressgateway-6f48dfb7db-9pnlv.istio-system domain=istio-system.svc.cluster.local
    ISTIO_META_ENABLE_HBONE: "true"
    imageType: distroless
  - prometheus
  discoveryAddress: istiod.istio-system.svc:15012
  image:
  metrics:
  proxyMetadata:
2024-06-25T13:27:08.977480Z	info	Apply mesh config from file defaultConfig:
defaultProviders:
enablePrometheusMerge: true
rootNamespace: istio-system
trustDomain: cluster.local
2024-06-25T13:27:08.981053Z	info	cpu limit detected as 2, setting concurrency

  ISTIO_META_ENABLE_HBONE: "true"
  imageType: distroless
2024-06-25T13:27:08.982905Z	info	Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
image:
proxyAdminPort: 15000
proxyMetadata:
serviceCluster: istio-proxy
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
2024-06-25T13:27:08.983142Z	info	JWT policy is third-party-jwt
2024-06-25T13:27:08.983169Z	info	using credential fetcher of JWT type in cluster.local trust domain
2024-06-25T13:27:09.188499Z	info	Workload SDS socket not found. Starting Istio SDS Server
2024-06-25T13:27:09.188585Z	info	Opening status port 15020
2024-06-25T13:27:09.188634Z	info	CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2024-06-25T13:27:09.188718Z	info	Using CA istiod.istio-system.svc:15012 cert with certs: var/run/secrets/istio/root-cert.pem
2024-06-25T13:27:09.217597Z	info	ads	All caches have been synced up in 241.833552ms, marking server ready
2024-06-25T13:27:09.218079Z	info	xdsproxy	Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "Kubernetes"
2024-06-25T13:27:09.218153Z	info	sds	Starting SDS grpc server
2024-06-25T13:27:09.219081Z	info	starting Http service at 127.0.0.1:15004
2024-06-25T13:27:09.222977Z	info	Pilot SAN: [istiod.istio-system.svc]
2024-06-25T13:27:09.226722Z	info	Starting proxy agent
2024-06-25T13:27:09.226865Z	info	Envoy command: [-c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-strategy immediate --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --allow-unknown-static-fields -l warning --component-log-level misc:error --concurrency 2]
external/com_github_google_tcmalloc/tcmalloc/arena.cc:58] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size); is something preventing mmap from succeeding (sandbox, VSS limitations)? 131072 632 @ 0x55610c797c 0x55610a437c 0x55610c06e8 0x55610c04cc 0x5561099338 0x5560fc5cd4 0x5560fc2a60 0x5561090748 0x7f9184783c
external/com_github_google_tcmalloc/tcmalloc/system-alloc.cc:625] MmapAligned() failed - unable to allocate with tag (hint, size, alignment) - is something limiting address placement? 0x2402c0000000 1073741824 1073741824 @ 0x55610c7614 0x55610c38d8 0x55610c3178 0x55610a42ec 0x55610c06e8 0x55610c04cc 0x5561099338 0x5560fc5cd4 0x5560fc2a60 0x5561090748 0x7f9184783c
2024-06-25T13:27:09.251207Z	error	Envoy exited with error: signal: aborted
2024-06-25T13:27:09.251944Z	info	sds	SDS server for workload certificates started, listening on "./var/run/secrets/workload-spiffe-uds/socket"

Version

$istioctl version
client version: 1.22.1
control plane version: 1.22.1
data plane version: 1.22.1 (3 proxies)

$kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.1

$helm version --short
v3.15.1+ge211f2a

Additional Information

Kubernetes is installed on 3 local machines.
1 control plane and 2 workers.
Installed Calico CNI and MetalLB.

@istio-policy-bot istio-policy-bot added area/ambient Issues related to ambient mesh area/environments labels Jun 25, 2024
@howardjohn
Copy link
Member

Interesting. Never seen this one. Do you have a "weird" host OS? Is it arm64?

@phoesh
Copy link
Author

phoesh commented Jun 25, 2024

Interesting. Never seen this one. Do you have a "weird" host OS? Is it arm64?

Yes, the machines are orange pi 5 pro with arm64 CPU.
Debian GNU/Linux 12 (bookworm)
5.10.160-rockchip-rk3588

@howardjohn
Copy link
Member

This looks like envoyproxy/envoy#15235

@phoesh
Copy link
Author

phoesh commented Jun 25, 2024

This looks like envoyproxy/envoy#15235

Yeah! Thank you.
I notice the issue mentioned the commit has already made tcmalloc compatible:
https://github.com/armbian/build/commit/b8bea3edb95c3a21399d3e42c90e6bdf032a2864
It seems to need to reinstall OS with armbian but I have no idea that armbian supports my boards or either.

@howardjohn
Copy link
Member

In theory you could make a debian rockchip build with the kernel config set. But might require recompiling your own kernel which is a pretty steep curve...

It would be nice for tcmalloc to handle this directly

fyi @kyessenov @briansonnenberg

@iamasmith
Copy link

One common issue when you see malloc related errors, particuarly well known with jemalloc is an unexpected page size it's worth trying getconf PAGESIZE and checking if the library supports that page size.

On regular PI5 there's an option to run with kernel=kernel8.img to switch to 4KiB page size from the regylar 16KiB, you lose a little performance but it does work - well known examples are fluentbit and fluentd don't work without that. I switched to promtail avoiding this issue but had it for a while.

@howardjohn
Copy link
Member

fyi @keithmattix - a change when we moved to tcmalloc

@howardjohn howardjohn changed the title Install Ambient ingress gateways error tcmalloc (Istio 1.22+) causes Envoy to fail to startup on some CPUs (rockchip) Aug 22, 2024
@howardjohn howardjohn removed the area/ambient Issues related to ambient mesh label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants