Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhook autoscaler doesn't recognize organization target #534

Closed
yfried opened this issue May 9, 2021 · 33 comments
Closed

webhook autoscaler doesn't recognize organization target #534

yfried opened this issue May 9, 2021 · 33 comments
Labels
question Further information is requested stale

Comments

@yfried
Copy link
Contributor

yfried commented May 9, 2021

I keep getting this in logs:

github-webhook-server 2021-05-09T11:48:28.986Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "296414796", "delivery": "787a18e0-b0bc-11eb-8d24-0ff6126380d2", "checkRun.status": "completed", "action": "completed"}

Looks like it can't pull the org name or type from the event payload (But I might be misunderstanding the code)

This is my config:

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: kubernetes-1cores-2048mi-summerwind-autoscaler
spec:
  scaleTargetRef:
    name: kubernetes-1cores-2048mi-summerwind
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 3
    duration: "5m"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: kubernetes-1cores-2048mi-summerwind
spec:
  template:
    spec:
      organization: Myorg-Private
      labels:
        - kubernetes-1cores-2048mi-summerwind
      ephemeral: true
      env:
        - name: RUNNER_DEBUG
          value: "true"
        - name: ACTIONS_RUNNER_INPUT_LABELS
          value: kubernetes-1cores-2048mi-summerwind
      image: my-custom-image:v1
      resources:
        requests:
          cpu: "1"
          memory: "2048Mi"
        limits:
          cpu: "1"
          memory: "2048Mi"
      dockerEnabled: false

I'm deploying the controller using helm chart from my own branch which shouldn't matter, but just in case...

Here's my values file:

authSecret:
  create: false
  name: controller-manager
createDummySecret: false
githubWebhookServer:
  enabled: true
  imagePullSecrets: []
  ingress:
    enabled: true
    hosts:
      - host: summerwind-webhook-listener.myorg.com
        paths:
                    - backend:
                        serviceName: summerwind-actions-runner-controller-webhook
                        servicePort: http 
    tls:
      - hosts:
          - summerwind-webhook-listener.myorg.com
        secretName: myorg.com
  nameOverride: summerwind-webhook-listener
  replicaCount: 1
  secret:
    create: true
    name: github-webhook-server
metrics:
  proxy:
    enabled: false
  serviceMonitor: true
@yfried yfried changed the title webhook autoscaler recognize organization target webhook autoscaler doesn't recognize organization target May 9, 2021
@mumoshu
Copy link
Collaborator

mumoshu commented May 9, 2021

@yfried Could you share us the webhook payload of the event whose delivery ID is "787a18e0-b0bc-11eb-8d24-0ff6126380d2"? Without seeing that, I can't see what's happening. The webhook-based autoscaler should just work with organizational runners.

@mumoshu
Copy link
Collaborator

mumoshu commented May 9, 2021

@yfried This the relevant code in the controller. You could probably try debugging it and see if it's really working as intended on your webhook payload https://github.com/actions-runner-controller/actions-runner-controller/blob/082245c5db64e023cd79604d9b158f336770e3fe/controllers/horizontal_runner_autoscaler_webhook.go#L170-L184

@yfried
Copy link
Contributor Author

yfried commented May 10, 2021

@yfried Could you share us the webhook payload of the event whose delivery ID is "787a18e0-b0bc-11eb-8d24-0ff6126380d2"? Without seeing that, I can't see what's happening. The webhook-based autoscaler should just work with organizational runners.

@mumoshu
I'm attaching a queued event that should have triggered a scaleup:

Request URL: https://summerwind-webhook-listener.mysoluto.com:
Request method: POST
Accept: */*
content-type: application/json
User-Agent: GitHub-Hookshot/536e04f
X-GitHub-Delivery: 787a18e0-b0bc-11eb-8d24-0ff6126380d2
X-GitHub-Event: check_run
X-GitHub-Hook-ID: 296414796
X-GitHub-Hook-Installation-Target-ID: 11223344
X-GitHub-Hook-Installation-Target-Type: organization
{
  "action": "created",
  "check_run": {
    "id": 1234567890,
    "node_id": "MDg6Q2hlY2tSdW4yNTM4NzI2Nzk0",
    "head_sha": "0989e14252273b848d0a6bd20c58ec41558c02ce",
    "external_id": "f73aec57-e22c-559d-9485-1eb8b431809b",
    "url": "https://api.github.com/repos/Myorg-Private/myrepo/check-runs/1234567890",
    "html_url": "https://github.com/Myorg-Private/myrepo/runs/1234567890",
    "details_url": "https://github.com/Myorg-Private/myrepo/runs/1234567890",
    "status": "queued",
    "conclusion": null,
    "started_at": "2021-05-09T11:46:42Z",
    "completed_at": null,
    "output": {
      "title": null,
      "summary": null,
      "text": null,
      "annotations_count": 0,
      "annotations_url": "https://api.github.com/repos/Myorg-Private/myrepo/check-runs/1234567890/annotations"
    },
    "name": "Run - yarn test:integration, node 14.x",
    "check_suite": {
      "id": 2690524465,
      "node_id": "MDEwOkNoZWNrU3VpdGUyNjkwNTI0NDY1",
      "head_branch": "main",
      "head_sha": "0989e14252273b848d0a6bd20c58ec41558c02ce",
      "status": "queued",
      "conclusion": null,
      "url": "https://api.github.com/repos/Myorg-Private/myrepo/check-suites/2690524465",
      "before": "c4e307e05ceba8530355c5df987c2220402956dc",
      "after": "0989e14252273b848d0a6bd20c58ec41558c02ce",
      "pull_requests": [

      ],
      "app": {
        "id": 15368,
        "slug": "github-actions",
        "node_id": "MDM6QXBwMTUzNjg=",
        "owner": {
          "login": "github",
          "id": 9919,
          "node_id": "MDEyOk9yZ2FuaXphdGlvbjk5MTk=",
          "avatar_url": "https://avatars.githubusercontent.com/u/9919?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/github",
          "html_url": "https://github.com/github",
          "followers_url": "https://api.github.com/users/github/followers",
          "following_url": "https://api.github.com/users/github/following{/other_user}",
          "gists_url": "https://api.github.com/users/github/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/github/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/github/subscriptions",
          "organizations_url": "https://api.github.com/users/github/orgs",
          "repos_url": "https://api.github.com/users/github/repos",
          "events_url": "https://api.github.com/users/github/events{/privacy}",
          "received_events_url": "https://api.github.com/users/github/received_events",
          "type": "Organization",
          "site_admin": false
        },
        "name": "GitHub Actions",
        "description": "Automate your workflow from idea to production",
        "external_url": "https://help.github.com/en/actions",
        "html_url": "https://github.com/apps/github-actions",
        "created_at": "2018-07-30T09:30:17Z",
        "updated_at": "2019-12-10T19:04:12Z",
        "permissions": {
          "actions": "write",
          "checks": "write",
          "contents": "write",
          "deployments": "write",
          "issues": "write",
          "metadata": "read",
          "organization_packages": "write",
          "packages": "write",
          "pages": "write",
          "pull_requests": "write",
          "repository_hooks": "write",
          "repository_projects": "write",
          "security_events": "write",
          "statuses": "write",
          "vulnerability_alerts": "read"
        },
        "events": [
          "check_run",
          "check_suite",
          "create",
          "delete",
          "deployment",
          "deployment_status",
          "fork",
          "gollum",
          "issues",
          "issue_comment",
          "label",
          "milestone",
          "page_build",
          "project",
          "project_card",
          "project_column",
          "public",
          "pull_request",
          "pull_request_review",
          "pull_request_review_comment",
          "push",
          "registry_package",
          "release",
          "repository",
          "repository_dispatch",
          "status",
          "watch",
          "workflow_dispatch",
          "workflow_run"
        ]
      },
      "created_at": "2021-05-09T11:46:41Z",
      "updated_at": "2021-05-09T11:46:41Z"
    },
    "app": {
      "id": 15368,
      "slug": "github-actions",
      "node_id": "MDM6QXBwMTUzNjg=",
      "owner": {
        "login": "github",
        "id": 9919,
        "node_id": "MDEyOk9yZ2FuaXphdGlvbjk5MTk=",
        "avatar_url": "https://avatars.githubusercontent.com/u/9919?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/github",
        "html_url": "https://github.com/github",
        "followers_url": "https://api.github.com/users/github/followers",
        "following_url": "https://api.github.com/users/github/following{/other_user}",
        "gists_url": "https://api.github.com/users/github/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/github/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/github/subscriptions",
        "organizations_url": "https://api.github.com/users/github/orgs",
        "repos_url": "https://api.github.com/users/github/repos",
        "events_url": "https://api.github.com/users/github/events{/privacy}",
        "received_events_url": "https://api.github.com/users/github/received_events",
        "type": "Organization",
        "site_admin": false
      },
      "name": "GitHub Actions",
      "description": "Automate your workflow from idea to production",
      "external_url": "https://help.github.com/en/actions",
      "html_url": "https://github.com/apps/github-actions",
      "created_at": "2018-07-30T09:30:17Z",
      "updated_at": "2019-12-10T19:04:12Z",
      "permissions": {
        "actions": "write",
        "checks": "write",
        "contents": "write",
        "deployments": "write",
        "issues": "write",
        "metadata": "read",
        "organization_packages": "write",
        "packages": "write",
        "pages": "write",
        "pull_requests": "write",
        "repository_hooks": "write",
        "repository_projects": "write",
        "security_events": "write",
        "statuses": "write",
        "vulnerability_alerts": "read"
      },
      "events": [
        "check_run",
        "check_suite",
        "create",
        "delete",
        "deployment",
        "deployment_status",
        "fork",
        "gollum",
        "issues",
        "issue_comment",
        "label",
        "milestone",
        "page_build",
        "project",
        "project_card",
        "project_column",
        "public",
        "pull_request",
        "pull_request_review",
        "pull_request_review_comment",
        "push",
        "registry_package",
        "release",
        "repository",
        "repository_dispatch",
        "status",
        "watch",
        "workflow_dispatch",
        "workflow_run"
      ]
    },
    "pull_requests": [

    ]
  },
  "repository": {
    "id": 1231231231,
    "node_id": "MDEwOlJlcG9zaXRvcnkzMjAwNTMwOTg=",
    "name": "myrepo",
    "full_name": "Myorg-Private/myrepo",
    "private": true,
    "owner": {
      "login": "Myorg-Private",
      "id": 11223344,
      "node_id": "MDEyOk9yZ2FuaXphdGlvbjMyMzYxMTk5",
      "avatar_url": "https://avatars.githubusercontent.com/u/11223344?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/Myorg-Private",
      "html_url": "https://github.com/Myorg-Private",
      "followers_url": "https://api.github.com/users/Myorg-Private/followers",
      "following_url": "https://api.github.com/users/Myorg-Private/following{/other_user}",
      "gists_url": "https://api.github.com/users/Myorg-Private/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/Myorg-Private/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/Myorg-Private/subscriptions",
      "organizations_url": "https://api.github.com/users/Myorg-Private/orgs",
      "repos_url": "https://api.github.com/users/Myorg-Private/repos",
      "events_url": "https://api.github.com/users/Myorg-Private/events{/privacy}",
      "received_events_url": "https://api.github.com/users/Myorg-Private/received_events",
      "type": "Organization",
      "site_admin": false
    },
    "html_url": "https://github.com/Myorg-Private/myrepo",
    "description": "We are merging all of our customer facing applications into a single, seamless experience...my.myenterprise.com",
    "fork": false,
    "url": "https://api.github.com/repos/Myorg-Private/myrepo",
    "forks_url": "https://api.github.com/repos/Myorg-Private/myrepo/forks",
    "keys_url": "https://api.github.com/repos/Myorg-Private/myrepo/keys{/key_id}",
    "collaborators_url": "https://api.github.com/repos/Myorg-Private/myrepo/collaborators{/collaborator}",
    "teams_url": "https://api.github.com/repos/Myorg-Private/myrepo/teams",
    "hooks_url": "https://api.github.com/repos/Myorg-Private/myrepo/hooks",
    "issue_events_url": "https://api.github.com/repos/Myorg-Private/myrepo/issues/events{/number}",
    "events_url": "https://api.github.com/repos/Myorg-Private/myrepo/events",
    "assignees_url": "https://api.github.com/repos/Myorg-Private/myrepo/assignees{/user}",
    "branches_url": "https://api.github.com/repos/Myorg-Private/myrepo/branches{/branch}",
    "tags_url": "https://api.github.com/repos/Myorg-Private/myrepo/tags",
    "blobs_url": "https://api.github.com/repos/Myorg-Private/myrepo/git/blobs{/sha}",
    "git_tags_url": "https://api.github.com/repos/Myorg-Private/myrepo/git/tags{/sha}",
    "git_refs_url": "https://api.github.com/repos/Myorg-Private/myrepo/git/refs{/sha}",
    "trees_url": "https://api.github.com/repos/Myorg-Private/myrepo/git/trees{/sha}",
    "statuses_url": "https://api.github.com/repos/Myorg-Private/myrepo/statuses/{sha}",
    "languages_url": "https://api.github.com/repos/Myorg-Private/myrepo/languages",
    "stargazers_url": "https://api.github.com/repos/Myorg-Private/myrepo/stargazers",
    "contributors_url": "https://api.github.com/repos/Myorg-Private/myrepo/contributors",
    "subscribers_url": "https://api.github.com/repos/Myorg-Private/myrepo/subscribers",
    "subscription_url": "https://api.github.com/repos/Myorg-Private/myrepo/subscription",
    "commits_url": "https://api.github.com/repos/Myorg-Private/myrepo/commits{/sha}",
    "git_commits_url": "https://api.github.com/repos/Myorg-Private/myrepo/git/commits{/sha}",
    "comments_url": "https://api.github.com/repos/Myorg-Private/myrepo/comments{/number}",
    "issue_comment_url": "https://api.github.com/repos/Myorg-Private/myrepo/issues/comments{/number}",
    "contents_url": "https://api.github.com/repos/Myorg-Private/myrepo/contents/{+path}",
    "compare_url": "https://api.github.com/repos/Myorg-Private/myrepo/compare/{base}...{head}",
    "merges_url": "https://api.github.com/repos/Myorg-Private/myrepo/merges",
    "archive_url": "https://api.github.com/repos/Myorg-Private/myrepo/{archive_format}{/ref}",
    "downloads_url": "https://api.github.com/repos/Myorg-Private/myrepo/downloads",
    "issues_url": "https://api.github.com/repos/Myorg-Private/myrepo/issues{/number}",
    "pulls_url": "https://api.github.com/repos/Myorg-Private/myrepo/pulls{/number}",
    "milestones_url": "https://api.github.com/repos/Myorg-Private/myrepo/milestones{/number}",
    "notifications_url": "https://api.github.com/repos/Myorg-Private/myrepo/notifications{?since,all,participating}",
    "labels_url": "https://api.github.com/repos/Myorg-Private/myrepo/labels{/name}",
    "releases_url": "https://api.github.com/repos/Myorg-Private/myrepo/releases{/id}",
    "deployments_url": "https://api.github.com/repos/Myorg-Private/myrepo/deployments",
    "created_at": "2020-12-09T19:04:06Z",
    "updated_at": "2021-05-06T14:27:15Z",
    "pushed_at": "2021-05-07T19:15:14Z",
    "git_url": "git://github.com/Myorg-Private/myrepo.git",
    "ssh_url": "git@github.com:Myorg-Private/myrepo.git",
    "clone_url": "https://github.com/Myorg-Private/myrepo.git",
    "svn_url": "https://github.com/Myorg-Private/myrepo",
    "homepage": "https://my.myenterprise.com",
    "size": 5296,
    "stargazers_count": 5,
    "watchers_count": 5,
    "language": "JavaScript",
    "has_issues": true,
    "has_projects": true,
    "has_downloads": true,
    "has_wiki": true,
    "has_pages": false,
    "forks_count": 1,
    "mirror_url": null,
    "archived": false,
    "disabled": false,
    "open_issues_count": 49,
    "license": null,
    "forks": 1,
    "open_issues": 49,
    "watchers": 5,
    "default_branch": "main"
  },
  "organization": {
    "login": "Myorg-Private",
    "id": 11223344,
    "node_id": "MDEyOk9yZ2FuaXphdGlvbjMyMzYxMTk5",
    "url": "https://api.github.com/orgs/Myorg-Private",
    "repos_url": "https://api.github.com/orgs/Myorg-Private/repos",
    "events_url": "https://api.github.com/orgs/Myorg-Private/events",
    "hooks_url": "https://api.github.com/orgs/Myorg-Private/hooks",
    "issues_url": "https://api.github.com/orgs/Myorg-Private/issues",
    "members_url": "https://api.github.com/orgs/Myorg-Private/members{/member}",
    "public_members_url": "https://api.github.com/orgs/Myorg-Private/public_members{/member}",
    "avatar_url": "https://avatars.githubusercontent.com/u/11223344?v=4",
    "description": null
  },
  "enterprise": {
    "id": 4321,
    "slug": "myenterprise",
    "name": "Myenterprise",
    "node_id": "MDEwOkVudGVycHJpc2U0MzYx",
    "avatar_url": "https://avatars.githubusercontent.com/b/4321?v=4",
    "description": "",
    "website_url": "https://www.myenterprise.com/",
    "html_url": "https://github.com/enterprises/myenterprise",
    "created_at": "2020-10-15T17:08:47Z",
    "updated_at": "2021-04-29T14:54:21Z"
  },
  "sender": {
    "login": "myuser",
    "id": 3214567,
    "node_id": "MDQ6VXNlcjg1NTY0OTU=",
    "avatar_url": "https://avatars.githubusercontent.com/u/3214567?v=4",
    "gravatar_id": "",
    "url": "https://api.github.com/users/myuser",
    "html_url": "https://github.com/myuser",
    "followers_url": "https://api.github.com/users/myuser/followers",
    "following_url": "https://api.github.com/users/myuser/following{/other_user}",
    "gists_url": "https://api.github.com/users/myuser/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/myuser/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/myuser/subscriptions",
    "organizations_url": "https://api.github.com/users/myuser/orgs",
    "repos_url": "https://api.github.com/users/myuser/repos",
    "events_url": "https://api.github.com/users/myuser/events{/privacy}",
    "received_events_url": "https://api.github.com/users/myuser/received_events",
    "type": "User",
    "site_admin": false
  }
}

@mumoshu
Copy link
Collaborator

mumoshu commented May 10, 2021

@yfried Thanks! The webhook payload seems fine.

Theoretically speaking, the only chance seems to be that you have two or more HorizontalRunnerAutoscaler that targets RunnerDeplyoment for the organization Myorg-Private.

Could you verify that by running kubectl -n $YOUR_HRA_NS get horizontalrunnerautoscaler?

@yfried
Copy link
Contributor Author

yfried commented May 10, 2021

No.
I had only 1.
Especially because of this caveat.

@mumoshu
Copy link
Collaborator

mumoshu commented May 10, 2021

@yfried Thanks. Can you see this message in your log? https://github.com/actions-runner-controller/actions-runner-controller/blob/082245c5db64e023cd79604d9b158f336770e3fe/controllers/horizontal_runner_autoscaler_webhook.go#L377

What's include the organization field? Empty? Or Myorg-Private?

@yfried
Copy link
Contributor Author

yfried commented May 10, 2021

No, this message isn't in my logs, which is why I'm guessing it doesn't know to look for org

@mumoshu
Copy link
Collaborator

mumoshu commented May 10, 2021

@yfried Thanks. Maybe the next chance would that we arent' indexing the HRA correctly? In other words, can you see that it isn't matching any HRA here, by adding your own log statement there? https://github.com/actions-runner-controller/actions-runner-controller/blob/082245c5db64e023cd79604d9b158f336770e3fe/controllers/horizontal_runner_autoscaler_webhook.go#L261-L269

mumoshu added a commit that referenced this issue May 11, 2021
… repo and org runners

Adds what I used while verifying #534
mumoshu added a commit that referenced this issue May 11, 2021
… repo and org runners

Adds what I used while verifying #534
mumoshu added a commit that referenced this issue May 11, 2021
Adds some helpful debug log messages I have used while verifying #534
mumoshu added a commit that referenced this issue May 11, 2021
Adds some helpful debug log messages I have used while verifying #534
@mumoshu
Copy link
Collaborator

mumoshu commented May 11, 2021

@yfried Hey! I've made a few changes to the controller myself and gave it a shot.

Webhook-based autoscaling on organizational runners seems to work for me.
This is the config I've used
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/acceptance/testdata/org.hra.yaml
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/acceptance/testdata/org.runnerdeploy.yaml

Note that I use envsubst to replace envvars like TEST_ORG with my own github org for testing

@mumoshu mumoshu added the question Further information is requested label May 11, 2021
@awoimbee
Copy link

the only chance seems to be that you have two or more HorizontalRunnerAutoscaler that targets RunnerDeplyoment for the organization Myorg-Private.

Is there a plan to handle multiple org. wide HorizontalRunnerAutoscalers ? Would it be possible to select the right HRA by looking at the RunnerDeployment's labels ?

@mumoshu
Copy link
Collaborator

mumoshu commented May 25, 2021

@awoimbee Could you elaborate? You have have one HRA and one RunnerDeployment per a github organization and that should just work.

@awoimbee
Copy link

@mumoshu
Let's say:
I want multiple kinds of runners (one with a GPU, one with 16GB of ram, ...) that I select with labels.
I want all my runner to be organization wide.
I want scaling to 0 (I don't want a runner with 24GB of ram and a GPU to sit idle)

-> I have to create multiple organization wide RunnerDeployment and also multiple HRA.

It seems like the webhook scaling doesn't handle having multiple organization wide HRA (yet)

Is there a plan to handle this use-case ?
(Thanks for the awesome work btw)

@mumoshu
Copy link
Collaborator

mumoshu commented May 26, 2021

@awoimbee How can you differentiate runners? Basically, webhook event payloads do not contain information about which runner(with certain labels, groups, orgs, repositories, etc) the webhook event is going to trigger a workflow job run on.

If you have unique enough job names per org/repo/labels/groups/etc for your workflows, for scaling based on check run events, you can set chckRun.Names on HRA so that the HRA only reacts to check run with those names.

@mumoshu
Copy link
Collaborator

mumoshu commented May 26, 2021

// So depending on your requirement, you'd need to raise feature requests to GitHub, not us.

@mumoshu
Copy link
Collaborator

mumoshu commented May 26, 2021

BTW, to be clear, although this issue says webhook autoscaler doesn't recognize organization target, I've successfully tested it to work on my environment.

So I'm still believing this would have been some user error, even though there might be some documentation or operational enhancements we need to make.

@stale
Copy link

stale bot commented Jun 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2021
@stale stale bot closed this as completed Jul 9, 2021
@fabiano-amaral
Copy link

fabiano-amaral commented Oct 7, 2021

I found the same problem on our setup

This is issue is caused by a HRA without minReplica and maxReplica.

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 7, 2021

@fabiano-amaral Thanks for the info! Could you clarify a bit more? I had no clue how minReplicas and maxReplicas of a HRA affects it, because actions-runner-controller doesn't use those fields for filtering the right HRA for a webhook event.

@fabiano-amaral
Copy link

fabiano-amaral commented Nov 2, 2021

@fabiano-amaral Thanks for the info! Could you clarify a bit more? I had no clue how minReplicas and maxReplicas of a HRA affects it, because actions-runner-controller doesn't use those fields for filtering the right HRA for a webhook event.

@mumoshu
Yes, i agree totally with u, but the only way that worked for us is adding this configs to HRA.

Min and Max MAYBE can be used to set the max os jobs that can be started at same time, so, you can limit how many pods spawn, and the minReplica can set idle jobs waiting for a webhook event.

but, its only my theory about this, I don't debugged the source code. This information could be in the documentation

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 3, 2021

@awoimbee Hey! We don't yet have out-of-box support for handling multiple sets of organizational runners with a single HRA that you asked.

But we do now have workflow_job webhook based scale that doesn't require manual mapping on check run names

https://github.com/actions-runner-controller/actions-runner-controller#example-1-scale-on-each-workflow_job-event

With that it should be pretty easy to set up multiple HRA+RunnerDeployment pairs.

Does it solve your original issue? Or were you talking about anything else?

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 4, 2021

@fabiano-amaral Thanks a lot for the info! I have not reproduced your issue yet, but I'd definitely keep your information in my mind and report back if there's any update 👍

@UrosCvijan
Copy link

UrosCvijan commented Jul 26, 2022

Hi @mumoshu did you find any conlusion on this, cause I am having the same thing. Started testing autoscaling with webhook on github enterprise, get the same message, don't have in the log finding organization runner..

2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler Found 0 HRAs by key {"key": "engineering/ReleaseTesting"}
2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "groups": "RunnerGroup{Scope:Organization, Kind:Custom, Name:prod-scaling}"}
2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler groups {"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "groups": "RunnerGroup{Scope:Organization, Kind:Custom, Name:prod-scaling}"}
2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler Found 1 HRAs by key {"key": "engineering/group/prod-scaling"}
2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler no repository/organizational/enterprise runner found {"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "repository": "engineering/ReleaseTesting", "organization": "engineering", "enterprise": "xxxxxxxx"}
2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event {"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "checkRun.status": "completed", "action": "completed"}

`
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: engineering-runner-prod
namespace: actions-runner-engineering
spec:
template:
spec:
organization: engineering
# serviceAccountName: actions-runner
labels:
- scale
- prod
group: prod-scaling

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: engineering-runner-prod-autoscaler
namespace: actions-runner-engineering
spec:
scaleDownDelaySecondsAfterScaleOut: 300
minReplicas: 1
maxReplicas: 5
scaleTargetRef:
name: engineering-runner-prod
scaleUpTriggers:

  • githubEvent:
    workflowJob: {}
    duration: "30m"
    `

So I tried the most simple approach from the docs. Now runner is in runner group, has specific tags, does that have to do anything with it?

As the idea is for us to have in one org multiple runner groups per team basically, so we can divide permissions, so i have multiple runner controllers and that all works fine and nice. But now i wanted to test out scaling to limit the number of runners just waiting idle. I get the above.

@toast-gear
Copy link
Collaborator

Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event {"event": "check_run"

 scaleUpTriggers:
    githubEvent:
    workflowJob: {} <---- your configured event
    duration: "30m"

@UrosCvijan
Copy link

Hmmm,
I understood from this example that I just need to leave it empty there:

https://github.com/actions-runner-controller/actions-runner-controller#example-1-scale-on-each-workflow_job-event

And he will pick up anything that comes in

@toast-gear
Copy link
Collaborator

toast-gear commented Jul 26, 2022

And he will pick up anything that comes in

It will pick up the configured event i.e. the child key of the githubEvent: key e.g.:

Look for the workflow_job event:

githubEvent:
      workflowJob: {}

Look for the check_run event:

githubEvent:
      checkRun: 
        ...

Look for the pull_request event:

githubEvent:
      pullRequest:
        ...

We are removing all event type other than workflow_job so you configure the webhook event being sent from GitHub to ARC to be the workflow_job event only and configure your HRAs to look for the workflow_job event (which you've done already), see #1607 for the why

We currently don't have any child keys for the workflowJob: key currently as we don't think any are needed, the other events are legacy (again see the issue for details) and so we are deprecating and removing them.

@UrosCvijan
Copy link

UrosCvijan commented Jul 26, 2022

So did i set it up correctly for the workflow_job event? I just took the example from the docs and try to replicate but for some reason it does not work. Is there something that I need to change for HRA to do the scaling of runners?

I am using these versions for helm chart:
actions-runner-controller-0.20.2 0.25.2

@toast-gear
Copy link
Collaborator

toast-gear commented Jul 26, 2022

Your HRA looks correct, the controller logs suggest it's receiving check_run events:

2022-07-26T10:52:06Z DEBUG controllers.webhookbasedautoscaler Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event {"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "checkRun.status": "completed", "action": "completed"}

See the event key value for what webhook event ARC received:

{"event": "check_run", "hookID": "326", "delivery": "fd99f650-0cd0-11ed-8638-58a055482577", "checkRun.status": "completed", "action": "completed"}

This is a misalignment between your configured webhook on github.com and your HRA, you need to take a look at your webhook configuration on github.com.

@UrosCvijan
Copy link

Ok, you were correct, i tried check_run event and it worked so went back to the webhook, didn't include workflow jobs event to be picked up. But now I have this other problem or that is behavior that is normal?

Basically as soon as the jobs finished all runners were killed (pods), and all runners are now in offline mode. I understood that they should stay online for the scaleDownDelaySecondsAfterScaleOut: 300 amount of time and then they get removed one by one and still they should be removed from my list of runners. Or I misunderstood that?

@toast-gear
Copy link
Collaborator

toast-gear commented Jul 26, 2022

Ok, you were correct, i tried check_run event and it worked so went back to the webhook, didn't include workflow jobs event to be picked up. But now I have this other problem or that is behavior that is normal?

Use the workflow_job webhook event only, all other webhook events are deprecated and are going to get removed from ARC entirely soon. We are going to update the docs to reflect this so people stop using them soon.

Basically as soon as the jobs finished all runners were killed (pods), and all runners are now in offline mode. I understood that they should stay online for the scaleDownDelaySecondsAfterScaleOut: 300 amount of time and then they get removed one by one and still they should be removed from my list of runners. Or I misunderstood that?

Sounds odd but I would first move over to the workflow_job event. If you are still having odd behaviour raise a new issue with the full details of deployment, yaml, logs etc

@UrosCvijan
Copy link

Yeah, yeah, i moved to workflow_job event and still I have 5 runners that are in offline mode.. I will wait some more to see, but they got removed as soon as the jobs completed and are in offline mode.. I will run again the same test workflow to see if It will create new 5 runners and then leave them offline as well after its finished. If it does i will go with new issue with all the details..

@UrosCvijan
Copy link

There was some weird behavior yesterday, so i will delete everything today and if i find some time bring all from zero so i have a clean slate and then can tell all the steps. But yesterday for example, after all runners were killed (as soon as they finished their jobs) they were offline, and still are, but like 5 minutes later new 4 runners were spun up and there was no action called, no webhook, nothing. And then after some time, i didn't see when, they were removed and those were not left in offline state.
So i will see to bring it all from zero and see how it behaves so i can tell you all the steps that i did if it happens again.

@mumoshu
Copy link
Collaborator

mumoshu commented Jul 27, 2022

they were offline, and still are

ARC tries its best to call the "Remove Runner" GitHub Actions API on every runner being shut down so that there will (ideally) be no runners hang "offline". If it failed to do so or GitHub Actions failed to handle the API call at all, you might end up with dangling "offline" runners that you have to clean up manually. I guess, though, it's very rare 🤔

5 minutes later new 4 runners were spun up and there was no action called

What were the desired replicas of your RunnerDeploymeyment at that time? Also- did the desired replicas went down to 4 from a larger value when it brought up the new 4 runners for no use?
A possible scenario would be that your loadbalancer, network, or ARC's github webhook server exposed to the Internet somehow went wild and it was unable to fully receive or handle all the webhook events sent by GitHub around that time.

If it did receive workflow_job events of status=queued 5 times and then received only 1 out of 5 workflow_job events of status completed due to the problem, ARC has no way to know that it needed to scale the desired replicas down from 4 to 0 so it will temporarily end up creating 4 runners (eventually redundant). As you observed, ARC will eventually fix the skew so in that sense ARC worked as designed.

@UrosCvijan
Copy link

Ok, sorry about me not being back on this, I was put to do other things :) So first thing I should mention that I am doing all of this on cluster which is older version 1.20.11.
I think I found out why the runners were left offline (I had my own image that was created from your image, where we added some packages) so now I am using the regular latest actions-runner image and I don't have runners staying offline anymore. But they still are removed as soon as the jobs are done.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: engineering-runner-prod-autoscaler
  namespace: actions-runner-engineering
spec:
  scaleDownDelaySecondsAfterScaleOut: 300
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    name: engineering-runner-prod
  scaleUpTriggers:
  - githubEvent:
      workflowJob: {}
    duration: "30m"

Did I understood correctly, they should stay for as long as scaleDownDelaySecondsAfterScaleOut is set and then scale them 1 by 1? Or because i am using workflowJob github event, he HRA scales it down cause it got event that job is completed?

But first question of them all, should we spend time on this as the version of the cluster is 1.20.11? If not, I understand completely and will look at it when we upgrade, and come back to you, but If this is not supposed to be happening even in this version, then we can take a look at it together. I can share screen if it would make things faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

6 participants