Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating and adding a timeout to the cleanup cluster step #7727

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ytimocin
Copy link
Contributor

@ytimocin ytimocin commented Jul 5, 2024

Description

Updating and adding a timeout to the cleanup cluster step.

Type of change

  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and doesn't change the functionality of Radius (issue link optional).

@ytimocin ytimocin requested review from a team as code owners July 5, 2024 19:49
@@ -52,5 +52,14 @@ for ns in $namespaces; do
break
fi
echo "deleting namespaces: $ns"

# Remove finalizers from the namespace if present
kubectl get namespace $ns -o json | jq '.spec.finalizers = []' >temp-namespace.json
Copy link
Contributor Author

@ytimocin ytimocin Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, UCP and Applications RP Pods were not healthy, and this caused finalizers to not finalize in some of the resources in some of the namespaces. And then this caused those namespaces to not be deleted.

UCP Pod Down -> Finalizers in some resources -> Namespaces couldn't be deleted -> Clean up step got stuck -> Long Running Workflow failed back-to-back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have logs from the failing pods? or from the Kubernetes controller-manager pod?

@@ -515,6 +515,7 @@ jobs:
--yes --verbose
- name: Clean up cluster
if: always()
timeout-minutes: 60
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timeout to clean up step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of the timeout? That we'd be notified if we hit it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is 6 hours, so fail faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted the files

Copy link

github-actions bot commented Jul 5, 2024

Unit Tests

3 218 tests  ±0   3 212 ✅ ±0   3m 55s ⏱️ +2s
  261 suites ±0       6 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit d8a3fd0. ± Comparison against base commit 534cba6.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Jul 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.99%. Comparing base (534cba6) to head (d8a3fd0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7727      +/-   ##
==========================================
- Coverage   61.00%   60.99%   -0.01%     
==========================================
  Files         520      520              
  Lines       27010    27010              
==========================================
- Hits        16478    16476       -2     
- Misses       9080     9081       +1     
- Partials     1452     1453       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@radius-functional-tests
Copy link

radius-functional-tests bot commented Jul 5, 2024

Radius functional test overview

🔍 Go to test action run

Name Value
Repository radius-project/radius
Commit ref 65113ae
Unique ID func470aad4369
Image tag pr-func470aad4369
Click here to see the list of tools in the current test run
  • gotestsum 1.10.0
  • KinD: v0.20.0
  • Dapr: 1.12.0
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.1.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func470aad4369
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func470aad4369
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func470aad4369
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func470aad4369
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting msgrp functional tests...
⌛ Starting shared functional tests...
⌛ Starting daprrp functional tests...
⌛ Starting cli functional tests...
⌛ Starting ucp functional tests...
⌛ Starting kubernetes functional tests...
⌛ Starting samples functional tests...
⌛ Starting datastoresrp functional tests...
✅ msgrp functional tests succeeded
✅ samples functional tests succeeded
✅ kubernetes functional tests succeeded
✅ ucp functional tests succeeded
✅ daprrp functional tests succeeded
✅ cli functional tests succeeded
✅ datastoresrp functional tests succeeded
✅ shared functional tests succeeded

Signed-off-by: ytimocin <ytimocin@microsoft.com>
kubectl delete namespace $ns --ignore-not-found=true
done

# Cleanup temporary file
rm -f temp-namespace.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an error occurs above/script exists, is it possible that file is not cleaned up ?


# Remove finalizers from the namespace if present
kubectl get namespace $ns -o json | jq '.spec.finalizers = []' >temp-namespace.json
kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f ./temp-namespace.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does --force help here? https://kubernetes.io/docs/reference/kubectl/generated/kubectl_delete/

Just thinking it might be easier if it works.

I'm also curious what the finalizers are. We have to be a little bit careful with this, because certain types in Kubernetes will manage cloud resources. If we delete the Kubernetes resource without running the finalizers, then the cloud resources leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants