Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harbor 2.6 - Jobservice cannot run replicated with RWO PVCs #1320

Closed
Tonkari opened this issue Oct 12, 2022 · 20 comments · Fixed by #1378
Closed

Harbor 2.6 - Jobservice cannot run replicated with RWO PVCs #1320

Tonkari opened this issue Oct 12, 2022 · 20 comments · Fixed by #1378
Assignees

Comments

@Tonkari
Copy link

Tonkari commented Oct 12, 2022

Summary

Jobservice can only run as a single pod when the underlying kubernetes cluster does not support ReadWriteMany access mode.

Environment

We are using this chart to deploy harbor on a kubernetes cluster that only supports ReadWriteOnce for persistent volumes. Currently we are running 3 replicas for the jobservice component. We are using the database JobLogger and S3 as storage backend for the registry.

Problem

When installing or updating to chart version 1.10.0 (Harbor 2.6) only one jobservice pod will become available, while the others stay in "ContainerCreating" phase. This happens because they all try to mount the same new scan data export volume. As a workaround we can only reduce the number of replicas to 1, which we have been trying to avoid.
The chart does not allow for any configuration regarding the component. If persistence is enabled, a single scan data export volume will always be created and used for the deployment.

Possible solutions

Separate persistence option for scan data export

Without knowing, whether it's necessary to persist the scan data exports, a possible solution would be to add a separate option to disable persistence for this feature. Instead, exports could be stored in the pod's filesystem - long enough to download them - and disappear when the pod gets terminated. Since I have not really dug in the harbor code, I'm not sure if this is realistic.

Store scan data exports in the database

The exports could be stored in the database. Same as above, I do not know if it's realistic, especially regarding the size of the generated csv files.

External storage support for scan data export

Since the registry already has an S3 bucket to use for storage, the same could be done for the jobservice component. Whether the same bucket with a different path or a different bucket should be used, can be discussed.

Use statefulSet for jobservice

The jobservice deployment could be converted to a statefulSet, and a volumeClaimTemplate could be created for the scan data exports. For the file JobLogger the persistentVolumeClaim could remain the same.

Update docs

If none of the above solutions are desirable, this incompatibility should be mentioned in the docs.

@r-ising
Copy link

r-ising commented Oct 14, 2022

We have the same issue after upgrading to harbor 2.6

@niklasweimann
Copy link

+1

1 similar comment
@janosmiko
Copy link

+1

@zyyw zyyw self-assigned this Oct 20, 2022
@Duke1616
Copy link

+1

@blufor
Copy link

blufor commented Oct 26, 2022

I can confirm this is a valid bug

@maxdanilov
Copy link

maxdanilov commented Oct 27, 2022

Struggling with the same issue, I went for an approach of choosing HA over scan data export, by manually moving from PVC volume type to emptyDir:

      - name: job-scandata-exports
        persistentVolumeClaim:
          claimName: registry-harbor-jobservice-scandata

to

      - name: job-scandata-exports
        emptyDir: {}

very meh (scan data export does not seem to work at all), but at least jobservice can run in the HA mode.

@derekcha
Copy link

derekcha commented Dec 6, 2022

+1

@Vad1mo
Copy link
Member

Vad1mo commented Jan 5, 2023

Thanks for the report, I will bring this issue up during the next community meeting, please join.

https://github.com/goharbor/community/wiki/Harbor-Community-Meetings.

@chlins chlins removed the help wanted Extra attention is needed label Jan 11, 2023
@chlins
Copy link
Member

chlins commented Jan 11, 2023

@Tonkari Hi, there is another volume mounted as RWO mode by default in the jobservice before v2.6,

accessMode: ReadWriteOnce
so did you change this by switching the job log location to database?

@Tonkari
Copy link
Author

Tonkari commented Jan 11, 2023

@chlins Yes, we did. I described our environment in the issue description. The possibility to use the database for the job log was also the reason why I proposed to use the database for the exports as well.
Any solution for the export issue could also be applied to the job log data, so RWO clusters would not have to use the database. Currently it's the only way to get this running replicated.

@chlins
Copy link
Member

chlins commented Jan 14, 2023

@maxdanilov Hi, what problem did you encounter for scan data export when you changed the volume to emptyDir?

@chlins
Copy link
Member

chlins commented Jan 14, 2023

@chlins Yes, we did. I described our environment in the issue description. The possibility to use the database for the job log was also the reason why I proposed to use the database for the exports as well. Any solution for the export issue could also be applied to the job log data, so RWO clusters would not have to use the database. Currently it's the only way to get this running replicated.

@Tonkari Did you use Kubernetes in the AWS or Azure, or config any affinity/antiAffinity policies? In my environment, scale up the jobservice replica to 3, 3 pods can all be ready as be scheduled to the same node.

@Tonkari
Copy link
Author

Tonkari commented Jan 14, 2023

@chlins Yes, we did specify anti-affinity. Having all Pods run on the same node is not an option for us, because we exchange nodes fairly often. Also it only compensates pod level failues, and node level failures would be met with downtime. We are also running our cluster in multiple availability zones. A volume in a single availability zone does not protect us from zone level failures.

@maxdanilov
Copy link

maxdanilov commented Jan 15, 2023

@chlins we're running Harbor on GCP, with multiple instances of the jobservice and the anti-affinity for the pods to run on different nodes. If using the standard configuration from the chart, the second replica of jobservice won't start since a persistent volume in GCP can't be mounted to multiple nodes in RW mode.

@chlins
Copy link
Member

chlins commented Jan 16, 2023

@chlins we're running Harbor on GCP, with multiple instances of the jobservice and the anti-affinity for the pods to run on different nodes. If using the standard configuration from the chart, the second replica of jobservice won't start since a persistent volume in GCP can't be mounted to multiple nodes in RW mode.

Yes, the HA is the problem that this issue should fix, but another question as you mentioned when you change the volume type to emptyDir, the scan data export function not work at all, could you share the details for this problem like jobservice logs?

@maxdanilov
Copy link

maxdanilov commented Jan 16, 2023

@chlins

It seems to not complain when exporting project CVEs:

2023-01-16T12:51:22Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"SCAN_DATA_EXPORT","id":"bda3478934353bb315599b43","t":1673873482,"args":null}
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:53]: Retrieved user id :6 for user name : <redacted>
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:234]: User <redacted> is not sys admin. Selecting projects with admin roles for export.
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:65]: Selected 1 projects administered by user <redacted>
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:24Z [INFO] [/jobservice/runner/redis.go:152]: Job 'SCAN_DATA_EXPORT:bda3478934353bb315599b43' exit with success

But the files in the Event Log bar on the right are empty when downloaded (despite images in the project having vulnerabilities). I'm not selecting any filters and trying to export everything.

@thangamani-arun
Copy link

Same with us. we are running on 2.4.x, however to upgrade to 2.6 its failing to unmount or mount with a PVC has ReadWriteOnce.

@Vad1mo Does it solve upgrade failures if we use S3 for these HA pods?

@chlins
Copy link
Member

chlins commented Feb 1, 2023

@chlins

It seems to not complain when exporting project CVEs:

2023-01-16T12:51:22Z [INFO] [/jobservice/worker/cworker/c_worker.go:77]: Job incoming: {"name":"SCAN_DATA_EXPORT","id":"bda3478934353bb315599b43","t":1673873482,"args":null}
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:53]: Retrieved user id :6 for user name : <redacted>
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:234]: User <redacted> is not sys admin. Selecting projects with admin roles for export.
2023-01-16T12:51:22Z [INFO] [/pkg/scan/export/filter_processor.go:65]: Selected 1 projects administered by user <redacted>
2023-01-16T12:51:22Z [INFO] [/pkg/config/rest/rest.go:47]: get configuration from url: http://registry-harbor-core:80/api/v2.0/internalconfig
2023-01-16T12:51:24Z [INFO] [/jobservice/runner/redis.go:152]: Job 'SCAN_DATA_EXPORT:bda3478934353bb315599b43' exit with success

But the files in the Event Log bar on the right are empty when downloaded (despite images in the project having vulnerabilities). I'm not selecting any filters and trying to export everything.

The function works from the log, no CVE found may be caused by other problem like filter or permission.

@chlins
Copy link
Member

chlins commented Feb 1, 2023

Same with us. we are running on 2.4.x, however to upgrade to 2.6 its failing to unmount or mount with a PVC has ReadWriteOnce.

@Vad1mo Does it solve upgrade failures if we use S3 for these HA pods?

@thangamani-arun The temporary workaround is to change the scandata volume to emptyDir, we'll fix this issue completely in the coming patch.

@maxdanilov
Copy link

@chlins

The function works from the log, no CVE found may be caused by other problem like filter or permission.

I don't think it's the case: the export was run with user having access to everything and no filters were applied (with vulnerabilities being present in the exported repos).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.