-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy: create dynamic-pools policy for cri-rm #969
Conversation
The commits should be squashed together and commit message enhanced. |
b0a91a5
to
9125219
Compare
The commits have been squashed together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this impressive enhancement PR! This is a very nice add-on to CRI-RM features.
I added some comments to help you polishing the code. I would also ask you to include end-to-end tests (most importantly test/e2e/policies.test-suite/dynamic-pools/n4c16/testNN-TESTNAME/code.var.sh
files) to ensure functional quality now and for the future.
|
||
// updatePoolCpuset updates the cpuset of the dynamicPools. | ||
func (p *dynamicPools) updatePoolCpuset() error { | ||
requestCpu, remainFree := p.calculateAllPoolRequests() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A situations to consider and include in e2e tests: none of the containers in a dynamic pool request any CPUs. (The containers may be BestEffort or they may have only resource limit.) It looks to me that in that case the container might get an empty cpuset, which will mean "can run on any CPU".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a container has no requests, it will be allocated to the shared dynamic-pool, which has cpuset.
We will include this case in the e2e tests later.
} | ||
|
||
for _, dp := range p.dynamicPools { | ||
if dp.Def.Name == "reserved" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reallocation of CPUs can severely harm performance of a workload. If the workload has been running on CPU package0 (socket0) and is suddenly pinned to CPU cores on package1 (socket1), then suddenly memory accesses become very expensive.
In order to avoid unnecessary CPU reallocations, I would suggest:
- First release only the necessary amount of CPUs from a pool (current number of CPUs minus requestCpu[dp], if this is a positive number).
- Then allocate CPUs only to those pools that really need them.
The current balloons policy implements topology-aware CPU allocation, that is, it allocates CPU cores that are topologically as close to the currently allocated CPUs as possible. This is a bigger enhancement and can be left to be done in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestions!
This is indeed where the dynamic-pools policy needs to be improved. We will seriously consider your suggestions, and we will improve the performance of dynamic-pools policy in the future.
522df2e
to
9125219
Compare
The CPU will always belong to a dynamic-pool, so the IdleCPUClass configuration option is not needed.
1. Change `defaultDynamicPoolDefName` to `sharedDynamicPoolDefName`, and use `reservedDynamicPoolDefName` and `sharedDynamicPoolDefName` instead of "reserved" or "shared" strings in all codes of dynamic-pool policy. 2. Call `getCpuUtilization()` once in the beginning of `calculateAllPoolWeights()`, and pass the same cpuInfo to all `updateRealCpuUsed()` calls as a parameter. 3. use `ToSliceNoSort()` instead of `ToSlice()`. 4. Let `updateRealCpuUsed()` use `DynamicPool` instead of `dynamicPools` as the receiver. 5. Modify the return value type of `updateRealCpuUsed` to `float64` instead of `int`. 6. Fix typo: s/dyanmic/dynamic/g in this file. 7. Update the information that needs to be included in `log.Info`, and put the rest in `log.Debug()`.
…de cleanup 1. Rename `formatStatLine` to `parseStatLine`. 2. Return `nil` instead of a slice that includes an empty string. 3. Check if cpuLine is less than 8, if so, do error handling. 4. Function `parseStatLine` does not process total cpu line, and add corresponding comments. 5. Remove `clocksPerSecond`, no need to scale the data (user, system, ...). 6. Document the return value of `calculateOneCpuUtilization`.
We are going to doing end-to-end tests for dynamic-pools policy. We read the end-to-end tests of the balloons policy, but we didn't understand it very well. Do you have relevant documents that we can learn from? Thanks! |
Hi @Yugar-1, I wrote you a documented example tests that introduce how to use the e2e test framework for testing the dynamic-pools policy. Please see the e2e commit in my e2e_dynamic_pools_example branch. I would advice you to start by installing e2e framework dependencies: docker and govm. You can find details in the CRI Resource Manager Developer's guide / End-to-end tests documentation. Then, if you read through test01 code first, test02 code second, I'm sure you will get the idea and will be able to extend the test set further. Feel free to take the contents of this commit as a starting point for your e2e tests. |
1. Split out the two functions `isNeedReallocate()` and `calculatePoolCpuset()` from the function `updatePoolCpuset()`. 2. Added unit tests for two functions `isNeedReallocate()` and `calculatePoolCpuset()`.
…tats instead of new environment variable `HOST_PROC`
Hi, @askervin , thank you very much for your support! |
1. If there is no container in a dynamic pool, there is no need to calculate its weight, that is, there is no need to allocate CPUs to it. 2. If there are containers in the shared dynamic pool, allocate at least one CPU to it, otherwise there is no need to allocate CPUs to it. 3. Update data types in log.debug. 4. Delete the `IdleCPUClass` of example configuration in the documentation of dynamic-pools policy.
Codecov Report
@@ Coverage Diff @@
## master #969 +/- ##
==========================================
- Coverage 33.68% 31.96% -1.72%
==========================================
Files 61 65 +4
Lines 9156 9833 +677
==========================================
+ Hits 3084 3143 +59
- Misses 5789 6402 +613
- Partials 283 288 +5
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@Yugar-1 , how do you see, has this PR stabilized now for next review round, or are you working on some changes still? |
Hi @askervin , this PR is stable and the function has been completed, can it be merged first? |
Have you experienced flakiness in the e2e tests? I got a failure from the first run of test03-rebalancing.
Having a deeper look:
...in the cri-resmgr perspective the last rebalancing looked like this:
The same test failed when I ran with
Would you check if these tests could be made more reliable? (Allow running a couple of rebalance rounds to let cpu load stabilize? Maybe The coverage in e2e tests looks pretty good, so I'm ok with not increasing unit test coverage in this PR. |
Hi, @askervin , thank you very much for your suggestion! I have fixed the e2e tests of dynamic-pools policy. Please try again and contact me if you have any questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Yugar-1, Thank you for all these improvements!
I think this large PR is very close to getting merged now.
I added a few mostly cosmetic comments. My wish is that anyone will be able to try out the dynamic-pools policy without errors just by copy-pasting the configuration samples.
All E2E tests passed with distro={debian-sid,ubuntu-22.04,fedora-36}
.
docs/policy/dynamic-pools.md
Outdated
HTTPEndpoint::8891 | ||
PrometheusExport:true | ||
logger: | ||
Debug:policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy-pasting this to the end of a working configuration file results in yaml parsing errors. You will need to add a space between :
that separates a key from a value. (Please test sample configuration snippets.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version still results in configuration failure:
$ cri-resmgr -force-config doc-test.cfg
...
F0130 09:52:03.497887 36423 main.go:85] failed to create resource manager instance: resource-manager: failed to load forced configuration doc-test.cfg: config error: failed to apply configuration from file: config error: failed to load configuration from file "doc-test.cfg": error converting YAML to JSON: yaml: line 31: could not find expected ':'
Two more key-value fixes needed. Please test.
Hi, @askervin, thank you so much for your comments, you noticed a lot of things that I had overlooked! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would you remove the ConfigMap header and related indentation from the "Example configuration that runs all pods in dynamic-pools." yaml in the documentation.
Please make sure that it runs fine with cri-resmgr -force-config doc-test.cfg
. This will make it easy for readers to try out.
docs/policy/dynamic-pools.md
Outdated
HTTPEndpoint::8891 | ||
PrometheusExport:true | ||
logger: | ||
Debug:policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version still results in configuration failure:
$ cri-resmgr -force-config doc-test.cfg
...
F0130 09:52:03.497887 36423 main.go:85] failed to create resource manager instance: resource-manager: failed to load forced configuration doc-test.cfg: config error: failed to apply configuration from file: config error: failed to load configuration from file "doc-test.cfg": error converting YAML to JSON: yaml: line 31: could not find expected ':'
Two more key-value fixes needed. Please test.
I'm so sorry for my carelessness, I've fixed it, please check again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Yugar-1!
LGTM.
Approved for CI. @klihub , @marquiz , @jukkar , @fmuyassarov , would you like to have a look, too? :) |
…/cri-resource-manager into add-dynamic-pools-policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this. Thanks Antti ! You've done a great job of reviewing it. And I apologize that I haven't had time to look into this.
I have one general comment which might be irrelevant for this PR. Since I understood so that this was forked off originally from an existing policy, it's good to eventually make sure that there is no unnecessary verbatim code duplication. If there is any substantial duplication of code which is not directly related to the actual policy logic but for instance metrics collection, or something other similar 'utility' functionality, then those maybe could be separated/exported out (in a subsequent PR) in the original policy into a subdirectory and then imported from there to this new policy. But as I said, it might not apply to this PR, if such code duplication does not exist...
Yeah, lgtm this now and address issues have further enhancements in subsequent PRs. And thanks @askervin for all the reviews! I was really useless here 🙄 |
This PR creates a new policy for
cri-resource-manager
nameddynamic-pools
policy.This policy can put the workload in different dynamic pools, and the CPUs of dynamic pools do not overlap with each other. These dynamic pools can be dynamically resized based on the requests of the containers and the CPU utilization of the workload in the dynamic pools.
Users can define and configure dynamic pools. A dynamic pool has a set of CPUs and a set of containers running on the CPUs. CPUs in dynamic pools can be configured, for example, by setting min and max frequencies on CPU cores and uncore.
The code and documentation of the
dynamic-pools
policy are included in this PR.