Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PromMetrics exports execution and controller metrics for old/stopped executions #3743

Open
briend opened this issue Sep 9, 2024 · 1 comment
Assignees

Comments

@briend
Copy link
Contributor

briend commented Sep 9, 2024

After setting teraslice v2.1.0 to use the new internal prom metrics, the number of metrics exported can (seemingly) grow without bounds. Old executions for the same job_id will accumulate, whereas previously only (apparently) non-terminal statuses like running were exported. If old executions are eventually removed automatically, maybe this new behavior makes sense, but I'm not sure if they are or if there are plans for that.

terafoundation:
  prom_metrics_enabled: true

Only one of these is in state running and the other two are stopped; previously only 1 series was exported and now there are 3. Same for teraslice_execution_status, teraslice_controller_.*, etc.

teraslice_execution_info{assignment="master", container="teraslice", ex_id="99d401f2-9d11-4ce9-8882-fbfc246ba2f3", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="92f53103-7de6-4bb3-929b-e2bb13598502", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"}  1

teraslice_execution_info{assignment="master", container="teraslice", ex_id="a837cda0-7801-43e0-96e5-47ebfd3f1303", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="92f53103-7de6-4bb3-929b-e2bb13598502", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"} 1

teraslice_execution_info{assignment="master", container="teraslice", ex_id="undefined", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="undefined", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"}  1

Also note one of these has job_id="undefined" and ex_id="undefined", which may be an additional bug, since the ex that show up with the normal /txt/ex look normal:

curl -Ss ts-ops2/txt/ex
name             lifecycle   slicers  workers  _status  ex_id                                 job_id                                _created                  _updated                
---------------  ----------  -------  -------  -------  ------------------------------------  ------------------------------------  ------------------------  ------------------------
datagen-noop-v1  persistent  1        1        running  a837cda0-7801-43e0-96e5-47ebfd3f1303  92f53103-7de6-4bb3-929b-e2bb13598502  2024-09-06T22:33:13.407Z  2024-09-06T22:33:35.682Z
datagen-noop-v1  persistent  1        1        stopped  99d401f2-9d11-4ce9-8882-fbfc246ba2f3  92f53103-7de6-4bb3-929b-e2bb13598502  2024-07-25T20:54:42.083Z  2024-09-06T22:33:04.632Z
datagen-noop-v1  persistent  1        1        stopped  c2faa2f4-9148-4cf8-b97f-7bda77db8e9b  92f53103-7de6-4bb3-929b-e2bb13598502  2024-07-25T20:40:27.951Z  2024-07-25T20:54:29.218Z
godber pushed a commit that referenced this issue Sep 19, 2024
…updates (#3747)

This PR makes the following changes:

- The `PromMetrics` class needs to reset it's list of metrics on each
scrape. If it doesn't do this, then all the executions are listed, not
just the active ones. `resetMetrics()` functions were added to
`PromMetrics` and `Exporter` to reset the `prom-client` register.
- Add `prom_metrics_display_url` field to terafoundation. This value
will be used as the `url` default label added to all prom metrics.
Defaults to an empty string, making it more obvious that this field is
missing from the config.
- Include cluster analytics metrics (GET '/cluster/stats' endpoint
results) in the cluster master

ref: #3743
@godber
Copy link
Member

godber commented Sep 19, 2024

This should be fixed in the v2.3.2 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants