promxy makes up datapoints by repeating the last dataset #606

Z3po · 2023-07-14T12:56:25Z

We recognized that promxy repeats the last datapoint when doing simple queries like "max()" on a set of clusters.

Our setup looks like the following:

flowchart TD

  Promxy --> Cluster1
  Promxy --> Cluster2
  Promxy --> Cluster3

  subgraph Cluster1
    Prometheus1-1
    Prometheus1-2
  end

  subgraph Cluster2
    Prometheus2-1 
    Proemtehus2-2
  end

  subgraph Cluster3
    Prometheus3-1
    Prometheus3-2
  end

While promxy accesses each of the clusters it does only connect to a single prometheus instance through a http proxy configuration. Each k8s cluster just contains data from it's own running pods.

The configuration looks like the following:

---
global:
  evaluation_interval: 5s
  external_labels:
    source: promxy
promxy:
  server_groups:
    - static_configs:
        - targets:
            - prometheus-cluster-1:443
      labels:
        cluster: cluster1
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster-2:443
      labels:
        cluster: cluster2
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster3:443
      labels:
        cluster: cluster3
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster4:443
      labels:
        cluster: cluster4
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
      ...

Now querying for data with to promxy or the prometheus instances directly through port-forwarding:

curl -v --get --data-urlencode 'query=max(container_memory_usage_bytes{id!="/",namespace=~"XXXX",container!="",container="XXXX",container!="POD",pod=~".*XXXXX"}) by (pod, container, cluster)' --data-urlencode "start=2023-07-14T06:39:53.152Z" --data-urlencode "end=2023-07-14T07:24:58.224Z" --data-urlencode "step=30s" 'http://localhost:9090/api/v1/query_range'

returns the following data through promxy:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "cluster": "cluster1",
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "1536000"
          ],
          [
            1689317423.152,
            "342065152"
          ],
          [
            1689317453.152,
            "1119674368"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "1498701824"
          ],
          [
            1689317543.152,
            "2520608768"
          ],
          [
            1689317573.152,
            "2830626816"
          ],
          [
            1689317603.152,
            "2830626816"
          ],
          [
            1689317633.152,
            "2830626816"
          ],
          [
            1689317663.152,
            "2830626816"
          ],
          [
            1689317693.152,
            "2830626816"
          ],
          [
            1689317723.152,
            "2830626816"
          ],
          [
            1689317753.152,
            "2830626816"
          ],
          [
            1689317783.152,
            "2830626816"
          ],
          [
            1689317813.152,
            "2830626816"
          ],
          [
            1689317843.152,
            "2830626816"
          ],
          [
            1689317873.152,
            "2830626816"
          ]
        ]
      }
    ]
  }
}

But through prometheus1-1:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "1536000"
          ],
          [
            1689317423.152,
            "342065152"
          ],
          [
            1689317453.152,
            "1119674368"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "1498701824"
          ],
          [
            1689317543.152,
            "2520608768"
          ],
          [
            1689317573.152,
            "2830626816"
          ]
        ]
      }
    ]
  }
}

and through prometheus1-2:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "92098560"
          ],
          [
            1689317423.152,
            "659861504"
          ],
          [
            1689317453.152,
            "1243791360"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "2733039616"
          ],
          [
            1689317543.152,
            "2545987584"
          ]
        ]
      }
    ]
  }
}

As you can see there's additional datapoints in the promxy output which shouldn't be there.
Am I holding it somehow wrong?

If I'm missing any data necessary to debug the issue, just let me know.

Thanks for your help in advance!

The text was updated successfully, but these errors were encountered:

jacksontj · 2023-07-27T04:54:13Z

I did some initial looking and am unable to reproduce this issue locally.

From a quick look the results are confusing to me for both promxy and prometheus. The query is a range of ~45m with a step of 30s; which should be 90 datapoints. Promxy was returning 17 and prometheus was 6 or 7. So its a bit hard to say without the full data that promxy was seeing (either trace logs or a tcpdump). But my guess here is that the max isn't unique -- such that its putting multiple series together, but thats a really off-the wall guess without more data.

Z3po · 2023-08-08T11:13:21Z

Thanks for taking a first look!

I've run the queries with a trace loglevel and got the following (now another pod but still the same issue:

{"level":"debug","msg":"AggregateExpr max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"}) max","time":"2023-08-08T11:28:43Z"}

{"api":"QueryRange","level":"debug","msg":"http://Cluster8:9090","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster6:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster2:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster5:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster4:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster1:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster7:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster3:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}

{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster6:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":5202879,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"http://Cluster8:9090","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":6089343,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster5:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":11511931,"value":[{"metric":{"container":"build","pod":"XXXX"},"values":[[1691487358,"1765376"],[1691487388,"725573632"]]}],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster7:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":18630738,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster2:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":24686659,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster4:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":25874023,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster1:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":26241856,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster3:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":25840574,"value":[],"warnings":null}

{"remoteAddr":"127.0.0.1","time":"2023-08-08T11:28:43.732172484Z","method":"GET","path":"/api/v1/query_range","protocol":"HTTP/1.1","status":200,"responseBytes":383,"duration":0.030032405,"query":"start=2023-08-08T09%3A29%3A58Z\u0026end=2023-08-08T09%3A40%3A04Z\u0026step=30s\u0026query=max%28container_memory_usage_bytes%7Bid%21%3D%22%2F%22%2Cnamespace%3D~%22XXXXXXX%22%2Ccontainer%21%3D%22%22%2Ccontainer%3D%22build%22%2Ccontainer%21%3D%22POD%22%2Cpod%3D~%22XXXX"}

And the result is

❯ curl -v --get --data-urlencode 'query=max(container_memory_usage_bytes{id!="/",namespace=~"XXXXXXX",container!="",container="build",container!="POD",pod=~"XXXX"}) by (pod, container)' --data-urlencode "start=2023-08-08T09:29:58Z" --data-urlencode end="2023-08-08T09:40:04Z" --data-urlencode "step=30s" http://localhost:8082/api/v1/query_range
*   Trying 127.0.0.1:8082...
* Connected to localhost (127.0.0.1) port 8082 (#0)
> GET /api/v1/query_range?query=max%28container_memory_usage_bytes%7bid%21%3d%22%2f%22%2cnamespace%3d~%22XXXXXXX%22%2ccontainer%21%3d%22%22%2ccontainer%3d%22build%22%2ccontainer%21%3d%22POD%22%2cpod%3d~%22XXXX%22%7d%29+by+%28pod%2c+container%29&start=2023-08-08T09%3a29%3a58Z&end=2023-08-08T09%3a40%3a04Z&step=30s HTTP/1.1
> Host: localhost:8082
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 08 Aug 2023 11:28:43 GMT
< Content-Length: 383
< 
* Connection #0 to host localhost left intact
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"container":"build","pod":"XXXX"},"values":[[1691487358,"1765376"],[1691487388,"725573632"],[1691487418,"725573632"],[1691487448,"725573632"],[1691487478,"725573632"],[1691487508,"725573632"],[1691487538,"725573632"],[1691487568,"725573632"],[1691487598,"725573632"]]}]}}

Looks to me like it's not receiving data points from any other cluster on that query....

And BTW we're running quay.io/jacksontj/promxy:v0.0.81.

Z3po · 2023-08-08T11:35:34Z

Sorry for the comment-update-mess. I decided it's better to run the curl command again and fetch the necessary information from the very same execution. The logs should definitely match the query.

Since we already did a query of a downstream (in some aggregate) we don't want to apply any LookbackDelta ranges to this (as it was already computed). This avoids the behavior where a response will "repeat" the last result seen for the duration of the LookbackDelta. Fixes #606

jacksontj · 2023-08-14T17:29:10Z

Thanks for the report here! With all the data this makes it trivial to reproduce which really speeds things up. So looking at this it seems that this is just another case of lookback-delta mischief. The query lookback delta from the upstream docs:

The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with server mode only.

For promxy it does the same-- the difference being that the dataset it is applying it against (usually) has already done this. So in this case, the default lookback delta is 5m -- so what happened is promql in promxy decided that the last point it saw was okay to repeat because it was the last point it saw. For other selectors (e.g. VectorSelector) this has already been fixed; but it seems that the aggregate side had this issue still. I have opened up #609 to resolve that issue.

Since we already did a query of a downstream (in some aggregate) we don't want to apply any LookbackDelta ranges to this (as it was already computed). This avoids the behavior where a response will "repeat" the last result seen for the duration of the LookbackDelta. Fixes #606

jacksontj mentioned this issue Aug 14, 2023

Override the LookbackDelta for Interval queries with NodeReplaced data #609

Merged

jacksontj closed this as completed in #609 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promxy makes up datapoints by repeating the last dataset #606

promxy makes up datapoints by repeating the last dataset #606

Z3po commented Jul 14, 2023 •

edited

Loading

jacksontj commented Jul 27, 2023

Z3po commented Aug 8, 2023 •

edited

Loading

Z3po commented Aug 8, 2023

jacksontj commented Aug 14, 2023

promxy makes up datapoints by repeating the last dataset #606

promxy makes up datapoints by repeating the last dataset #606

Comments

Z3po commented Jul 14, 2023 • edited Loading

jacksontj commented Jul 27, 2023

Z3po commented Aug 8, 2023 • edited Loading

Z3po commented Aug 8, 2023

jacksontj commented Aug 14, 2023

Z3po commented Jul 14, 2023 •

edited

Loading

Z3po commented Aug 8, 2023 •

edited

Loading