Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promxy makes up datapoints by repeating the last dataset #606

Closed
Z3po opened this issue Jul 14, 2023 · 4 comments · Fixed by #609
Closed

promxy makes up datapoints by repeating the last dataset #606

Z3po opened this issue Jul 14, 2023 · 4 comments · Fixed by #609

Comments

@Z3po
Copy link

Z3po commented Jul 14, 2023

We recognized that promxy repeats the last datapoint when doing simple queries like "max()" on a set of clusters.

Our setup looks like the following:

flowchart TD

  Promxy --> Cluster1
  Promxy --> Cluster2
  Promxy --> Cluster3

  subgraph Cluster1
    Prometheus1-1
    Prometheus1-2
  end

  subgraph Cluster2
    Prometheus2-1 
    Proemtehus2-2
  end

  subgraph Cluster3
    Prometheus3-1
    Prometheus3-2
  end
Loading

While promxy accesses each of the clusters it does only connect to a single prometheus instance through a http proxy configuration. Each k8s cluster just contains data from it's own running pods.

The configuration looks like the following:

---
global:
  evaluation_interval: 5s
  external_labels:
    source: promxy
promxy:
  server_groups:
    - static_configs:
        - targets:
            - prometheus-cluster-1:443
      labels:
        cluster: cluster1
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster-2:443
      labels:
        cluster: cluster2
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster3:443
      labels:
        cluster: cluster3
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
    - static_configs:
        - targets:
            - prometheus-cluster4:443
      labels:
        cluster: cluster4
      remote_read: true
      remote_read_path: /api/v1/read
      query_params:
        nocache: 1
      scheme: https
      http_client:
        dial_timeout: 4s
        tls_config:
          insecure_skip_verify: true
          cert_file: /etc/certs/client.crt
          key_file: /etc/certs/client.key
      ignore_error: true
      ...

Now querying for data with to promxy or the prometheus instances directly through port-forwarding:

curl -v --get --data-urlencode 'query=max(container_memory_usage_bytes{id!="/",namespace=~"XXXX",container!="",container="XXXX",container!="POD",pod=~".*XXXXX"}) by (pod, container, cluster)' --data-urlencode "start=2023-07-14T06:39:53.152Z" --data-urlencode "end=2023-07-14T07:24:58.224Z" --data-urlencode "step=30s" 'http://localhost:9090/api/v1/query_range'

returns the following data through promxy:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "cluster": "cluster1",
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "1536000"
          ],
          [
            1689317423.152,
            "342065152"
          ],
          [
            1689317453.152,
            "1119674368"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "1498701824"
          ],
          [
            1689317543.152,
            "2520608768"
          ],
          [
            1689317573.152,
            "2830626816"
          ],
          [
            1689317603.152,
            "2830626816"
          ],
          [
            1689317633.152,
            "2830626816"
          ],
          [
            1689317663.152,
            "2830626816"
          ],
          [
            1689317693.152,
            "2830626816"
          ],
          [
            1689317723.152,
            "2830626816"
          ],
          [
            1689317753.152,
            "2830626816"
          ],
          [
            1689317783.152,
            "2830626816"
          ],
          [
            1689317813.152,
            "2830626816"
          ],
          [
            1689317843.152,
            "2830626816"
          ],
          [
            1689317873.152,
            "2830626816"
          ]
        ]
      }
    ]
  }
}

But through prometheus1-1:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "1536000"
          ],
          [
            1689317423.152,
            "342065152"
          ],
          [
            1689317453.152,
            "1119674368"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "1498701824"
          ],
          [
            1689317543.152,
            "2520608768"
          ],
          [
            1689317573.152,
            "2830626816"
          ]
        ]
      }
    ]
  }
}

and through prometheus1-2:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "container": "XXXX",
          "pod": "XXXX"
        },
        "values": [
          [
            1689317393.152,
            "92098560"
          ],
          [
            1689317423.152,
            "659861504"
          ],
          [
            1689317453.152,
            "1243791360"
          ],
          [
            1689317483.152,
            "1345925120"
          ],
          [
            1689317513.152,
            "2733039616"
          ],
          [
            1689317543.152,
            "2545987584"
          ]
        ]
      }
    ]
  }
}

As you can see there's additional datapoints in the promxy output which shouldn't be there.
Am I holding it somehow wrong?

If I'm missing any data necessary to debug the issue, just let me know.

Thanks for your help in advance!

@jacksontj
Copy link
Owner

I did some initial looking and am unable to reproduce this issue locally.

From a quick look the results are confusing to me for both promxy and prometheus. The query is a range of ~45m with a step of 30s; which should be 90 datapoints. Promxy was returning 17 and prometheus was 6 or 7. So its a bit hard to say without the full data that promxy was seeing (either trace logs or a tcpdump). But my guess here is that the max isn't unique -- such that its putting multiple series together, but thats a really off-the wall guess without more data.

@Z3po
Copy link
Author

Z3po commented Aug 8, 2023

Thanks for taking a first look!

I've run the queries with a trace loglevel and got the following (now another pod but still the same issue:

{"level":"debug","msg":"AggregateExpr max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"}) max","time":"2023-08-08T11:28:43Z"}

{"api":"QueryRange","level":"debug","msg":"http://Cluster8:9090","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster6:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster2:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster5:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster4:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster1:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster7:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}
{"api":"QueryRange","level":"debug","msg":"https://Cluster3:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z"}

{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster6:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":5202879,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"http://Cluster8:9090","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":6089343,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster5:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":11511931,"value":[{"metric":{"container":"build","pod":"XXXX"},"values":[[1691487358,"1765376"],[1691487388,"725573632"]]}],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster7:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":18630738,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster2:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":24686659,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster4:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":25874023,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster1:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":26241856,"value":[],"warnings":null}
{"api":"QueryRange","error":null,"level":"trace","msg":"https://Cluster3:443","query":"max by(pod, container) (container_memory_usage_bytes{container!=\"\",container!=\"POD\",container=\"build\",id!=\"/\",namespace=~\"XXXXXXX\",pod=~\"XXXX\"})","r":{"Start":"2023-08-08T09:29:58Z","End":"2023-08-08T09:40:04Z","Step":30000000000},"time":"2023-08-08T11:28:43Z","took":25840574,"value":[],"warnings":null}

{"remoteAddr":"127.0.0.1","time":"2023-08-08T11:28:43.732172484Z","method":"GET","path":"/api/v1/query_range","protocol":"HTTP/1.1","status":200,"responseBytes":383,"duration":0.030032405,"query":"start=2023-08-08T09%3A29%3A58Z\u0026end=2023-08-08T09%3A40%3A04Z\u0026step=30s\u0026query=max%28container_memory_usage_bytes%7Bid%21%3D%22%2F%22%2Cnamespace%3D~%22XXXXXXX%22%2Ccontainer%21%3D%22%22%2Ccontainer%3D%22build%22%2Ccontainer%21%3D%22POD%22%2Cpod%3D~%22XXXX"}

And the result is

❯ curl -v --get --data-urlencode 'query=max(container_memory_usage_bytes{id!="/",namespace=~"XXXXXXX",container!="",container="build",container!="POD",pod=~"XXXX"}) by (pod, container)' --data-urlencode "start=2023-08-08T09:29:58Z" --data-urlencode end="2023-08-08T09:40:04Z" --data-urlencode "step=30s" http://localhost:8082/api/v1/query_range
*   Trying 127.0.0.1:8082...
* Connected to localhost (127.0.0.1) port 8082 (#0)
> GET /api/v1/query_range?query=max%28container_memory_usage_bytes%7bid%21%3d%22%2f%22%2cnamespace%3d~%22XXXXXXX%22%2ccontainer%21%3d%22%22%2ccontainer%3d%22build%22%2ccontainer%21%3d%22POD%22%2cpod%3d~%22XXXX%22%7d%29+by+%28pod%2c+container%29&start=2023-08-08T09%3a29%3a58Z&end=2023-08-08T09%3a40%3a04Z&step=30s HTTP/1.1
> Host: localhost:8082
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 08 Aug 2023 11:28:43 GMT
< Content-Length: 383
< 
* Connection #0 to host localhost left intact
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"container":"build","pod":"XXXX"},"values":[[1691487358,"1765376"],[1691487388,"725573632"],[1691487418,"725573632"],[1691487448,"725573632"],[1691487478,"725573632"],[1691487508,"725573632"],[1691487538,"725573632"],[1691487568,"725573632"],[1691487598,"725573632"]]}]}}

Looks to me like it's not receiving data points from any other cluster on that query....

And BTW we're running quay.io/jacksontj/promxy:v0.0.81.

@Z3po
Copy link
Author

Z3po commented Aug 8, 2023

Sorry for the comment-update-mess. I decided it's better to run the curl command again and fetch the necessary information from the very same execution. The logs should definitely match the query.

jacksontj added a commit that referenced this issue Aug 14, 2023
Since we already did a query of a downstream (in some aggregate) we
don't want to apply any LookbackDelta ranges to this (as it was already
computed). This avoids the behavior where a response will "repeat" the
last result seen for the duration of the LookbackDelta.

Fixes #606
jacksontj added a commit that referenced this issue Aug 14, 2023
Since we already did a query of a downstream (in some aggregate) we
don't want to apply any LookbackDelta ranges to this (as it was already
computed). This avoids the behavior where a response will "repeat" the
last result seen for the duration of the LookbackDelta.

Fixes #606
@jacksontj
Copy link
Owner

Thanks for the report here! With all the data this makes it trivial to reproduce which really speeds things up. So looking at this it seems that this is just another case of lookback-delta mischief. The query lookback delta from the upstream docs:

The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with server mode only.

For promxy it does the same-- the difference being that the dataset it is applying it against (usually) has already done this. So in this case, the default lookback delta is 5m -- so what happened is promql in promxy decided that the last point it saw was okay to repeat because it was the last point it saw. For other selectors (e.g. VectorSelector) this has already been fixed; but it seems that the aggregate side had this issue still. I have opened up #609 to resolve that issue.

jacksontj added a commit that referenced this issue Aug 14, 2023
Since we already did a query of a downstream (in some aggregate) we
don't want to apply any LookbackDelta ranges to this (as it was already
computed). This avoids the behavior where a response will "repeat" the
last result seen for the duration of the LookbackDelta.

Fixes #606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants