Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the docs of pyroscope.scrape #5847

Merged
merged 6 commits into from
Jan 10, 2024
Merged

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Nov 24, 2023

Adding extra information which me and @thampiotr recently learned while tweaking Agents running pyroscope.scrape.

The PR also cleans up the docs in general.

I am not sure who should review this PR from the Docs squad. IIUC @clayton-cornell is normally helping with Agent docs whereas @knylander-grafana usually helps with Pyroscope docs :)

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, otherwise LGTM

docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
#### `scrape_interval` argument

If `scrape_interval` is short:
* Advantages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage is also being able to catch events such as OOM when they happen quickly. So maybe something like "Ability to capture rapid changes in profiles".

Copy link
Contributor Author

@ptodev ptodev Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this is true if delta = false, because then profiles would work similarly to Prometheus metrics.

However, if delta = true, wouldn't a lower scrape frequency lead to less catched events?

E.g. if scrape_frequency = "15s", then each observation will last 14 seconds. For every 60 seconds we will lose 4 seconds of observations - this is the time between scrapes. On the other hand, for scrape_frequency = "60s" we will only lose 1 second worth of samples for every minute.

It'd be great if someone from the Pyroscope team confirms this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is when a bug leads to e.g. a rapid growth in memory and within 20 seconds the instance OOMs. This is much harder to catch with 1m scrape interval, as you'd need to be lucky to finish the profile just a few seconds before OOM event happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that when analysing OOMs, it's better to have a short scrape frequency. However, in case the application is not crashing and duration = true, I think that longer scrapes will actually result in more samples and will better detect rapid changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you refer to in duration = false|true? I don't see such setting on pyroscope.scrape.

Copy link
Contributor Author

@ptodev ptodev Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant delta = true. I'll edit my comment above.

docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
docs/sources/flow/reference/components/pyroscope.scrape.md Outdated Show resolved Hide resolved
`job_name` | `string` | The job name to override the job label with. | component name | no
`params` | `map(list(string))` | A set of query parameters with which the target is scraped. | | no
`scrape_interval` | `duration` | How frequently to scrape the targets of this scrape config. | `"15s"` | no
`scrape_timeout` | `duration` | The timeout for scraping targets of this config. | `"18s"` | no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to add it right here: "It must be larger than scrape_interval".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure about this one. We usually document constraints under the table... I'll let the comment sit for now in case someone else wants to opine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cyriltovena curious your thoughts here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bug to require scrape_timeout longer that scrape_interval.
It may make sense for cpu, but does not makes sense for other (not delta profiles) requests which should not last long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Shouldn't the docs therefore explain this - that CPU profiles may need different timeout & interval than the remaining profile types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I will just state "must be larger than scrape_interval" in the table as @thampiotr suggested. We usually state such things under the table, but I think it improves readability in this case and it's worth making an exception.

Shouldn't the docs therefore explain this - that CPU profiles may need different timeout & interval than the remaining profile types?

I think for now we can just explain how the component behaves, and we can improve with such advice later on. I think what @korniltsev states applies not just for CPU, but any "delta" profile.

Btw I think the reason for timing out delta profiles before the next query is started is that there can't be more than one active pprof for a certain type at a time. However, even if we have a very slow non-delta pprof query, I suppose we still won't be able to send a new query until the old one has finished. This is just a guess though.

@@ -142,7 +230,7 @@ an `oauth2` block.
The `profiling_config` block configures the profiling settings when scraping
targets.

The block contains the following attributes:
The following arguments are supported:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is path_prefix even used? I can't find code which uses it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bug yes. Worth confirming it doesn't work and open an issue.

`scrape_timeout` | `duration` | The timeout for scraping targets of this config. | `"18s"` | no
`scheme` | `string` | The URL scheme with which to fetch metrics from targets. | `"http"` | no
`bearer_token` | `secret` | Bearer token to authenticate with. | | no
`bearer_token_file` | `string` | File containing a bearer token to authenticate with. | | no
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such _file arguments should be deprecated. Normally we recommend using components like local.file, remote.vault, remote.s3 for such use cases.

@@ -322,6 +404,19 @@ If the agent is _not_ running in clustered mode, this block is a no-op.

[using clustering]: {{< relref "../../concepts/clustering.md" >}}

## Common configuration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clayton-cornell - this is another use case for a "Common configuration" section. Yesterday we discussed it in the context of documenting shared blocks. Here I use it to document shared arguments.

@rfratto - you mentioned in a previous PR that you're concerned with too much repetition, which makes users think there's more to learn than there really is. I agree, and I hope this section here helps with that.


### `delta` argument

When the `delta` argument is `false`, the [pprof][] HTTP query will be instantaneous.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a delta of false actually mean, e.g. in the context of CPU profiles? Does it lead to visible differences in the Pyroscope UI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Delta" to me means that the difference between the first sample and the last sample.
Is that the case here? I was under the impression that all samples for the duration of seconds=14 will be taken into account.


For example, the `job_name` of `pyroscope.scrape "local" { ... }` will be `"pyroscope.scrape/local"`.

#### `targets` argument
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not normal to add a section for an argument inside the ## Arguments section. However, I think it really helps here because there's lots to discuss. I added the sections using #### and not ### so that they don't stand out too much and so that they are not listed in the ToC:
Screenshot 2023-11-24 at 13 28 50

@@ -288,7 +370,7 @@ profile.custom "PROFILE_TYPE" {
Multiple `profile.custom` blocks can be specified. Labels assigned to
`profile.custom` blocks must be unique across the component.

The `profile.custom` block accepts the following arguments:
The following arguments are supported:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be nice to have an example with profile.custom in the Examples section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clayton-cornell - The changes in the PR make this page more in line with our Writing documentation for Flow components guide, but I have taken a few liberties which are noted in comments.

@knylander-grafana - there is a Go (pull mode) page on the Pyroscope website which contains lots of duplicate information. Should we remove the duplications?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We created that page in response to feedback that this page was too verbose and hard to get started quickly. We will definitely keep the other page for now and I'll just manage the duplicate work of making sure changes made here make it over there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could look into converting Go (pull mode) and eBPF into new task-based docs under Agent's Getting Started section called "Profiling Go Applications" and "Profiling with eBPF".

@ptodev ptodev force-pushed the ptodev/clarify-pyroscope-config branch from 07edd31 to 71a0a29 Compare November 24, 2023 16:00
`targets`. The scraped performance profiles are forwarded to the list of receivers passed in
`forward_to`.
`pyroscope.scrape` collects [pprof] performance profiles for a given set of HTTP `targets`.

Copy link
Contributor

@Rperry2174 Rperry2174 Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually share this picture to show visually what the grafana agent does... may be worth adding in here. we could update this image to have more detail (i.e. showing multiple endpoints for scraping)

image

Copy link
Contributor Author

@ptodev ptodev Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the picture shows the interactions between different components, I'd prefer not to place it in the reference docs for pyroscope.scrape. It'd be great to have a new "Profiling Go Applications" task-based doc in the Getting Started section. That doc can walk people through installing the Agent for profiling and we could put the picture there.

I think having such an architecture picture would be great for most task-based docs, and it would be even better if they are all consistent.
cc @clayton-cornell

@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 29, 2023
Copy link
Contributor

github-actions bot commented Jan 7, 2024

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Jan 7, 2024
@ptodev ptodev force-pushed the ptodev/clarify-pyroscope-config branch from 2b066e1 to 7970f78 Compare January 8, 2024 09:00
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clayton-cornell
Copy link
Contributor

@ptodev Does this make this one ready for merge? There's a lot of unresolved comments throughout the doc.

@github-actions github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label Jan 9, 2024
@ptodev
Copy link
Contributor Author

ptodev commented Jan 9, 2024

@ptodev Does this make this one ready for merge? There's a lot of unresolved comments throughout the doc.

@clayton-cornell I am happy to merge it. It is a good improvement over what we have atm, and most of the discussions are closed now. The only remaining thing I'm curious about is how delta vs-non-delta profiles are displayed in the UI, but we could ask about that some other time. Would you mind if I merge it please? I'm not sure if you want to review further.

@ptodev ptodev force-pushed the ptodev/clarify-pyroscope-config branch from 2b90bd4 to 89b7cba Compare January 9, 2024 08:17
@ptodev ptodev force-pushed the ptodev/clarify-pyroscope-config branch from 89b7cba to d935d1b Compare January 9, 2024 08:39
@clayton-cornell
Copy link
Contributor

@ptodev Lets get this one published. We can open a new PR when needed ;-)

@ptodev ptodev merged commit ed6128e into main Jan 10, 2024
10 checks passed
@ptodev ptodev deleted the ptodev/clarify-pyroscope-config branch January 10, 2024 20:09
BarunKGP pushed a commit to BarunKGP/grafana-agent that referenced this pull request Feb 20, 2024
* Clarify some aspects of pyroscope.scrape

---------

Co-authored-by: Piotr <17101802+thampiotr@users.noreply.github.com>
Co-authored-by: Ryan Perry <Rperry2174@gmail.com>
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants