-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cockroach retains a high memory footprint after a large query is processed #20078
Comments
If there isn't memory pressure on the system, I'm not sure Go's GC proactively returns memory back to the OS. Might be worth looking into that. |
(or that the OS might not bother taking it back until it's needed) |
To @a-robinson's point, I was about to suggest that you attempt to add a debug endpoint or something that calls FreeOSMemory and observe what happens after you call it. |
The Go GC returns memory to the OS at a very slow rate. I can't recall what the time frame is, but multiple minutes. |
I ran |
The scavenge period is 5min. While looking this up, I also discovered the the Go GC "forces" a GC every 2min. That seems to be independent of whether GC is necessary. Cc @a-robinson and @nvanbenschoten as this is the first time I've encountered something that occurs at a 2min frequency in our system. |
@petermattis the fact that we had an hour where it wasn't freeing memory leads me to believe that somethings up with the Go GC scavenging. |
Oooh that's a very interesting discovery. It pairs nicely with #12713. |
The Go GC is not a compacting/copying collector, so fragmentation can result in memory that is idle that cannot be returned to the system (even a single long-lived value can force a whole page to be retained). There are allocation strategies that can be used to minimize fragmentation, but I think it would be best to just set expectations that cockroachdb will grow to its configured size and not return memory to the system. |
@bdarnell If Go GC can hold up so much freed memory due to fragmentation, then how does one reasonably guarantee that cockroach will remain within its configured limit? Does cockroach track process RSS to disallow further allcations? In my experiments, I saw cockoach RSS go well beyond what was configured (took up 10 GB when configured to take up 6 GB) Also it seems unlikely that such large amounts of memory can be held up due to fragmentation. |
These pages that can't be returned to the system can still be reused by go's allocator, so we can reasonably expect the process to stop growing. It just won't necessarily return that memory to the system. Note that |
@bdarnell I'm hesitant to chalk this up as an inability to release memory due to fragmentation of the Go heap. That's certainly a possibility, but there might be something else going on here. There is some more detail in https://golang.org/pkg/runtime/#MemStats ( |
I agree with @petermattis The allocator not releasing 2 out of 4 GB of freed memory for hours indicates that either the allocator is too rudimentary, or the pattern is targeted to be adversarial. Also unclear what caused it to get freed up eventually.
This seems reasonable assuming the allocator is not rudimentary.
Sounds reasonable. But I would say that what I am observing is not peak usage of 10 GB but sustained usage of 10 GB on an idle system after a peak. |
If the system (meaning the entire machine) is idle, though, then the OS has little motivation to reclaim the memory from the process. From golang/go#14521 (comment): "The go runtime can only advise the operating system that a section of If you want to verify whether Go is doing the right thing, you could either dump the go GC stats and examine |
Yeah, it's possible. This certainly seems like a lot of memory to be held up by fragmentation. But it's worth noting that returning memory to the system has not historically been a priority for google's allocators (tcmalloc never returned memory to the system when it was first released), so I'm not surprised if go's allocator doesn't do well here. |
@arjunravinarayan already ran stress on the machine and the OS reclaimed the memory. |
I don't think this is the cause here. See from madvise man page: http://man7.org/linux/man-pages/man2/madvise.2.html
Note that the OS may not reclaim immediately but the RSS will go down. In my case, I was only looking at RSS which did not go down. But I'm not a linux expert, so I might be missing something about madvise's behavior.
I looked at this open issue yesterday. golang/go#16930 (and a user bug related to that golang/go#14045) It seems like the allocator is not "prompt" enough to return the memory back, but from what I gather if there are no major fragmentation issues, a force GC will run every 2 minutes, and a force scavenging attempt will be done every 5 minutes, which would release the memory back to the system (using madvise on linux.) So if you wait a few minutes, memory would be released. My suggestion is to add some more monitoring in cockroach (maybe an endpoint to call GetMemStats? does pprof already have this?) and at least make sure whether Go's allocator is releasing the memory at all or not. |
@arjunravinarayan Given the reproducibility of the original issue, it seems like we should be able to identify where the memory is going. Let's extend the log message in |
Thanks for the lead, @petermattis. I reran the query, with the additions to the log message. I found the following things:
|
|
From the go1.9 source:
|
@a-robinson, I wasn't careful about noting down timings as to when the RSS went down, it could have happened when I applied memory pressure to the system using |
@a-robinson RSS only goes down when you apply memory pressure to the system. It stays at |
I think we can close this issue, given that we've tracked it down to the following conclusions:
Any objections @petermattis? |
No objections from me. |
TL;DR: it doesn't appear to be the case that memory is freed, so running a large query once means that cockroach retains a large memory footprint which can be detrimental to colocated processes.
I restored a TPC-H scalefactor-5 dataset, then performed the following instructions:
Start a cluster limiting the –-max-sql-memory, and the RocksDB cache:
./cockroach start --insecure --background --max-sql-memory 2GB --cache 1GB
These limits are set purely to understand the problem better, since the RocksDB cache space usage does not show up in pprof memory usages.
I then ran the following query, with DISTSQL turned off (in order to simplify things):
select * from tpch.partsupp JOIN tpch.lineitem ON l_partkey = ps_partkey;
This was just a large join query that I made up, which would create a large Hashtable. Note that the local SQL processor constructs the Hashtable on the right-side table, so this JOIN query was strategically designed to construct the larger Hashtable. DistSQL would run a merge join in this case, which wouldn't achieve my goals of having a large memory footprint.
The query OOMs as expected after a bit:
pq: root: memory budget exceeded: 10240 bytes requested, 2000000000 bytes in budget
I terminated the SQL shell, just to be absolutely sure resources weren't tied to the session, then I did something else on my machine for an hour.
Memory usage remains at a constant
3.46GB
. Grabbing a heap profile (attached) shows that the--inuse_space
is 6 megabytes. Attached profile.An
--alloc_space
profile confirms that I'm not crazy, and that this node did at one point allocate a bunch of space in order to execute the JOIN query.We do not have visibility into the amount of memory held in CGo AFAIK, but we restricted the RocksDB cache to
1GB
, so if that's working correctly, that's still 2.5GB unaccounted for.@petermattis, do you have any suggestions for how I should proceed to investigate where this memory is held? I still have the running cluster, and it doesn't seem to be in any hurry to free up the space.
cc @asubiotto, @vivekmenezes
The text was updated successfully, but these errors were encountered: