Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cockroach retains a high memory footprint after a large query is processed #20078

Closed
rjnn opened this issue Nov 15, 2017 · 26 comments
Closed

cockroach retains a high memory footprint after a large query is processed #20078

rjnn opened this issue Nov 15, 2017 · 26 comments
Assignees
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented Nov 15, 2017

TL;DR: it doesn't appear to be the case that memory is freed, so running a large query once means that cockroach retains a large memory footprint which can be detrimental to colocated processes.

I restored a TPC-H scalefactor-5 dataset, then performed the following instructions:

Start a cluster limiting the –-max-sql-memory, and the RocksDB cache:
./cockroach start --insecure --background --max-sql-memory 2GB --cache 1GB
These limits are set purely to understand the problem better, since the RocksDB cache space usage does not show up in pprof memory usages.

I then ran the following query, with DISTSQL turned off (in order to simplify things):

select * from tpch.partsupp JOIN tpch.lineitem ON l_partkey = ps_partkey;

This was just a large join query that I made up, which would create a large Hashtable. Note that the local SQL processor constructs the Hashtable on the right-side table, so this JOIN query was strategically designed to construct the larger Hashtable. DistSQL would run a merge join in this case, which wouldn't achieve my goals of having a large memory footprint.

The query OOMs as expected after a bit:
pq: root: memory budget exceeded: 10240 bytes requested, 2000000000 bytes in budget

I terminated the SQL shell, just to be absolutely sure resources weren't tied to the session, then I did something else on my machine for an hour.

Memory usage remains at a constant 3.46GB. Grabbing a heap profile (attached) shows that the --inuse_space is 6 megabytes. Attached profile.

An --alloc_space profile confirms that I'm not crazy, and that this node did at one point allocate a bunch of space in order to execute the JOIN query.

We do not have visibility into the amount of memory held in CGo AFAIK, but we restricted the RocksDB cache to 1GB, so if that's working correctly, that's still 2.5GB unaccounted for.

@petermattis, do you have any suggestions for how I should proceed to investigate where this memory is held? I still have the running cluster, and it doesn't seem to be in any hurry to free up the space.

cc @asubiotto, @vivekmenezes

@rjnn
Copy link
Contributor Author

rjnn commented Nov 15, 2017

screenshot 2017-11-15 15 25 54
The Admin UI confirms the memory usage numbers, and that CGo is behaving properly.

@a-robinson
Copy link
Contributor

If there isn't memory pressure on the system, I'm not sure Go's GC proactively returns memory back to the OS. Might be worth looking into that.

@a-robinson
Copy link
Contributor

(or that the OS might not bother taking it back until it's needed)

@jordanlewis
Copy link
Member

To @a-robinson's point, I was about to suggest that you attempt to add a debug endpoint or something that calls FreeOSMemory and observe what happens after you call it.

@petermattis
Copy link
Collaborator

The Go GC returns memory to the OS at a very slow rate. I can't recall what the time frame is, but multiple minutes.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 15, 2017

I ran stress --vm 64 to add some memory pressure to my machine. stress spawns workers that take up memory (each "vm" by default eats 256mb, and I'm on a 16GB laptop, so I used 64 workers). After running this for a couple minutes, I killed stress. Cockroach was down to a reasonable amount of memory, validating @a-robinson's hypothesis.

screenshot 2017-11-15 15 39 06

@petermattis
Copy link
Collaborator

The scavenge period is 5min.

While looking this up, I also discovered the the Go GC "forces" a GC every 2min. That seems to be independent of whether GC is necessary. Cc @a-robinson and @nvanbenschoten as this is the first time I've encountered something that occurs at a 2min frequency in our system.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 15, 2017

@petermattis the fact that we had an hour where it wasn't freeing memory leads me to believe that somethings up with the Go GC scavenging.

@a-robinson
Copy link
Contributor

While looking this up, I also discovered the the Go GC "forces" a GC every 2min. That seems to be independent of whether GC is necessary. Cc @a-robinson and @nvanbenschoten as this is the first time I've encountered something that occurs at a 2min frequency in our system.

Oooh that's a very interesting discovery. It pairs nicely with #12713.

@bdarnell
Copy link
Contributor

The Go GC is not a compacting/copying collector, so fragmentation can result in memory that is idle that cannot be returned to the system (even a single long-lived value can force a whole page to be retained). There are allocation strategies that can be used to minimize fragmentation, but I think it would be best to just set expectations that cockroachdb will grow to its configured size and not return memory to the system.

@garvitjuniwal
Copy link
Contributor

@bdarnell If Go GC can hold up so much freed memory due to fragmentation, then how does one reasonably guarantee that cockroach will remain within its configured limit? Does cockroach track process RSS to disallow further allcations? In my experiments, I saw cockoach RSS go well beyond what was configured (took up 10 GB when configured to take up 6 GB)

Also it seems unlikely that such large amounts of memory can be held up due to fragmentation.

@bdarnell
Copy link
Contributor

These pages that can't be returned to the system can still be reused by go's allocator, so we can reasonably expect the process to stop growing. It just won't necessarily return that memory to the system.

Note that --max-sql-memory and --cache-size are not a complete accounting of all of cockroach's memory - we currently recommend setting each of these flags to 25% of the memory allocated to cockroach and leave the other half to things that are not currently controlled by these flags (over time, we'll move more of our memory usage under the control of these or other limits). A peak of 10GB when --max-sql-memory plus --cache-size are 6GB is not unexpected.

@petermattis
Copy link
Collaborator

@bdarnell I'm hesitant to chalk this up as an inability to release memory due to fragmentation of the Go heap. That's certainly a possibility, but there might be something else going on here. There is some more detail in https://golang.org/pkg/runtime/#MemStats (HeapIdle vs HeapInuse) that could give us a more definitive answer.

@garvitjuniwal
Copy link
Contributor

I agree with @petermattis The allocator not releasing 2 out of 4 GB of freed memory for hours indicates that either the allocator is too rudimentary, or the pattern is targeted to be adversarial. Also unclear what caused it to get freed up eventually.

These pages that can't be returned to the system can still be reused by go's allocator, so we can reasonably expect the process to stop growing. It just won't necessarily return that memory to the system.

This seems reasonable assuming the allocator is not rudimentary.

Note that --max-sql-memory and --cache-size are not a complete accounting of all of cockroach's memory - we currently recommend setting each of these flags to 25% of the memory allocated to cockroach and leave the other half to things that are not currently controlled by these flags (over time, we'll move more of our memory usage under the control of these or other limits). A peak of 10GB when --max-sql-memory plus --cache-size are 6GB is not unexpected.

Sounds reasonable. But I would say that what I am observing is not peak usage of 10 GB but sustained usage of 10 GB on an idle system after a peak.

@a-robinson
Copy link
Contributor

If the system (meaning the entire machine) is idle, though, then the OS has little motivation to reclaim the memory from the process. From golang/go#14521 (comment):

"The go runtime can only advise the operating system that a section of
virtual address space is no longer needed, it cannot force the operating
system to take action. Operating systems, especially linux, and especially
if they are not under significant memory pressure can, and regularly do,
ignore this advice."

If you want to verify whether Go is doing the right thing, you could either dump the go GC stats and examine HeapReleased, or just apply memory pressure to the system (as @arjunravinarayan did) and see whether it gets reclaimed then.

@bdarnell
Copy link
Contributor

I'm hesitant to chalk this up as an inability to release memory due to fragmentation of the Go heap. That's certainly a possibility, but there might be something else going on here.

Yeah, it's possible. This certainly seems like a lot of memory to be held up by fragmentation. But it's worth noting that returning memory to the system has not historically been a priority for google's allocators (tcmalloc never returned memory to the system when it was first released), so I'm not surprised if go's allocator doesn't do well here.

@vivekmenezes
Copy link
Contributor

@arjunravinarayan already ran stress on the machine and the OS reclaimed the memory.

@garvitjuniwal
Copy link
Contributor

"The go runtime can only advise the operating system that a section of
virtual address space is no longer needed, it cannot force the operating
system to take action. Operating systems, especially linux, and especially
if they are not under significant memory pressure can, and regularly do,
ignore this advice."

I don't think this is the cause here. See from madvise man page: http://man7.org/linux/man-pages/man2/madvise.2.html

Note that, when applied to shared mappings, MADV_DONTNEED
              might not lead to immediate freeing of the pages in the range.
              The kernel is free to delay freeing the pages until an
              appropriate moment.  The resident set size (RSS) of the
              calling process will be immediately reduced however.

Note that the OS may not reclaim immediately but the RSS will go down. In my case, I was only looking at RSS which did not go down. But I'm not a linux expert, so I might be missing something about madvise's behavior.

Yeah, it's possible. This certainly seems like a lot of memory to be held up by fragmentation. But it's worth noting that returning memory to the system has not historically been a priority for google's allocators (tcmalloc never returned memory to the system when it was first released), so I'm not surprised if go's allocator doesn't do well here.

I looked at this open issue yesterday. golang/go#16930 (and a user bug related to that golang/go#14045) It seems like the allocator is not "prompt" enough to return the memory back, but from what I gather if there are no major fragmentation issues, a force GC will run every 2 minutes, and a force scavenging attempt will be done every 5 minutes, which would release the memory back to the system (using madvise on linux.) So if you wait a few minutes, memory would be released.

My suggestion is to add some more monitoring in cockroach (maybe an endpoint to call GetMemStats? does pprof already have this?) and at least make sure whether Go's allocator is releasing the memory at all or not.

@petermattis petermattis added this to the 1.2 milestone Nov 20, 2017
@petermattis
Copy link
Collaborator

@arjunravinarayan Given the reproducibility of the original issue, it seems like we should be able to identify where the memory is going. Let's extend the log message in server/status/runtime.go to include HeapIdle and HeapInuse.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 20, 2017

Thanks for the lead, @petermattis. I reran the query, with the additions to the log message. I found the following things:

ms.HeapInuse jumps from 400MiB to 2.4GiB, but then stays there for ~2 minutes (the query in contrast runs out of memory and errors out within ~15 seconds). After that, it drops to ~35MiB, just when ms.HeapIdle jumps to 2.6GiB. That memory stays in ms.HeapIdle for the rest of the duration. After 10 minutes from the start of the query (8 minutes after the switch from HeapInuse to HeapIdle) the 2.6GiB also starts showing up in ms.HeapReleased, (although it is now in both HeapIdle and HeapReleased). It stays this way for over half an hour.

@a-robinson
Copy link
Contributor

HeapReleased is included in HeapIdle, so that's expected once go has released the memory back to the OS. Did the process's RSS go down or stay the same?

@petermattis
Copy link
Collaborator

From the go1.9 source:

	// TODO(austin): heap_released is both useless and inaccurate
	// in its current form. It's useless because, from the user's
	// and OS's perspectives, there's no difference between a page
	// that has not yet been faulted in and a page that has been
	// released back to the OS. We could fix this by considering
	// newly mapped spans to be "released". It's inaccurate
	// because when we split a large span for allocation, we
	// "unrelease" all pages in the large span and not just the
	// ones we split off for use. This is trickier to fix because
	// we currently don't know which pages of a span we've
	// released. We could fix it by separating "free" and
	// "released" spans, but then we have to allocate from runs of
	// free and released spans.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 20, 2017

@a-robinson, I wasn't careful about noting down timings as to when the RSS went down, it could have happened when I applied memory pressure to the system using stress, or by itself after some cutoff time. I'm rerunning the query on a fresh machine, and waiting over 15 minutes before applying memory pressure to be absolutely sure. I will report back.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 20, 2017

@a-robinson RSS only goes down when you apply memory pressure to the system. It stays at 3.2GiB until I run stress --vm 64 (which I ran after 12 minutes of waiting) and then immediately drops down to ~90 MiB.

@rjnn
Copy link
Contributor Author

rjnn commented Nov 27, 2017

I think we can close this issue, given that we've tracked it down to the following conclusions:

  1. The extra memory is created only when Cockroach is explicitly put into modes that ask it to use a lot of memory for query processing, and the features that mitigate memory usage (namely DistSQL and external storage) are turned off.
  2. The memory usage is released when there is memory pressure on the system.
  3. The memory is explicitly marked as free to return to the system in the meantime.

Any objections @petermattis?

@petermattis
Copy link
Collaborator

No objections from me.

@rjnn rjnn closed this as completed Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants