cockroach retains a high memory footprint after a large query is processed #20078

rjnn · 2017-11-15T20:07:46Z

TL;DR: it doesn't appear to be the case that memory is freed, so running a large query once means that cockroach retains a large memory footprint which can be detrimental to colocated processes.

I restored a TPC-H scalefactor-5 dataset, then performed the following instructions:

Start a cluster limiting the –-max-sql-memory, and the RocksDB cache:
./cockroach start --insecure --background --max-sql-memory 2GB --cache 1GB
These limits are set purely to understand the problem better, since the RocksDB cache space usage does not show up in pprof memory usages.

I then ran the following query, with DISTSQL turned off (in order to simplify things):

select * from tpch.partsupp JOIN tpch.lineitem ON l_partkey = ps_partkey;

This was just a large join query that I made up, which would create a large Hashtable. Note that the local SQL processor constructs the Hashtable on the right-side table, so this JOIN query was strategically designed to construct the larger Hashtable. DistSQL would run a merge join in this case, which wouldn't achieve my goals of having a large memory footprint.

The query OOMs as expected after a bit:
pq: root: memory budget exceeded: 10240 bytes requested, 2000000000 bytes in budget

I terminated the SQL shell, just to be absolutely sure resources weren't tied to the session, then I did something else on my machine for an hour.

Memory usage remains at a constant 3.46GB. Grabbing a heap profile (attached) shows that the --inuse_space is 6 megabytes. Attached profile.

An --alloc_space profile confirms that I'm not crazy, and that this node did at one point allocate a bunch of space in order to execute the JOIN query.

We do not have visibility into the amount of memory held in CGo AFAIK, but we restricted the RocksDB cache to 1GB, so if that's working correctly, that's still 2.5GB unaccounted for.

@petermattis, do you have any suggestions for how I should proceed to investigate where this memory is held? I still have the running cluster, and it doesn't seem to be in any hurry to free up the space.

cc @asubiotto, @vivekmenezes

The text was updated successfully, but these errors were encountered:

rjnn · 2017-11-15T20:27:21Z

The Admin UI confirms the memory usage numbers, and that CGo is behaving properly.

a-robinson · 2017-11-15T20:32:39Z

If there isn't memory pressure on the system, I'm not sure Go's GC proactively returns memory back to the OS. Might be worth looking into that.

a-robinson · 2017-11-15T20:34:02Z

(or that the OS might not bother taking it back until it's needed)

jordanlewis · 2017-11-15T20:34:27Z

To @a-robinson's point, I was about to suggest that you attempt to add a debug endpoint or something that calls FreeOSMemory and observe what happens after you call it.

petermattis · 2017-11-15T20:36:10Z

The Go GC returns memory to the OS at a very slow rate. I can't recall what the time frame is, but multiple minutes.

rjnn · 2017-11-15T20:40:06Z

I ran stress --vm 64 to add some memory pressure to my machine. stress spawns workers that take up memory (each "vm" by default eats 256mb, and I'm on a 16GB laptop, so I used 64 workers). After running this for a couple minutes, I killed stress. Cockroach was down to a reasonable amount of memory, validating @a-robinson's hypothesis.

petermattis · 2017-11-15T20:40:59Z

The scavenge period is 5min.

While looking this up, I also discovered the the Go GC "forces" a GC every 2min. That seems to be independent of whether GC is necessary. Cc @a-robinson and @nvanbenschoten as this is the first time I've encountered something that occurs at a 2min frequency in our system.

rjnn · 2017-11-15T20:48:03Z

@petermattis the fact that we had an hour where it wasn't freeing memory leads me to believe that somethings up with the Go GC scavenging.

a-robinson · 2017-11-15T20:48:51Z

While looking this up, I also discovered the the Go GC "forces" a GC every 2min. That seems to be independent of whether GC is necessary. Cc @a-robinson and @nvanbenschoten as this is the first time I've encountered something that occurs at a 2min frequency in our system.

Oooh that's a very interesting discovery. It pairs nicely with #12713.

bdarnell · 2017-11-15T20:49:53Z

The Go GC is not a compacting/copying collector, so fragmentation can result in memory that is idle that cannot be returned to the system (even a single long-lived value can force a whole page to be retained). There are allocation strategies that can be used to minimize fragmentation, but I think it would be best to just set expectations that cockroachdb will grow to its configured size and not return memory to the system.

garvitjuniwal · 2017-11-15T21:37:54Z

@bdarnell If Go GC can hold up so much freed memory due to fragmentation, then how does one reasonably guarantee that cockroach will remain within its configured limit? Does cockroach track process RSS to disallow further allcations? In my experiments, I saw cockoach RSS go well beyond what was configured (took up 10 GB when configured to take up 6 GB)

Also it seems unlikely that such large amounts of memory can be held up due to fragmentation.

bdarnell · 2017-11-15T21:48:18Z

These pages that can't be returned to the system can still be reused by go's allocator, so we can reasonably expect the process to stop growing. It just won't necessarily return that memory to the system.

Note that --max-sql-memory and --cache-size are not a complete accounting of all of cockroach's memory - we currently recommend setting each of these flags to 25% of the memory allocated to cockroach and leave the other half to things that are not currently controlled by these flags (over time, we'll move more of our memory usage under the control of these or other limits). A peak of 10GB when --max-sql-memory plus --cache-size are 6GB is not unexpected.

petermattis · 2017-11-15T22:03:29Z

@bdarnell I'm hesitant to chalk this up as an inability to release memory due to fragmentation of the Go heap. That's certainly a possibility, but there might be something else going on here. There is some more detail in https://golang.org/pkg/runtime/#MemStats (HeapIdle vs HeapInuse) that could give us a more definitive answer.

garvitjuniwal · 2017-11-15T22:52:43Z

I agree with @petermattis The allocator not releasing 2 out of 4 GB of freed memory for hours indicates that either the allocator is too rudimentary, or the pattern is targeted to be adversarial. Also unclear what caused it to get freed up eventually.

These pages that can't be returned to the system can still be reused by go's allocator, so we can reasonably expect the process to stop growing. It just won't necessarily return that memory to the system.

This seems reasonable assuming the allocator is not rudimentary.

Note that --max-sql-memory and --cache-size are not a complete accounting of all of cockroach's memory - we currently recommend setting each of these flags to 25% of the memory allocated to cockroach and leave the other half to things that are not currently controlled by these flags (over time, we'll move more of our memory usage under the control of these or other limits). A peak of 10GB when --max-sql-memory plus --cache-size are 6GB is not unexpected.

Sounds reasonable. But I would say that what I am observing is not peak usage of 10 GB but sustained usage of 10 GB on an idle system after a peak.

a-robinson · 2017-11-16T15:00:43Z

If the system (meaning the entire machine) is idle, though, then the OS has little motivation to reclaim the memory from the process. From golang/go#14521 (comment):

"The go runtime can only advise the operating system that a section of
virtual address space is no longer needed, it cannot force the operating
system to take action. Operating systems, especially linux, and especially
if they are not under significant memory pressure can, and regularly do,
ignore this advice."

If you want to verify whether Go is doing the right thing, you could either dump the go GC stats and examine HeapReleased, or just apply memory pressure to the system (as @arjunravinarayan did) and see whether it gets reclaimed then.

bdarnell · 2017-11-16T15:40:51Z

I'm hesitant to chalk this up as an inability to release memory due to fragmentation of the Go heap. That's certainly a possibility, but there might be something else going on here.

Yeah, it's possible. This certainly seems like a lot of memory to be held up by fragmentation. But it's worth noting that returning memory to the system has not historically been a priority for google's allocators (tcmalloc never returned memory to the system when it was first released), so I'm not surprised if go's allocator doesn't do well here.

vivekmenezes · 2017-11-16T16:32:50Z

@arjunravinarayan already ran stress on the machine and the OS reclaimed the memory.

garvitjuniwal · 2017-11-16T18:32:51Z

"The go runtime can only advise the operating system that a section of
virtual address space is no longer needed, it cannot force the operating
system to take action. Operating systems, especially linux, and especially
if they are not under significant memory pressure can, and regularly do,
ignore this advice."

I don't think this is the cause here. See from madvise man page: http://man7.org/linux/man-pages/man2/madvise.2.html

Note that, when applied to shared mappings, MADV_DONTNEED
              might not lead to immediate freeing of the pages in the range.
              The kernel is free to delay freeing the pages until an
              appropriate moment.  The resident set size (RSS) of the
              calling process will be immediately reduced however.

Note that the OS may not reclaim immediately but the RSS will go down. In my case, I was only looking at RSS which did not go down. But I'm not a linux expert, so I might be missing something about madvise's behavior.

Yeah, it's possible. This certainly seems like a lot of memory to be held up by fragmentation. But it's worth noting that returning memory to the system has not historically been a priority for google's allocators (tcmalloc never returned memory to the system when it was first released), so I'm not surprised if go's allocator doesn't do well here.

I looked at this open issue yesterday. golang/go#16930 (and a user bug related to that golang/go#14045) It seems like the allocator is not "prompt" enough to return the memory back, but from what I gather if there are no major fragmentation issues, a force GC will run every 2 minutes, and a force scavenging attempt will be done every 5 minutes, which would release the memory back to the system (using madvise on linux.) So if you wait a few minutes, memory would be released.

My suggestion is to add some more monitoring in cockroach (maybe an endpoint to call GetMemStats? does pprof already have this?) and at least make sure whether Go's allocator is releasing the memory at all or not.

petermattis · 2017-11-20T14:45:18Z

@arjunravinarayan Given the reproducibility of the original issue, it seems like we should be able to identify where the memory is going. Let's extend the log message in server/status/runtime.go to include HeapIdle and HeapInuse.

rjnn · 2017-11-20T20:05:30Z

Thanks for the lead, @petermattis. I reran the query, with the additions to the log message. I found the following things:

ms.HeapInuse jumps from 400MiB to 2.4GiB, but then stays there for ~2 minutes (the query in contrast runs out of memory and errors out within ~15 seconds). After that, it drops to ~35MiB, just when ms.HeapIdle jumps to 2.6GiB. That memory stays in ms.HeapIdle for the rest of the duration. After 10 minutes from the start of the query (8 minutes after the switch from HeapInuse to HeapIdle) the 2.6GiB also starts showing up in ms.HeapReleased, (although it is now in both HeapIdle and HeapReleased). It stays this way for over half an hour.

a-robinson · 2017-11-20T20:12:43Z

HeapReleased is included in HeapIdle, so that's expected once go has released the memory back to the OS. Did the process's RSS go down or stay the same?

petermattis · 2017-11-20T20:18:54Z

From the go1.9 source:

	// TODO(austin): heap_released is both useless and inaccurate
	// in its current form. It's useless because, from the user's
	// and OS's perspectives, there's no difference between a page
	// that has not yet been faulted in and a page that has been
	// released back to the OS. We could fix this by considering
	// newly mapped spans to be "released". It's inaccurate
	// because when we split a large span for allocation, we
	// "unrelease" all pages in the large span and not just the
	// ones we split off for use. This is trickier to fix because
	// we currently don't know which pages of a span we've
	// released. We could fix it by separating "free" and
	// "released" spans, but then we have to allocate from runs of
	// free and released spans.

rjnn · 2017-11-20T20:30:36Z

@a-robinson, I wasn't careful about noting down timings as to when the RSS went down, it could have happened when I applied memory pressure to the system using stress, or by itself after some cutoff time. I'm rerunning the query on a fresh machine, and waiting over 15 minutes before applying memory pressure to be absolutely sure. I will report back.

rjnn · 2017-11-20T20:51:33Z

@a-robinson RSS only goes down when you apply memory pressure to the system. It stays at 3.2GiB until I run stress --vm 64 (which I ran after 12 minutes of waiting) and then immediately drops down to ~90 MiB.

rjnn · 2017-11-27T22:17:21Z

I think we can close this issue, given that we've tracked it down to the following conclusions:

The extra memory is created only when Cockroach is explicitly put into modes that ask it to use a lot of memory for query processing, and the features that mitigate memory usage (namely DistSQL and external storage) are turned off.
The memory usage is released when there is memory pressure on the system.
The memory is explicitly marked as free to return to the system in the meantime.

Any objections @petermattis?

petermattis · 2017-11-28T01:43:46Z

No objections from me.

petermattis assigned rjnn Nov 20, 2017

petermattis added this to the 1.2 milestone Nov 20, 2017

rjnn mentioned this issue Nov 20, 2017

sql: parity between local SQL and distSQL in JOINs #20173

Closed

rjnn closed this as completed Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cockroach retains a high memory footprint after a large query is processed #20078

cockroach retains a high memory footprint after a large query is processed #20078

rjnn commented Nov 15, 2017

rjnn commented Nov 15, 2017

a-robinson commented Nov 15, 2017

a-robinson commented Nov 15, 2017

jordanlewis commented Nov 15, 2017

petermattis commented Nov 15, 2017

rjnn commented Nov 15, 2017

petermattis commented Nov 15, 2017

rjnn commented Nov 15, 2017

a-robinson commented Nov 15, 2017

bdarnell commented Nov 15, 2017

garvitjuniwal commented Nov 15, 2017

bdarnell commented Nov 15, 2017

petermattis commented Nov 15, 2017

garvitjuniwal commented Nov 15, 2017

a-robinson commented Nov 16, 2017

bdarnell commented Nov 16, 2017

vivekmenezes commented Nov 16, 2017

garvitjuniwal commented Nov 16, 2017

petermattis commented Nov 20, 2017

rjnn commented Nov 20, 2017 •

edited

Loading

a-robinson commented Nov 20, 2017

petermattis commented Nov 20, 2017

rjnn commented Nov 20, 2017

rjnn commented Nov 20, 2017

rjnn commented Nov 27, 2017

petermattis commented Nov 28, 2017

cockroach retains a high memory footprint after a large query is processed #20078

cockroach retains a high memory footprint after a large query is processed #20078

Comments

rjnn commented Nov 15, 2017

rjnn commented Nov 15, 2017

a-robinson commented Nov 15, 2017

a-robinson commented Nov 15, 2017

jordanlewis commented Nov 15, 2017

petermattis commented Nov 15, 2017

rjnn commented Nov 15, 2017

petermattis commented Nov 15, 2017

rjnn commented Nov 15, 2017

a-robinson commented Nov 15, 2017

bdarnell commented Nov 15, 2017

garvitjuniwal commented Nov 15, 2017

bdarnell commented Nov 15, 2017

petermattis commented Nov 15, 2017

garvitjuniwal commented Nov 15, 2017

a-robinson commented Nov 16, 2017

bdarnell commented Nov 16, 2017

vivekmenezes commented Nov 16, 2017

garvitjuniwal commented Nov 16, 2017

petermattis commented Nov 20, 2017

rjnn commented Nov 20, 2017 • edited Loading

a-robinson commented Nov 20, 2017

petermattis commented Nov 20, 2017

rjnn commented Nov 20, 2017

rjnn commented Nov 20, 2017

rjnn commented Nov 27, 2017

petermattis commented Nov 28, 2017

rjnn commented Nov 20, 2017 •

edited

Loading