Make optimal memory allocation #6

pPanda-beta · 2023-09-07T21:16:03Z

Considering the fixes of #4 (comment) is done, now the next step will be to support highly concurrent IO.

So far I have observed for any size of IO request 32MiB is allocated, which increases the memory usage unnecessarily.

Line 319 in b8f5045

b := make([]byte, maximumPacketSize)

As an alternative we can allocate the exact amount of the request size. (for both read and write)

---  b := make([]byte, maximumPacketSize) 
	for {
		var requestHeader protocol.TransmissionRequestHeader
...
		case protocol.TRANSMISSION_TYPE_REQUEST_READ:
			if err := binary.Write(conn, binary.BigEndian, protocol.TransmissionReplyHeader{
				ReplyMagic: protocol.TRANSMISSION_MAGIC_REPLY,
				Error:      0,
				Handle:     requestHeader.Handle,
			}); err != nil {
				return err
			}

+++			b := make([]byte, int64(requestHeader.Length))
			n, err := export.Backend.ReadAt(largeB, int64(requestHeader.Offset))

I have tested this and since linux dev mapper devices (if used on top of nbd device) makes very small io requests this has significantly reduced memory usage around 80-90% (since most IO requests do not exceed 4MiB).

pojntfx · 2023-09-11T12:48:50Z

Hmmm, I seem to remember that there was a reason why it was this way; I'll take a look at this ASAP. I'll probably add a check that it doesn't exceed the maximum length first (so that a client can't do DoS attacks) and then use the request header!

… amount of memory that is required per request (see #6)

pojntfx · 2023-10-05T19:03:41Z

So I've tried this patch, but we're getting significantly worse performance:

With the patch:

pojntfx@fedora:~/Projects/r3map$ go build -o /tmp/r3map-benchmark-direct-mount ./cmd/r3map-benchmark-direct-mount/ && sudo /tmp/r3map-benchmark-direct-mount
Open: 10.24412ms
Latency till first two chunks: 362.348µs
Read throughput: 1713.94 MB/s (13711.50 Mb/s)
Read check: Passed
Write throughput: 477.73 MB/s (3821.86 Mb/s)
Sync: 215.875989ms
Write check: Passed
Close: 35.668024ms

Before the patch:

pojntfx@fedora:~/Projects/r3map$ go build -o /tmp/r3map-benchmark-direct-mount ./cmd/r3map-benchmark-direct-mount/ && sudo /tmp/r3map-benchmark-direct-mount
Open: 14.540542ms
Latency till first two chunks: 51.69µs
Read throughput: 2356.33 MB/s (18850.61 Mb/s)
Read check: Passed
Write throughput: 502.19 MB/s (4017.50 Mb/s)
Sync: 197.373918ms
Write check: Passed
Close: 36.308795ms

I'll take a look at why that is exactly.

pojntfx · 2023-10-05T19:18:51Z

Figured it out, if we only increase the buffer size when we actually need it we're getting the same or better performance: 184321a

pojntfx · 2023-10-05T19:22:50Z

Released in https://github.com/pojntfx/go-nbd/releases/tag/v0.3.0

pPanda-beta · 2023-10-05T22:31:25Z

@pojntfx Thanks for the patch.

Basically either of these solutions (what we had before vs what we had now) can not solve the problems together.

The problems:

(I faced) N of connections require N*32MiB memory, for me I was having only 384 connections eating 12GiB of ram (locked).
If we allocate based on "as much as required" frequent allocation and extremely poor garbage collection will happen. Make optimal memory allocation #6 (comment) . This will impact overall IOPS and throughput.

With current solution, once the request size increases, we will allocate (or rather increase the buffer size, but we will never decrease). This properly addresses problem#2. But when we are in memory crunch, 12GiB of resident memory is pretty expensive. In my use case there can be large read request spikes, which will allocate upto 12GiB but will never come down. Adaptive allocation / a pool of buffer memories with all powers of 2 (4KiB-32MiB) - request will choose the closes greater buffer (upperbound) / smaller buffer withe few cpu cycles io.SectionReader{}+io.CopyBuffer() can be applied.

At this point I would request to make the strategy configurable so that user can choose what suits best. If performance is the demand, one can choose the increase and hold strategy. If some user wants low memory overhead, they can choose allocate-and-release strategy.

It can also be an interface

type transfer interface {
	CopyN(backend io.ReaderAt, offset int64, size int, socket io.Writer) (int, error)
}

Or by introducing a new Backend interface with bufferless functions.

type ReaderTo interface {
	ReadNTo(offset int64, size int, socket io.Writer) (int, error)
}

type WriterFrom interface {
	WriteNFrom(socket io.Reader, writeAtOffset int64, size int) (int, error)
}

type BackendV2 interface {
	ReaderTo
	WriterFrom

	Size() (int64, error)
	Sync() error
}

pojntfx · 2023-10-05T22:49:31Z

Hmm, this sounds like a good idea. I'd personally argue against changing the underlying interface (since a change from the typical io.ReaderAt interface will lead to a lot of changes in downstreams), but I think adding an option to choose between the two different strategies (in the Options struct) sounds like a good idea. What do you think?

pojntfx · 2023-10-05T23:19:56Z

Another option could also be to do this:

b := []byte{}
	for {
		var requestHeader protocol.TransmissionRequestHeader
		if err := binary.Read(conn, binary.BigEndian, &requestHeader); err != nil {
			return err
		}

		if requestHeader.RequestMagic != protocol.TRANSMISSION_MAGIC_REQUEST {
			return ErrInvalidMagic
		}

		length := requestHeader.Length
		if length > defaultMaximumRequestSize {
			return ErrInvalidBlocksize
		}

		if length != uint32(len(b)) {
			b = make([]byte, length)
		}

This way we just change the buffer length if the one we require is different from the one we currently have. From my local r3map benchmark, this leads to the same performance as the "reuse or grow" strategy, and should help fix your problem.

pojntfx · 2023-10-05T23:39:19Z

See: https://github.com/pojntfx/go-nbd/releases/tag/v0.3.2

pojntfx self-assigned this Sep 11, 2023

pojntfx added the enhancement New feature or request label Sep 11, 2023

pojntfx added a commit that referenced this issue Oct 4, 2023

feat: Allow specifying the maximum request size and only allocate the…

5a1141f

… amount of memory that is required per request (see #6)

pojntfx closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make optimal memory allocation #6

Make optimal memory allocation #6

pPanda-beta commented Sep 7, 2023

pojntfx commented Sep 11, 2023

pojntfx commented Oct 5, 2023 •

edited

Loading

pojntfx commented Oct 5, 2023

pojntfx commented Oct 5, 2023 •

edited

Loading

pPanda-beta commented Oct 5, 2023

pojntfx commented Oct 5, 2023

pojntfx commented Oct 5, 2023 •

edited

Loading

pojntfx commented Oct 5, 2023

Make optimal memory allocation #6

Make optimal memory allocation #6

Comments

pPanda-beta commented Sep 7, 2023

pojntfx commented Sep 11, 2023

pojntfx commented Oct 5, 2023 • edited Loading

pojntfx commented Oct 5, 2023

pojntfx commented Oct 5, 2023 • edited Loading

pPanda-beta commented Oct 5, 2023

pojntfx commented Oct 5, 2023

pojntfx commented Oct 5, 2023 • edited Loading

pojntfx commented Oct 5, 2023

pojntfx commented Oct 5, 2023 •

edited

Loading

pojntfx commented Oct 5, 2023 •

edited

Loading

pojntfx commented Oct 5, 2023 •

edited

Loading