…o tasks
Currently, the following situation causes the export worker to consume more
memory than necessary:
* A user exports a project with images included, and
* The project contains multiple tasks with video chunks.
The main reason for this is that the `image_maker_per_task` dictionary in
`CVATProjectDataExtractor.__init__` indirectly references a distinct
`FrameProvider` for each task, which in turn, indirectly references an
`av.containers.Container` object corresponding to the most recent chunk
opened in that task.
Initially, none of the chunks are opened, so the memory usage is low. But as
Datumaro iterates over each frame, it eventually requests at least one image
from each task of the project. This causes the corresponding `FrameProvider`
to open a chunk file and keep a handle to it. The only way for a
`FrameProvider` to close this chunk file is to open a new chunk when a frame
from that chunk is requested. Therefore, each `FrameProvider` keeps at least
one chunk open from the moment Datumaro requests the first frame from its
task and _until the end of the export_.
This manifests in the export worker's memory usage growing and growing as
Datumaro goes from task to task. An open chunk consumes whatever Python
objects represent it, but more importantly, any C-level buffers allocated by
FFmpeg, which can be quite significant. In my testing, the size of the
per-chunk memory was on the order of tens of megabytes. An open chunk also
takes up a file descriptor.
The fix for this is conceptually simple: ensure that only one
`FrameProvider` object exists at a time. AFAICS, when a project is exported,
all frames from a given task are grouped together, so we shouldn't need
multiple tasks' chunks to be open at the same time anyway.
I had to restructure the code to make this work, so I took the opportunity
to fix a few other things, as well:
* The code currently relies on garbage collection of PyAV's `Container`
objects to free the resources used by them. Even though
`VideoReader.__iter__` uses a `with` block to close the container, the
`with` block can only do so if the container is iterated all the way to
the end. This doesn't happen when `FrameProvider` uses it, since
`FrameProvider` seeks to the needed frame and then stops iterating.
I don't have evidence that this causes any issues at the moment, but
Python does not guarantee that objects are GC'd promptly, so this could
become another source of excessive memory usage.
I added cleanup methods (`close`/`unload`/`__exit__`) at several layers of
the code to ensure that each chunk is closed as soon as it is no longer
needed.
* I factored out and merged the code used to generate `dm.Image` objects
when exporting projects and jobs/tasks. It's likely possible to merge even
more code, but I don't want to expand the scope of the patch too much.
* I fixed a seemingly useless optimization in the handling for 3D frames.
Specifically, `CVATProjectDataExtractor` performs queries of the form:
task.data.images.prefetch_related().get(id=i)
The prefetch here looks useless, as only a single object is queried - it
wouldn't be any less efficient to just let Django fetch the `Image`'s
`related_files` when needed.
I rewrote this code to prefetch `related_files` for all images in the task
at once instead.