Fix excessive memory usage when exporting projects with multiple video tasks #7374

…o tasks Currently, the following situation causes the export worker to consume more memory than necessary: * A user exports a project with images included, and * The project contains multiple tasks with video chunks. The main reason for this is that the `image_maker_per_task` dictionary in `CVATProjectDataExtractor.__init__` indirectly references a distinct `FrameProvider` for each task, which in turn, indirectly references an `av.containers.Container` object corresponding to the most recent chunk opened in that task. Initially, none of the chunks are opened, so the memory usage is low. But as Datumaro iterates over each frame, it eventually requests at least one image from each task of the project. This causes the corresponding `FrameProvider` to open a chunk file and keep a handle to it. The only way for a `FrameProvider` to close this chunk file is to open a new chunk when a frame from that chunk is requested. Therefore, each `FrameProvider` keeps at least one chunk open from the moment Datumaro requests the first frame from its task and _until the end of the export_. This manifests in the export worker's memory usage growing and growing as Datumaro goes from task to task. An open chunk consumes whatever Python objects represent it, but more importantly, any C-level buffers allocated by FFmpeg, which can be quite significant. In my testing, the size of the per-chunk memory was on the order of tens of megabytes. An open chunk also takes up a file descriptor. The fix for this is conceptually simple: ensure that only one `FrameProvider` object exists at a time. AFAICS, when a project is exported, all frames from a given task are grouped together, so we shouldn't need multiple tasks' chunks to be open at the same time anyway. I had to restructure the code to make this work, so I took the opportunity to fix a few other things, as well: * The code currently relies on garbage collection of PyAV's `Container` objects to free the resources used by them. Even though `VideoReader.__iter__` uses a `with` block to close the container, the `with` block can only do so if the container is iterated all the way to the end. This doesn't happen when `FrameProvider` uses it, since `FrameProvider` seeks to the needed frame and then stops iterating. I don't have evidence that this causes any issues at the moment, but Python does not guarantee that objects are GC'd promptly, so this could become another source of excessive memory usage. I added cleanup methods (`close`/`unload`/`__exit__`) at several layers of the code to ensure that each chunk is closed as soon as it is no longer needed. * I factored out and merged the code used to generate `dm.Image` objects when exporting projects and jobs/tasks. It's likely possible to merge even more code, but I don't want to expand the scope of the patch too much. * I fixed a seemingly useless optimization in the handling for 3D frames. Specifically, `CVATProjectDataExtractor` performs queries of the form: task.data.images.prefetch_related().get(id=i) The prefetch here looks useless, as only a single object is queried - it wouldn't be any less efficient to just let Django fetch the `Image`'s `related_files` when needed. I rewrote this code to prefetch `related_files` for all images in the task at once instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix excessive memory usage when exporting projects with multiple video tasks #7374

Fix excessive memory usage when exporting projects with multiple video tasks #7374

Commits on Jan 19, 2024

Commits on Jan 24, 2024