Outlier detection #120

mzur · 2023-12-06T13:41:35Z

Resolves #88

Notes:

Move feature vector code from biigle/maia to biigle/largo.
Make code in biigle/maia reuse/subclass code from biigle/largo to generate feature vectors.
Merge the vector database into the regular database (see explanation below). This is done in Move feature vector models to default database connection maia#150
Implement ImageAnnotationLabelFeatureVector and VideoAnnotationLabelFeatureVector models
Extend GenerateAnnotationPatch jobs to generate feature vectors, too (in the same job).
GenerateAnnotationPatch should update existing feature vectors if they already exist
Handle added feature vectors because of added annotation labels. (reuse existing feature vector)
Largo patches are computed with a padding but the feature vectors are generated from the "tight" box around the annotation.
Update the feature vector Python script to enable a "single-file" mode where it reads a single file and outputs the feature vector to stdout. This can be used in GenerateAnnotationPatch to avoid reading and writing additional files.
- ~~Maybe leave the implementation with the CSV file exclusive to MAIA after all? Remove the Trait again.~~
- Or use a RAMDisk to exchange the files (use /dev/shm which is 64 MB and should be enough by default in Docker?)
Handle feature vectors of whole frame annotations, too.
Unify CPU and GPU workers. Both use the same (PyTorch) Docker image and adapt to the available hardware.
- Or add pytorch CPU to the CPU worker image because it is much smaller?
Implement the UI
- This will be implemented as a "sorting" tab as explained in Patch sorting #97. The initial sorting options will be "ID" (i.e. ascending order of creation date) and "outliers" (i.e. the outliers first). Outliers will be determined by computing the average feature vector of all shown annotations and then sorting by most dissimilar. Users should be able to choose the sorting direction just as with the volume overview.
- Volume UI
- Project UI
Transform the "GenerateAnntationPatch" job to a "ProcessAnnotatedFile" job that generates all patches, feature vectors (and SVGs Add annotation outlines to Largo thumbnails #119) for a file. The job has an optional "only" argument where it can be limited to generate only stuff for a single annotation. It also has optional arguments to disable generation of patches, feature vectors (and SVGs).
- Update clone volume code (Largo sort core#730)
- Update biigle/sync code (Largo sort sync#41)
- Update biigle/maia code (Largo sort maia#149)
Implement a console command to generate feature vectors of existing annotations. This uses only the Largo annotation patches (thumbnails) so the original images don't have to be downloaded/processed. This is a tradeoff between processing speed and feature vector quality.
- Test if the sorting results are about the same than with full-resolution feature vectors
- The sorting is worse than with full-resolution feature vectors. We need to find a solution to initialize the vectors from the original files.
- I've implemented an updated FV initialization based on thumbnails in the sim-sort-thumbs branch. This method uses the approximated bounding box of the annotation instead of the whole thumbnail to generate FV. Maybe this can be used to generate FV for remote volumes whereas we generate from original files for locally stored data. It's hard to determine how well the sorting works with this as it looks ok but is not identical to the sorting based on original files.
- The with the new "ProcessFile" job, the console command here will be the existing generate-missing command that submits one job per file. The commant should be made more intelligent so it checks for missing data file by file and groups submitted jobs (with $only annotations) by file.
- The changes in sim-sort-thumbs seem to work quite well compared to the "real thing" (I was finally able to compare the sorting on real data). So we can use this to initialize all remote volumes.
- Update generate-missing with the new "ProcessAnnotatedFile" jobs.
- Finalize and merge the changes of sim-sort-thumbs.
Update the manual with sorting instructions
Implement synchronization between the regular database and the vector database: If an annotation changes, update all feature vectors of the annotation, if a label changes, add/remove a feature vector (there can be several places where this happens).
Clean up feature vectors on annotation/file/volume/project delete.
~~Enable index from the beginning so it doesn't take long to compute for LabelBOT.~~ Unclear how the index should work (with partitioned tables etc.)
Integrate biigle/largo in biigle/schema
Make sure feature vector generation works with the sync import
- Check why annotation patches are processed before image thumbnails are generated while at it.
Test the feature for images and videos (by importing some volumes)
Don't execute the CopyFeatureVector job when an annotation is created
~~Make call to CopyFeatureVector in Largo save controller more efficient (copy in batches with insert?)~~
- Too complicated. Can do this if we run into performance issues.
Create an issue to implement the "MAIA style" sorting by a single patch, too (later). Sort by similarity #125
- If implemented like in MAIA, make the button to select a reference hidable in the (upcoming) settings tab.
- Otherwise it could be implemented in the sorting tab. Select the sorter, then click on a patch and then it will sort. But the MAIA style would be more intuitive I think.
Maybe think of a different title for sorting by "ID"? Or add a help text? Also update this in the volume overview if changed.
- I chose "created" for Largo but left "ID" in the volume overview, as it makes more sense there.
Update the clone volume and import/sync features to use the new "ProcessAnnotatedFile" jobs.

This can be used by biigle/maia, too.

mzur · 2023-12-12T10:54:39Z

Adding annotation feature vectors poses a problem with the current approach of a separate vector database (for MAIA). If the vector database gets an added image_annotation_feature_vectors table, this table must be kept in sync with the main database. This means that, whenever an annotation is added, modified or deleted, the feature vector in the vector database has to be updated, too. What's more, for Largo and later LabelBOT, we also need the label_id for the annotation. However, each annotation can have multiple label IDs. Here we can either add the label_id to the table and duplicate the entries for the same annotation but with different label IDs or add a pivot table for annotation IDs and label IDs. But each of these must be kept in sync with any label change now, too. This is quite risky, as we might miss some locations where annotation labels are modified. Also, it might duplicate data unnecessarily.

So I'm now thinking about putting all the feature vectors back into the regular database. This way we could use joins and foreign key constraints to get label IDs without duplication and also automatically delete items when they are no longer needed. Annotation feature vectors can be added/modified/deleted with the same logic that handles the annotation thumbnails.

The main reason to separate the vector database from the main database was that the backups could be separated, too. I don't want to frequently back up a 100 GB database every 10 minutes. So I'm now experimenting with pg_dump --exclude-tables to create a backup that does not include feature vectors (this makes it necessary to create separate tables for all feature vectors and, e.g., not just add another column to image_annotations). Also I try pg_dump -Fc --table to create a dump that only contains the feature vector tables. The Fc custom archive format is necessary so the individual tables can be selected during the import.

mzur · 2023-12-12T11:02:21Z

Dumping with -Fc takes a very long time (maybe because it's compressing at the same time).

I'll now try a combination of pg_dump --exclude-table-data "*_feature_vectors" and pg_dump --table "*_feature_vectors" --data-only with the (faster) plain format and see if this can be successfully resored.

mzur · 2023-12-12T12:40:26Z

Here is a possible strategy to migrate the existing setup:

Create MAIA feature vector tables with the exact same names and columns in the regular database (via Laravel migrations). Proposal/candidate IDs can become foreign keys with constraints, so they don't have to be manually deleted any more.
Update the MAIA code to use these tables instead.
Instruct instance admins to generate a database dump of the vector database with pg_dump --data-only and then restore the dump to the regular database.
After this the separate database can be removed.

mzur · 2023-12-12T13:43:51Z

Here is what I found:

Backup and restore of separated feature vectors and a single database is possible with the commands shown above
I will merge the separate vector database into the regular database to benefit from foreign key constraints. The main argument for separate databases was the backup and this point is now moot.
Joins will not be necessary for sorting in Largo because the annotation feature vectors can also directly store the associated volume ID
Joins may be impractical for LabelBOT because they may be too slow with large tables (this needs to be investigated once this landed in Largo and we have the actual database). This means that we might store the label_id and label_tree_id directly in the feature vector table, too, and this is only possible with manually managing add/modify/delete operations with annotation labels, what prompted me to investigate this in the first place. Performance is crucial here so this may really be necessary.
An alternative to adding redundant information to the annotation feature vector tables may be a Postgres materialized view that is regularly updated. This would double (!) the storage requirements for the feature vectors, though, which almost immediately rules this solution out. The table might as well be regularly refreshed with a scheduled job in Laravel. Trading off a slightly outdated table in favor of less implementation effort/logic may be a viable solution.

The migration to create the vector extension had a later timestamp but it needs to be executed before.

If there were duplicates, the data would be returned as object instead of the expected array.

The observers would also fire when a new annotation was created. In this case the copy feature vector job should not be dispatched. Now with the event, the copy job is only dispatched when a label is attached to an existing annotation.

Keep SVG generation call in GenerateImage/VideoAnnotationPatch due to changes in #120.

mzur added 3 commits December 6, 2023 14:38

Update GenerateAnnotationPatch with test cases from biigle/maia

90dd432

Merge branch 'master' into sim-sort

cc4cc99

Implement trait to compute an annotation bounding box

9e406a1

This can be used by biigle/maia, too.

mzur mentioned this pull request Dec 8, 2023

Largo sort biigle/maia#149

Merged

Add ExtractFeatures script from biigle/maia

33ebbee

This was referenced Dec 12, 2023

Move feature vector models to default database connection biigle/maia#150

Merged

Remove separate vector database biigle/core#720

Merged

mzur added 9 commits December 15, 2023 15:23

Implement feature vector models

4c1addc

Fix migration timestamp

fb73480

The migration to create the vector extension had a later timestamp but it needs to be executed before.

Implement GenerateFeatureVectors job

0896365

Implement generating feature vectors in GenerateAnnotationPatch jobs

27ce50c

Generate new feature vectors for all annotation labels instead of first

c1df26f

Implement annotation label observers to copy feature vectors

bc57c1e

Update/fix feature vector migration

e8ff0b5

Copy feature vectors when a Largo job is applied

271d0dd

Handle feature vector generation of whole frame annotations

f458f4b

mzur mentioned this pull request Jan 5, 2024

Largo sort biigle/core#730

Merged

Add update schema action

5098614

mzur mentioned this pull request Jan 10, 2024

Add annotation outlines to Largo thumbnails #119

Merged

10 tasks

mzur added 7 commits January 10, 2024 16:58

WIP Start implementing sorting tab with sort by ID and outlier

16b3328

Implement outlier sorting UI for volume largo

d55c42d

Fix default patch sorting direction

6960687

Implement project largo sort by outlier controller

a5b75b4

Implement outlier sorting in project Largo UI

1715d01

Implement job to initialize feature vectors from thumbnails

3a44e20

Simplify InitializeFeatureVectorChunk job

c216ba7

mzur added 2 commits January 15, 2024 16:08

Implement command to initialize feature vectors from thumbnails

09f5bc7

WIP Implement updated feature vector initialization based on thumbnails

9021480

mzur mentioned this pull request Jan 18, 2024

Make GenerateAnnotationPatch queues configurable #124

Open

mzur added 3 commits January 18, 2024 12:18

WIP Transform GenerateAnnotationPatch to ProcessAnnotatedFile

567ff7e

Fix data returned by volume sort by outliers controller

6e681de

If there were duplicates, the data would be returned as object instead of the expected array.

Implement missing features of ProcessAnnotationFile jobs

1f5f7d0

mzur mentioned this pull request Jan 18, 2024

Largo sort biigle/sync#41

Merged

mzur added 14 commits January 19, 2024 16:24

Make code more reusable in biigle/maia

e609f4a

Update Python requirements

f804ff8

Merge branch 'sim-sort-thumbs' into sim-sort

3fe453a

Finish and test refined InitializeFeatureVectorChunk job

78c4957

Update initialize feature vector command to process only one volume

40e2d7f

Fix sorting in video volume

e0c9ade

Merge branch 'master' into sim-sort

a960cdb

Merge branch 'master' into sim-sort

54fe650

WIP Update GenerateMissing command

90a30a7

Finish update of generate missing command with tests

5f54efc

Refactor generate missing command

7bf82b8

Implement checking for feature vectors in generate missing command

9012f5d

Update manual article with sorting instructions

6eade58

lehecht added a commit that referenced this pull request Jan 25, 2024

Add new svg annotation generation flags

99b8f19

Keep SVG generation call in GenerateImage/VideoAnnotationPatch due to changes in #120.

This was referenced Jan 25, 2024

Sort by similarity #125

Closed

Patch sorting #97

Open

mzur marked this pull request as ready for review January 25, 2024 12:51

mzur self-assigned this Jan 25, 2024

Update Pillow version

2343427

mzur merged commit dcdd6da into master Jan 25, 2024
2 checks passed

mzur deleted the sim-sort branch January 25, 2024 14:12

mzur mentioned this pull request Jan 26, 2024

Add force option to generate missing command #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier detection #120

Outlier detection #120

mzur commented Dec 6, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

Outlier detection #120

Outlier detection #120

Conversation

mzur commented Dec 6, 2023 • edited Loading

mzur commented Dec 12, 2023 • edited Loading

mzur commented Dec 12, 2023 • edited Loading

mzur commented Dec 12, 2023 • edited Loading

mzur commented Dec 12, 2023 • edited Loading

mzur commented Dec 6, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading

mzur commented Dec 12, 2023 •

edited

Loading