Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New object detection and image segmentation widgets #6

Closed
NielsRogge opened this issue Jun 4, 2021 · 32 comments
Closed

New object detection and image segmentation widgets #6

NielsRogge opened this issue Jun 4, 2021 · 32 comments
Labels

Comments

@NielsRogge
Copy link
Contributor

NielsRogge commented Jun 4, 2021

For the DETR model, which will soon be part of HuggingFace Transformers (see huggingface/transformers#11653 (comment)), it would be cool to have object detection and image segmentation (actually panoptic segmentation) inference widgets.

Similar to the image classification widget, a user should be able to upload/drag an image to the widget, which is then annotated with bounding boxes and classes (in case of object detection), or turned into a segmentation map (in case of panoptic segmentation).

Here are 2 notebooks which illustrate what you can do with the head models of DETR:

DetrForObjectDetection: https://colab.research.google.com/drive/170dlGN5s37uaYO32XKUHfPklGS8oB059?usp=sharing
DetrForSegmentation: https://colab.research.google.com/drive/1hTGTPGBLPRY1QkLmG7P9air6v04tcXUL?usp=sharing

The models are already on the hub: https://huggingface.co/models?search=facebook/detr

cc @LysandreJik

@julien-c
Copy link
Member

julien-c commented Jun 4, 2021

really cool!

so i think the steps are:

  • define and agree on an API shape (should be future proof to other potential models with the same task). I usually try to take inspiration from the existing hosted APIs (Google Vision, etc) that do those tasks
  • implement those models in the Inference API, or here in api-inference-community (@Narsil and team can review a first draft)
  • build a widget (<= we'll open source a current snapshot of our widget code in this present repo, in the next few days)

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jun 4, 2021

Ok, I think it might make sense to divide the API into object detection, semantic segmentation, instance segmentation, and panoptic segmentation. This blog post explains the difference between semantic/instance/panoptic segmentation well.

The input/output of the various tasks is as follows:

  • object detection: input = RGB image. Output: RGB image with bounding boxes and corresponding instance labels.
  • semantic segmentation: input = RGB image. Output: per-pixel semantic class label.
  • instance segmentation: input = RGB image. Output: per-object (instance) mask and instance label.
  • panoptic segmentation: input = RGB image. Output: per-pixel semantic class + optional instance labels.

@julien-c
Copy link
Member

julien-c commented Jun 4, 2021

Agreed. Though we'll probably group all of those in a generic image-segmentation on the hub side for ease-of-use/accessibility

@LysandreJik
Copy link
Member

We just open sourced widgets in huggingface/huggingface_hub#87 if you want to take a look @NielsRogge! We'll write a document on how to get started but feel free to try it out locally!

@julien-c
Copy link
Member

maybe @mishig25 can take a look at the widget side of this!

@Narsil
Copy link

Narsil commented Jun 23, 2021

Proposed PR on transformers side huggingface/transformers#12321

@mishig25 I think image manipulation will be a bit tricky to do client side, hence I propose a "reduction" mecanism for the actual API (that goes on top of the pipeline or within the pipeline) to simply output 1 image, what do you think?

If we kept the actual masks, we could do improved UX with maybe mouse hover effects and so on, but I am not 100% sure of how easy to do in JS (doing it in python is somewhat trivial maybe half a day to fix all the odd issues like label placement and so on)

@julien-c
Copy link
Member

tagging @gary149 and @severo as well but I think client-side rendering can/will be way cooler (interactivity as you mention, quality of the UX, etc)

Also is someone ends up calling the API outside of the widget (like in an actual programmatic use case I don't think they will want the rendered output)

@severo
Copy link
Collaborator

severo commented Jun 23, 2021

I love doing this kind of processing in JS

@Narsil
Copy link

Narsil commented Jun 23, 2021

Yes API-wise we need to be able to support raw masks for sure !

@mishig25
Copy link
Collaborator

@severo @julien-c please let me know if there's anything particular you'd like me to work on. Otherwise, I can start digging more into Visualizer module of detectron2 and see how desired results can be achieved with JS & web interactivity

@mishig25
Copy link
Collaborator

mishig25 commented Jun 25, 2021

Assuming the API output will be

[
   {
        "mask": // Array<Array<Bool>> 2D array of bool,
         "score": // float,
         "label": // str,
   },
// ...
]

In terms of visualizing masks, which option would you suggest:

  1. <canvas>-based approach
  2. <img>-based approach that uses CSS property mask-image
  3. something else entirely

Unless there is an objection of using <canvas> element, I think <canvas>-based approach will be the most straightforward (I might be wrong).

@severo
Copy link
Collaborator

severo commented Jun 25, 2021

I think it's the best approach too

@julien-c
Copy link
Member

julien-c commented Jun 25, 2021

i'm not a pro at frontend drawing technologies, what are the Pros & Cons of svg vs. canvas? what about WebGL, maybe via ThreeJS? 🤯

(just out of curiosity!)

@severo
Copy link
Collaborator

severo commented Jun 25, 2021

SVG is really done for vectorial drawings. We can incorporate bitmap images in it, but it's not really natural and you have to generate the images by the way (using canvas)
Managing mask images is naturally done by modifying the pixels of a canvas (see https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API/Tutorial/Pixel_manipulation_with_canvas for example, or https://observablehq.com/@severo/voronoi-stippling-on-elevation-dem).
re: WebGL/three.js... why not! But it's a bit the same as SVG, we have to generate the textures, then apply them to the 3D geometries (like in https://observablehq.com/@severo/voronoi-cloth)

@mishig25
Copy link
Collaborator

Widget-wise,
svg would be better suited for object detection (bounding boxes)
WebGL/three.js would be better suited for anything that relates to 3D or complex 2D graphics (like images-to-3d 3D object generation from images)

@mishig25
Copy link
Collaborator

mishig25 commented Jun 30, 2021

I'll create a draft PR once I refactor the code.
Assuming the API output will be [{mask, score, label}, ...] and based on @gary149's feedback, here is a screenshot (please add any other feedback as a comment below):
segmentation4

Currently, it supports highlight on mouseover, should I add mobile version of it like highlighting on touchbegin/tapped?

@osanseviero
Copy link
Member

😮

@julien-c
Copy link
Member

looks really cool!

@NielsRogge
Copy link
Contributor Author

Wow, super cool!! Really nice to see the cats picture still going strong haha (it's part of COCO evaluation, they are not my cats sadly).

I should add the id2label to the config file, such that we can see the actual labels instead of LABEL_93 and so on.

@mishig25
Copy link
Collaborator

mishig25 commented Jul 7, 2021

Pushed updates in branches widget-image-segmentation & widget-object-detection

Updates:

  1. WIP/implement object detection widget
    Assuming the API input will be identical to that of ImageClassificationWidget and API output will be
[
   {
        "boundingBox": // Array<{x: number, y: number}> 2D array of 4 corrner vertixes of the bounding box,
         "score": // float,
         "label": // str,
   },
// ...
]
  1. CSS transitions of fadeIn/Out when masks/boundinfBoxes chancge visibility on user interaction
  2. Subtle darkening of WidgeOutputChart when switched to darkmode

Screenshots:

ImageSeg ObjDet

@osanseviero
Copy link
Member

FYI @nateraw

@mishig25
Copy link
Collaborator

mishig25 commented Jul 20, 2021

@Narsil and I are discussing whether image seg & obj det should be same(identical) or different pipelines.

Reasons to treat them as same pipelines:
  1. The tasks are related
  2. Trying to limit number of unique pipelines & widgets
  3. In this case, we will treat bounding boxes from obj det as masks from image seg. In other words, obj det is a special case of image seg.
Reasons to treat them as different pipelines:
  1. Differences in output (masks VS bounding boxes) are different enough to create unique widgets for each task (image seg & obj det)

Please let us know which option you would prefer and why @julien-c @NielsRogge @osanseviero @severo and we can reach a consensus

@julien-c
Copy link
Member

I would treat them as different pipelines/widgets, I think it's clearer.

but it depends on whether a single model/checkpoint can output both representations at the same time, in one model forward pass? (my understanding was no)

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jul 21, 2021

I would also treat them differently, as they are quite separate tasks.

Object detection is fairly simple: given an image, predict class labels + corresponding bounding boxes.

However, image segmentation has different subtasks: 20210721_094012.jpg

I wonder whether all of these can be supported by a general image segmentation pipeline, or whether we should create one for every subtask. I also am wondering about the names of the head models: for now I have called DETR's panoptic segmentation model DetrForSegmentation, but it might be more appropriate to call it DetrForImageSegmentation (if we join all subtasks into one) or DetrForPanopticSegmentation (if we do decide to split up the different subtasks).

Currently I'm working on another model, SegFormer, which is a semantic segmentation model, which predicts a label per pixel. So also here, current wondering how to call the head model: SegFormerForImageSegmentation, or SegFormerForSemanticSegmentation?

Image segmentation seems to take all kinds of exotic forms, for example last week a paper by Facebook AI came out called "Per-Pixel Classification is Not All You Need for Semantic Segmentation". So even for semantic segmentation, there are different ways to solve the problem. Edit: reading the abstract, it seems fairly simple, they predict a binary mask per label rather than doing per-pixel classification.

Curious about hearing your thoughts about all of this. I guess I should do a deep dive into image segmentation, because I'm coming from NLP.

@Narsil
Copy link

Narsil commented Jul 21, 2021

Hi,

Bounding boxes is really the same as segmentation to me, it's just that the output can be simplified as squares.
You are after all declaring part of the image as belonging to a certain class. The fact that it is a square shouldn't really matter to a user.
The parallel in NLP is NER vs POS, which are really identical and was correctly labeled token-classification in transformers.

Image segmentation is really multi classification PER pixel, so a general list of masks+labels should cover all the potential needs (every mask can hit a single pixel multiple times). (It's equivalent as a list of class per-pixel).

For instance-aware + part-aware, one, simply needs to add some form of dependencies between the parts (everything is most likely a tree, so a simple "parent" link should cover all cases there).

@mishig25
Copy link
Collaborator

Aggregating all the opinions expressed above, could I preliminarily conclude that:

  1. Separate pipelines for: [object-detection, image-segmentation]
    Points considered:
    a. Outputs different enough to be separate tasks (mask:image vs box:array of 4 vertices)
  2. Only one general image segmentation pipeline for seg_subtasks: [semantic, panoptic, part, ...]
    Points considered:
    a. Ease-of-use/accessibility
    b. All subtasks can be covered with a general image seg task

One possibility is to have a one pipeline (for image seg & object detection), but have 2 different widgets for visualizing mask vs box.
However, ideally, we'd keep 1-1 relationship between pipelines & widgets.

Also, do we have to consider size differences in treating object detection outputs as mask vs box? Box (which is array of 4 {x:int,y:int}) would be much smaller than Mask (which is image data). Is this difference in sizes significant enough to consider this point?

Please let me know any thoughts 👍

@julien-c
Copy link
Member

Yes, let's do two distinct widgets on the frontend side, and I would also tend to do two distinct pipelines 👍

@mishig25
Copy link
Collaborator

I've uploaded a demo with hardcoded inputs and outputs for the object detection widget here (until we figure out the pipeline):
https://6102a74c4d4db912930e6357--huggingface-widgets.netlify.app/
Please provide feedback on anything: interaction, colors, etc.

@julien-c
Copy link
Member

my only feedback is that I...

LOVE IT 🔥

@severo
Copy link
Collaborator

severo commented Jul 30, 2021

Excellent!

Feedback: I'm wondering if the bars and the bounding boxes use the same base colors? Not sure, but it seems like the bounding boxes use the browser's base colors ('red', 'blue', etc) while the bars use the tailwindcss base colors.

@mishig25
Copy link
Collaborator

@severo thats a great feedback! That's indeed how its currently done (see here and here): if boundingbox is red, then the bar/label is red-400. I'll update boundingboxes to use color-400 as well 👍

@LysandreJik LysandreJik transferred this issue from huggingface/huggingface_hub Mar 16, 2022
@osanseviero
Copy link
Member

osanseviero commented Mar 17, 2022

Since these pipelines and widgets are merged, I'll close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants