InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Microsoft Research Asia
*Indicates Equal Contribution

Indicates Corresponding Author

InstructDiffusion is a unifying and generic framework for aligning computer vision tasks with human instructions.

MY ALT TEXT

Abstract

We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (\eg, categories and coordinates) for each vision task. We cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is based on the diffusion process and learned to predict the pixel according to user instructions (such as circling the left shoulder of the man with red and placing a blue mask on the left car). InstructDiffusion could handle various vision tasks such as understanding tasks (segmentation and keypoint detection) and generative tasks (editing and restoration). It even demonstrates the ability to handle unseen tasks and outperforms previous methods on unseen datasets. This represents a significant step towards a generalist modeling interface for vision tasks and advancing artificial general intelligence in computer vision.

Keypoint Detection

MY ALT TEXT

(a) Mark the car logo with a blue circle.
(b) Put a blue circle on the nose of the white tiger and use the red color to draw a circle around the left shoulder of the white tiger.
(c) Create a yellow circle around the right eye of the whale.
(d) Use blue to encircle the right wrist of the person on the far left and draw a yellow circle over the left wrist of the person on the far right.

Segmentation

MY ALT TEXT

(a) Mark the pixels of cat in the mirror to blue and leave the rest unchanged.
(b) Fill in the pixels of neutrophil with yellow, retaining the existing colors of the remaining pixels.
(c) Modify the pixels of Oriental Pearl Tower to red without affecting any other pixels.
(d) Paint the pixels of shadow in blue and maintain the current appearance of the other pixels.

Low Level Tasks

MY ALT TEXT

Image Editing

MY ALT TEXT MY ALT TEXT

BibTeX


      @article{Geng23instructdiff,
        author       = {Zigang Geng and
                        Binxin Yang and
                        Tiankai Hang and
                        Chen Li and
                        Shuyang Gu and
                        Ting Zhang and
                        Jianmin Bao and
                        Zheng Zhang and
                        Han Hu and
                        Dong Chen and
                        Baining Guo},
        title        = {InstructDiffusion: {A} Generalist Modeling Interface for Vision Tasks},
        journal      = {CoRR},
        volume       = {abs/2309.03895},
        year         = {2023},
        url          = {https://doi.org/10.48550/arXiv.2309.03895},
        doi          = {10.48550/arXiv.2309.03895},
      }