Skip to content

Reading a Paper together: Diffedit

DiffEdit: Diffusion-based semantic image editing with mask guidance Arxiv link

Zotero - chrome menu bar. The paper will appear in the software (zotero)

  • can have the url
  • has metadata
  • can also read the paper
  • annotate, edit + organize to papers

Screenshot-2022-10-30-at-4-55-08-PM

Observations

  • First sentence == "wow, how great is current advances!"
  • Second sentence = "here is DiffEdit"
  • Third sentence = "when an image and prompt are given, the generated image should retain as much of the original as possible"

Then from the pictures, it looks like the text prompt is closely aligned to the original image, so the generated image should only change what is requested.

Main contribution Previous techniques usually require a "mask" to be manually supplied, but this papers' main contribution is to dynamically find the mask itself

Introductions to papers

Talks about what its trying to do, tries to describe the problem + provide an overview to current methods.

Screenshot-2022-10-30-at-5-02-46-PM

Looking at related work is a good place to look into current papers.

Background

Screenshot-2022-10-30-at-5-06-23-PM

Scary time. A lot of equations. Background is often written last and intended to look smart to the reviewers.

Just a note: No one in the world is going to look at this paragraph and immediately know.

Let's walk through the math part of the papers.

  • Super helpful tip: learn the greek alphabet to interpret the equations

Equation 1 can be read as follows:

L = "loss"
epsilon = "true noise"
epsilon_theta = "noise estimator"
    - X_t = is the image at step `t`
    - t = the step number

|| epsilon - epsilon_theta(X_t, t)|| is the rank 2 norm
  • What does double pipe mean: Quora link
  • What does the super + sub script mean: Quora link
  • means the root sum of squares

What does the capital E mean?

  • Expected value operator. In statistics the expected value is often the weighted average
  • Say you have 50% chance of willing $10, the E(situation) = 0.5 x $10 = $5 which means "in-general" people will win $5

Looking at the "picture" algorithm diagram

Screenshot-2022-10-30-at-5-28-18-PM

With walking through the talk step by step, the big idea is:

  1. run inference on the original image for the truth horse vs. zebra and then do a diff.
  2. for the zebra inference, it will highlight the pixels relating to the animal and then leave the background alone
  3. after inferring on both, doing the diff, will show that the background is the same on both images, this will be our MASK
  4. then when going through the normal diffusion process, at every step, will replace the MASKED area (background) with the original, to ensure that the pixels remain the same

Comment on Appendices:

  • often contain experiments or lessons learned while developing the process

HW comments

  • using hugging face's pipeline and the provided code, try and create the above paper's results!