Reading a Paper together: Diffedit
DiffEdit: Diffusion-based semantic image editing with mask guidance Arxiv link
Zotero - chrome menu bar. The paper will appear in the software (zotero)
- can have the url
- has metadata
- can also read the paper
- annotate, edit + organize to papers

Observations
- First sentence == "wow, how great is current advances!"
- Second sentence = "here is DiffEdit"
- Third sentence = "when an image and prompt are given, the generated image should retain as much of the original as possible"
Then from the pictures, it looks like the text prompt is closely aligned to the original image, so the generated image should only change what is requested.
Main contribution Previous techniques usually require a "mask" to be manually supplied, but this papers' main contribution is to dynamically find the mask itself
Introductions to papers
Talks about what its trying to do, tries to describe the problem + provide an overview to current methods.

Looking at related work is a good place to look into current papers.
Background

Scary time. A lot of equations. Background is often written last and intended to look smart to the reviewers.
Just a note: No one in the world is going to look at this paragraph and immediately know.
Let's walk through the math part of the papers.
- Super helpful tip: learn the greek alphabet to interpret the equations
Equation 1 can be read as follows:
L = "loss"
epsilon = "true noise"
epsilon_theta = "noise estimator"
- X_t = is the image at step `t`
- t = the step number
|| epsilon - epsilon_theta(X_t, t)|| is the rank 2 norm
- What does double pipe mean: Quora link
- What does the super + sub script mean: Quora link
- means the root sum of squares
What does the capital E mean?
Expected valueoperator. In statistics the expected value is often the weighted average- Say you have 50% chance of willing $10, the
E(situation) = 0.5 x $10 = $5which means "in-general" people will win $5
Looking at the "picture" algorithm diagram

With walking through the talk step by step, the big idea is:
- run inference on the original image for the truth
horsevs.zebraand then do a diff. - for the zebra inference, it will highlight the pixels relating to the animal and then leave the background alone
- after inferring on both, doing the diff, will show that the background is the same on both images, this will be our MASK
- then when going through the normal diffusion process, at every step, will replace the MASKED area (background) with the original, to ensure that the pixels remain the same
Comment on Appendices:
- often contain experiments or lessons learned while developing the process
HW comments
- using hugging face's pipeline and the provided code, try and create the above paper's results!