Diffusion Models: AI Doesn't Draw Images, It Sculpts Them

Diffusion Models: AI Doesn't "Draw" Images, It "Sculpts" Them

Most people think AI generates images by painting stroke by stroke on a blank canvas. Wrong. Completely wrong.

The truth is far more interesting — diffusion models create images the way a sculptor carves stone. They don't create from scratch. They slowly "carve" meaningful pictures out of pure noise.

Honestly, when I first understood this principle, I got goosebumps. Not because it's complicated, but because it's so elegant.

The Core Idea: Mess It Up, Then Fix It

Training a diffusion model has two phases, and the logic is simple enough to summarize in one sentence: add noise to good photos, then teach AI to remove it.

Phase one is "forward diffusion." Take a clear photo of a cat, add a tiny bit of random noise in the first step, a little more in the second, keep going... After a few hundred steps, the photo becomes pure random noise. You can't tell it was ever a cat.

Phase two is "reverse diffusion" — and this is where the magic happens. We train a neural network to "guess" what noise was added at each step, then subtract it. During training, we tell it the right answer — because we added the noise ourselves, we know exactly what was added at each step. After learning from millions of photos, the network gains the ability to reconstruct an image from pure noise.

In code:

# Forward: good image → noise (we know the noise at each step)
for t in range(T):
    x_t = add_noise(x_{t-1}, noise_we_know)

# Reverse: noise → good image (the network learns to predict noise)
for t in reversed(range(T)):
    predicted_noise = model(x_t, t)
    x_{t-1} = remove_noise(x_t, predicted_noise)

During inference it gets even wilder — you give the model a blob of random noise and it starts "sculpting": step one removes some noise and the image vaguely shows structure; step two removes more and the structure sharpens; by step 20, details appear; by step 50, a complete image emerges.

Why Is It Called "Diffusion"?

The name comes from the diffusion process in physics. Think about it: drop ink into clear water and it slowly spreads until the whole glass becomes a uniform color — this process is irreversible. The ink won't gather back into a drop on its own.

Diffusion models reverse this process. They teach AI to do something impossible in nature: make a glass of uniformly colored water re-concentrate into a single drop of ink. Sounds like time reversal, right? That's why the entire CV community exploded when this approach was first proposed.

From DDPM to Stable Diffusion to FLUX: The Evolution

In 2020, Ho et al. proposed DDPM (Denoising Diffusion Probabilistic Models), proving the approach works. But DDPM had a fatal flaw — it was too slow. Generating one image required 1,000 steps, taking minutes per image. Fine for a lab, nobody in production can wait that long.

Then in 2022, Stable Diffusion arrived. Its breakthrough was running diffusion in latent space. Instead of operating in pixel space — where a 1024×1024 image has over a million pixels to compute at every step — it first compresses the image into a tiny latent space (like 128×128) using an encoder, runs diffusion there, then reconstructs with a decoder. This single move cut computation by 64x.

Then came FLUX. The team at Black Forest Labs (many from the original Stable Diffusion crew) made several key changes: replacing U-Net with DiT (Diffusion Transformer) architecture, introducing rectified flow training, and compressing steps down to around 20. Faster than Stable Diffusion, with higher quality too.

This is what our lab runs every day. FLUX on MS01, about 15 seconds for 20 steps. Fifteen seconds ago it was pure noise, now it's a complete cover image — I still find this process amazing every time.

Why Prompts Matter So Much

You might be wondering: if the model starts sculpting from random noise, how does it know whether to sculpt a cat or a dog?

The answer is conditional guidance. At every training step, we give the model not just the current noisy image, but also a condition — usually a text description. The model learns: "given this text description, what's the most likely noise?"

You type "an orange baby fire dragon doing experiments in a lab," and the model references this condition at every denoising step. The final "sculpted" image will match your description. Vague prompts like just "cat" leave the model to fill in details. Specific prompts yield specific results.

This also explains why FLUX is so sensitive to prompt quality. It's powerful, but you need to tell it what you want. A precise prompt and a vague prompt can produce vastly different images.

Diffusion Models vs GAN: Who Won?

Before 2022, GANs (Generative Adversarial Networks) ruled image generation. GAN's approach: two networks fight — one fakes images, one detects fakes. The more they fight, the better the faking gets.

GAN's problem: unstable training. You might train for three days and get nothing but noise. The industry joke is "GAN collapse" — the generator finds a shortcut: output the same image regardless of input. The discriminator can't tell. Both "win," but the output is worthless.

Diffusion models have none of these problems. Training is stable as a clock, loss curves consistently descend. And diversity is inherently good — starting from different random noise means every output is different.

Looking back, diffusion models beating GANs was almost inevitable. GANs are like teaching two people to deceive each other. Diffusion models teach a student reverse engineering. Which one is more controllable is obvious.

What Actually Happens When You Use FLUX?

When you submit a cover image task in our system, the pipeline looks like this:

1. Submit prompt → local-image-submit creates job
2. FLUX receives: pure random noise + your encoded prompt + steps(20)
3. Step 1: model predicts noise, removes some → structure faintly appears
4. Steps 2-10: structure sharpens, main subject emerges
5. Steps 11-20: detail refinement — colors, lighting, texture gradually complete
6. Output 1200×630 PNG → convert to WebP → upload to OSS → return URL

20 steps. 15 seconds. From noise to cover image. Every time I see this pipeline complete, I think: in 2020 it took 1,000 steps and several minutes. Six years later, 20 steps does it. The acceleration of technology is genuinely terrifying.

What's Next: World Models?

Diffusion models have won in images. What about video? Sora, Kling, Vidu all use diffusion models too, just extended from 2D to spatiotemporal 3D. But video faces a challenge images don't — temporal consistency. The cat in frame 1 must be the same cat as in frame 50, with coherent motion. This is still being tackled.

Further out, the concept of World Models is being seriously discussed. Not generating images, not generating video, but generating an "interactive 3D world." You walk in, things are still there. You push a cup, it falls and shatters. That's the real endgame for diffusion models.

But that's another story. For today, understanding diffusion models' "stone carving" process is enough.

SFD Editor's Note: We generate cover images with FLUX every day, but truly understanding what it does changes how you feel watching the 20-step denoising process. Try this in your own environment: add noise to an image, then let the model restore it. Seeing the "sculpting" process with your own eyes beats reading ten papers.