This is a blog post by Zhanfeng Mo ([email protected]) for illustrating diffusion-based LLMs, a non-auto-regressive LLM proposed in https://arxiv.org/abs/2502.09992.

TLDR: LLaDA is The first work that scales masked-sequence-denoising-LLMs to 8B, and achieving a nice in-context learning performance comparable to other auto-regressive baselines.

Principle of designing language models

Suppose $x=(x_1,...,x_L) \in V^L$ is a trajectory of $L$ tokens, where $V$ is the vocabulary.

The key of designing new LMs lies in the parameterization of $p_{\theta}(\cdot)$, the unconditional probability of ****any trajectory.

Let $P_{\theta}(\cdot)$ be the LM, i.e., a conditional probability of any token. Now we have various choices to parameterize $p_{\theta}(\cdot)$ with $P_{\theta}(\cdot)$.

Auto-regressive parameterization (AR):

$$ p_{\theta}(x)=P_{\theta}(x_L|x_{1:L-1}) \cdots P_{\theta}(x_3|x_{1:2}) P_{\theta}(x_2|x_{1}) P_{\theta}(x_1) $$

Therefore, the AR generation process requires $L$ sequential next-token-prediction steps.

The learning objective of AR-LLM becomes

$$ \max_{\theta} \mathbb{E}{x\sim p{\text{data}}} \left[\log p_{\theta}(x) \right]

\mathbb{E}{x\sim p{\text{data}}} \left[\sum_{l=0}^L\log P_{\theta}(x_l|x_{1:l-1}) \right] $$
Diffusion-denoising parameterization:

Parameterize an invertible process (e.g. Markov chain) bridging between $p_{\text{data}}$ and an easy-to-sample prior $\pi$, such that the data generation process = reversing / denoise data from $\pi$ to data distribution.

To this end, we can construct a Markov chain $(x^t)_{t\in I}$ (for continuous-time, take $I\triangleq[0,1]$; for discrete-time take $I\triangleq \{0,\Delta t,2\Delta t...,1\}$).

$$ \text{Data}=x^0 \to x^{\Delta t} \to \cdots \to x^{1-\Delta t} \to x^1 = \text{Prior} $$

Informally, the diffusion-denoising parameterization is

$p_{\theta}(x) = \int p_{\theta}(x^0,x^{\Delta t},\dots,x^{1-\Delta t},x^{1}) \mathrm{d} (x^{\Delta t},\dots,x^{1-\Delta t},x^{1}), \ p_{\theta}(x^0,x^{\Delta t},\dots,x^{1-\Delta t},x^{1})=P_{\theta}(x^0 | x^{\Delta t}) P_{\theta}(x^{\Delta t} | x^{2\Delta t}) \cdots P_{\theta}(x^{1-\Delta t} | x^{1}) \pi(x^1).$.

Therefore, diffusion-generation process requires $T \triangleq 1/\Delta t$ denoising (next-state-prediction) steps.

The learning objective becomes

$$ \max_{\theta} \mathbb{E}{x\sim p{\text{data}}} \left[\log p_{\theta}(x) \right]

\mathbb{E}{x\sim p{\text{data}},(x^{i\Delta t})\sim \text{Diff}} \left[\sum_{i=0}^{T-1}\log P_{\theta}(x^{i\Delta t}|x^{(i+1)\Delta t}) \right] $$

An ideal design of $(x^t)_{t\in I}$ satisfies
1. The prior $x^1 \sim \pi$ is easy to sample.
2. The forward transition $x^0 \to x^t$ or $x^{t} \to x^{t+\Delta t}$ is easy to model / sample.
To build a diffusion-based LM, we only need to design
1. The prior $\pi$ and the rule of noise insertion $x^0 \to x^t$ (forward diffusion).
2. The parameterization of LM $P_{\theta}(x^{i\Delta} | x^{(i+1)\Delta t})$, e.g. a Gaussian / Bernoulli with learnable mean and variance.

From LM to LLM

Scalability: able to steadily reduce perplexity / reward in large-scaled (large parameter size + larget dataset size) pretraining / RL-finetuning.

Principle of designing language models

$$ \max_{\theta} \mathbb{E}{x\sim p{\text{data}}} \left[\log p_{\theta}(x) \right]

$$ \max_{\theta} \mathbb{E}{x\sim p{\text{data}}} \left[\log p_{\theta}(x) \right]

From LM to LLM