This is a blog post by Zhanfeng Mo ([email protected]) for illustrating diffusion-based LLMs, a non-auto-regressive LLM proposed in https://arxiv.org/abs/2502.09992.
TLDR: LLaDA is The first work that scales masked-sequence-denoising-LLMs to 8B, and achieving a nice in-context learning performance comparable to other auto-regressive baselines.
Suppose $x=(x_1,...,x_L) \in V^L$ is a trajectory of $L$ tokens, where $V$ is the vocabulary.

The key of designing new LMs lies in the parameterization of $p_{\theta}(\cdot)$, the unconditional probability of ****any trajectory.
Let $P_{\theta}(\cdot)$ be the LM, i.e., a conditional probability of any token. Now we have various choices to parameterize $p_{\theta}(\cdot)$ with $P_{\theta}(\cdot)$.
Auto-regressive parameterization (AR):
$$ p_{\theta}(x)=P_{\theta}(x_L|x_{1:L-1}) \cdots P_{\theta}(x_3|x_{1:2}) P_{\theta}(x_2|x_{1}) P_{\theta}(x_1) $$
Therefore, the AR generation process requires $L$ sequential next-token-prediction steps.
The learning objective of AR-LLM becomes
\mathbb{E}{x\sim p{\text{data}}} \left[\sum_{l=0}^L\log P_{\theta}(x_l|x_{1:l-1}) \right] $$
Diffusion-denoising parameterization:
Parameterize an invertible process (e.g. Markov chain) bridging between $p_{\text{data}}$ and an easy-to-sample prior $\pi$, such that the data generation process = reversing / denoise data from $\pi$ to data distribution.
To this end, we can construct a Markov chain $(x^t)_{t\in I}$ (for continuous-time, take $I\triangleq[0,1]$; for discrete-time take $I\triangleq \{0,\Delta t,2\Delta t...,1\}$).
$$ \text{Data}=x^0 \to x^{\Delta t} \to \cdots \to x^{1-\Delta t} \to x^1 = \text{Prior} $$
Informally, the diffusion-denoising parameterization is
$p_{\theta}(x) = \int p_{\theta}(x^0,x^{\Delta t},\dots,x^{1-\Delta t},x^{1}) \mathrm{d} (x^{\Delta t},\dots,x^{1-\Delta t},x^{1}), \ p_{\theta}(x^0,x^{\Delta t},\dots,x^{1-\Delta t},x^{1})=P_{\theta}(x^0 | x^{\Delta t}) P_{\theta}(x^{\Delta t} | x^{2\Delta t}) \cdots P_{\theta}(x^{1-\Delta t} | x^{1}) \pi(x^1).$.
Therefore, diffusion-generation process requires $T \triangleq 1/\Delta t$ denoising (next-state-prediction) steps.
The learning objective becomes
\mathbb{E}{x\sim p{\text{data}},(x^{i\Delta t})\sim \text{Diff}} \left[\sum_{i=0}^{T-1}\log P_{\theta}(x^{i\Delta t}|x^{(i+1)\Delta t}) \right] $$
An ideal design of $(x^t)_{t\in I}$ satisfies
To build a diffusion-based LM, we only need to design