TLDR: This blog is created by Zhanfeng Mo. He shares some personal viewpoints and comments on recent advancements in LM-based RL.

LM Finetuning

For any vocabulary $V$, a sequence of $L$ token is denoted by $q\triangleq (q_1,\dots,q_L)\in V^L.$ An language model (LM) policy parameterized by $\theta$ is defined as $\pi_{\theta}(\cdot|q) \in \mathcal{P}(V)$, i.e. a probability measure over the vocabulary. In practice, the LM policy is implemented by

$$ \begin{align}\pi_{\theta}(\cdot | q) \triangleq \mathrm{Multi}(z),\ z=\mathrm{LM}_{\theta}(q)\in \mathbb{R}^{|V|}, \end{align} $$

where $\mathrm{Multi}(\cdot)$ is a multinomial distribution parameterized by the LM logic $z$ on $V$. Without causing ambiguity, we denote the conditional probability of LM on an arbitrary sequence $a\in V^{I}$ as

$$ \pi_{\theta}(a|q) \triangleq \pi_{\theta}(a_I|[q,a_{<I}])\cdots \pi_{\theta}(a_2|[q,a_1])\pi_{\theta}(a_1|q). $$

LM fine-tuning aims to reshape the generation policy of LLM to meet some supervisions, e.g.

$$ \min_{\theta} \mathbb{E}{q\sim \mathcal{D}} \left[ d(\pi{\theta}(\cdot|q), p^(\cdot|q) \right] = \mathbb{E}{q\sim \mathcal{D}} \left[ \mathrm{KL}(\pi{\theta}(\cdot|q), p^(\cdot|q)) \right] =\mathbb{E}{q\sim \mathcal{D}}\mathbb{E}{a\sim p^(\cdot|q)} \left[ \log \pi_{\theta}(a|q) - \log p^(a|q)) \right]), $$

Strong process supervision: minimize the distance between LM policy and a given QA distribution.

which is equivalent to the Supervised Fine-Tuning (SFT) objective:

$$ \min_{\theta} \mathbb{E}{q\sim \mathcal{D}} \mathbb{E}{a\sim p^*(\cdot|q)} \left[\log \pi_{\theta}(a|q) \right] \approx \mathbb{E}{(a,q)\sim \mathcal{D}{\mathrm{SFT}}} \left[ \log \pi_{\theta}(a|q) \right]. $$

Here, $\mathcal{D}_{\mathrm{SFT}}$ denotes the empirical distribution over a finite set of pre-collected QA-pairs.
Weak outcome supervision: optimize the LM policy to maximize a reward $r(\cdot,\cdot): V^{\infty}\times V^{\infty} \to \mathbb{R}$ (e.g., get correct / preferable / well-formatted answers). The optimization problem is known as Reinforcement Fine-Tuning (RFT)

$$ \min_{\theta} J(\pi_{\theta})= \mathbb{E}{(q,a)\sim \mathcal{D}} \mathbb{E}{o\sim \pi_{\theta}(\cdot|q)} \left[ r(o,a) \right]. $$

Personally, I do NOT consider RFT as a conventional RL problem:

It is not a sequential decision modeling
The reward is given only once.

Instead, I would consider RFT a distributional optimization problem.

LM Finetuning

LM Policy Optimization