A Brief Introduction to Natural Language Generation (NLG)

In this blog post, I dive into some of the mathematical foundations of natural language generation (NLG).

In language modeling, the likelihood of generating a word \(w_i\) is given by the conditional probability \(P(w_i \mid w_1, \ldots, w_{i-1})\).

To generate a coherent and relevant sequence of \(t\) words using causal language modeling (CLM), three components are essential:

a language model \(M\),
a prompt \(X\),
and a decoding algorithm.

Prompting

Prompts are task-specific instructions used for guiding a model’s output (Amatriain, 2024; Sahoo et al., 2024). In other words, prompts are the user’s input.

The model \(M\) generates the subsequent text by maximizing the probability of each word \(w_i\), conditioned not only on the previously generated words \(w_1, \ldots, w_{i-1}\), but also on the prompt
\(X = x_1, \ldots, x_m\):

\[P(w_1, \ldots, w_t \mid X) = P(w_1 \mid X) \cdot \prod_{i=2}^t P(w_i \mid X, w_1, \ldots, w_{i-1})\]

where \(t\) is the length of the generated text sequence (Bengio et al., 2003; Vaswani et al., 2017).

A common way to find effective prompts is prompt engineering, i.e., the systematic crafting of prompts. Below I cover some of the most widely used techniques.

Zero-Shot Prompting

Zero-shot prompting (Radford et al., 2019) directly instructs a model to perform a task without examples:

Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:

Domain-specific fine-tuning or instruction tuning improves zero-shot performance (Wei et al., 2022).

Few-Shot Prompting

Few-shot prompting (Brown et al., 2020) provides a small number of examples (“shots”) in the prompt:

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Few-shot prompting is most effective when the model is sufficiently large (Kaplan et al., 2020).

Chain-of-Thought (CoT) Prompting

Chain-of-Thought prompting (Wei et al., 2022) enables multi-step reasoning by providing intermediate reasoning steps:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
A: Roger started with 5 balls. Two cans contain 6 balls.
\(5 + 6 = 11\). The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 and bought 6 more, how many apples do they have?

CoT can be combined with zero-shot or few-shot prompting (Kojima et al., 2022).

Tree-of-Thoughts (ToT) Prompting

Tree-of-Thoughts prompting (Long, 2023; Yao et al., 2023) encourages self-evaluation by maintaining a tree of reasoning paths:

Imagine three different experts answering this question.
Each writes down one step of their thinking.
They share it with the group and continue.
If any expert realizes they are wrong, they leave.
The question is…

Decoding Algorithms

Decoding algorithms transform probability distributions into coherent natural language. The choice of decoding algorithm significantly affects output quality and diversity.

Greedy Search

Greedy search selects the most probable token at each timestep:

\[y_t = \arg \max_y P(y \mid y_1, \ldots, y_{t-1}, x)\]

Its main limitation is lack of diversity and error accumulation.

Beam Search

Beam search maintains multiple candidate sequences (“beams”). The scoring function is:

\[\mathcal{L}(y_1, \ldots, y_t) = \sum_{k=1}^t \log P(y_k \mid y_1, \ldots, y_{k-1}, x)\]

This reduces the risk of missing high-probability sequences but can still lack diversity.

Top-\(k\) Sampling

Top-\(k\) sampling (Fan et al., 2018) restricts sampling to the \(k\) most likely tokens:

\[V^{(k)} = \{ w \in V \mid \text{rank}(P(w)) \le k \}\]

The distribution is renormalized:

\[P^*(w) = \begin{cases} P(w) / Z & \text{if } w \in V^{(k)} \\ 0 & \text{otherwise} \end{cases}\]

Fixed \(k\) can lead to generic or incoherent text (Holtzman et al., 2020).

Nucleus (Top-\(p\)) Sampling

Nucleus sampling selects the smallest vocabulary \(V^{(p)}\) whose cumulative probability exceeds \(p\):

\[\sum_{w \in V^{(p)}} P(w) \ge p\]

Typically \(0.7 \le p \le 0.95\).

Temperature Sampling

Temperature rescales logits:

\[P^*(w_i) = \frac{\exp(z_i / t)}{\sum_j \exp(z_j / t)}\]

Lower \(t\) → deterministic, higher \(t\) → more creative.

\(\eta\)-Sampling

\(\eta\)-sampling (Hewitt et al., 2022) uses entropy-based truncation:

\[\mathcal{C}(y_{<t}) = \{y \in V \mid P(y \mid y_{<t}) > \eta\}\]

with [ \eta = \min(\epsilon, \sqrt{\epsilon} e^{-H(P)}) ]

Locally Typical Sampling

Locally typical sampling (Meister et al., 2023) minimizes divergence between conditional entropy and log-probability:

\[\min_{\mathcal{C}} \sum_{y \in \mathcal{C}} \left| H(Y_t \mid Y_{<t}) + \log P(y \mid y_{<t}) \right|\]

subject to:

\[\sum_{y \in \mathcal{C}} P(y \mid y_{<t}) \ge \tau\]

Domain-Specific Fine-Tuning

Domain-specific fine-tuning adapts a pretrained model to a dataset \(D\):

\[\theta^* = \arg \min_\theta \sum_{(x_i, y_i) \in D} -\log P(y_i \mid x_i, \theta)\]

Parameters are updated iteratively:

\[\theta \leftarrow \theta - \eta \nabla_\theta L(\theta)\]

Regularization yields:

\[\theta^* = \arg \min_\theta \left[ \sum_{(x_i, y_i) \in D} -\log P(y_i \mid x_i, \theta) + \lambda R(\theta) \right]\]

Concluding Remarks

Prompting, decoding algorithms, and domain-specific fine-tuning are all powerful mechanisms for controlling the quality, coherence, and diversity of machine-generated text.