# Topical Language Generation with Transformers

Large-scale transformer-based language models (LMs) demonstrate impressive capabilities in open text generation. However, controlling the generated text’s properties such as the topic, style, and sentiment is challenging and often requires significant changes to the model architecture or retraining and fine-tuning the model on new supervised data.

We introduce Topical Language Generation (TLG) by combining a pre-trained LM with topic modeling information. We cast the problem using Bayesian probability formulation with topic probabilities as a prior, LM probabilities as the likelihood, and topical language generation probability as the posterior. In learning the model, we derive the topic probability distribution from the user-provided document’s natural structure. Furthermore, we extend our model by introducing new parameters and functions to influence the quantity of the topical features presented in the generated text. This feature would allow us to easily control the topical properties of the generated text.

## Language modeling and decoding

The applications of language generation in NLP can be divided into two main categories: directed language generation and open-ended language generation. Directed language generation involves transforming input to output such as machine translation, summarization, etc. These approaches need some semantic alignment between the inputs and the outputs. On the other hand, open-ended language generation has much more freedom in the generation process because it does not need to be aligned with any output. The open-ended language generation has applications in conditional story generation, dialog systems, and predictive response generation. Even though there is more flexibility in choosing the next tokens compared to directed language generation, controlling the top-level features of the generated text is a desirable property that needs to be addressed and still is a challenging problem.

Given a sequence of m tokens x_1, .., x_m as the context, the problem of open-ended language generation can be formulated as finding the continuation x_{m+1}, …, x_{m+n} with n tokens. In other words, if we consider the whole context plus continuation as following:

The language modeling probability can be decomposed using the chain rule as:

The language modeling probability can be used with a *decoding strategy* to generate the next token for language generation. Finding the optimal continuation can be formulated as:

Solving the above Equation is not tractable so practical decoding strategies use approximations to generate the next tokens. The most famous and widely used decoding strategies are greedy decoding and beam search methods. Greedy decoding selects the highest probability token at each time step, while the beam search keeps a set of hypotheses and then updates the tokens in the hypotheses as it goes through and decodes more tokens. These approaches are well suited for directed language generation, but they suffer from repetition, genericness, and degenerate continuations.

Both of these approaches are deterministic in the sense that they do not involve any random selection in their algorithms.

On the other hand, stochastic decoding methods sample from a model-dependent distribution q:

The simplest stochastic sampling consists of sampling from top-k probabilities, the use of constant k is problematic because in some contexts the probability distribution of the next token is flat which means there are plenty of reasonable next tokens to select from but in some other contexts the distribution is concentrated in a small number of tokens. To solve this problem, (Holtzmanet al. (2020)) proposed Nucleus Sampling. In this method, a subset of vocabulary is defined which is the smallest set V(p) such that:

Then the resulting distribution which is based on the new vocabulary should be re-scaled to form a probability distribution. Under Nucleus Sampling, the number of plausible next tokens changes dynamically with the context and generated tokens. In this work, we use Nucleus Sampling as the base decoding technique and propose a new method to take into account topical knowledge about the tokens.

## Topical Language Modeling

Given a list of K topics t = {1…K}, to control the outputs of the language model to follow a certain topic, at each generation step, we have to model the following probability distribution:

Compared to the previous Equation, the only difference is that it is conditioned on the topic t_j. To create the right-hand side of Equation 6, we change the last layer of the network that creates the logits.

Here, we adopt the GPT transformer architecture. If S is the last layer we use softmax to get the final probabilities:

We can use the Bayes rule on P(x_i|x<i,t_j) to obtain:

Because in topic modeling, documents are treated as bag of words we can also assume that the probability of the topic for each token is independent of the previously generated tokens. Based on this assumption we have:

Now, assuming that we have P(t_j|x_i), then using Equation 10 we can prove that the conditions topical language model can be written as:

For complete proof refer to the paper.

## Topic modeling

Topic modeling algorithms automatically extract topics from a collection of textual data. They are based on statistical unsupervised models that discover the themes running through documents. We use two main algorithms in topic modeling.

- LDA (Latent Dirichlet Allocation): The basic idea behind LDA is that in a collection of documents, every document has multiple topics and each topic has a probability distribution. Moreover, each topic has a distribution over vocabulary. For example, a document can be on the topics of “Football”, “News” and “America” and the topic of “Football” can contain words including “NFL”, “Football”, “teams” with a higher probability compared to other words. Given a collection of M documents with vocabulary V, we can fix the number of topics to be K. In LDA, the probabilities of topics per documents and topic for tokens can be summarized in matrix forms, θ_M×K, and φ_K×|V|, respectively. After the learning, we have the distributions of topics for each token and hence we can write:

- LSI (Latent Semantic Indexing): LSI is the application of the singular value decomposition method to the word-document matrix, with rows and columns representing the words and documents, respectively. Let X_|V|×M be the token-document matrix such that X_{i,j} is the occurrence of token i in document j, then singular value decomposition can be used to find the low-rank approximation:

After the decomposition, U still has the same number of rows as tokens but has fewer columns that represent latent space that is usually interpreted as “topics”. So, normalizing U gives us the scores of each token per topic. We can use this score for the probability of topic j for each token i in the vocabulary:

## Controllable Generation Methods

The conditional topical language model in the equation above gives us a token generation that is conditioned on a specific topic but we cannot control the amount of the influence.

1- Adding topical parameter and logit threshold: adding the term log(P(t_j|x_i)) directly to the actual logit from the model can deteriorate the fluency of generated text in some cases. We propose two methods to alleviate this problem. We introduce a new parameter γ to control the influence of topical distribution:

Higher values ofγresult in more on-topic text generation because the final probability will be dominated more by log(P(t_j|x_i))than the logit from the base language modeling.

The other approach is to cut the log probabilities of the topic with a threshold. The lower values of S correspond to tokens that the model gives very low probabilities and we do not want to change them because it introduces unwanted tokens and diminishes the fluency. In Equation above, we only keep log(P(t_j|x_i))for all the values of S that are larger than threshold.

and log prob used in the following equation:

ower values of threshold correlate with more on-topic text generation because we change more tokens from the original model by log(P(t_j|x_i)).

2 -Using α-entmax instead of softmax: The problem with the softmax function is that it gives non-zero probabilities to a lot of unnecessary and implausible tokens. The softmax function is dense because it is proportional to exp function and can never give exactly zero probabilities at the output. We useα-entmax instead to create more sparse probabilities that are less prone to degenerate text. α-entmax is defined as

where ∆|V|−1:={p∈IR|V|−1,∑ipi=1}is the probability simplex, and for α≥1, HTα(p) is the Tsallis entropy which defines the family of entropies as follows:

α-entmax is the generalized form of the softmax function. In particular, for α=1 it exactly reduces to the softmax function and as α increases, the sparsity in the output probabilities continuously increases. Here we are specifically interested in α=2 which results in sparsemax:

Unlike the softmax function, sparsemax can assign zero probabilities.

3-Adding temperature and repetition penalty parameters: We need to make some changes to the base nucleus sampling to control the base distribution flatness and prevent it from generating repetitive words. We denote the final logit after the above changes as ui. Given a temperature, repetition penalty r and the list of generated tokens g, the final probability distribution for sampling is:

when T→0, the sampling reduces to greedy sampling; while if T→∞ the distribution becomes flatter and more random. The penalized sampling discourages drawing already generated tokens.

## Topical Text Generation with Different Topics

One of the biggest benefits of TLG is that it can be used with different language models without any retraining or fine-tuning of the base model, however, to generate topical texts we need to have topics extracted from a text corpus. For training the topic models, we used Alexa Topical-chat dataset. This data set contains conversations and a knowledge base in a wide variety of topics from politics and music to sports. We do not use the tags for topics in the dataset but extract them automatically with our LDA and LSI topic models. This unsupervised approach gives us the flexibility to work with any raw text corpus.

In this experiment, a fixed neutral prompt has been used to make sure the model is not conditioned on the few initial tokens. The results in the table below show that after selecting a topic from the topic modeling output, the model can create long, coherent, and fluent text continuation without manually injecting extra knowledge from other resources or through training on labeled datasets.

## Effects of Hyperparamters on TLG

In our proposed approach, we can useγandthresholdas knob parameters to control the amount of topic influence on the language generation process. More specifically, based on Equation 27higher values of gamma will result in more on-topic results. Also, lower values of the threshold are correlated with more on-topic language generation. In the limit, if we setγ=0 andthreshold=0TLG reduces to the original language model without any topic. But, our experiments have shown that changingγvalues are less detrimental to the fluency of the generated text than changing the threshold. This is due to the fact that thresholding can easily cut off the probabilities that are related to function tokens (like stop words) in the vocabulary which hurts the fluency of the model. Fig below demonstrates the language generation on a fixed topic (football) with different values ofγandthreshold. To show how much each token accounts for the topic we use color-coding in which stronger colors show more on-topic words. We skipped the last stage of decoding. This is why the individual tokens from Byte Pair Encoding (BPE) tokenization can be seen.

Now tell me how it works?

The language generation is the task of generating the next token conditioned on the previously generated tokens. The probability distribution of the next token in the base language models is flatter in some token positions and more peaked at some other token positions. For example, given the prompt of “The issue is that” there are plenty of possible next tokens compared to the next token of a prompt like “It is focused” which is almost always “on”. This property of language models gives us the flexibility to meddle in the generation process and steer it towards desired tokens when the probability distribution is flatter.

The concept of flat or peaked distribution can be easily measured in terms of the entropy of the distribution. In Figures a and b we compare the entropy of the base model (token entropy) with the posterior probability distribution from Equation 20 as the total entropy. Higher entropy for the base model in one position is a sign of its capability to sample from a large set of potential tokens with almost equal probabilities but in our conditional language modeling, we want to restrict that set to a smaller set that conforms with the chosen topic. Therefore, in almost all cases, the entropy of the TLG model drops significantly compared to the base model. We can observe the differences are larger for the tokens that represent the topic (like teams, football, culture and, music) and smaller for function tokens (like stop words that do not play any role in different topics).

Another interesting observation is how the prior distribution that was extracted from topic modeling forces the language model to choose the topical tokens. The top-5 most likely tokens in a generation process are depicted in Figure 4. For the topic of football, the top-5 candidate tokens chosen by the model are compatible with the chosen topic.

## Graphical User Interface

We also provide the GUI as a playground for users to work with the TLG. On the left panel, you can control the dataset, topic model, number of topics, and other generation settings. The playground gives you a graph plot which is a novel representation of the topics and how they are related to each other. Then you can choose the topic of interest and choose a prompt and finally hit the generate button to get the topical text.