Over the past few months I’ve been playing with language models of various sizes.

Most people, when they think of language models, think of very large language models which can respond to a single stimulus.

By any standard, the different versions of ChatGPT are very large language models, and can classify texts without needing to be fed examples. These models can also helpfully provide justification for the classifications they provide.

Here’s a example sentence, which happens (for reasons that will become obvious) to be taken from a political party manifesto:¹:

“The legacy of the Celtic Tiger includes ‘ghost’ housing estates, negative equity, an unprecedented banking crisis, unemployment, a personal debt crisis, badly planned towns and a huge Exchequer deficit.”

Here’s how we might turn that into a prompt eliciting classification:

“Imagine it is February 2011. Classify the following sentence as about the past, present or future. ‘The legacy of the Celtic Tiger includes ’ghost’ housing estates, negative equity, an unprecedented banking crisis, unemployment, a personal debt crisis, badly planned towns and a huge Exchequer deficit.’”

We don’t need to supply examples of past, present or future-oriented sentences in order to get ChatGPT to give us a result. We just instruct it to classify. We did give some context (“imagine it is 2011”): this probably does not affect the response given in this example, but might matter when we are asking ChatGPT to classify a sentence which refers to a date which is now in the past but was in the future at the time the sentence was written.

Here’s ChatGPT’s response:

“In the context of February 2011, the sentence”The legacy of the Celtic Tiger includes ‘ghost’ housing estates, negative equity, an unprecedented banking crisis, unemployment, a personal debt crisis, badly planned towns, and a huge Exchequer deficit” is about the past. It describes the consequences and legacy of the Celtic Tiger economic boom in Ireland, which took place in the past, leading to the issues and challenges mentioned in the sentence.”

I have been hand-coding the temporal focus of a number of sentences together with a research assistant. In some cases, we’ve got a third opinion from ChatGPT. I’ve been staggered by the fluency of the response. Whilst I don’t always agree with the classification given, I find the reasoning very helpful, and it almost always shifts my assessments in the way an assessment from another human would.

Zero-shot classification like this is an incredible technology. However, it’s not always the most accurate way of classifying a large number of texts. Several papers have found that fine-tuning smaller language models can give more accurate classification.

Unfortunately, “fine-tuning a smaller language model” is not easy. It imposes several moderately demanding software and hardware requirements. In this post, I’ll try replicating an existing classification effort using a fine-trained model.

The paper I’m replicating

I’m working with “The Temporal Focus of Campaign Communication”, written by Stefan Müller, which appeared in the Journal of Politics in 2022. The aim of the paper is to describe the temporal focus of party manifestos – or in other words, to calculate the proportion of sentences or quasi-sentences which are about the past, the present, or the future. The main finding of the paper is that “parties devote, on average, around half of their manifestos to the future, 10% to the past, and 40% to the present”. Secondary findings include differences between incumbents and all other parties (incumbents talk more about the past), and differences in sentiment when talking about past, present and future.

The paper uses a support vector machine (SVM) to classify sentences. This SVM is trained on 5858 hand-coded English language sentences. The input to the SVM is a huge matrix with 5858 rows (one row for each sentence) and 9926 columns (one column for each unique word in the corpus), where the entries in each cell are the number of times each word occurred in each sentence. When the data is split into a training and testing set using a 70:30 split, the SVM manages to classify three-quarters of sentences correctly.

It’s a measure of the speed of progress in language modelling (and a comment on the speed of the academic publication process) that classifying sentences using a SVM with a term-document matrix now seems rather old-fashioned.² Although term document matrices contain a tremendous amount of information, bag-of-words approaches don’t take account of the order of the words in a sentence. The sentences

“The dog bit the man”

and

“The man bit the dog”

give the same term document matrix, but have very different meanings. It’s therefore natural to ask whether language models, which do claim to understand the sequence of words, are able to classify sentences more accurately.

Fine-training a model

Let’s start with a short Python script which will train our model. I’m going to assume that you have the data from Stefan’s replication archive already downloaded, and that the file data_sentences_classified_english.rds is in your working directory.

Modules

We begin by loading the modules we’ll use.

import pandas as pd
import numpy as np
import pyreadr
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
import evaluate

The pandas and numpy modules may be familiar to those of you who have previously used Python for data analysis. The pyreadr module allows us to import the data file, saved in one of R’s native file formats, without even having to touch R.

The datasets, transformers and evaluate modules are part of the HuggingFace 🤗 ecology. They make it possible to conduct machine learning tasks in a very few lines of code. It was initially hard for me to understand how essential HuggingFace 🤗 is to the machine learning community. I imagine it was like coming to R and discovering the influence of Hadley Wickham.

The HuggingFace 🤗 libraries are, of course, built on other modules which make GPU go brr. The torch module is one of these. If you have torch or these other modules installed, you have probably dealt with lots of grizzly details to do with package management in Python. Well done you.

Because we’ve imported torch, we’ll specify the device we’re using: either the GPU (if the CUDA backend is available) or the CPU. Because I’m running this on my desktop, I’ll use the GPU.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Data import

We begin by reading in the file using pyreadr. Because pyreadr supports both .RData files and .rds files, and because .RData files can include multiple objects, the return value of read_r is a dictionary where the names of the different R objects map on to dictionary keys. Because an .rd file is a single unnamed object, we access the data frame by using the special key None. After this, we end up with a pandas data frame stored in df. We then select just the two columns text and class.

fileloc = "../data_sentences_classified_english.rds"
df = pyreadr.read_r(fileloc)
df = df[None]
df = df[['text', 'class']]

Here it’s helpful to turn the categorical variable class into a numeric variable. We do that with some helper lists.

id2label = {
    "0": "Past",
    "1": "Present",
    "2": "Future",
}
label2id = {
    "Past": 0,
    "Present": 1,
    "Future": 2,
}

df['labels'] = df['class'].map(label2id)

We can now convert our pandas data frame into a HuggingFace 🤗 dataset:

mueller = Dataset.from_pandas(df)

Selecting a base model

At this point, we set up our model. This consists of two parts – a tokenizer, and the model proper.

Our model is the distilBERT model, which is recognized as being a fairly zippy and lightweight model which inherits the good performance of the older BERT model from which is has been, uh, distilled. distilBERT exists in case-sensitive and uncased versions. Here, I’m used the slightly smaller uncased version. If we were repeating this in German, where proper nouns are capitalized, and where letter case makes a difference, we might choose the case-sensitive version. This distilbert-uncased model is available through the HuggingFace 🤗 repositories, and if you are running the following commands for the first time the model will be saved to a cached location on your hard drive.

The tokenizer splits our text into component parts, and represents these parts as a matrix. In this respect, it’s similar to the bag of words approach we discussed previously. Tokenizers usually come attached to models. In the case of models from the BERT family, the tokenizer splits the sentences into word pieces. This tokenizer will also be downloaded the first time it is used.

We use these models through two commands from the transformers module: AutoTokenizer and AutoModelForSequenceClassification. They allow us to instantiate the model easily with configuration options appropriate to the relevant task – in this case, sequence classification, a slightly broader category than text classification, but one which works for our application.

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels = 3,
                                                           id2label=id2label,
                                                           label2id=label2id)

We will get some warning messages: the key message is “You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference”. That’s what we’ll do now.

Configuration

We now configure the model. There are lots of configuration options here. Some configuration options affect the potential accuracy of the model. These are configuration options like the learning rate. Other configuration options affect the resource-intensiveness of the model training. These are configuration options like batch size. If this model gives you an “out of memory” error (something which is common when using a GPU to train), then you can reduce the batch size without affecting the accuracy of the model. Most of the parameters here are taken from tutorials I’ve seen: I would no more change these parameters than I would change the arguments of glm.control if I were fitting a logit regression.

batch_size = 16 ## make this smaller if you get OOM errors

logging_steps = len(mueller) // batch_size
### Equivalent to R: logging_steps = floor(length(mueller) / batch_size)
model_name = f"{model_name}-finetuned-temporal"

training_args = TrainingArguments(
    output_dir=model_name,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    logging_steps=logging_steps,
    push_to_hub=False,
    log_level="error"
)

As part of this stage we can also set out the metrics we’ll be using to evaluate the model. This evaluation is post-estimation: the BERT model targets its own loss function when training, and our calculations of accuracy and F1 are for our own benefit, not the model’s.

def compute_metrics(eval_pred):
    f1 = evaluate.load("f1")
    accuracy = evaluate.load("accuracy")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    f1val = f1.compute(references=labels, predictions=predictions, average = "weighted")
    accval = accuracy.compute(references=labels, predictions=predictions)
    return {"F1": f1val, "accuracy": accval}

Tokenization

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

### Add a 80:20 train/test split
mueller_encoded = mueller.map(tokenize, batched=True)

mueller_encoded = mueller_encoded.train_test_split(test_size=0.2)

Training and saving the model

The lines below actually train or fine-tune the model. This is the most computationally intensive stage. With just under six thousand examples, my machine is able to fine-tune the model in about six minutes.

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=mueller_encoded["train"],
    eval_dataset=mueller_encoded["test"],
    tokenizer=tokenizer
)

trainer.train()

We can investigate where the trainer ended up by printing the result of trainer.state:

print(trainer.state)

This gives us different figures concerning the F1 and accuracy metrics in the training and testing data. In this case, both the accuracy and F1 scores are around 0.83. This is good: if our accuracy was high but our F1 was comparatively low it might suggest that the classifier is getting good accuracy by predicting the most common class all of the time. This is certainly something I’ve run into in the past.

These accuracy statistics compare favourably to the accuracy statistics reported in the supplemental information for the original article. There the accuracy was reported at 75%. This means that we’ve improved on this model by almost half as much (8 percentage points) as this model improved over a null model which simply predicts the most common ‘future’ class all the time (60% accuracy).

At this point it’s helpful to check that we can generate predictions; that the accuracy is what was reported, and that the predictions are not all the same.

predictions = trainer.predict(mueller_encoded["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)
sum(preds == 0)
sum(preds == 1)
sum(preds == 2)
manual_accuracy = np.mean(preds == mueller_encoded["test"]["labels"])

We’ll want to save this model so that we can use it later for the purposes of inference. It is possible to save these to the HuggingFace 🤗 site/hub if you have a log-in, but I am not quite ready to do that.

trainer.save_model("distilbert-base-uncased-finetuned-temporal")

Using the model to classify

Now that we’ve got a fine-tuned model, we can use it on the remaining unlabelled manifesto sentences. This stage can be carried out entirely separately from the training stage. Remember though that you will probably need to import many of the same modules as before.

import pandas as pd
import numpy as np
import pyreadr
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, pipeline
import torch
from transformers.pipelines.pt_utils import KeyDataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_ckpt = "distilbert-base-uncased-finetuned-temporal"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

num_labels = 3
id2label = {
    0: "Past",
    1: "Present",
    2: "Future",
}
label2id = {
    "Past": 0,
    "Present": 1,
    "Future": 2,
}

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels, id2label=id2label, label2id=label2id)

In this next code chunk we set up a pipeline, which will allow us to batch process new inputs, and thus classify them faster. The pipeline will however require us to specify a batch size, which will depend on the hardware available. If you think your machine has a better specification than my machine, increase the batch size. If you think your machine has a worse specification than my machine, lower the batch size.

classify = pipeline(task='text-classification',  # replace with whatever task you have
                    model=model,
                    tokenizer = tokenizer,
                    top_k = None)
batch_size = 8

To produce the data we wish to feed into our pipeline, we turn back to R, and gather information on the manifestos of English language parties. We do this using the manifestoR package, for which you’ll need your own key.

library(manifestoR)
library(tidyverse)
library(tidytext)
marpor_apikey_locn <- "/path/to/your/key.txt"
marpor_apikey_locn <- "/mnt/003b6650-3320-4dbb-9c9e-ff84cfef882f/home/chris/Dropbox/tmp/manifesto_apikey.txt"

mp_setapikey(key.file = marpor_apikey_locn)
dat <- mp_maindataset()

english_speaking_countries <- c("United Kingdom",
                               "Ireland", "New Zealand",
                               "Australia", "Canada")

corpus_uk <- mp_corpus(countryname == "United Kingdom")
corpus_irl <- mp_corpus(countryname == "Ireland")
corpus_nzl <- mp_corpus(countryname == "New Zealand")
corpus_aus <- mp_corpus(countryname == "Australia")
corpus_can <- mp_corpus(countryname == "Canada")

I’ve written a short function to pull out the quasi-sentences and attach some identifying variables.

tidy_manif <- function(x) {
    dat <- as.data.frame(x$content)
    dat$party <- x$meta["party"]$party
    dat$date <- x$meta["date"]$date
    dat$manifesto_id <- x$meta["id"]$id
    return(dat)
}
### for each corpus, get the date of each document
uk <- bind_rows(lapply(corpus_uk, tidy_manif))
irl <- bind_rows(lapply(corpus_irl, tidy_manif))
nzl <- bind_rows(lapply(corpus_nzl, tidy_manif))
aus <- bind_rows(lapply(corpus_aus, tidy_manif))
can <- bind_rows(lapply(corpus_can, tidy_manif))

Although the entries in the manifesto data are supposed already to be split into quasi-sentences, that’s not true for some of the earlier manifestos. As a result, not only do I have to split the sentences again using the tidytext library’s unnest_tokens function, I have to exclude some “sentences” with a very large number of characters. If I feed these mammoth sentences into a BERT-based model, I’ll get a complaint about the input data exceeding the maximum number of tokens (512). If I was doing this properly (rather than just for a blog post), I’d amend the data so that I could split these sentences properly.

### If the sentences aren't already split, try again using tidytext
dat <- bind_rows(list(uk, irl, nzl, aus, can))

dat <- dat |>
    unnest_tokens(output = "the_sentence", input = text, token = "sentences")

dat <- dat |>
    filter(nchar(the_sentence) < 1800)

saveRDS(dat, file = "english_language_manifestos.rds")

We now head back to Python and read in our data using the pyreader package like we did before.

df = pyreadr.read_r("./english_language_manifestos.rds")[None]
df = df[['party', 'manifesto_id', 'the_sentence']]

This gives us a data frame of manifesto sentences. We now iterate over the rows in this data frame by converting it to a HuggingFace 🤗 dataset.

dataset = Dataset.from_pandas(df)

def list_of_lists2pd(l):
    inner_list = l
    labels = [item['label'] for item in inner_list]
    scores = [item['score'] for item in inner_list]
    # Create a DataFrame from the extracted data
    df = pd.DataFrame([scores], columns=labels)
    return(df)

res = [list_of_lists2pd(out) for out in classify(KeyDataset(dataset, "the_sentence"), batch_size = batch_size)]

This will take some time. Again, it’s hard to say how much time it will take in total. If you are running this on your CPU without graphics card acceleration and with a small batch size, it might run overnight. If you’re running this on the cloud with a nice NVidia TPU, it might take less time than you’ll take to read this paragraph. But it’s probably worth your while leaving your desk, taking a walk around, and doing some back stretches.

In this next chunk, we use some Python list comprehension, together with a function which takes a list of lists and turns it into a pandas data frame. This still gives us a list, when what we want is something more rectangular. Happily, we can perform the Python equivalent of bind_rows using pd.concat. We then bind this data frame of predictions with the original data frame, where we bind by row number. You don’t need to worry too much about the calls to reset_index – they are there because we know we’re operating on two data frames where the rows correspond, and where we don’t need to duplicate any indices in the result of the merger.

### now concatenate these by row
res = pd.concat(res)
df = df.reset_index(drop = True)
res = res.reset_index(drop = True)
out = pd.concat([df, res], axis = 1)
out = out.reset_index(drop = True)

out.to_parquet("classified_manifesto_sentences.parquet")

We can then return to R to conduct our analysis.

library(tidyverse)
library(arrow)
dat <- read_parquet("classified_manifesto_sentences.parquet")

We want to aggregate by party, and check whether there are differences by incumbency.

dat <- dat |>
    group_by(party, manifesto_id) |>
    summarize(across(c("Past", "Present", "Future"),
                     mean),
              .groups = "drop")

aux <- readRDS("../data_merged.rds") |>
    dplyr::select(manifesto_id, party, incumbency_status2_factor) |>
    distinct()

dat <- left_join(dat,
                 aux,
                 by = join_by(manifesto_id, party))

The first claim we want to reproduce was this:

“Across the sample of 621 party manifestos, on average, 54% of sentences relate to the future, 37% focus on the present, and 9% describe the past.”

We can get that by summarizing again.

dat |>
    summarize(across(c("Past", "Present", "Future"),
                     mean),
              .groups = "drop")

We find that the proportions are slightly different: here, 57% of sentences relate to the future, 32% to the present, and 11% to the past. This could either be due to the basis for classification (large language model rather than SVM) or the inclusion of new manifestos from the post-2019 period. The broad-brush implications of the findings (a majority of sentences are about the future; sentences about the past are the least common) are unchanged.

The second claim we want to reproduce is that incumbents talk about the past more:

“Incumbent parties’ average emphasis on the past exceeds the focus on the past by opposition parties by around 5 percentage points”

We can show this with a linear regression:

library(modelsummary)
mod <- lm(Past ~ incumbency_status2_factor, data = dat)
modelsummary(list("Past proportion" = mod),
             coef_rename = c("incumbency_status2_factorIncumbent" = "Incumbent"),
             output = "markdown")

Linear regression of the proportion of sentences concerning the past
	Past proportion
(Intercept)	0.090
	(0.004)
Incumbent	0.082
	(0.007)
Num.Obs.	359
R2	0.265
R2 Adj.	0.263
AIC	-956.3
BIC	-944.6
Log.Lik.	481.141
RMSE	0.06

Once again, the proportions are very slightly different, but the conclusion (incumbents talk about the past more) is unchanged.

Conclusions

In this post, I’ve given code which can be used to fine train a language model to classify sentences given some existing hand-coded sentences. I’ve written this post mostly as an aide-memoire, but I hope that it might prove useful to anyone who is keen on classification, who is somewhat skeptical of zero-shot classification, but who is worried about the set-up costs of fine-tuning a large language model.

My machine

This analysis was run on a Linux machine (Ubuntu 23.04) with an AMD Ryzen 9 3900X 12-Core Processor and 64 gigabytes of RAM; my graphics card is a NVIDIA GeForce GTX 1650 with 4 Gb memory. Compared to CPUs currently available, my CPU is in the top quintile. Compared to graphics cards currently available, this is below average.

Footnotes

Sinn Féin general election manifesto, 2011.↩︎
None of this is intended in any way as a slight on Stefan Müller, who’s been working in this area for a long time and who is already working on much richer analyses ↩︎