Important

tl:dr; R code to produce a lookup table connecting They Work For You constituency codes to Wikidata entities, and the self-same lookup table.

Observational social scientists often work with named entities which have no canonical name. Examples include politicians and places.

Sometimes the different names given to these entities differ quite fundamentally. The names “Western Isles” and “Na h-Eileanan an Iar” pick out the same constituency, but there is no way of knowing this unless you know both English and Gaelic.

Most of the time, the different names given to these entities differ only in minor respects. The two strings “St. Helen’s” and “St Helens” pick out the same constituency, but differences in punctuation mean that these two strings don’t match exactly, and can’t therefore be used to join different sources of data.

Although we could remove punctuation to deal with the St Helens of this world, rule-based transformations like stripping punctuation won’t work all the time. Someone not familiar with British politics might be able to work out that “Tam Dalyell” and “Thomas Dalyell” are the same person, but there’s no rule that enables us to move between these two different representations.

The general name for the problem I’m talking about is record linkage. In this post, I’ll give R code to link two data sets which record information about parliamentary constituencies:

They Work For You, a website which makes information from parliamentary websites more easy to use
Wikidata, the structured knowledge base that acts as a companion to Wikipedia

They Work For You provides good information about parliamentary divisions, and records structured information about constituencies. Wikidata contains rich biographical information about parliamentarians (and most things). It’s therefore helpful to connect these two sources of data.

Although most of the time we’ll be linking data-sets regarding politicians, linking constituencies is an important first step, since it allows us to block exactly on constituency, making our matches much more accurate, and allowing us to distinguish between legislators with identical names but who represent different constituencies.

The packages used

I’ll be using six packages in total. I’ll use the tidyverse, as I do with most projects. I’ll be using the jsonlite package, because some of the data from TheyWorkForYou is in JSON format. The record linkage itself is done in reclin2, a recent rewrite of the reclin package which I’ve been using for the past year or so. Finally, I’ll be querying WikiData using the WikidataQueryServiceR package. The remaining two packages, parallel and glue, are used only incidentally.

library(tidyverse)
library(jsonlite)
library(reclin2)
library(WikidataQueryServiceR)
library(parallel)
library(glue)

Because record linkage can be computationally expensive (comparing 650 constituency names to 650 other constituency names is a lot of comparisons!), I tell reclin2 to use all but one of my cores.

num_cores <- parallel::detectCores() - 1
setDTthreads(threads = num_cores)

Querying Wikidata

If you don’t already know about Wikidata, you should read up on it. Wikidata contains information about many different kinds of entities and their relationships.

Wikidata can be queried using SPARQL, a language not too similar from other query languages used for databases. If you want to see some examples of SPARQL queries for British politicians, there is a truly excellent page of examples.

Wikidata can be difficult to get to grips with because you often need to start from a particular named entity or properties, and finding those named entities can be difficult. I’m going to start from a list of properties relating to membership in different parliaments. Here’s my starting point, written out as a tribble.

terms <- tribble(~start_date, ~wikidata_member_id,
        "2019-12-12","Q77685926",
        "2017-06-08","Q30524710",
        "2015-05-07","Q30524718",
        "2010-05-06","Q35494253",
        "2005-05-05","Q35647955",
        "2001-06-07","Q35921591",
        "1997-05-01","Q36634044",
        "1992-04-09","Q41582627",
        "1987-06-11","Q41582624",
        "1983-06-09","Q41582621",
        "1979-05-03","Q41582619",
        "1974-10-10","Q41582617",
        "1974-02-28","Q41582615",
        "1970-06-18","Q41582612",
        "1966-03-31","Q41582609",
        "1964-10-15","Q41582608",
        "1959-10-08","Q41582606",
        "1955-05-26","Q41582604",
        "1951-10-25","Q41582603",
        "1950-02-23","Q41582600",
        "1945-07-05","Q41582597")

This tribble contains information on the start date of each parliamentary term in the post-war period, and a WikiData property. The property Q30524710 is the property of “being a member of the 57th Parliament of the United Kingdom”, and we can learn more about this property by prefixing it with https://www.wikidata.org/wiki/. The resulting page is human readable, and tells us a bit more about the relationships of this property, most obviously the dates of this term.

Because we’ll be looking for members elected between the start of each term and the beginning of the following term, I’ll do some date handling, including a purely notional end date for the current parliament.

terms <- terms |>
    mutate(start_date = as.Date(start_date),
           end_date = lag(start_date) - 1,
           end_date = coalesce(end_date, as.Date("2099-12-31")))

I’m now going to construct a SPARQL query based on this set of properties. Here’s what the query looks like, with a placeholder session:

SELECT DISTINCT ?constituency ?constituencyLabel {
 ?item p:P39 ?positionStatement .
 ?positionStatement ps:P39 wd:<<session>> . 
 ?positionStatement pq:P768 ?constituency . 
 SERVICE wikibase:label { bd:serviceParam wikibase:language 'en' }
}

Here’s an English translation of that query:

“start with those items which are items which hold a position (have property P39), and store the position held as positionStatement; keep those positions which are positions in the named parliamentary session, and store the property qualifier electoral district (P768) as constituency; use the labelling service to construct the additional variable constituencyLabel

This is a rather indirect query, because it starts with “items which hold positions”. In writing this blog-post, I’ve found it easier to start from legislators and abstract to the constituencies they represent, but that might not work for all cases.

You can try this query at the public WikiData SPARQL endpoint for the 2017-2019 parliament. Here are the first five rows of the result:

constituency	constituencyLabel
wd:Q750994	Argyll and Bute
wd:Q875403	Streatham
wd:Q751233	Ashton-under-Lyne
wd:Q1077604	Lincoln
wd:Q874276	Windsor

We could use the public endpoint, but we want to store these lists programmatically. Here, I write a function which takes a session identifier (say, Q30524710) and returns a data frame. I then map over the different values of wikidata_member_id and save the results in a new data-frame column called wikidata.

wd_consts <- function(session) {
    query_template <- "
SELECT DISTINCT ?constituency ?constituencyLabel {
 ?item p:P39 ?positionStatement .
 ?positionStatement ps:P39 wd:<<session>> . 
 ?positionStatement pq:P768 ?constituency . 
 SERVICE wikibase:label { bd:serviceParam wikibase:language 'en' }
}
"
    query <- glue::glue(session = session,
                        query_template,
                        .open = "<<",
                        .close = ">>")
    res <- query_wikidata(query)   
}

terms <- terms |>
    mutate(wikidata = map(wikidata_member_id, wd_consts))

Working with They Work For You data

They Work For You is an excellent site which has, together with the related project Public Whip, done an incredible amount to familiarize people with how MPs vote in the House of Commons.

Both They Work For You and Public Whip are powered by the parlparse project, which can be found on GitHub. I love that all the underlying data is freely available, but I don’t always like the format it’s stored in. For our purposes, we’re interested in the (monster) JSON file, people.json. Assuming you’ve downloaded this file to your working directory, you can read it in using the following code:

library(jsonlite)
ppl <- fromJSON("./people.json")

Although jsonlite’s fromJSON command will try and simplify to a data frame as much as possible, the ppl object is a list of four different data frames:

memberships
organizations
persons
posts

Memberships are different from persons. Diane Abbott is a person, who has the person ID uk.org.publicwhip/person/10001. Diane Abbott is associated with multiple different memberships. For example, membership uk.org.publicwhip/member/2069 is Diane Abbott’s membership in the Commons between 1987 and 1992, representing Hackney North and Stoke Newington. The memberships data frame contains a variable, post_id, which corresponds to the id field in the posts data frame.

Navigating between these different data frames and identifiers can sometimes be difficult, especially where some columns are actually data frame columns. The following code splits up the JSON file into these separate elements, and makes sure that all the variables we need are addressable using the dollar sign.

### Split the JSON up
members <- ppl$members
persons <- ppl$persons
posts <- ppl$posts

### Handle awkward details in posts
posts <- data.frame(name = posts$area$name,
                    id = posts$id,
                    start_date = posts$start_date,
                    end_date = posts$end_date)

I’ll use the same data frame I used before (terms), and map over the start and end dates in order to store the constituencies used for each parliamentary terms. As before, you might find the logic a bit circuitous, because I first identify the members who served in this period before finding out the constituencies they represented.

twfy_consts <- function(begin, end, members, posts) {
### type checking
    begin <- as.Date(begin)
    end <- as.Date(end)    
    tmp <- members |>
        mutate(start_date = as.Date(start_date),
               end_date = as.Date(end_date))

### Just get Commons seats
    tmp <- tmp |>
        filter(start_reason == "general_election") |>
        filter(grepl("cons", post_id)) |>
        filter(!grepl("-NI$", post_id))

    ### The start date has to be equal to or after the `begin` argument
    tmp <- tmp |>
        filter(start_date >= begin)

### Deal with missing end dates by imputing a notional end date way in the future
    tmp <- tmp |>
        mutate(end_date = coalesce(end_date, as.Date("2099-11-30")))

    ### The end date has to be (strictly) less than the `end` argument
    tmp <- tmp |>
        filter(is.na(end_date) | end_date < end)

    ### Pull out the unique values... 
    tmp <- tmp |>    
        dplyr::select(post_id) |>
        distinct()

    #### ... and merge
    tmp <- left_join(tmp,
                     posts |> dplyr::distinct(name, id, start_date),
                     by = join_by(post_id == id))
                     
    
}

terms <- terms |>
    mutate(twfydata = map2(start_date, end_date, twfy_consts, members, posts))

If you’re playing along at home, you should have a data frame terms which looks a little bit like this. We have a row for each session, and two special data frame columns which contain the constituencies for that session according to WikiData (wikidata), and the constituencies for that session according to TheyWorkForYou (twfydata).

# A tibble: 6 × 5
  start_date wikidata_member_id end_date   wikidata             twfydata      
  <date>     <chr>              <date>     <list>               <list>        
1 2019-12-12 Q77685926          2099-12-31 <spc_tbl_ [650 × 2]> <df [650 × 3]>
2 2017-06-08 Q30524710          2019-12-11 <spc_tbl_ [650 × 2]> <df [650 × 3]>
3 2015-05-07 Q30524718          2017-06-07 <spc_tbl_ [650 × 2]> <df [650 × 3]>
4 2010-05-06 Q35494253          2015-05-06 <spc_tbl_ [650 × 2]> <df [650 × 3]>
5 2005-05-05 Q35647955          2010-05-05 <spc_tbl_ [646 × 2]> <df [645 × 3]>
6 2001-06-07 Q35921591          2005-05-04 <spc_tbl_ [659 × 2]> <df [659 × 3]>

Record linkage with `reclin2`

Now that we know what names and codes each project uses to store information for each parliamentary term, we can start linking the two sets of data. This was the only part of my code that I had to use a for loop for.

for (i in 1:nrow(terms)) {
    ### Pull out our data frames
    wd <- terms$wikidata[[i]] |>
        dplyr::select(wikidata_id = constituency,
                      name = constituencyLabel)
    
    twfy <- terms$twfydata[[i]] |>
        dplyr::select(twfy_id = post_id,
                      name)

### Link the two data-sets...
    pp <- reclin2::pair(wd, twfy)

    ### Using a single variable, name, comparing them using a function
    ### which is good for strings
    pp <- compare_pairs(pp, on = c("name"),
                        default_comparator = cmp_jarowinkler(0.9))

    ### These two steps create a scoring/`weights` variable
    m <- problink_em(~name, pp)
    pp <- predict(m, pairs = pp, add = TRUE)

    ### Exact one-to-one matching, as specified by the n and m arguments
    pp <- select_n_to_m(pp, variable = "ntom", score = "weights",
                        threshold = 0,
                        n = 1, m = 1)

    ### Get the links
    linked_data_set <- link(pp, selection = "ntom") |>
        dplyr::select(wikidata_id, twfy_id)

    ## Add this on to the wikidata data
    terms$wikidata[[i]] <- left_join(terms$wikidata[[i]],
                                     linked_data_set,
                                     by = join_by(constituency == wikidata_id))
    
}

This for loop is not particularly quick, because the select_n_to_m matching has to engage in a lot of comparisons rather than matching greedily. Since we’re only doing this once, slow optimal matching is probably better than greedy matching.

If we just want the lookup, we can extract it. Here’s the code, and what it looks like. You can download it here.

### Just extract the lookup
lu <- terms |>
    dplyr::select(wikidata) |>
    unnest(cols = c(wikidata)) |>
    dplyr::select(wikidata_id = constituency,
                  twfy_id = twfy_id)

lu <- lu |>
    distinct()

# A tibble: 1,650 × 2
   wikidata_id                             twfy_id                   
   <chr>                                   <chr>                     
 1 http://www.wikidata.org/entity/Q299984  uk.org.publicwhip/cons/3  
 2 http://www.wikidata.org/entity/Q613294  uk.org.publicwhip/cons/648
 3 http://www.wikidata.org/entity/Q611841  uk.org.publicwhip/cons/191
 4 http://www.wikidata.org/entity/Q988356  uk.org.publicwhip/cons/549
 5 http://www.wikidata.org/entity/Q751715  uk.org.publicwhip/cons/519
 6 http://www.wikidata.org/entity/Q1077604 uk.org.publicwhip/cons/343
 7 http://www.wikidata.org/entity/Q874607  uk.org.publicwhip/cons/619
 8 http://www.wikidata.org/entity/Q875108  uk.org.publicwhip/cons/601
 9 http://www.wikidata.org/entity/Q1080424 uk.org.publicwhip/cons/335
10 http://www.wikidata.org/entity/Q988533  uk.org.publicwhip/cons/541
# ℹ 1,640 more rows

Conclusions

In this post, I’ve shown

how to access Wikidata from R
how to access the data underlying TheyWorkForYou
how to link these two data sources using the reclin2 package

I encourage you to read up more on the reclin2 package and explore the possibilities that Wikidata offers. This blog post emerged from experimenting with record linkage trying to link legislators directly, and realizing that “blocking” on constituency was a necessary first step. Hopefully it’s also a useful first step for you.

The packages used

Querying Wikidata

Working with They Work For You data

Record linkage with reclin2

Conclusions

Record linkage with `reclin2`