library(tidyverse)
library(jsonlite)
library(reclin2)
library(WikidataQueryServiceR)
library(parallel)
library(glue)
tl:dr; R code to produce a lookup table connecting They Work For You constituency codes to Wikidata entities, and the self-same lookup table.
Observational social scientists often work with named entities which have no canonical name. Examples include politicians and places.
Sometimes the different names given to these entities differ quite fundamentally. The names “Western Isles” and “Na h-Eileanan an Iar” pick out the same constituency, but there is no way of knowing this unless you know both English and Gaelic.
Most of the time, the different names given to these entities differ only in minor respects. The two strings “St. Helen’s” and “St Helens” pick out the same constituency, but differences in punctuation mean that these two strings don’t match exactly, and can’t therefore be used to join different sources of data.
Although we could remove punctuation to deal with the St Helens of this world, rule-based transformations like stripping punctuation won’t work all the time. Someone not familiar with British politics might be able to work out that “Tam Dalyell” and “Thomas Dalyell” are the same person, but there’s no rule that enables us to move between these two different representations.
The general name for the problem I’m talking about is record linkage. In this post, I’ll give R code to link two data sets which record information about parliamentary constituencies:
- They Work For You, a website which makes information from parliamentary websites more easy to use
- Wikidata, the structured knowledge base that acts as a companion to Wikipedia
They Work For You provides good information about parliamentary divisions, and records structured information about constituencies. Wikidata contains rich biographical information about parliamentarians (and most things). It’s therefore helpful to connect these two sources of data.
Although most of the time we’ll be linking data-sets regarding politicians, linking constituencies is an important first step, since it allows us to block exactly on constituency, making our matches much more accurate, and allowing us to distinguish between legislators with identical names but who represent different constituencies.
The packages used
I’ll be using six packages in total. I’ll use the tidyverse
, as I do with most projects. I’ll be using the jsonlite
package, because some of the data from TheyWorkForYou is in JSON format. The record linkage itself is done in reclin2
, a recent rewrite of the reclin
package which I’ve been using for the past year or so. Finally, I’ll be querying WikiData using the WikidataQueryServiceR
package. The remaining two packages, parallel
and glue
, are used only incidentally.
Because record linkage can be computationally expensive (comparing 650 constituency names to 650 other constituency names is a lot of comparisons!), I tell reclin2
to use all but one of my cores.
<- parallel::detectCores() - 1
num_cores setDTthreads(threads = num_cores)
Querying Wikidata
If you don’t already know about Wikidata, you should read up on it. Wikidata contains information about many different kinds of entities and their relationships.
Wikidata can be queried using SPARQL, a language not too similar from other query languages used for databases. If you want to see some examples of SPARQL queries for British politicians, there is a truly excellent page of examples.
Wikidata can be difficult to get to grips with because you often need to start from a particular named entity or properties, and finding those named entities can be difficult. I’m going to start from a list of properties relating to membership in different parliaments. Here’s my starting point, written out as a tribble.
<- tribble(~start_date, ~wikidata_member_id,
terms "2019-12-12","Q77685926",
"2017-06-08","Q30524710",
"2015-05-07","Q30524718",
"2010-05-06","Q35494253",
"2005-05-05","Q35647955",
"2001-06-07","Q35921591",
"1997-05-01","Q36634044",
"1992-04-09","Q41582627",
"1987-06-11","Q41582624",
"1983-06-09","Q41582621",
"1979-05-03","Q41582619",
"1974-10-10","Q41582617",
"1974-02-28","Q41582615",
"1970-06-18","Q41582612",
"1966-03-31","Q41582609",
"1964-10-15","Q41582608",
"1959-10-08","Q41582606",
"1955-05-26","Q41582604",
"1951-10-25","Q41582603",
"1950-02-23","Q41582600",
"1945-07-05","Q41582597")
This tribble contains information on the start date of each parliamentary term in the post-war period, and a WikiData property. The property Q30524710
is the property of “being a member of the 57th Parliament of the United Kingdom”, and we can learn more about this property by prefixing it with https://www.wikidata.org/wiki/
. The resulting page is human readable, and tells us a bit more about the relationships of this property, most obviously the dates of this term.
Because we’ll be looking for members elected between the start of each term and the beginning of the following term, I’ll do some date handling, including a purely notional end date for the current parliament.
<- terms |>
terms mutate(start_date = as.Date(start_date),
end_date = lag(start_date) - 1,
end_date = coalesce(end_date, as.Date("2099-12-31")))
I’m now going to construct a SPARQL query based on this set of properties. Here’s what the query looks like, with a placeholder session
:
SELECT DISTINCT ?constituency ?constituencyLabel {
:P39 ?positionStatement .
?item p:P39 wd:<<session>> .
?positionStatement ps:P768 ?constituency .
?positionStatement pq:label { bd:serviceParam wikibase:language 'en' }
SERVICE wikibase }
Here’s an English translation of that query:
“start with those items which are items which hold a position (have property P39), and store the position held as
positionStatement
; keep those positions which are positions in the named parliamentary session, and store the property qualifier electoral district (P768) asconstituency
; use the labelling service to construct the additional variableconstituencyLabel
This is a rather indirect query, because it starts with “items which hold positions”. In writing this blog-post, I’ve found it easier to start from legislators and abstract to the constituencies they represent, but that might not work for all cases.
You can try this query at the public WikiData SPARQL endpoint for the 2017-2019 parliament. Here are the first five rows of the result:
constituency | constituencyLabel |
---|---|
wd:Q750994 | Argyll and Bute |
wd:Q875403 | Streatham |
wd:Q751233 | Ashton-under-Lyne |
wd:Q1077604 | Lincoln |
wd:Q874276 | Windsor |
We could use the public endpoint, but we want to store these lists programmatically. Here, I write a function which takes a session identifier (say, Q30524710
) and returns a data frame. I then map
over the different values of wikidata_member_id
and save the results in a new data-frame column called wikidata
.
<- function(session) {
wd_consts <- "
query_template SELECT DISTINCT ?constituency ?constituencyLabel {
?item p:P39 ?positionStatement .
?positionStatement ps:P39 wd:<<session>> .
?positionStatement pq:P768 ?constituency .
SERVICE wikibase:label { bd:serviceParam wikibase:language 'en' }
}
"
<- glue::glue(session = session,
query
query_template,.open = "<<",
.close = ">>")
<- query_wikidata(query)
res
}
<- terms |>
terms mutate(wikidata = map(wikidata_member_id, wd_consts))
Working with They Work For You data
They Work For You is an excellent site which has, together with the related project Public Whip, done an incredible amount to familiarize people with how MPs vote in the House of Commons.
Both They Work For You and Public Whip are powered by the parlparse
project, which can be found on GitHub. I love that all the underlying data is freely available, but I don’t always like the format it’s stored in. For our purposes, we’re interested in the (monster) JSON file, people.json
. Assuming you’ve downloaded this file to your working directory, you can read it in using the following code:
library(jsonlite)
<- fromJSON("./people.json") ppl
Although jsonlite
’s fromJSON
command will try and simplify to a data frame as much as possible, the ppl
object is a list of four different data frames:
memberships
organizations
persons
posts
Memberships are different from persons. Diane Abbott is a person, who has the person ID uk.org.publicwhip/person/10001
. Diane Abbott is associated with multiple different memberships. For example, membership uk.org.publicwhip/member/2069
is Diane Abbott’s membership in the Commons between 1987 and 1992, representing Hackney North and Stoke Newington. The memberships data frame contains a variable, post_id
, which corresponds to the id
field in the posts
data frame.
Navigating between these different data frames and identifiers can sometimes be difficult, especially where some columns are actually data frame columns. The following code splits up the JSON file into these separate elements, and makes sure that all the variables we need are addressable using the dollar sign.
### Split the JSON up
<- ppl$members
members <- ppl$persons
persons <- ppl$posts
posts
### Handle awkward details in posts
<- data.frame(name = posts$area$name,
posts id = posts$id,
start_date = posts$start_date,
end_date = posts$end_date)
I’ll use the same data frame I used before (terms
), and map
over the start and end dates in order to store the constituencies used for each parliamentary terms. As before, you might find the logic a bit circuitous, because I first identify the members who served in this period before finding out the constituencies they represented.
<- function(begin, end, members, posts) {
twfy_consts ### type checking
<- as.Date(begin)
begin <- as.Date(end)
end <- members |>
tmp mutate(start_date = as.Date(start_date),
end_date = as.Date(end_date))
### Just get Commons seats
<- tmp |>
tmp filter(start_reason == "general_election") |>
filter(grepl("cons", post_id)) |>
filter(!grepl("-NI$", post_id))
### The start date has to be equal to or after the `begin` argument
<- tmp |>
tmp filter(start_date >= begin)
### Deal with missing end dates by imputing a notional end date way in the future
<- tmp |>
tmp mutate(end_date = coalesce(end_date, as.Date("2099-11-30")))
### The end date has to be (strictly) less than the `end` argument
<- tmp |>
tmp filter(is.na(end_date) | end_date < end)
### Pull out the unique values...
<- tmp |>
tmp ::select(post_id) |>
dplyrdistinct()
#### ... and merge
<- left_join(tmp,
tmp |> dplyr::distinct(name, id, start_date),
posts by = join_by(post_id == id))
}
<- terms |>
terms mutate(twfydata = map2(start_date, end_date, twfy_consts, members, posts))
If you’re playing along at home, you should have a data frame terms
which looks a little bit like this. We have a row for each session, and two special data frame columns which contain the constituencies for that session according to WikiData (wikidata
), and the constituencies for that session according to TheyWorkForYou (twfydata
).
# A tibble: 6 × 5
start_date wikidata_member_id end_date wikidata twfydata
<date> <chr> <date> <list> <list>
1 2019-12-12 Q77685926 2099-12-31 <spc_tbl_ [650 × 2]> <df [650 × 3]>
2 2017-06-08 Q30524710 2019-12-11 <spc_tbl_ [650 × 2]> <df [650 × 3]>
3 2015-05-07 Q30524718 2017-06-07 <spc_tbl_ [650 × 2]> <df [650 × 3]>
4 2010-05-06 Q35494253 2015-05-06 <spc_tbl_ [650 × 2]> <df [650 × 3]>
5 2005-05-05 Q35647955 2010-05-05 <spc_tbl_ [646 × 2]> <df [645 × 3]>
6 2001-06-07 Q35921591 2005-05-04 <spc_tbl_ [659 × 2]> <df [659 × 3]>
Record linkage with reclin2
Now that we know what names and codes each project uses to store information for each parliamentary term, we can start linking the two sets of data. This was the only part of my code that I had to use a for
loop for.
for (i in 1:nrow(terms)) {
### Pull out our data frames
<- terms$wikidata[[i]] |>
wd ::select(wikidata_id = constituency,
dplyrname = constituencyLabel)
<- terms$twfydata[[i]] |>
twfy ::select(twfy_id = post_id,
dplyr
name)
### Link the two data-sets...
<- reclin2::pair(wd, twfy)
pp
### Using a single variable, name, comparing them using a function
### which is good for strings
<- compare_pairs(pp, on = c("name"),
pp default_comparator = cmp_jarowinkler(0.9))
### These two steps create a scoring/`weights` variable
<- problink_em(~name, pp)
m <- predict(m, pairs = pp, add = TRUE)
pp
### Exact one-to-one matching, as specified by the n and m arguments
<- select_n_to_m(pp, variable = "ntom", score = "weights",
pp threshold = 0,
n = 1, m = 1)
### Get the links
<- link(pp, selection = "ntom") |>
linked_data_set ::select(wikidata_id, twfy_id)
dplyr
## Add this on to the wikidata data
$wikidata[[i]] <- left_join(terms$wikidata[[i]],
terms
linked_data_set,by = join_by(constituency == wikidata_id))
}
This for loop is not particularly quick, because the select_n_to_m
matching has to engage in a lot of comparisons rather than matching greedily. Since we’re only doing this once, slow optimal matching is probably better than greedy matching.
If we just want the lookup, we can extract it. Here’s the code, and what it looks like. You can download it here.
### Just extract the lookup
<- terms |>
lu ::select(wikidata) |>
dplyrunnest(cols = c(wikidata)) |>
::select(wikidata_id = constituency,
dplyrtwfy_id = twfy_id)
<- lu |>
lu distinct()
# A tibble: 1,650 × 2
wikidata_id twfy_id
<chr> <chr>
1 http://www.wikidata.org/entity/Q299984 uk.org.publicwhip/cons/3
2 http://www.wikidata.org/entity/Q613294 uk.org.publicwhip/cons/648
3 http://www.wikidata.org/entity/Q611841 uk.org.publicwhip/cons/191
4 http://www.wikidata.org/entity/Q988356 uk.org.publicwhip/cons/549
5 http://www.wikidata.org/entity/Q751715 uk.org.publicwhip/cons/519
6 http://www.wikidata.org/entity/Q1077604 uk.org.publicwhip/cons/343
7 http://www.wikidata.org/entity/Q874607 uk.org.publicwhip/cons/619
8 http://www.wikidata.org/entity/Q875108 uk.org.publicwhip/cons/601
9 http://www.wikidata.org/entity/Q1080424 uk.org.publicwhip/cons/335
10 http://www.wikidata.org/entity/Q988533 uk.org.publicwhip/cons/541
# ℹ 1,640 more rows
Conclusions
In this post, I’ve shown
- how to access Wikidata from R
- how to access the data underlying TheyWorkForYou
- how to link these two data sources using the
reclin2
package
I encourage you to read up more on the reclin2
package and explore the possibilities that Wikidata offers. This blog post emerged from experimenting with record linkage trying to link legislators directly, and realizing that “blocking” on constituency was a necessary first step. Hopefully it’s also a useful first step for you.