coal_id <- "LP; NP"Introduction
In the previous post, I created a function which could generate all \(2^N-1\) possible coalitions of \(N\) parties. I placed particular emphasis on making that function as fast as possible. This matters because typically we’ll be analyzing hundreds of government formation opportunities, and even code that is relatively quick for a single government formation opportunity (say, less than half a second) can become a bottleneck when dealings with hundreds of such opportunities.
One of the most time-consuming stages in creating all possible coalitions was creating the coalition identifier – a sequence of party codes, separated by a colon. Is there a way to make this stage faster?
List columns
One option is to not to store the coalition identifier as a character vector of length one, but as a character vector with length equal to the number of parties. That is, instead of storing the coalition identifier for the two-party coalition involving both Labour and National as
we can store it as
coal_id <- c("LP", "NP")This seems natural when dealing with a single coalition, but becomes more problematic when we must store information on multiple coalitions in a data frame. One key principle of tidy data is that every cell in the data frame should correspond to a single value. We respected that principle previously because we constructed a single value by concatenating different strings. Here, we’re breaking that principle by using a ragged array or (to use terminology more specific to R) a list-column.
List columns are supported in ‘base’ R. Here’s an example.
dat <- data.frame(iso3c = rep("NZL", 3),
election_date = as.Date("2022-11-19"))
listcol <- list("LP", "NP", c("LP", "NP"))
dat$listcol <- listcolWhen we inspect the data it all looks normal:
dat iso3c election_date listcol
1 NZL 2022-11-19 LP
2 NZL 2022-11-19 NP
3 NZL 2022-11-19 LP, NP
We have to peek in to the internals to see how listcol is represented.
str(dat)'data.frame': 3 obs. of 3 variables:
$ iso3c : chr "NZL" "NZL" "NZL"
$ election_date: Date, format: "2022-11-19" "2022-11-19" ...
$ listcol :List of 3
..$ : chr "LP"
..$ : chr "NP"
..$ : chr "LP" "NP"
However many operations which work fine for data frames break when working with list-columns. It’s not, for example, possible to merge two data frames on a list-column.
aux <- data.frame(numpar = c(1, 1, 2))
aux$listcol <- listcol
comb <- merge(dat, aux, by = "listcol", all = TRUE)Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and "quick"
Have you called 'sort' on a list?
I’ve also had to construct data frames with a list-column quite carefully, using the $ notation. The following will not work:
aux <- data.frame(numpar = c(1, 1, 2),
listcol = listcol)Error in data.frame(numpar = c(1, 1, 2), listcol = listcol): arguments imply differing number of rows: 3, 2
and we have to use very obscure R commands to tell R that we do in fact wish to construct a data frame where one column is a list.
One solution is to use tibbles, which are a subclass of data frames. R for Data Science describes tibbles as:
“opinionated data frames that make working in the tidyverse a little easier”
and certainly they make working with list-columns much easier. We can construct a tibble containing a list column by replacing our earlier call to data.frame with a call to tibble:
require(tibble)Loading required package: tibble
aux <- tibble(numpar = c(1, 1, 2),
listcol = listcol)which gives the following result:
aux# A tibble: 3 × 2
numpar listcol
<dbl> <list>
1 1 <chr [1]>
2 1 <chr [1]>
3 2 <chr [2]>
Revising our coalition building function
How does this help us with our function to create coalitions? Let’s dust down the function used previously:
make_coals <- function(df,
party_name_var = "party_name_short",
key_var = "election_id") {
if (!inherits(df, "data.frame")) {
stop("`df` must be a data frame")
}
### Unclass if it's a tibble, otherwise df[,var] won't work
if (inherits(df, "tbl")) {
df <- as.data.frame(df)
}
party_names <- df[, party_name_var]
key <- df[, key_var]
key <- unique(key)
if (length(key) != 1) {
stop("Variable in `key_var` must have one unique value")
}
### List where each list entry records participation (1) or
### non-participation (0)
plist <- lapply(party_names, function(x) c(0, 1))
### Expand the grid
coals <- expand.grid(plist)
### Remove the "no-party" entry, which is always the first
coals <- coals[-1,]
colnames(coals) <- party_names
coal_id <- apply(coals, 1, function(x){
coal_parties <- party_names[which(x == 1)]
paste(sort(coal_parties), collapse = "; ")
})
retval <- data.frame(coal_id = coal_id,
key = key)
### Amend the last name
### use double brackets b/c we're only picking out a single value
names(retval)[[ncol(retval)]] <- key_var
retval
}and compare it to the following revised function declaration:
make_coals_v2 <- function(df,
party_name_var = "party_name_short",
key_var = "election_id") {
require(tibble)
party_names <- df |> pull({{party_name_var}})
key <- df |> pull({{key_var}})
key <- unique(key)
if (length(key) != 1) {
stop("Variable in `key_var` must have one unique value")
}
### List where each list entry records participation (1) or
### non-participation (0)
plist <- lapply(party_names, function(x) c(0, 1))
### Expand the grid
coals <- expand.grid(plist)
### Remove the "no-party" entry, which is always the first
coals <- coals[-1,]
colnames(coals) <- party_names
coal_id <- apply(coals, 1, function(x){
party_names[which(x == 1)]
})
retval <- tibble(coal_id = coal_id,
key = key)
### Amend the last name
### use double brackets b/c we're only picking out a single value
retval
}Note how I’ve used the curly-curly operator ({{) to interpolate the name of a variable, and supplied this as an argument to pull. You can think of pull as the tidyverse’s slightly longer-winded equivalent of the dollar sign notation, $.
I use pull because if I try to access a variable programatically the way I did before, it give me a single column tibble, which is usually not what I want:
var <- "numpar"
aux[,var]# A tibble: 3 × 1
numpar
<dbl>
1 1
2 1
3 2
We can test out these two functions with our examples. As before, the code to load all the data is hidden.
Show the code
library(dplyr)
library(Amelia)
pg_elex <- read.csv("data/view_election.csv")
pg_party <- read.csv("data/view_party.csv")
### Restrict to national parliament elections
pg_elex <- pg_elex |>
subset(election_type == "parliament")
### Date handling
pg_elex <- pg_elex |>
transform(election_date = as.Date(election_date))
### Add on the party family
pg_elex <- merge(pg_elex,
pg_party |>
subset(select = c(party_id, family_name)) |>
unique(),
by = c("party_id"),
all.x = TRUE,
all.y = FALSE)
### Add on lagged vote and seat share
pg_elex <- pg_elex |>
transform(vote_share = vote_share / 100,
seat_share = seats / seats_total)
pg_elex <- merge(pg_elex,
pg_elex |> subset(select = c(election_id, party_id, vote_share, seat_share)),
by.x = c("previous_parliament_election_id", "party_id"),
by.y = c("election_id", "party_id"),
suffixes = c("", ".l"),
all.x = TRUE,
all.y = FALSE)
### Remove the very first election for each country
pg_elex <- pg_elex |>
dplyr::group_by(country_name_short) |>
dplyr::filter(election_date > min(election_date, na.rm = TRUE))
### Coalesce the remaining vote and seat shares to zero
pg_elex <- pg_elex |>
transform(vote_share.l = dplyr::coalesce(vote_share.l, 0.0),
seat_share.l = dplyr::coalesce(seat_share.l, 0.0))
###
### Restrict to post-1945 elections
pg_elex <- pg_elex |>
subset(election_date > as.Date("1945-01-01"))
### Calculate the sum of vote and seat shares
pg_elex <- pg_elex |>
dplyr::group_by(election_id) |>
dplyr::mutate(sum_seat_sh = sum(seat_share, na.rm = TRUE),
sum_vote_sh = sum(vote_share, na.rm = TRUE))
### Remove cases where the sum of seat shares is greater than one
### At present, this is the 2022 Australian election
eg <- pg_elex |>
subset(sum_seat_sh > 1.01,
select = c(election_date, country_name_short, party_name_english, seat_share))
pg_elex <- pg_elex |>
subset(sum_seat_sh == 1)
### Set missing seats to zero
pg_elex <- pg_elex |>
transform(seat_share = dplyr::coalesce(seat_share, 0.0),
seats = dplyr::coalesce(seats, 0L))
### In-fill missing vote share with the smaller value of 1 percent or underflow / nMissing
pg_elex <- pg_elex |>
dplyr::group_by(election_id) |>
dplyr::mutate(infill = 1 - unique(sum_vote_sh),
nMiss = sum(is.na(vote_share)),
nMiss = ifelse(nMiss == 0, 1, nMiss),
replacement = min(c(1, unique(infill) / nMiss)),
vote_share = dplyr::coalesce(vote_share, replacement))
### Vote share may still not sum to one, but there will be no missing values
### check <- sum(is.na(pg_elex$vote_share))
### Remove some variables we no longer need
pg_elex <- pg_elex |>
subset(select = c(previous_parliament_election_id, election_id, election_date,
country_name_short, country_name,
party_id, party_name, party_name_english, party_name_short,
left_right, family_name,
seats, seats_total, seat_share, seat_share.l,
vote_share, vote_share.l))
### Coerce to data frame (not tibble)
### Helpful for Amelia
pg_elex <- as.data.frame(pg_elex)
### Single-shot imputation of left-right
a.out <- amelia(pg_elex, m = 1,
idvars = c("previous_parliament_election_id", "election_id", "election_date",
"country_name_short", "country_name",
"party_id", "party_name", "party_name_english", "party_name_short"),
noms = c("family_name"))
pg_elex <- a.out$imputations[[1]]
pg_elex <- pg_elex |>
subset(party_name != "one seat") |>
subset(party_name != "no seat") |>
subset(party_name_english != "no seat") |>
subset(party_name != "no party affiliation") |>
subset(party_name_english != "no party affiliation") |>
subset(seats > 0)### 1975 New Zealand election -- two parties
small_eg_df <- subset(pg_elex, election_id == 459)
### 2005 Norwegian election - seven parties
medium_eg_df <- subset(pg_elex, election_id == 3)
### 1994 Italian election - eighteen parties
large_eg_df <- subset(pg_elex, election_id == 146)make_coals(small_eg_df) coal_id election_id
2 LP 459
3 NP 459
4 LP; NP 459
make_coals_v2(small_eg_df)# A tibble: 3 × 2
coal_id key
<named list> <int>
1 <chr [1]> 459
2 <chr [1]> 459
3 <chr [2]> 459
If we want to see the coalition identifiers, we can inspect them:
make_coals_v2(small_eg_df) |>
pull(coal_id)$`2`
[1] "LP"
$`3`
[1] "NP"
$`4`
[1] "LP" "NP"
Speed-ups
Does this help us with generating our possible coalitions over a plausibly sized data-set? Let’s find out…
split_data <- split(pg_elex, pg_elex$election_id)
t1 <- system.time(res <- lapply(split_data, make_coals))
print(t1) user system elapsed
121.487 0.255 121.766
t2 <- system.time(res <- lapply(split_data, make_coals_v2))
print(t2) user system elapsed
24.171 0.244 24.420
The speed gains are quite considerable. This is a good example of when tibbles can be used to great effect. Using tibbles in your code does have drawbacks – subsetting doesn’t always work the way you think it should, or the way you are used to – but when the gains are this considerable, these drawbacks are worth it.