Introduction

In the previous post, I created a function which could generate all $2^N-1$ possible coalitions of $N$ parties. I placed particular emphasis on making that function as fast as possible. This matters because typically we’ll be analyzing hundreds of government formation opportunities, and even code that is relatively quick for a single government formation opportunity (say, less than half a second) can become a bottleneck when dealings with hundreds of such opportunities.

One of the most time-consuming stages in creating all possible coalitions was creating the coalition identifier – a sequence of party codes, separated by a colon. Is there a way to make this stage faster?

List columns

One option is to not to store the coalition identifier as a character vector of length one, but as a character vector with length equal to the number of parties. That is, instead of storing the coalition identifier for the two-party coalition involving both Labour and National as

coal_id <- "LP; NP"

we can store it as

coal_id <- c("LP", "NP")

This seems natural when dealing with a single coalition, but becomes more problematic when we must store information on multiple coalitions in a data frame. One key principle of tidy data is that every cell in the data frame should correspond to a single value. We respected that principle previously because we constructed a single value by concatenating different strings. Here, we’re breaking that principle by using a ragged array or (to use terminology more specific to R) a list-column.

List columns are supported in ‘base’ R. Here’s an example.

dat <- data.frame(iso3c = rep("NZL", 3),
                  election_date = as.Date("2022-11-19"))

listcol <- list("LP", "NP", c("LP", "NP"))

dat$listcol <- listcol

When we inspect the data it all looks normal:

dat

  iso3c election_date listcol
1   NZL    2022-11-19      LP
2   NZL    2022-11-19      NP
3   NZL    2022-11-19  LP, NP

We have to peek in to the internals to see how listcol is represented.

str(dat)

'data.frame':   3 obs. of  3 variables:
 $ iso3c        : chr  "NZL" "NZL" "NZL"
 $ election_date: Date, format: "2022-11-19" "2022-11-19" ...
 $ listcol      :List of 3
  ..$ : chr "LP"
  ..$ : chr "NP"
  ..$ : chr  "LP" "NP"

However many operations which work fine for data frames break when working with list-columns. It’s not, for example, possible to merge two data frames on a list-column.

aux <- data.frame(numpar = c(1, 1, 2))
aux$listcol <- listcol

comb <- merge(dat, aux, by = "listcol", all = TRUE)

Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and "quick"
Have you called 'sort' on a list?

I’ve also had to construct data frames with a list-column quite carefully, using the $ notation. The following will not work:

aux <- data.frame(numpar = c(1, 1, 2),
                  listcol = listcol)

Error in data.frame(numpar = c(1, 1, 2), listcol = listcol): arguments imply differing number of rows: 3, 2

and we have to use very obscure R commands to tell R that we do in fact wish to construct a data frame where one column is a list.

One solution is to use tibbles, which are a subclass of data frames. R for Data Science describes tibbles as:

“opinionated data frames that make working in the tidyverse a little easier”

and certainly they make working with list-columns much easier. We can construct a tibble containing a list column by replacing our earlier call to data.frame with a call to tibble:

require(tibble)

Loading required package: tibble

aux <- tibble(numpar = c(1, 1, 2),
                  listcol = listcol)

which gives the following result:

aux

# A tibble: 3 × 2
  numpar listcol  
   <dbl> <list>   
1      1 <chr [1]>
2      1 <chr [1]>
3      2 <chr [2]>

Revising our coalition building function

How does this help us with our function to create coalitions? Let’s dust down the function used previously:

make_coals <- function(df,
                       party_name_var = "party_name_short",
                       key_var = "election_id") {
    if (!inherits(df, "data.frame")) {
        stop("`df` must be a data frame")
    }
    
### Unclass if it's a tibble, otherwise df[,var] won't work
    if (inherits(df, "tbl")) {
        df <- as.data.frame(df)
    }
    
    party_names <- df[, party_name_var]
    key <- df[, key_var]

    key <- unique(key)

    if (length(key) != 1) {
        stop("Variable in `key_var` must have one unique value")
    }
    
### List where each list entry records participation (1) or
### non-participation (0)
    plist <- lapply(party_names, function(x) c(0, 1))
### Expand the grid
    coals <- expand.grid(plist)
### Remove the "no-party" entry, which is always the first
    coals <- coals[-1,]
    
    colnames(coals) <- party_names

    coal_id <- apply(coals, 1, function(x){
        coal_parties <- party_names[which(x == 1)]
        paste(sort(coal_parties), collapse = "; ")
    })
    retval <- data.frame(coal_id = coal_id,
                         key = key)

### Amend the last name
    ### use double brackets b/c we're only picking out a single value
    names(retval)[[ncol(retval)]] <- key_var

    retval
}

and compare it to the following revised function declaration:

make_coals_v2 <- function(df,
                       party_name_var = "party_name_short",
                       key_var = "election_id") {

    require(tibble)
    party_names <- df |> pull({{party_name_var}})
    key <- df |> pull({{key_var}})
    key <- unique(key)

    if (length(key) != 1) {
        stop("Variable in `key_var` must have one unique value")
    }
    
### List where each list entry records participation (1) or
### non-participation (0)
    plist <- lapply(party_names, function(x) c(0, 1))
### Expand the grid
    coals <- expand.grid(plist)
### Remove the "no-party" entry, which is always the first
    coals <- coals[-1,]
    colnames(coals) <- party_names

    coal_id <- apply(coals, 1, function(x){
        party_names[which(x == 1)]
    })
    retval <- tibble(coal_id = coal_id,
                         key = key)

### Amend the last name
    ### use double brackets b/c we're only picking out a single value
    retval
}

Note how I’ve used the curly-curly operator ({{) to interpolate the name of a variable, and supplied this as an argument to pull. You can think of pull as the tidyverse’s slightly longer-winded equivalent of the dollar sign notation, $.

I use pull because if I try to access a variable programatically the way I did before, it give me a single column tibble, which is usually not what I want:

var <- "numpar"
aux[,var]

# A tibble: 3 × 1
  numpar
   <dbl>
1      1
2      1
3      2

We can test out these two functions with our examples. As before, the code to load all the data is hidden.

Show the code

library(dplyr)
library(Amelia)
pg_elex <- read.csv("data/view_election.csv")
pg_party <- read.csv("data/view_party.csv")

### Restrict to national parliament elections
pg_elex <- pg_elex |>
    subset(election_type == "parliament")

### Date handling
pg_elex <- pg_elex |>
    transform(election_date = as.Date(election_date))

### Add on the party family 
pg_elex <- merge(pg_elex,
                 pg_party |>
                 subset(select = c(party_id, family_name)) |>
                 unique(),
                 by = c("party_id"),
                 all.x = TRUE,
                 all.y = FALSE)

### Add on lagged vote and seat share
pg_elex <- pg_elex |>
    transform(vote_share = vote_share / 100,
              seat_share = seats / seats_total)

pg_elex <- merge(pg_elex,
                 pg_elex |> subset(select = c(election_id, party_id, vote_share, seat_share)),
                 by.x = c("previous_parliament_election_id", "party_id"),
                 by.y = c("election_id", "party_id"),
                 suffixes = c("", ".l"),
                 all.x = TRUE,
                 all.y = FALSE)

### Remove the very first election for each country
pg_elex <- pg_elex |>
    dplyr::group_by(country_name_short) |>
    dplyr::filter(election_date > min(election_date, na.rm = TRUE))

### Coalesce the remaining vote and seat shares to zero
pg_elex <- pg_elex |>
    transform(vote_share.l = dplyr::coalesce(vote_share.l, 0.0),
              seat_share.l = dplyr::coalesce(seat_share.l, 0.0))

### 
### Restrict to post-1945 elections
pg_elex <- pg_elex |>
    subset(election_date > as.Date("1945-01-01"))

### Calculate the sum of vote and seat shares
pg_elex <- pg_elex |>
    dplyr::group_by(election_id) |>
    dplyr::mutate(sum_seat_sh = sum(seat_share, na.rm = TRUE),
                  sum_vote_sh = sum(vote_share, na.rm = TRUE))

### Remove cases where the sum of seat shares is greater than one
### At present, this is the 2022 Australian election
eg <- pg_elex |>
    subset(sum_seat_sh > 1.01,
           select = c(election_date, country_name_short, party_name_english, seat_share))

pg_elex <- pg_elex |>
    subset(sum_seat_sh == 1)

### Set missing seats to zero
pg_elex <- pg_elex |>
    transform(seat_share = dplyr::coalesce(seat_share, 0.0),
           seats = dplyr::coalesce(seats, 0L))

### In-fill missing vote share with the smaller value of 1 percent or underflow / nMissing
pg_elex <- pg_elex |>
    dplyr::group_by(election_id) |>
    dplyr::mutate(infill = 1 - unique(sum_vote_sh),
              nMiss = sum(is.na(vote_share)),
              nMiss = ifelse(nMiss == 0, 1, nMiss),
              replacement = min(c(1, unique(infill) / nMiss)),
              vote_share = dplyr::coalesce(vote_share, replacement))

### Vote share may still not sum to one, but there will be no missing values
### check <- sum(is.na(pg_elex$vote_share))

### Remove some variables we no longer need
pg_elex <- pg_elex |>
    subset(select = c(previous_parliament_election_id, election_id, election_date,
                      country_name_short, country_name,
                      party_id, party_name, party_name_english, party_name_short,
                      left_right, family_name,
                      seats, seats_total, seat_share, seat_share.l,
                      vote_share, vote_share.l))

### Coerce to data frame (not tibble)
### Helpful for Amelia
pg_elex <- as.data.frame(pg_elex)

### Single-shot imputation of left-right
a.out <- amelia(pg_elex, m = 1,
       idvars = c("previous_parliament_election_id", "election_id", "election_date",
                  "country_name_short", "country_name",
                  "party_id", "party_name", "party_name_english", "party_name_short"),
       noms = c("family_name"))
       
pg_elex <- a.out$imputations[[1]]

pg_elex <- pg_elex |>
    subset(party_name != "one seat") |>
    subset(party_name != "no seat") |>
    subset(party_name_english != "no seat") |>
    subset(party_name != "no party affiliation") |>
    subset(party_name_english != "no party affiliation") |>
    subset(seats > 0)

### 1975 New Zealand election -- two parties
small_eg_df <- subset(pg_elex, election_id == 459)
### 2005 Norwegian election - seven parties
medium_eg_df <- subset(pg_elex, election_id == 3)
### 1994 Italian election - eighteen parties
large_eg_df <- subset(pg_elex, election_id == 146)

make_coals(small_eg_df)

  coal_id election_id
2      LP         459
3      NP         459
4  LP; NP         459

make_coals_v2(small_eg_df)

# A tibble: 3 × 2
  coal_id        key
  <named list> <int>
1 <chr [1]>      459
2 <chr [1]>      459
3 <chr [2]>      459

If we want to see the coalition identifiers, we can inspect them:

make_coals_v2(small_eg_df) |>
    pull(coal_id)

$`2`
[1] "LP"

$`3`
[1] "NP"

$`4`
[1] "LP" "NP"

Speed-ups

Does this help us with generating our possible coalitions over a plausibly sized data-set? Let’s find out…

split_data <- split(pg_elex, pg_elex$election_id)
t1 <- system.time(res <- lapply(split_data, make_coals))
print(t1)

   user  system elapsed 
121.487   0.255 121.766

t2 <- system.time(res <- lapply(split_data, make_coals_v2))
print(t2)

   user  system elapsed 
 24.171   0.244  24.420

The speed gains are quite considerable. This is a good example of when tibbles can be used to great effect. Using tibbles in your code does have drawbacks – subsetting doesn’t always work the way you think it should, or the way you are used to – but when the gains are this considerable, these drawbacks are worth it.