Coalition formation is an important topic in political science, but coalition data is difficult to work with. When a number of groups come together to form a coalition, they make a single discrete choice out of a large number of possible coalitions. The number of possible coalitions grows exponentially with the number of groups. If we exclude the coalition with no groups, the number of possible coalitions is \(2^N - 1\).
Creating these possible coalitions in your favourite stats package can be complex. In this series of posts, I’ll look at how you can create data on possible coalitions using R.
I’ll pay particular attention to how you can can do this quickly. Whilst there will always be a place for code snippets which run overnight, slow code is code which is difficult to integrate into a quickly changing pipeline.
I’m going to build this coalition data using data from the ParlGov project. I’ve put some of the code I use to tidy the data below. You can expand it so that you can see what I’m doing.
Show the code
library(dplyr)library(Amelia)pg_elex <-read.csv("data/view_election.csv")pg_party <-read.csv("data/view_party.csv")### Restrict to national parliament electionspg_elex <- pg_elex |>subset(election_type =="parliament")### Date handlingpg_elex <- pg_elex |>transform(election_date =as.Date(election_date))### Add on the party family pg_elex <-merge(pg_elex, pg_party |>subset(select =c(party_id, family_name)) |>unique(),by =c("party_id"),all.x =TRUE,all.y =FALSE)### Add on lagged vote and seat sharepg_elex <- pg_elex |>transform(vote_share = vote_share /100,seat_share = seats / seats_total)pg_elex <-merge(pg_elex, pg_elex |>subset(select =c(election_id, party_id, vote_share, seat_share)),by.x =c("previous_parliament_election_id", "party_id"),by.y =c("election_id", "party_id"),suffixes =c("", ".l"),all.x =TRUE,all.y =FALSE)### Remove the very first election for each countrypg_elex <- pg_elex |> dplyr::group_by(country_name_short) |> dplyr::filter(election_date >min(election_date, na.rm =TRUE))### Coalesce the remaining vote and seat shares to zeropg_elex <- pg_elex |>transform(vote_share.l = dplyr::coalesce(vote_share.l, 0.0),seat_share.l = dplyr::coalesce(seat_share.l, 0.0))### ### Restrict to post-1945 electionspg_elex <- pg_elex |>subset(election_date >as.Date("1945-01-01"))### Calculate the sum of vote and seat sharespg_elex <- pg_elex |> dplyr::group_by(election_id) |> dplyr::mutate(sum_seat_sh =sum(seat_share, na.rm =TRUE),sum_vote_sh =sum(vote_share, na.rm =TRUE))### Remove cases where the sum of seat shares is greater than one### At present, this is the 2022 Australian electioneg <- pg_elex |>subset(sum_seat_sh >1.01,select =c(election_date, country_name_short, party_name_english, seat_share))pg_elex <- pg_elex |>subset(sum_seat_sh ==1)### Set missing seats to zeropg_elex <- pg_elex |>transform(seat_share = dplyr::coalesce(seat_share, 0.0),seats = dplyr::coalesce(seats, 0L))### In-fill missing vote share with the smaller value of 1 percent or underflow / nMissingpg_elex <- pg_elex |> dplyr::group_by(election_id) |> dplyr::mutate(infill =1-unique(sum_vote_sh),nMiss =sum(is.na(vote_share)),nMiss =ifelse(nMiss ==0, 1, nMiss),replacement =min(c(1, unique(infill) / nMiss)),vote_share = dplyr::coalesce(vote_share, replacement))### Vote share may still not sum to one, but there will be no missing values### check <- sum(is.na(pg_elex$vote_share))### Remove some variables we no longer needpg_elex <- pg_elex |>subset(select =c(previous_parliament_election_id, election_id, election_date, country_name_short, country_name, party_id, party_name, party_name_english, party_name_short, left_right, family_name, seats, seats_total, seat_share, seat_share.l, vote_share, vote_share.l))### Coerce to data frame (not tibble)### Helpful for Ameliapg_elex <-as.data.frame(pg_elex)### Single-shot imputation of left-righta.out <-amelia(pg_elex, m =1,idvars =c("previous_parliament_election_id", "election_id", "election_date","country_name_short", "country_name","party_id", "party_name", "party_name_english", "party_name_short"),noms =c("family_name"))pg_elex <- a.out$imputations[[1]]
I’ll be working with a data frame called pg_elex. It contains information on elections (each with a unique identifier, election_id) contested by different parties (unique identifier: party_id) who won votes (vote_share, transformed to be in the range \([0, 1]\)) and seats (seats or seat_share). Because we’re working with coalitions, and because coalitions include seat-winning parties rather than independents or single members of parliament, we’ll remove some entries from the data.
We’ll start writing out the most minimal version of our function, make_coals. This function will operate on a data frame which has values for a single election (or some other grouping factor). Below I’ve written my first attempt at this function. (I’ve turned line numbering on – this will be useful later).
make_coals <-function(df,party_name_var ="party_name_short",key_var ="election_id") {if (!inherits(df, "data.frame")) {stop("`df` must be a data frame") }### Unclass if it's a tibble, otherwise df[,var] won't workif (inherits(df, "tbl")) { df <-as.data.frame(df) } party_names <- df[, party_name_var] key <- df[, key_var] key <-unique(key)if (length(key) !=1) {stop("Variable in `key_var` must have one unique value") }### List where each list entry records participation (1) or### non-participation (0) plist <-lapply(party_names, function(x) c(0, 1))### Expand the grid coals <-expand.grid(plist)### Remove the "no-party" entry, which is always the first coals <- coals[-1,]colnames(coals) <- party_names coal_id <-apply(coals, 1, function(x){ coal_parties <- party_names[which(x ==1)]paste(sort(coal_parties), collapse ="; ") }) retval <-data.frame(coal_id = coal_id,key = key)### Amend the last name### use double brackets b/c we're only picking out a single valuenames(retval)[[ncol(retval)]] <- key_var retval}
There are a couple of things to note here. First, the combination of all possible parties is generated using expand.grid, supplying a list of length equal to the number of parties, where each list item has a zero and a one. Second, the coalition is given an identifier, which is just a concatenation of the sorted party names, separated with a semi-colon. Later on I’ll use party_id as the basis for this identifier.
Let’s check the output of this function in a very simple case, before moving on to larger examples, where we’ll start timing the function. We begin by setting up our examples.
### 1975 New Zealand election -- two partiessmall_eg_df <-subset(pg_elex, election_id ==459)### 2005 Norwegian election - seven partiesmedium_eg_df <-subset(pg_elex, election_id ==3)### 1994 Italian election - eighteen partieslarge_eg_df <-subset(pg_elex, election_id ==146)nrow(small_eg_df)
[1] 2
nrow(medium_eg_df)
[1] 7
nrow(large_eg_df)
[1] 18
Here is the example output for the New Zealand case, where we have \(2^2 - 1 = 3\) possible coalitions.
How quickly does this function generate the possible coalitions? We can get a rough approximation using the system.time() function to return the elapsed time in seconds.
system.time(make_coals(small_eg_df))
user system elapsed
0.001 0.000 0.001
system.time(make_coals(medium_eg_df))
user system elapsed
0.004 0.000 0.004
system.time(make_coals(large_eg_df))
user system elapsed
9.616 0.080 9.697
It seems that run-time increases exponentially with the number of seat-winning parties, and that the run-time for a large number of parties is unacceptably high (i.e., more than a few seconds). Bear in mind that we will be calling this function several hundred times, and that we have not yet added other functions which might themselves be computationally expensive.
What can be improved?
There a number of problems with the code as it is written. We can identify these problems by using profiling. This will show us where in the code we are spending most time. In order to use line numbers with the default R profiling function Rprof, I need to write the function out to a file.
We can see that most of the time is spent on line 33 of the function (paste(sort(coal_parties), collapse = "; ")). This line seems natural, but it’s actually horribly inefficient. Here, we are sorting the vector coal_parties\(2^n - 1\) times. We can do much better by sorting the data frame just once, in ascending alphabetical order of party_name_short, at the beginning of the function. (If we were using numerical identifiers we would use ascending numerical order).
We can also do the concatenation of party names in a smarter way. Although R’s base paste function is great, there are some specialised packages which handling strings in better or faster ways. The stringi package has a function stri_c which concatenates character vectors slightly quicker (at the cost of omitting a few validation checks).
Here is a rewritten version of the function, which remedies these mistakes, and also uses a slightly more efficient way of generating binary combinations (make_coal_mat), which is taken from package e1071 (function bincombinations).
make_coal_mat <-function(party_names) { n0 <-length(party_names) ncombs <-2^ n0 coals <-matrix(FALSE, nrow = ncombs, ncol = n0)for (i in1:n0) { coals[, i] <-rep(c(rep(FALSE, (ncombs/2^i)),rep(TRUE, (ncombs/2^i))), length.out = ncombs) }colnames(coals) <- party_names### omit "zero parties" row coals <- coals[-1,]return(coals)}make_coals <-function(df,party_name_var ="party_name_short",key_var ="election_id") {require(stringi)if (!inherits(df, "data.frame")) {stop("`df` must be a data frame") }### Unclass if it's a tibble, otherwise df[,var] won't workif (inherits(df, "tbl")) { df <-as.data.frame(df) }### Order the data frame by party name### This avoids lots of sorts later on df <- df[order(df[,party_name_var]), ] party_names <- df[, party_name_var] key <- df[, key_var] key <-unique(key)if (length(key) !=1) {stop("Variable in `key_var` must have one unique value") } coals <-make_coal_mat(party_names) coal_id <-apply(coals, 1, function(x) { stringi::stri_c(party_names[which(x ==TRUE)], collapse ="; ") }) retval <-data.frame(coal_id = coal_id,key = key,row.names =NULL,check.names =FALSE,stringsAsFactors =FALSE)### Amend the last namenames(retval)[[ncol(retval)]] <- key_var retval}
Here we see the result of the function on the simplest case:
which is around one quarter of the time the first version took. We can now carry out this exercise for all groups in the data using a split-apply-combine strategy.
Total evaluation time of half a minute is reasonable for 618 unique elections.
Parallelization
If we are confident we’ve optimized the code as far as we can, we can turn to hardware based speed-ups. Here, I parallelize the split-apply-combine operation in the following way, achieving a 2.5x speedup on my eight core machine. Note that I have to send the make_coal_mat function to the compute cluster I’ve created, even though it’s a local cluster.
In this way, through a combination of rewritten code and better use of computing power, we’re able to calculate possible coalitions for all elections in almost the same time it originally took us to calculate possible coalitions for one (admittedly complex) election!