Introduction

Coalition formation is an important topic in political science, but coalition data is difficult to work with. When a number of groups come together to form a coalition, they make a single discrete choice out of a large number of possible coalitions. The number of possible coalitions grows exponentially with the number of groups. If we exclude the coalition with no groups, the number of possible coalitions is \(2^N - 1\).

Creating these possible coalitions in your favourite stats package can be complex. In this series of posts, I’ll look at how you can create data on possible coalitions using R.

I’ll pay particular attention to how you can can do this quickly. Whilst there will always be a place for code snippets which run overnight, slow code is code which is difficult to integrate into a quickly changing pipeline.

I’m going to build this coalition data using data from the ParlGov project. I’ve put some of the code I use to tidy the data below. You can expand it so that you can see what I’m doing.

Show the code

library(dplyr)
library(Amelia)
pg_elex <- read.csv("data/view_election.csv")
pg_party <- read.csv("data/view_party.csv")

### Restrict to national parliament elections
pg_elex <- pg_elex |>
    subset(election_type == "parliament")

### Date handling
pg_elex <- pg_elex |>
    transform(election_date = as.Date(election_date))

### Add on the party family 
pg_elex <- merge(pg_elex,
                 pg_party |>
                 subset(select = c(party_id, family_name)) |>
                 unique(),
                 by = c("party_id"),
                 all.x = TRUE,
                 all.y = FALSE)

### Add on lagged vote and seat share
pg_elex <- pg_elex |>
    transform(vote_share = vote_share / 100,
              seat_share = seats / seats_total)

pg_elex <- merge(pg_elex,
                 pg_elex |> subset(select = c(election_id, party_id, vote_share, seat_share)),
                 by.x = c("previous_parliament_election_id", "party_id"),
                 by.y = c("election_id", "party_id"),
                 suffixes = c("", ".l"),
                 all.x = TRUE,
                 all.y = FALSE)

### Remove the very first election for each country
pg_elex <- pg_elex |>
    dplyr::group_by(country_name_short) |>
    dplyr::filter(election_date > min(election_date, na.rm = TRUE))

### Coalesce the remaining vote and seat shares to zero
pg_elex <- pg_elex |>
    transform(vote_share.l = dplyr::coalesce(vote_share.l, 0.0),
              seat_share.l = dplyr::coalesce(seat_share.l, 0.0))

### 
### Restrict to post-1945 elections
pg_elex <- pg_elex |>
    subset(election_date > as.Date("1945-01-01"))

### Calculate the sum of vote and seat shares
pg_elex <- pg_elex |>
    dplyr::group_by(election_id) |>
    dplyr::mutate(sum_seat_sh = sum(seat_share, na.rm = TRUE),
                  sum_vote_sh = sum(vote_share, na.rm = TRUE))

### Remove cases where the sum of seat shares is greater than one
### At present, this is the 2022 Australian election
eg <- pg_elex |>
    subset(sum_seat_sh > 1.01,
           select = c(election_date, country_name_short, party_name_english, seat_share))

pg_elex <- pg_elex |>
    subset(sum_seat_sh == 1)

### Set missing seats to zero
pg_elex <- pg_elex |>
    transform(seat_share = dplyr::coalesce(seat_share, 0.0),
           seats = dplyr::coalesce(seats, 0L))

### In-fill missing vote share with the smaller value of 1 percent or underflow / nMissing
pg_elex <- pg_elex |>
    dplyr::group_by(election_id) |>
    dplyr::mutate(infill = 1 - unique(sum_vote_sh),
              nMiss = sum(is.na(vote_share)),
              nMiss = ifelse(nMiss == 0, 1, nMiss),
              replacement = min(c(1, unique(infill) / nMiss)),
              vote_share = dplyr::coalesce(vote_share, replacement))

### Vote share may still not sum to one, but there will be no missing values
### check <- sum(is.na(pg_elex$vote_share))

### Remove some variables we no longer need
pg_elex <- pg_elex |>
    subset(select = c(previous_parliament_election_id, election_id, election_date,
                      country_name_short, country_name,
                      party_id, party_name, party_name_english, party_name_short,
                      left_right, family_name,
                      seats, seats_total, seat_share, seat_share.l,
                      vote_share, vote_share.l))

### Coerce to data frame (not tibble)
### Helpful for Amelia
pg_elex <- as.data.frame(pg_elex)

### Single-shot imputation of left-right
a.out <- amelia(pg_elex, m = 1,
       idvars = c("previous_parliament_election_id", "election_id", "election_date",
                  "country_name_short", "country_name",
                  "party_id", "party_name", "party_name_english", "party_name_short"),
       noms = c("family_name"))
       
pg_elex <- a.out$imputations[[1]]

I’ll be working with a data frame called pg_elex. It contains information on elections (each with a unique identifier, election_id) contested by different parties (unique identifier: party_id) who won votes (vote_share, transformed to be in the range \([0, 1]\)) and seats (seats or seat_share). Because we’re working with coalitions, and because coalitions include seat-winning parties rather than independents or single members of parliament, we’ll remove some entries from the data.

pg_elex <- pg_elex |>
    subset(party_name != "one seat") |>
    subset(party_name != "no seat") |>
    subset(party_name_english != "no seat") |>
    subset(party_name != "no party affiliation") |>
    subset(party_name_english != "no party affiliation") |>
    subset(seats > 0)

A first attempt

We’ll start writing out the most minimal version of our function, make_coals. This function will operate on a data frame which has values for a single election (or some other grouping factor). Below I’ve written my first attempt at this function. (I’ve turned line numbering on – this will be useful later).

make_coals <- function(df,
                       party_name_var = "party_name_short",
                       key_var = "election_id") {
    if (!inherits(df, "data.frame")) {
        stop("`df` must be a data frame")
    }
    
### Unclass if it's a tibble, otherwise df[,var] won't work
    if (inherits(df, "tbl")) {
        df <- as.data.frame(df)
    }
    
    party_names <- df[, party_name_var]
    key <- df[, key_var]

    key <- unique(key)

    if (length(key) != 1) {
        stop("Variable in `key_var` must have one unique value")
    }
    
### List where each list entry records participation (1) or
### non-participation (0)
    plist <- lapply(party_names, function(x) c(0, 1))
### Expand the grid
    coals <- expand.grid(plist)
### Remove the "no-party" entry, which is always the first
    coals <- coals[-1,]
    
    colnames(coals) <- party_names

    coal_id <- apply(coals, 1, function(x){
        coal_parties <- party_names[which(x == 1)]
        paste(sort(coal_parties), collapse = "; ")
    })
    retval <- data.frame(coal_id = coal_id,
                         key = key)

### Amend the last name
    ### use double brackets b/c we're only picking out a single value
    names(retval)[[ncol(retval)]] <- key_var

    retval
}

There are a couple of things to note here. First, the combination of all possible parties is generated using expand.grid, supplying a list of length equal to the number of parties, where each list item has a zero and a one. Second, the coalition is given an identifier, which is just a concatenation of the sorted party names, separated with a semi-colon. Later on I’ll use party_id as the basis for this identifier.

Let’s check the output of this function in a very simple case, before moving on to larger examples, where we’ll start timing the function. We begin by setting up our examples.

### 1975 New Zealand election -- two parties
small_eg_df <- subset(pg_elex, election_id == 459)
### 2005 Norwegian election - seven parties
medium_eg_df <- subset(pg_elex, election_id == 3)
### 1994 Italian election - eighteen parties
large_eg_df <- subset(pg_elex, election_id == 146)
nrow(small_eg_df)

[1] 2

nrow(medium_eg_df)

[1] 7

nrow(large_eg_df)

[1] 18

Here is the example output for the New Zealand case, where we have \(2^2 - 1 = 3\) possible coalitions.

make_coals(small_eg_df)

  coal_id election_id
2      LP         459
3      NP         459
4  LP; NP         459

How quickly does this function generate the possible coalitions? We can get a rough approximation using the system.time() function to return the elapsed time in seconds.

system.time(make_coals(small_eg_df))

   user  system elapsed 
  0.001   0.000   0.001

system.time(make_coals(medium_eg_df))

   user  system elapsed 
  0.004   0.000   0.004

system.time(make_coals(large_eg_df))

   user  system elapsed 
  9.616   0.080   9.697

It seems that run-time increases exponentially with the number of seat-winning parties, and that the run-time for a large number of parties is unacceptably high (i.e., more than a few seconds). Bear in mind that we will be calling this function several hundred times, and that we have not yet added other functions which might themselves be computationally expensive.

What can be improved?

There a number of problems with the code as it is written. We can identify these problems by using profiling. This will show us where in the code we are spending most time. In order to use line numbers with the default R profiling function Rprof, I need to write the function out to a file.

tmp <- tempfile()
dump("make_coals", file = tmp)

Rprof(filename = "Rprof.out", line.profiling = TRUE)
parse(tmp)
foo <- make_coals(large_eg_df)
Rprof(NULL)

summaryRprof(filename = "Rprof.out",
             lines = "show")

$by.self
          self.time self.pct total.time total.pct
<text>#34      6.50    73.86       6.50     73.86
<text>#33      0.92    10.45       0.92     10.45
<text>#32      0.68     7.73       8.10     92.05
<text>#36      0.40     4.55       0.40      4.55
<text>#28      0.28     3.18       0.28      3.18
<text>#26      0.02     0.23       0.02      0.23

$by.total
          total.time total.pct self.time self.pct
<text>#32       8.10     92.05      0.68     7.73
<text>#34       6.50     73.86      6.50    73.86
<text>#33       0.92     10.45      0.92    10.45
<text>#36       0.40      4.55      0.40     4.55
<text>#28       0.28      3.18      0.28     3.18
<text>#26       0.02      0.23      0.02     0.23

$by.line
          self.time self.pct total.time total.pct
<text>#26      0.02     0.23       0.02      0.23
<text>#28      0.28     3.18       0.28      3.18
<text>#32      0.68     7.73       8.10     92.05
<text>#33      0.92    10.45       0.92     10.45
<text>#34      6.50    73.86       6.50     73.86
<text>#36      0.40     4.55       0.40      4.55

$sample.interval
[1] 0.02

$sampling.time
[1] 8.8

We can see that most of the time is spent on line 33 of the function (paste(sort(coal_parties), collapse = "; ")). This line seems natural, but it’s actually horribly inefficient. Here, we are sorting the vector coal_parties \(2^n - 1\) times. We can do much better by sorting the data frame just once, in ascending alphabetical order of party_name_short, at the beginning of the function. (If we were using numerical identifiers we would use ascending numerical order).

We can also do the concatenation of party names in a smarter way. Although R’s base paste function is great, there are some specialised packages which handling strings in better or faster ways. The stringi package has a function stri_c which concatenates character vectors slightly quicker (at the cost of omitting a few validation checks).

Here is a rewritten version of the function, which remedies these mistakes, and also uses a slightly more efficient way of generating binary combinations (make_coal_mat), which is taken from package e1071 (function bincombinations).

make_coal_mat <- function(party_names) {
    n0 <- length(party_names)
    ncombs <- 2 ^ n0
    coals <- matrix(FALSE, nrow = ncombs, ncol = n0)
    for (i in 1:n0) {
        coals[, i] <- rep(c(rep(FALSE, (ncombs/2^i)),
                            rep(TRUE, (ncombs/2^i))), 
                          length.out = ncombs)
    }
    colnames(coals) <- party_names
    ### omit "zero parties" row
    coals <- coals[-1,]
    return(coals)
}


make_coals <- function(df,
                       party_name_var = "party_name_short",
                       key_var = "election_id") {

    require(stringi)
    
    if (!inherits(df, "data.frame")) {
        stop("`df` must be a data frame")
    }
    
### Unclass if it's a tibble, otherwise df[,var] won't work
    if (inherits(df, "tbl")) {
        df <- as.data.frame(df)
    }

### Order the data frame by party name
    ### This avoids lots of sorts later on
    df <- df[order(df[,party_name_var]), ]
    
    party_names <- df[, party_name_var]
    
    key <- df[, key_var]

    key <- unique(key)

    if (length(key) != 1) {
        stop("Variable in `key_var` must have one unique value")

    }
    coals <- make_coal_mat(party_names)
    
    coal_id <- apply(coals, 1, function(x) {
        stringi::stri_c(party_names[which(x == TRUE)], collapse = "; ")
    })
        
    retval <- data.frame(coal_id = coal_id,
                         key = key,
                         row.names = NULL,
                         check.names = FALSE,
                         stringsAsFactors = FALSE)

    ### Amend the last name
    names(retval)[[ncol(retval)]] <- key_var

    retval
}

Here we see the result of the function on the simplest case:

make_coals(small_eg_df)

Loading required package: stringi

  coal_id election_id
1      NP         459
2      LP         459
3  LP; NP         459

And the medium example:

make_coals(medium_eg_df)

                       coal_id election_id
1                            V           3
2                           SV           3
3                        SV; V           3
4                           Sp           3
5                        Sp; V           3
6                       Sp; SV           3
7                    Sp; SV; V           3
8                          KrF           3
9                       KrF; V           3
10                     KrF; SV           3
11                  KrF; SV; V           3
12                     KrF; Sp           3
13                  KrF; Sp; V           3
14                 KrF; Sp; SV           3
15              KrF; Sp; SV; V           3
16                           H           3
17                        H; V           3
18                       H; SV           3
19                    H; SV; V           3
20                       H; Sp           3
21                    H; Sp; V           3
22                   H; Sp; SV           3
23                H; Sp; SV; V           3
24                      H; KrF           3
25                   H; KrF; V           3
26                  H; KrF; SV           3
27               H; KrF; SV; V           3
28                  H; KrF; Sp           3
29               H; KrF; Sp; V           3
30              H; KrF; Sp; SV           3
31           H; KrF; Sp; SV; V           3
32                          Fr           3
33                       Fr; V           3
34                      Fr; SV           3
35                   Fr; SV; V           3
36                      Fr; Sp           3
37                   Fr; Sp; V           3
38                  Fr; Sp; SV           3
39               Fr; Sp; SV; V           3
40                     Fr; KrF           3
41                  Fr; KrF; V           3
42                 Fr; KrF; SV           3
43              Fr; KrF; SV; V           3
44                 Fr; KrF; Sp           3
45              Fr; KrF; Sp; V           3
46             Fr; KrF; Sp; SV           3
47          Fr; KrF; Sp; SV; V           3
48                       Fr; H           3
49                    Fr; H; V           3
50                   Fr; H; SV           3
51                Fr; H; SV; V           3
52                   Fr; H; Sp           3
53                Fr; H; Sp; V           3
54               Fr; H; Sp; SV           3
55            Fr; H; Sp; SV; V           3
56                  Fr; H; KrF           3
57               Fr; H; KrF; V           3
58              Fr; H; KrF; SV           3
59           Fr; H; KrF; SV; V           3
60              Fr; H; KrF; Sp           3
61           Fr; H; KrF; Sp; V           3
62          Fr; H; KrF; Sp; SV           3
63       Fr; H; KrF; Sp; SV; V           3
64                         DNA           3
65                      DNA; V           3
66                     DNA; SV           3
67                  DNA; SV; V           3
68                     DNA; Sp           3
69                  DNA; Sp; V           3
70                 DNA; Sp; SV           3
71              DNA; Sp; SV; V           3
72                    DNA; KrF           3
73                 DNA; KrF; V           3
74                DNA; KrF; SV           3
75             DNA; KrF; SV; V           3
76                DNA; KrF; Sp           3
77             DNA; KrF; Sp; V           3
78            DNA; KrF; Sp; SV           3
79         DNA; KrF; Sp; SV; V           3
80                      DNA; H           3
81                   DNA; H; V           3
82                  DNA; H; SV           3
83               DNA; H; SV; V           3
84                  DNA; H; Sp           3
85               DNA; H; Sp; V           3
86              DNA; H; Sp; SV           3
87           DNA; H; Sp; SV; V           3
88                 DNA; H; KrF           3
89              DNA; H; KrF; V           3
90             DNA; H; KrF; SV           3
91          DNA; H; KrF; SV; V           3
92             DNA; H; KrF; Sp           3
93          DNA; H; KrF; Sp; V           3
94         DNA; H; KrF; Sp; SV           3
95      DNA; H; KrF; Sp; SV; V           3
96                     DNA; Fr           3
97                  DNA; Fr; V           3
98                 DNA; Fr; SV           3
99              DNA; Fr; SV; V           3
100                DNA; Fr; Sp           3
101             DNA; Fr; Sp; V           3
102            DNA; Fr; Sp; SV           3
103         DNA; Fr; Sp; SV; V           3
104               DNA; Fr; KrF           3
105            DNA; Fr; KrF; V           3
106           DNA; Fr; KrF; SV           3
107        DNA; Fr; KrF; SV; V           3
108           DNA; Fr; KrF; Sp           3
109        DNA; Fr; KrF; Sp; V           3
110       DNA; Fr; KrF; Sp; SV           3
111    DNA; Fr; KrF; Sp; SV; V           3
112                 DNA; Fr; H           3
113              DNA; Fr; H; V           3
114             DNA; Fr; H; SV           3
115          DNA; Fr; H; SV; V           3
116             DNA; Fr; H; Sp           3
117          DNA; Fr; H; Sp; V           3
118         DNA; Fr; H; Sp; SV           3
119      DNA; Fr; H; Sp; SV; V           3
120            DNA; Fr; H; KrF           3
121         DNA; Fr; H; KrF; V           3
122        DNA; Fr; H; KrF; SV           3
123     DNA; Fr; H; KrF; SV; V           3
124        DNA; Fr; H; KrF; Sp           3
125     DNA; Fr; H; KrF; Sp; V           3
126    DNA; Fr; H; KrF; Sp; SV           3
127 DNA; Fr; H; KrF; Sp; SV; V           3

and we can check the timings now.

system.time(make_coals(large_eg_df))

   user  system elapsed 
  2.369   0.016   2.384

which is around one quarter of the time the first version took. We can now carry out this exercise for all groups in the data using a split-apply-combine strategy.

split_data <- split(pg_elex, pg_elex$election_id)
system.time(res <- lapply(split_data, make_coals))

   user  system elapsed 
 33.104   0.136  33.258

Total evaluation time of half a minute is reasonable for 618 unique elections.

Parallelization

If we are confident we’ve optimized the code as far as we can, we can turn to hardware based speed-ups. Here, I parallelize the split-apply-combine operation in the following way, achieving a 2.5x speedup on my eight core machine. Note that I have to send the make_coal_mat function to the compute cluster I’ve created, even though it’s a local cluster.

library(parallel)
nCores <- parallel::detectCores()
cl <- makeCluster(nCores - 1)
clusterExport(cl, varlist = "make_coal_mat")
system.time(res <- parLapply(cl, split_data, make_coals))

   user  system elapsed 
  1.627   0.059  13.752

In this way, through a combination of rewritten code and better use of computing power, we’re able to calculate possible coalitions for all elections in almost the same time it originally took us to calculate possible coalitions for one (admittedly complex) election!