<- "LP; NP" coal_id
Introduction
In the previous post, I created a function which could generate all \(2^N-1\) possible coalitions of \(N\) parties. I placed particular emphasis on making that function as fast as possible. This matters because typically we’ll be analyzing hundreds of government formation opportunities, and even code that is relatively quick for a single government formation opportunity (say, less than half a second) can become a bottleneck when dealings with hundreds of such opportunities.
One of the most time-consuming stages in creating all possible coalitions was creating the coalition identifier – a sequence of party codes, separated by a colon. Is there a way to make this stage faster?
List columns
One option is to not to store the coalition identifier as a character vector of length one, but as a character vector with length equal to the number of parties. That is, instead of storing the coalition identifier for the two-party coalition involving both Labour and National as
we can store it as
<- c("LP", "NP") coal_id
This seems natural when dealing with a single coalition, but becomes more problematic when we must store information on multiple coalitions in a data frame. One key principle of tidy data is that every cell in the data frame should correspond to a single value. We respected that principle previously because we constructed a single value by concatenating different strings. Here, we’re breaking that principle by using a ragged array or (to use terminology more specific to R) a list-column.
List columns are supported in ‘base’ R. Here’s an example.
<- data.frame(iso3c = rep("NZL", 3),
dat election_date = as.Date("2022-11-19"))
<- list("LP", "NP", c("LP", "NP"))
listcol
$listcol <- listcol dat
When we inspect the data it all looks normal:
dat
iso3c election_date listcol
1 NZL 2022-11-19 LP
2 NZL 2022-11-19 NP
3 NZL 2022-11-19 LP, NP
We have to peek in to the internals to see how listcol
is represented.
str(dat)
'data.frame': 3 obs. of 3 variables:
$ iso3c : chr "NZL" "NZL" "NZL"
$ election_date: Date, format: "2022-11-19" "2022-11-19" ...
$ listcol :List of 3
..$ : chr "LP"
..$ : chr "NP"
..$ : chr "LP" "NP"
However many operations which work fine for data frames break when working with list-columns. It’s not, for example, possible to merge two data frames on a list-column.
<- data.frame(numpar = c(1, 1, 2))
aux $listcol <- listcol
aux
<- merge(dat, aux, by = "listcol", all = TRUE) comb
Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and "quick"
Have you called 'sort' on a list?
I’ve also had to construct data frames with a list-column quite carefully, using the $
notation. The following will not work:
<- data.frame(numpar = c(1, 1, 2),
aux listcol = listcol)
Error in data.frame(numpar = c(1, 1, 2), listcol = listcol): arguments imply differing number of rows: 3, 2
and we have to use very obscure R commands to tell R that we do in fact wish to construct a data frame where one column is a list.
One solution is to use tibbles, which are a subclass of data frames. R for Data Science describes tibbles as:
“opinionated data frames that make working in the tidyverse a little easier”
and certainly they make working with list-columns much easier. We can construct a tibble containing a list column by replacing our earlier call to data.frame
with a call to tibble
:
require(tibble)
Loading required package: tibble
<- tibble(numpar = c(1, 1, 2),
aux listcol = listcol)
which gives the following result:
aux
# A tibble: 3 × 2
numpar listcol
<dbl> <list>
1 1 <chr [1]>
2 1 <chr [1]>
3 2 <chr [2]>
Revising our coalition building function
How does this help us with our function to create coalitions? Let’s dust down the function used previously:
<- function(df,
make_coals party_name_var = "party_name_short",
key_var = "election_id") {
if (!inherits(df, "data.frame")) {
stop("`df` must be a data frame")
}
### Unclass if it's a tibble, otherwise df[,var] won't work
if (inherits(df, "tbl")) {
<- as.data.frame(df)
df
}
<- df[, party_name_var]
party_names <- df[, key_var]
key
<- unique(key)
key
if (length(key) != 1) {
stop("Variable in `key_var` must have one unique value")
}
### List where each list entry records participation (1) or
### non-participation (0)
<- lapply(party_names, function(x) c(0, 1))
plist ### Expand the grid
<- expand.grid(plist)
coals ### Remove the "no-party" entry, which is always the first
<- coals[-1,]
coals
colnames(coals) <- party_names
<- apply(coals, 1, function(x){
coal_id <- party_names[which(x == 1)]
coal_parties paste(sort(coal_parties), collapse = "; ")
})<- data.frame(coal_id = coal_id,
retval key = key)
### Amend the last name
### use double brackets b/c we're only picking out a single value
names(retval)[[ncol(retval)]] <- key_var
retval }
and compare it to the following revised function declaration:
<- function(df,
make_coals_v2 party_name_var = "party_name_short",
key_var = "election_id") {
require(tibble)
<- df |> pull({{party_name_var}})
party_names <- df |> pull({{key_var}})
key <- unique(key)
key
if (length(key) != 1) {
stop("Variable in `key_var` must have one unique value")
}
### List where each list entry records participation (1) or
### non-participation (0)
<- lapply(party_names, function(x) c(0, 1))
plist ### Expand the grid
<- expand.grid(plist)
coals ### Remove the "no-party" entry, which is always the first
<- coals[-1,]
coals colnames(coals) <- party_names
<- apply(coals, 1, function(x){
coal_id which(x == 1)]
party_names[
})<- tibble(coal_id = coal_id,
retval key = key)
### Amend the last name
### use double brackets b/c we're only picking out a single value
retval }
Note how I’ve used the curly-curly operator ({{
) to interpolate the name of a variable, and supplied this as an argument to pull
. You can think of pull
as the tidyverse’s slightly longer-winded equivalent of the dollar sign notation, $
.
I use pull
because if I try to access a variable programatically the way I did before, it give me a single column tibble, which is usually not what I want:
<- "numpar"
var aux[,var]
# A tibble: 3 × 1
numpar
<dbl>
1 1
2 1
3 2
We can test out these two functions with our examples. As before, the code to load all the data is hidden.
Show the code
library(dplyr)
library(Amelia)
<- read.csv("data/view_election.csv")
pg_elex <- read.csv("data/view_party.csv")
pg_party
### Restrict to national parliament elections
<- pg_elex |>
pg_elex subset(election_type == "parliament")
### Date handling
<- pg_elex |>
pg_elex transform(election_date = as.Date(election_date))
### Add on the party family
<- merge(pg_elex,
pg_elex |>
pg_party subset(select = c(party_id, family_name)) |>
unique(),
by = c("party_id"),
all.x = TRUE,
all.y = FALSE)
### Add on lagged vote and seat share
<- pg_elex |>
pg_elex transform(vote_share = vote_share / 100,
seat_share = seats / seats_total)
<- merge(pg_elex,
pg_elex |> subset(select = c(election_id, party_id, vote_share, seat_share)),
pg_elex by.x = c("previous_parliament_election_id", "party_id"),
by.y = c("election_id", "party_id"),
suffixes = c("", ".l"),
all.x = TRUE,
all.y = FALSE)
### Remove the very first election for each country
<- pg_elex |>
pg_elex ::group_by(country_name_short) |>
dplyr::filter(election_date > min(election_date, na.rm = TRUE))
dplyr
### Coalesce the remaining vote and seat shares to zero
<- pg_elex |>
pg_elex transform(vote_share.l = dplyr::coalesce(vote_share.l, 0.0),
seat_share.l = dplyr::coalesce(seat_share.l, 0.0))
###
### Restrict to post-1945 elections
<- pg_elex |>
pg_elex subset(election_date > as.Date("1945-01-01"))
### Calculate the sum of vote and seat shares
<- pg_elex |>
pg_elex ::group_by(election_id) |>
dplyr::mutate(sum_seat_sh = sum(seat_share, na.rm = TRUE),
dplyrsum_vote_sh = sum(vote_share, na.rm = TRUE))
### Remove cases where the sum of seat shares is greater than one
### At present, this is the 2022 Australian election
<- pg_elex |>
eg subset(sum_seat_sh > 1.01,
select = c(election_date, country_name_short, party_name_english, seat_share))
<- pg_elex |>
pg_elex subset(sum_seat_sh == 1)
### Set missing seats to zero
<- pg_elex |>
pg_elex transform(seat_share = dplyr::coalesce(seat_share, 0.0),
seats = dplyr::coalesce(seats, 0L))
### In-fill missing vote share with the smaller value of 1 percent or underflow / nMissing
<- pg_elex |>
pg_elex ::group_by(election_id) |>
dplyr::mutate(infill = 1 - unique(sum_vote_sh),
dplyrnMiss = sum(is.na(vote_share)),
nMiss = ifelse(nMiss == 0, 1, nMiss),
replacement = min(c(1, unique(infill) / nMiss)),
vote_share = dplyr::coalesce(vote_share, replacement))
### Vote share may still not sum to one, but there will be no missing values
### check <- sum(is.na(pg_elex$vote_share))
### Remove some variables we no longer need
<- pg_elex |>
pg_elex subset(select = c(previous_parliament_election_id, election_id, election_date,
country_name_short, country_name,
party_id, party_name, party_name_english, party_name_short,
left_right, family_name,
seats, seats_total, seat_share, seat_share.l,
vote_share, vote_share.l))
### Coerce to data frame (not tibble)
### Helpful for Amelia
<- as.data.frame(pg_elex)
pg_elex
### Single-shot imputation of left-right
<- amelia(pg_elex, m = 1,
a.out idvars = c("previous_parliament_election_id", "election_id", "election_date",
"country_name_short", "country_name",
"party_id", "party_name", "party_name_english", "party_name_short"),
noms = c("family_name"))
<- a.out$imputations[[1]]
pg_elex
<- pg_elex |>
pg_elex subset(party_name != "one seat") |>
subset(party_name != "no seat") |>
subset(party_name_english != "no seat") |>
subset(party_name != "no party affiliation") |>
subset(party_name_english != "no party affiliation") |>
subset(seats > 0)
### 1975 New Zealand election -- two parties
<- subset(pg_elex, election_id == 459)
small_eg_df ### 2005 Norwegian election - seven parties
<- subset(pg_elex, election_id == 3)
medium_eg_df ### 1994 Italian election - eighteen parties
<- subset(pg_elex, election_id == 146) large_eg_df
make_coals(small_eg_df)
coal_id election_id
2 LP 459
3 NP 459
4 LP; NP 459
make_coals_v2(small_eg_df)
# A tibble: 3 × 2
coal_id key
<named list> <int>
1 <chr [1]> 459
2 <chr [1]> 459
3 <chr [2]> 459
If we want to see the coalition identifiers, we can inspect them:
make_coals_v2(small_eg_df) |>
pull(coal_id)
$`2`
[1] "LP"
$`3`
[1] "NP"
$`4`
[1] "LP" "NP"
Speed-ups
Does this help us with generating our possible coalitions over a plausibly sized data-set? Let’s find out…
<- split(pg_elex, pg_elex$election_id)
split_data <- system.time(res <- lapply(split_data, make_coals))
t1 print(t1)
user system elapsed
121.487 0.255 121.766
<- system.time(res <- lapply(split_data, make_coals_v2))
t2 print(t2)
user system elapsed
24.171 0.244 24.420
The speed gains are quite considerable. This is a good example of when tibbles can be used to great effect. Using tibbles in your code does have drawbacks – subsetting doesn’t always work the way you think it should, or the way you are used to – but when the gains are this considerable, these drawbacks are worth it.