Recently I had to simulate the expected absolute distance between citizens and their nearest political party, under the assumptions that both citizen and party positions are drawn from a standard normal distribution.

I have to simulate this quantity because there’s no closed form solution outside of the single party case.

Initial code

Here’s some set up that’s common to all simulations. I set a seed, draw 10,000 voters, and say that I want one hundred simulations for each batch, or each size of party system. One hundred simulations is not a lot, but it helps illustrate the point without making this document too slow to render.

library(tictoc)
set.seed(236435)
n_voters <- 1e4
voters <- rnorm(n_voters)
n_sims <- 100

Here’s my first implementation.

tic()
out <- sapply(1:6, function(n_parties) {
    res <- sapply(1:n_sims, function(i) {
        parties <- rnorm(n_parties, mean = 0, sd = 1)
        d <- outer(voters, parties, `-`)
        d <- abs(d)
        min_d <- apply(d, 1, min)
        return(mean(min_d))
### Create all the distances
    })
    return(res)
})
toc()

8.293 sec elapsed

In this code, the outer loop works over the number of parties in the system. The inner loop works over the simulations. Within each iteration in the inner loop, I simulate positions for n_parties parties. I then calculate all the pairwise distances between the vector of parties and the vector of voters using the outer function. I convert these differences into distances by using the absolute value function. This gives me a matrix with n_voters rows and n_parties columns. In order to get the “nearest” party, I apply the min function over the rows of this matrix.

As you can see from the calls to tic() and toc(), I’ve timed the execution time using the tictoc library. I remember being disappointed by how slow this was. What if I wanted 1,000 simulations for each size of party system? Sure, I could parallelize this, but since it involves tailoring code to different machines with different numbers of cores, parallelization is almost always my last resort.

Profiling

In an effort to improve the code, I used the built-in R profiler. Here’s the code and the output I got.

Rprof()
out <- sapply(1:6, function(n_parties) {
    res <- sapply(1:n_sims, function(i) {
        parties <- rnorm(n_parties, mean = 0, sd = 1)
        d <- outer(voters, parties, `-`)
        d <- abs(d)
        min_d <- apply(d, 1, min)
        return(mean(min_d))
### Create all the distances
    })
    return(res)
})
Rprof(NULL)
summaryRprof()$by.self

                self.time self.pct total.time total.pct
"apply"              7.68    72.05      10.52     98.69
"FUN"                2.12    19.89      10.66    100.00
"unlist"             0.36     3.38       0.36      3.38
"lengths"            0.22     2.06       0.22      2.06
"aperm.default"      0.12     1.13       0.12      1.13
"outer"              0.08     0.75       0.10      0.94
"aperm"              0.02     0.19       0.14      1.31
"abs"                0.02     0.19       0.02      0.19
"mean.default"       0.02     0.19       0.02      0.19
"rep.int"            0.02     0.19       0.02      0.19

Let’s focus on the first two lines of the output. The second line talks about FUN, which is R’s way of talking about an anonymous function. It would be better if I had written out my functions in separate function declarations like a proper programmer. I’m sure there must be an R style guide out there which deprecates anonymous functions longer than two lines.

It’s the first line, however, which is key. Calls to apply are taking up the lion’s share of the compute time. It turns out that using the general function apply is overkill, and that more specialized functions can get row minima more quickly. Here’s a rewrite which replaces min_d <- apply(d, 1, min) with min_d <- rowMins(d), where rowMins() comes from the matrixStats package.

library(matrixStats)
tic()
out <- sapply(1:6, function(n_parties) {
    res <- sapply(1:n_sims, function(i) {
        parties <- rnorm(n_parties, mean = 0, sd = 1)
        d <- outer(voters, parties, `-`)
        d <- abs(d)
        min_d <- rowMins(d)
        return(mean(min_d))
### Create all the distances
    })
    return(res)
})
toc()

0.316 sec elapsed

As you can see, there’s a substantial improvement. Using rowMins is faster by a factor of twenty-five. If we look at the result of summaryRprof() below, we can see that the rowMins and outer function are the two big functions that are taking up most of the time.

               self.time self.pct total.time total.pct
"rowMins"           0.18    60.00       0.18     60.00
"mean.default"      0.06    20.00       0.06     20.00
"outer"             0.04    13.33       0.04     13.33
"abs"               0.02     6.67       0.02      6.67

Make it go faster

Is there any scope for speeding up further? Since we were able to speed things up by using a specialised function from a separate package, maybe we should use a specialized function from a specialised package, one which delights in making things go faster? Such a package is the Rfast package. I don’t know anything about the authors, but the package loads with such a ridiculous l33t startup message that it feels like they might have been interested in the demo coding scene in the 1990s.

library(Rfast)

Loading required package: Rcpp

Loading required package: zigg

Loading required package: RcppParallel


Attaching package: 'RcppParallel'

The following object is masked from 'package:Rcpp':

    LdFlags


Rfast: 2.1.5.1

 ___ __ __ __ __    __ __ __ __ __ _             _               __ __ __ __ __     __ __ __ __ __ __   
|  __ __ __ __  |  |  __ __ __ __ _/            / \             |  __ __ __ __ /   /__ __ _   _ __ __\  
| |           | |  | |                         / _ \            | |                        / /          
| |           | |  | |                        / / \ \           | |                       / /          
| |           | |  | |                       / /   \ \          | |                      / /          
| |__ __ __ __| |  | |__ __ __ __           / /     \ \         | |__ __ __ __ _        / /__/\          
|    __ __ __ __|  |  __ __ __ __|         / /__ _ __\ \        |_ __ __ __ _   |      / ___  /           
|   \              | |                    / _ _ _ _ _ _ \                     | |      \/  / /       
| |\ \             | |                   / /           \ \                    | |         / /          
| | \ \            | |                  / /             \ \                   | |        / /          
| |  \ \           | |                 / /               \ \                  | |       / /          
| |   \ \__ __ _   | |                / /                 \ \     _ __ __ __ _| |      / /          
|_|    \__ __ __\  |_|               /_/                   \_\   /_ __ __ __ ___|      \/             team


Attaching package: 'Rfast'

The following objects are masked from 'package:matrixStats':

    colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
    rowMaxs, rowMedians, rowMins, rowRanks, rowVars

The following object is masked from 'package:tictoc':

    Stack

Let’s try their implementation of rowMins.

tic()
out <- sapply(1:6, function(n_parties) {
    res <- sapply(1:n_sims, function(i) {
        parties <- rnorm(n_parties, mean = 0, sd = 1)
        d <- outer(voters, parties, `-`)
        d <- abs(d)
        min_d <- Rfast::rowMins(d)
        return(mean(min_d))
### Create all the distances
    })
    return(res)
})
toc()

0.22 sec elapsed

Once again, we’ve shaved some time off. It may not appear like a lot of time, but this implementation is now 1.5 times quicker.

What doesn’t work (1)

This reliance on sampling might seem strange. Why, if we are postulating that voter distributions follow a normal distribution, and if we are setting the number of voters to A Big Number, don’t we aim for an exact solution?

You can, if you like, write out a function that will take a vector of party positions, calculate midpoints between them, and integrate areas between these midpoints – but it’s incredibly slow. R likes matrix operations; it doesn’t like integration.

What doesn’t work (2)

In this code, I’ve used two loops. R programmers are often told that loops should be avoided, if at all possible. One way of avoiding an outer loop is to endogenize the number of parties – to simply sample a number of parties per iteration. This involves a certain loss of control – we never get exactly 100 simulations per party system size – but does it help speed things up?

tic()
sf <- function() { 
    n_parties <- sample.int(6, size = 1)
    parties <- rnorm(n_parties, mean = 0, sd = 1)
    d <- outer(voters, parties, `-`)
    d <- abs(d)
    min_d <- Rfast::rowMins(d)
    return(c(n_parties, mean(min_d)))
### Create all the distances
}
out <- replicate(6 * n_sims, sf(), simplify = 'array')
toc()

0.261 sec elapsed

This loss of control doesn’t give an increase in speed. It seems that the overhead of the outer loop – which only runs over six elements – simply isn’t significant.

Conclusions

The meta level conclusion is that it’s always important to time things and check that your intuitions about what works or doesn’t work are correct.

The only-slightly-less-meta-level conclusion is that some “natural” R programming patterns aren’t always that quick. Although base R functions have been optimized a lot, they are generic tools, and some tools with a narrower remit can do the job quicker.