library(tictoc)
set.seed(236435)
<- 1e4
n_voters <- rnorm(n_voters)
voters <- 100 n_sims
Recently I had to simulate the expected absolute distance between citizens and their nearest political party, under the assumptions that both citizen and party positions are drawn from a standard normal distribution.
I have to simulate this quantity because there’s no closed form solution outside of the single party case.
Initial code
Here’s some set up that’s common to all simulations. I set a seed, draw 10,000 voters, and say that I want one hundred simulations for each batch, or each size of party system. One hundred simulations is not a lot, but it helps illustrate the point without making this document too slow to render.
Here’s my first implementation.
tic()
<- sapply(1:6, function(n_parties) {
out <- sapply(1:n_sims, function(i) {
res <- rnorm(n_parties, mean = 0, sd = 1)
parties <- outer(voters, parties, `-`)
d <- abs(d)
d <- apply(d, 1, min)
min_d return(mean(min_d))
### Create all the distances
})return(res)
})toc()
8.293 sec elapsed
In this code, the outer loop works over the number of parties in the system. The inner loop works over the simulations. Within each iteration in the inner loop, I simulate positions for n_parties
parties. I then calculate all the pairwise distances between the vector of parties and the vector of voters using the outer
function. I convert these differences into distances by using the absolute value function. This gives me a matrix with n_voters
rows and n_parties
columns. In order to get the “nearest” party, I apply the min
function over the rows of this matrix.
As you can see from the calls to tic()
and toc()
, I’ve timed the execution time using the tictoc
library. I remember being disappointed by how slow this was. What if I wanted 1,000 simulations for each size of party system? Sure, I could parallelize this, but since it involves tailoring code to different machines with different numbers of cores, parallelization is almost always my last resort.
Profiling
In an effort to improve the code, I used the built-in R profiler. Here’s the code and the output I got.
Rprof()
<- sapply(1:6, function(n_parties) {
out <- sapply(1:n_sims, function(i) {
res <- rnorm(n_parties, mean = 0, sd = 1)
parties <- outer(voters, parties, `-`)
d <- abs(d)
d <- apply(d, 1, min)
min_d return(mean(min_d))
### Create all the distances
})return(res)
})Rprof(NULL)
summaryRprof()$by.self
self.time self.pct total.time total.pct
"apply" 7.68 72.05 10.52 98.69
"FUN" 2.12 19.89 10.66 100.00
"unlist" 0.36 3.38 0.36 3.38
"lengths" 0.22 2.06 0.22 2.06
"aperm.default" 0.12 1.13 0.12 1.13
"outer" 0.08 0.75 0.10 0.94
"aperm" 0.02 0.19 0.14 1.31
"abs" 0.02 0.19 0.02 0.19
"mean.default" 0.02 0.19 0.02 0.19
"rep.int" 0.02 0.19 0.02 0.19
Let’s focus on the first two lines of the output. The second line talks about FUN
, which is R’s way of talking about an anonymous function. It would be better if I had written out my functions in separate function declarations like a proper programmer. I’m sure there must be an R style guide out there which deprecates anonymous functions longer than two lines.
It’s the first line, however, which is key. Calls to apply
are taking up the lion’s share of the compute time. It turns out that using the general function apply
is overkill, and that more specialized functions can get row minima more quickly. Here’s a rewrite which replaces min_d <- apply(d, 1, min)
with min_d <- rowMins(d)
, where rowMins()
comes from the matrixStats
package.
library(matrixStats)
tic()
<- sapply(1:6, function(n_parties) {
out <- sapply(1:n_sims, function(i) {
res <- rnorm(n_parties, mean = 0, sd = 1)
parties <- outer(voters, parties, `-`)
d <- abs(d)
d <- rowMins(d)
min_d return(mean(min_d))
### Create all the distances
})return(res)
})toc()
0.316 sec elapsed
As you can see, there’s a substantial improvement. Using rowMins
is faster by a factor of twenty-five. If we look at the result of summaryRprof()
below, we can see that the rowMins
and outer
function are the two big functions that are taking up most of the time.
self.time self.pct total.time total.pct
"rowMins" 0.18 60.00 0.18 60.00
"mean.default" 0.06 20.00 0.06 20.00
"outer" 0.04 13.33 0.04 13.33
"abs" 0.02 6.67 0.02 6.67
Make it go faster
Is there any scope for speeding up further? Since we were able to speed things up by using a specialised function from a separate package, maybe we should use a specialized function from a specialised package, one which delights in making things go faster? Such a package is the Rfast
package. I don’t know anything about the authors, but the package loads with such a ridiculous l33t
startup message that it feels like they might have been interested in the demo coding scene in the 1990s.
library(Rfast)
Loading required package: Rcpp
Loading required package: zigg
Loading required package: RcppParallel
Attaching package: 'RcppParallel'
The following object is masked from 'package:Rcpp':
LdFlags
Rfast: 2.1.5.1
___ __ __ __ __ __ __ __ __ __ _ _ __ __ __ __ __ __ __ __ __ __ __
| __ __ __ __ | | __ __ __ __ _/ / \ | __ __ __ __ / /__ __ _ _ __ __\
| | | | | | / _ \ | | / /
| | | | | | / / \ \ | | / /
| | | | | | / / \ \ | | / /
| |__ __ __ __| | | |__ __ __ __ / / \ \ | |__ __ __ __ _ / /__/\
| __ __ __ __| | __ __ __ __| / /__ _ __\ \ |_ __ __ __ _ | / ___ /
| \ | | / _ _ _ _ _ _ \ | | \/ / /
| |\ \ | | / / \ \ | | / /
| | \ \ | | / / \ \ | | / /
| | \ \ | | / / \ \ | | / /
| | \ \__ __ _ | | / / \ \ _ __ __ __ _| | / /
|_| \__ __ __\ |_| /_/ \_\ /_ __ __ __ ___| \/ team
Attaching package: 'Rfast'
The following objects are masked from 'package:matrixStats':
colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
rowMaxs, rowMedians, rowMins, rowRanks, rowVars
The following object is masked from 'package:tictoc':
Stack
Let’s try their implementation of rowMins
.
tic()
<- sapply(1:6, function(n_parties) {
out <- sapply(1:n_sims, function(i) {
res <- rnorm(n_parties, mean = 0, sd = 1)
parties <- outer(voters, parties, `-`)
d <- abs(d)
d <- Rfast::rowMins(d)
min_d return(mean(min_d))
### Create all the distances
})return(res)
})toc()
0.22 sec elapsed
Once again, we’ve shaved some time off. It may not appear like a lot of time, but this implementation is now 1.5 times quicker.
What doesn’t work (1)
This reliance on sampling might seem strange. Why, if we are postulating that voter distributions follow a normal distribution, and if we are setting the number of voters to A Big Number, don’t we aim for an exact solution?
You can, if you like, write out a function that will take a vector of party positions, calculate midpoints between them, and integrate areas between these midpoints – but it’s incredibly slow. R likes matrix operations; it doesn’t like integration.
What doesn’t work (2)
In this code, I’ve used two loops. R programmers are often told that loops should be avoided, if at all possible. One way of avoiding an outer loop is to endogenize the number of parties – to simply sample a number of parties per iteration. This involves a certain loss of control – we never get exactly 100 simulations per party system size – but does it help speed things up?
tic()
<- function() {
sf <- sample.int(6, size = 1)
n_parties <- rnorm(n_parties, mean = 0, sd = 1)
parties <- outer(voters, parties, `-`)
d <- abs(d)
d <- Rfast::rowMins(d)
min_d return(c(n_parties, mean(min_d)))
### Create all the distances
}<- replicate(6 * n_sims, sf(), simplify = 'array')
out toc()
0.261 sec elapsed
This loss of control doesn’t give an increase in speed. It seems that the overhead of the outer loop – which only runs over six elements – simply isn’t significant.
Conclusions
The meta level conclusion is that it’s always important to time things and check that your intuitions about what works or doesn’t work are correct.
The only-slightly-less-meta-level conclusion is that some “natural” R programming patterns aren’t always that quick. Although base R functions have been optimized a lot, they are generic tools, and some tools with a narrower remit can do the job quicker.