Minimal important differences in the social sciences, part 1

measurement
academia
Author

Chris Hanretty

Published

August 27, 2024

If I told you that a weight loss drug would cause you to lose fourteen kilograms, give or take four kilograms, I think you would be impressed.1

If I told you that a weight loss drug would cause you to lose fourteen grams, give or take four grams, I think you would be less than impressed. You might say that a difference of fourteen grams “does not matter”, or is “inconsequential” or is not “substantively significant” – and these are just the polite things you might say to me.

Somewhere between fourteen grams and fourteen kilograms the effects of a weight loss drug start to matter, or become consequential, or became substantively significant. I’m going to call that point the minimal important difference, or MID.

I don’t know what the MID is for weight loss treatments: I’m not an expert on weight loss.2

I do think that the MID for weight loss exists. I also think it’s a good idea to concentrate resources (time, attention, money) on treatments which have effects bigger than the MID, compared to treatments which have effects smaller than the minimal important difference, assuming everything about the treatments other than their effect size is the same.

I think that what is true of weight loss is also true of lots of outcomes in the social sciences, even where these outcomes are regarded by almost everyone as normatively very important. For example: whilst I think that higher levels of democracy are always better, I think that there are some differences in levels of democracy which are not substantively significant.

This blog post is one of two describing how we can establish MIDs in the social sciences. In this post, I’ll look at how we can set a lower bound for the MIDs by looking at the minimal detectable change. In a sequel, I’ll look at a more specific case where researchers are indifferent between multiple measures of the same concept. I’ll argue that indifference between rival measures implies indifference to changes of less than a specified amount.

The minimal detectable change

Measurement – at least for continuous quantities – is never exact. This is true for ordinary observable quantities as well as for the social scientist’s latent constructs. In the recent summer Olympics, times in swimming competitions were recorded to the hundredth of a second, rather than to the thousandth of a second.

The reason is rather prosaic: although it’s possible to record elapsed time to the thousandth of a second, it’s not possible to construct swimming pools with tolerances of less than a millimetre, and so any difference of less than a hundredth of a second may be due to minute differences in the length of the lanes.

There is therefore some sense in which a difference of less than a hundredth of a second is not important. If it were important, we’d start constructing swimming pools the same way we construct particle accelerators.

Social scientists tend not to talk about tolerances, but about the standard error of measurement. This standard error of measurement comes up in different guises:

  • Maybe you have a psychometric test estimated under the assumptions of classical test theory, and you have a known standard error of measurement.
  • Maybe you have some rates of intercoder reliability and have converted a reliability statistic to a standard error of measurement.
  • Maybe you have a Bayesian measurement model and approximate the standard error of measurement by the posterior standard deviation of some parameter.

If you are in the first two categories; you have a single standard error of measurement; if you are in the last category; you have as many standard errors of measurement as you have measured parameters. You will therefore want to replace references to the standard error of measurement with the smallest standard error of measurement.

If you have a standard error of measurement, you can work out the minimum detectable change, or “the amount of change needed to exceed measurement error for a specific measure based on a predetermined confidence threshold”. We may suppose that any two measures we care to compare are draws from independent normal distributions which have different means (the true scores we want to estimate but can never know) but identical standard deviations (the standard error of measurement). The difference between two normal distributions is distributed normally with standard deviation equal to the square root of the sum of the two variances. If we want to make claims about differences at the 95% level of significance,

\[ \mathrm{MDC} = 1.96 \times \sqrt{2} \mathrm{SEM} \tag{1}\]

where SEM is the standard error of measurement. If you’re a Bayesian, and if you have access to the posterior distributions needed, you might want to ignore this useful equation and calculate the probability that a difference is greater than zero directly.

The MDC for electoral democracy

Let’s use as an example the minimal detectable change for electoral democracy. For this I’ll use data from the V-Dem project, and values of their variable v2x_polyarchy for the post-war period. I’ll also use the posterior standard deviation (v2x_polyarchy_sd) as an equivalent of the standard error of measurement.

library(vdemdata)
data("vdem")
vdem <- vdem |>
    subset(year >= 1945,
           select = c(country_text_id, year, v2x_polyarchy, v2x_polyarchy_sd)) |>
    subset(!is.na(v2x_polyarchy))

smallest_posterior_sd <- min(vdem$v2x_polyarchy_sd)

Specifically, I’ll use the smallest posterior standard deviation. In part because of the way the V-Dem scores are transformed to the range [0-1], the standard deviations are smallest at low values. The V-Dem project as a whole is most confident about the levels of electoral democracy in Eritrea in 2007. (Spoiler: they’re pretty confident there wasn’t much). The posterior standard deviation for this country year is 0.002 units on a 0-1 scale.

If we multiply this quantity by 1.96 and againt by the square root of two, we get a figure of 0.0055 units. To put it on a different metric, that’s roughly 0.02 standard deviations.

This MDC gives a rationale for only reporting V-Dem figures to two decimal places, because any information in that third decimal place is smaller than the minimum detectable change.

It is possible to give a different minimum detectable change depending on the research question. This alternative figure can be larger or smaller.

conditional_posterior_sd <- vdem |>
    subset(v2x_polyarchy > 0.4)

conditional_posterior_sd <- conditional_posterior_sd$v2x_polyarchy_sd |>
    min()

Here’s an example of a larger MDC. Perhaps you are interested in democratic back-sliding, and want to focus on countries which score above 0.4 on this 0-1 scale. Recall that the V-Dem project tends to be more confident about low scores on their democracy measures. When we restrict ourselves to cases which score above 0.4, the smallest posterior standard deviation is 0.019. This in turn means that the MDC is around 0.05 units. This is an order of magnitude greater than what we had before. For comparison: levels of electoral democracy in the UK since 2000 have never ranged by more than 0.04 units (from 0.84 to 0.88).

It’s also possible to argue for a smaller MDC by shifting the unit of analysis. Suppose that we were not interested in country-years, but in regional averages. For independent measures, the standard error of measurement decreases with the square root of the number of observations. If our regions have thirty-six countries in them, the MDC is smaller by a factor of 6. If we really formed our ideas of what was detectable (and, a fortiori, important) by thinking first about regional averages, this might be a good MDC to use. I tend not to think “region-first”, but maybe you do.

Is the MDC for democracy a MID?

I don’t think anything in the previous section is particularly controversial. It’s as good a guide to the MDC as you can get without having access to the full posterior distribution for the V-Dem variable in question.3

What is more controversial is the idea that the minimum detectable change should set a lower limit on the minimal important difference.

There is a good pragmatic argument, which says that for any putative change in one direction, you should not hold it out to be important if later more accurate measurement would show that it was in fact a change in the opposite direction.

This pragmatic argument, however, only works in the context of the measures we have. It is perfectly consistent to say that the minimal important difference is in fact smaller than the minimum detectable change, and for that reason we should improve our measurement techniques.

I think the strongest context for this argument comes from conflict deaths. One of my colleagues at Royal Holloway works a lot on the measurement of conflict deaths. His paper on survey measurement of conflict deaths in Iraq shows that estimates of the number of conflict deaths in Iraq differ by one and a half orders of magnitude, from 26,000 to 1.0 million. If the differences between estimates represented the standard error of measurement rather than poor research conduct, we would be placed in a position where the minimum detectable change involved several thousand deaths. That seems wrong. Maybe we should just get better at measuring conflict casualties – a sort of “git gud” approach to substantive significance.

The pragmatic argument does, however, seem to work for measures of electoral democracy. It would in principle to be possible to improve the V-Dem project by involving more experts per country, and to increase its scope by extending the analysis not just forward (at a rate of one year per year) but backwards. The increase in the number of country experts would improve the accuracy; the increase in the number of country-years can only make the smallest posterior standard deviation smaller.

In practice, however, the V-Dem project is one of the best funded social science projects in history, and the project has literally unrivalled temporal and geographic coverage. In what follows, I’ll look at other big projects which have near complete coverage of the units of analysis.

MDCs for other variables

Party positions

The Manifesto Project generates estimates of parties’ left-right positions in different ways based on the content of their manifestos. It allows researchers to calculate bootstrapped standard errors for each document.

Let’s suppose we use the logit RILE measure. This variable has a mean close to zero, a standard deviation close to one, and fat tails which mean that there are some observations with values greater than plus or minus three.

The lowest standard error in the data has a value of just under 0.024 units. This means that the minimal detectable change is around 0.06 units, or 0.07 standard deviations. This is a change roughly equal to the change in the Liberal Democrats’ (putative) left-right position between 2017 and 2019.

Left-right positions of Supreme Court justices

Martin-Quinn scores are annual estimates of the positions of justices on the Supreme Court of the United States. They’re estimated using a Bayesian item response model, and the authors report the posterior standard deviation. The mean score is close to zero; the standard deviation across all judge years is close to two.

mq <- read.csv("http://mqscores.wustl.edu/media/2022/justices.csv")

The justice with the smallest posterior standard deviation is Byron White, who served on the court for a long time (1962 - 1993) and who was located close to the court median. The posterior standard deviation for White in 1967 is 0.141 units. This implies that the minimum detectable change is 0.4 units, or just under 0.2 standard deviations. Considering the most recent Martin-Quinn scores, this difference is roughly the difference between a Brett Kavanaugh and an Amy Cony Barrett.

Pledge fulfillment

Pledge fulfillment is the proportion of a parties’ promises which are fulfilled before the next election. Measuring pledge fulfillment requires researchers to work out which statements are pledges, and which pledges have been (fully or partly) fulfilled. Both of these tasks are subjective. The authors of the leading study on pledge fulfillment therefore report rates of agreement between coders. Across pairs of researchers, the average rate of agreement on pledge fulfillment (not pledge identification) was 93%.

I’m sure there’s a formula which will allow me to calculate between percentage agreement and other reliability statistics, but I took the cheat’s way out and simulated some fake data. Here I use the IRRsim package to simulate data with 100 raters, 100 “events” (pledges) and three categories, collapsing these to top-two versus all else. This gives data which approximates the pledge fulfillment data.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(IRRsim)
set.seed(26943)
tmp <- IRRsim::simulateRatingMatrix(nLevels = 3, 
                             k = 100, 
                             k_per_event = 100, 
                             agree = 0.93, 
                             nEvents = 100) |>
    as.data.frame() |>
    mutate(across(everything(), function(x) as.numeric(x > 1)))
sem <- sd(colMeans(tmp))
mdc <- qnorm(.975) * sqrt(2) * sem

The minimum detectable change is 3.3 percentage points, or just under 0.07 standard deviations. In other terms, that’s slightly less than the average pledge fulfillment ratio amongst UK governing parties (86%) and the Swedish minority coalition led by Fredrik Reinfeldt.

Conclusions

In this post I’ve suggested one way that we can established minimal important differences for outcomes in the social sciences, at least for those outcomes which are measured on a continuous scale with some reported standard error of measurement (or something analogous).

I’ve applied this to four measures – levels of democracy, left-right positions of political parties, left-right positions of judges on SCOTUS, and rates of pledge fulfillment.

I’ve noted that the MDC depends on how the unit of analysis is characterised. Some measures – like seat-share weighted cabinet positions – are measured on the same scale as constituent units, but we can be more precise when dealing with aggregates. The situation would be more complicated with measurements which depend on two noisy inputs.

In the next blog, I’ll look at how we might instead establish minimal important differences in the case where we are indifferent between two rival measures. This can be applied to cases where both measures are measured ostensibly without error, and generally gives larger MIDs.

Footnotes

  1. This is roughly the effect of semaglutide / Wegovy, according to this study; effect sizes in weight loss are more commonly specified in percent, and so I have multiplied the percentage point effect size (-15.2) by the average weight of the trial participants (106 kg). I think it’s fair to say that Wegovy has been broadly recognised as a major new development in the treatment of weight loss.↩︎

  2. Experts on weight loss, as you might expect, have written about this. This paper suggests that most funding agencies use a minimal important difference of five percent, but at the same time it argues that the MID should really be higher, at around twenty percent, which would mean that the effects of Wegovy are less than the MID.↩︎

  3. The full posterior distributions are incredibly unwieldy, and very infrequently updated. I would not recommend that you work with them unless you have a lot of time, a lot of working memory, and incredible patience.↩︎