Minimal important differences in the social sciences, part 2

measurement
academia
Author

Chris Hanretty

Published

September 3, 2024

In a previous post, I suggested that one way of determining the minimal important difference is to establish the minimal detectable change. If you don’t know that a change is different from zero, it surely can’t be that important.

In this post, I want to suggest another way to determine the minimal importance difference. This is by working out whether you are indifferent between multiple proposed continuous measures of a concept. If you’re indifferent between two measures, the times when they disagree surely can’t be that important.

But first, a fable…

Why the king’s rooms run hot and cold

A long time ago, in a far away country, there lived two inventors.

These two inventors wished to measure heat.

One inventor measured heat by placing brandy in a thin column of glass.

As the level of the brandy rose, so the heat in the room was judged greater.

The other inventor did the same, but with quicksilver instead of brandy.

The devices made by these men sometimes agreed, but sometimes differed.

People in the mountains preferred the measure based on brandy; people in the plains preferred the measure based on quicksilver.

The two men brought their inventions to the king of that land.

They showed him their inventions, and asked him to judge which was best.

The king, wishing to offend neither man, said he could not decide, but offered to install devices from both men in the rooms in his palace.

Shortly after, the king’s steward noted disagreements between these devices.

The cellar to the west of the palace was judged by brandy to be the coldest room in the palace.

By quicksilver, though, it was the cellar to the east of the palace that was judged coldest.

The two inventors, hearing of the steward’s observations, called on the king again.

“Sire”, they said, “you said you could not decide between us on general grounds. Here you need only decide on particulars. Go into the west cellar, and then into the east cellar, and say which is coldest. Then shall you know which measure is best”.

The king went to the west cellar, and then to the east cellar, and then back to the two inventors. He told them that just as before, he could not judge which of the rooms was colder, and thus could not say which measure was better.

The two inventors thanked the king and left disappointed. The king too was disappointed, for he had no wish to spend any time in the cellars of his palace, and treated his steward rudely thereafter.

Time passed, and the king grew old, and took to complaining of the temperature in his rooms.

He asked his steward to make the temperature in his water closet more equal to the temperature in his bed chamber.

His steward, much put upon, declined.

“No sire, I shall not. For I mark the difference between the measure made in your bed-chamber, and the measure made in your water closet, and these two measures are closer together than the difference between the measures in the west cellar and the measures in the east cellar, which you thought so finely matched. If you tell me now there is a difference, then I must go back to those two men who troubled you earlier, and have them trouble you again”.

The king, not wishing to see the two inventors again, and knowing that his steward had the better of him, accept the man’s argument. And that is why the king’s rooms blow hot and cold.

Back to the social sciences

Though that fable was exceeding subtle, I hope you see some analogies to the social sciences. We have different measures of things, and if we really want to say that we’re indifferent between them, and indifferent to cases where they disagree, we have to say that some differences (in terms of one of those measures) aren’t important.

The fable simplifies things, because by construction there’s only one relevant case.

Now let’s start again from first principles, and only then deal with the complications posed by multiple cases.

Let’s suppose that there is some social scientist who is indifferent between two different measures of the same concept.

(This individual may have written articles which use one measure, and different articles which use the rival measure. Maybe they’ve even written articles which use both measures. After all, doesn’t everyone love robustness to different measurement strategies?)

Let’s suppose that these two measures are continuous. This means that we can always make a pairwise comparison between two cases and identify one case which has “more” of the underlying concept on this measure. Figuratively: the measures never put their hands up and say, “too close to call”.

Because our two measures are different, then with enough comparisons they will on occasion disagree. If they always agreed, they wouldn’t be different.1 One measure will say that the example A in the comparison has “more”; the other measure will say that it’s example B that has more.

Let’s suppose that there is precisely one pairwise comparison where there is disagreement. This is, of course, the situation given in the fable above.

In this case, the social scientist either thinks

  • “example A has more [of this concept]”, or
  • they are indifferent between the example, or they think
  • “example B has more of this concept”.

I would argue that they must be indifferent, because if they thought that either example A or example B had more of this concept, they would have a good reason to prefer the measure that they sided with, and couldn’t therefore be indifferent between the two measures as we initially supposed they were.

Now let’s consider the more typical situation where there are lots of cases of pairwise disagreement between measures.

We can’t take any specific instance and expect our social scientist to be indifferent. Maybe that instance of disagreement results from very particular circumstances. Maybe in that case the researcher thinks that one example has more, thus giving them a reason to prefer one index, but there are other offsetting cases which suggest the rival index.

What I’ll suggest is that our researcher has to be indifferent in the average or median case of disagreement. They have to be indifferent in the case of this case of disagreement because if they weren’t, it would strongly imply that they were not in fact indifferent between measures.

This argument is informal and not fully worked out. In theory if we allow no restrictions on the social scientist’s intuitive judgements, we could pair off freak results and find that the social scientist only has to be indifferent in cases involving the smallest disagreement.

A worked example with disproportionality

In this section, I’ll look at how we might establish a minimal important difference for votes-seats disproportionality, given indifference between measures. Votes-seats disproportionality is a good example because most of the time votes and seats are measured without error. As a result we can’t calculate a minimal important difference based on the minimal detectable change. I’ll assume that there is a social scientist who is indifferent between the Gallagher index and the Sainte-Laguë index, two commonly used indices of disproportionalit. I’ll calculate the values of these indices for modern parliamentary elections covered by ParlGov, and generate all pairs of elections.

suppressPackageStartupMessages(library(tidyverse))
dat <- read.csv("view_election.csv")
dat <- dat |>
    filter(election_type == "parliament") |>
    mutate(election_date = as.Date(election_date)) |>
    filter(election_date > as.Date("1945-12-31")) |>
    mutate(vote_share = vote_share / 100,
           seat_share = seats / seats_total)

gallagher <- function(v, s) {
    sqrt(1/2 * sum((v - s)^2, na.rm = TRUE))
}
sainte_lague <- function(v, s) {
    sum((v - s)^2 / v, na.rm = TRUE)
}
dat <- dat |>
    group_by(country_name_short, election_date, election_id) |>
    summarize(d_gall = gallagher(vote_share, seat_share),
              d_sl = sainte_lague (vote_share, seat_share),
              .groups = "drop")

### Generate all pairs
pairs <- expand.grid(A = unique(dat$election_id),
                     B = unique(dat$election_id)) |>
    filter(A < B) |>
    left_join(dat, by = join_by(A == election_id)) |>
    left_join(dat, by = join_by(B == election_id),
              suffix = c(".A", ".B"))

Having generated all pairs of elections, I focus on those pairs where there is disagreement concerning which example has “more” disproportionality. As might be expected, only a minority of pairings throw up some disagreement, but that minority, at roughly one in eight, isn’t neglible.

nrow(pairs)
[1] 212878
pairs <- pairs |>
    mutate(gall_pref = sign(d_gall.B - d_gall.A),
           sl_pref = sign(d_sl.B - d_sl.A)) |>
    filter(gall_pref != sl_pref)
nrow(pairs)
[1] 26250

Now let’s examine the absolute differences between values of the Sainte-Laguë index.

pairs <- pairs |>
    mutate(delta_sl = d_sl.B - d_sl.A)

summary(abs(pairs$delta_sl))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.005039 0.013049 0.022868 0.028760 0.599469 

In this case, we wouldn’t want to force anyone to say that a difference in the Sainte-Laguë index of 0.59 units(!) was not important just because it arose in the context of a disagreement between indices. But it does seem plausible to suggest that the median absolute value of 0.013 units might be a minimal important difference. For what it’s worth, if we compare that value to the standard deviation of values of the index in our original data, we have a minimal important difference of roughly 0.125 standard deviations.

We could, of course, have started from the other index. Let’s examine the absolute difference between values of the Gallagher index.

pairs <- pairs |>
    mutate(delta_gall = d_gall.B - d_gall.A)

summary(abs(pairs$delta_gall))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
1.600e-07 3.848e-03 9.325e-03 1.339e-02 1.863e-02 1.091e-01 

In this case, we get a pretty small median absolute difference of just under 0.01 units – but of course the standard deviation of the Gallagher index is much smaller, at 0.048 units. Expressed in terms of standard deviations, we have a minimal important difference of roughly 0.2 standard deviations.

We therefore have a minimal important difference for one index, and a minimal important difference for another index. This means that some differences will be important when we use one index, and not important when we use another index. That might seem unsatisfactory, but “disagreement over indices” should surely have consequences for what is substantively important.

Back to measures of democracy

We can apply this same logic to democracy. Let’s suppose there is a researcher who is indifferent between two measures of electoral democracy:

  • the V-Dem project’s measure v2x_polyarchy
  • Polity V scores

Thankfully these two measures are both available in the V-Dem data.

library(vdemdata)
data("vdem")
### Just select the variables we're interested in
dat <- vdem |>
    dplyr::select(country_text_id, year,
                  v2x_polyarchy,
                  e_polity2)
### Fix some Polity codes
dat <- dat |>
    mutate(e_polity2 = na_if(e_polity2, -88),
           e_polity2 = na_if(e_polity2, -66))

Generating all possible pairwise comparisons of all country years across all of the modern period is memory-intensive, so I’ll focus on years since 1945 and countries covered by both projects.

### Restrict it to the post-war period
dat <- dat |>
    filter(year >= 1945)

### Restrict it to common cases
dat <- dat |>
    filter(!is.na(e_polity2)) |>
    filter(!is.na(v2x_polyarchy))

### Create a unique label
dat <- dat |>
    mutate(label = paste0(country_text_id, year))

### Generate all pairwise combinations of country years.
### Note that there are around 4,600 country years
### so we have 4,600 * (4,600 - 1) pairings
### but we can restrict it to cases order is alphabetical
### taking it down to 10 million
pairs <- expand.grid(A = unique(dat$label),
                     B = unique(dat$label),
                     stringsAsFactors = FALSE) |>
    filter(A < B)

### Start merging on the indicators
pairs <- left_join(pairs,
                   dat |> dplyr::select(label, v2x_polyarchy, e_polity2),
                   by = join_by(A == label))

pairs <- left_join(pairs,
                   dat |> dplyr::select(label, v2x_polyarchy, e_polity2),
                   by = join_by(B == label),
                   suffix = c(".A", ".B"))

As before, we focus just on cases of disagreement. Here, I’ll look at cases of “strict” disagreement.

nrow(pairs)
[1] 48960460
pairs <- pairs |>
    mutate(vdem_pref = sign(v2x_polyarchy.B - v2x_polyarchy.A),
           polity_pref = sign(e_polity2.B - e_polity2.A)) |>
    filter((vdem_pref == 1 & polity_pref == -1) |
           (vdem_pref == -1 & polity_pref == 1))
nrow(pairs)
[1] 4961321

Here, around 10% of pairings of country years see a disagreement between the two indices. Let’s work out what the median absolute difference in these cases of disagreement is.

First, let’s do v2x_polyarchy:

pairs <- pairs |>
    mutate(delta_vdem = v2x_polyarchy.B - v2x_polyarchy.A)

summary(abs(pairs$delta_vdem))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00100 0.02800 0.06500 0.08811 0.11900 0.78000 

Our candidate for the minimal important difference is of 0.065 units. That’s around 0.25 standard deviations when looking at the V-Dem data as a whole, and bigger than 0.05, the minimal important difference implied by the minimal detectable change.

and now e_polity2:

pairs <- pairs |>
    mutate(delta_polity = e_polity2.B - e_polity2.A)

summary(abs(pairs$delta_polity))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   3.295   4.000  20.000 

Because the Polity score is a fine-grained ordinal variable rather than a continuous variable, the median absolute difference in cases of disagreement is exactly two units.

Conclusions

In this post I’ve suggested that if you are indifferent between two measures, this can help us establish a minimal important difference. This way of establishing a minimal important difference is valuable in cases where, like electoral disproportionality, we can’t establish a minimal important difference by way of measurement error.

I don’t know how common it is for researchers to be genuinely indifferent between measures. I also don’t know how persistent that indifference is. Indeed, one possible way of reacting to these claims about minimal important differences is to quickly develop a more fine grained preference between measures based on cases of disagreement. Obviously it’s fanciful to imagine anyone going through 4 million pairs of country years and determining whether they side more with Polity or V-Dem, but sharpening our preferences by looking at cases of disagreement might still be a worthwhile way to spend some time.

Footnotes

  1. There’s a possible de re/*de dicto confusion here, but pay no attention to that.↩︎