How to lie/mislead with trend lines

February 27, 2008

Recently, there’s been a minor polling skirmish about the size of the PdL’s lead. As Ilvo Diamanti puts it in a wonderful column,

the polls that Berlusconi personally commissions, projects and releases give the PdL a ten point advantage over the PD. Not 6, as instead suggest the faked polls commissioned by the PD and the left-wing newspapers.

Plotting lots of different polls and drawing a trend line between them is one easy way of checking whether the gap is getting bigger or not. One simply looks at the trend lines for the PdL and PD, and see whether they’re going up or down.

If only it were that easy. There are lots of different ways to draw a trend line. You can fire up your copy of Excel, punch in the data, and ask it to draw a 2nd order polynomial, y=ax+bx², where y is the degree of support (%), x is time, and the computer fills in a and b as best it can. Or, you could (still in Excel) choose a 3rd order polynomial, y=ax+bx²+cx³. Or, better yet, a fourth order polynomial (you get the picture). Or, you could open your favourite stats software, and run what’s called a loess (lowess) regression [more technical paper], where the computer runs similar calculations for each point in turn.

Problem is, not all of these calculations give the same line. The graph below shows three candidates – a 2nd and 3rd order polynomial, and a local regression – drawn for the PdL (blue) and the PD (red). They give pretty different results. The third-order polynomial shows a swing up for the PdL, a swing down for the PD. The second-order polynomial shows almost the reverse. The local regression is between the two.

Polynomials, thumbnail

So which is best? Well, think about what’s happening with the polynomials. Here, the computer is trying to find the values of a and b that mean the line fits the points as well as possible – but the values of a and b have to be the same all the way along the line. So if you’ve got a point that sticks out like a sore thumb somewhere early along the line, it’s going to affect how the line is drawn somewhere down the line. The polls for the PdL is January will affect the curve of the line in late February. With local regression however, the effect is – just like it says – local.

Now, because computers are smart (in a dumb way), our polynomial line’s not going to look stupid, and the gap between the line in February and any one poll will probably not be large. But the approach is wrong. We’re letting something that happened much earlier affect our judgement of today’s trend. So be careful of which trend lines you see!

posted in election, italy, polling, statistics by Chris

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org