My SO works at a newspaper — a pretty good one, in fact.
Unfortunately, many of the fun headlines don’t make it in to the paper — not because they’re particularly scabrous (though some of the funny ones certainly were so — and I still can’t believe that “long march” made it into a story about house prices in Chinatown). Rather, they get jettisoned because they don’t work well if you’re trying to engage in a little bit of search engine optimization (SEO). Google, for all its tremendous achievements in information retrieval, is very bad at understanding puns.
I mention this because I’ve come across a rare instance where the British penchant for punning has complicated my life.
I’m currently working on a project looking at the representation of constituency opinion in Parliament. One of our objectives involves examining the distribution of parliamentary attention — whether MPs from constituencies very concerned by immigration talk more about immigration than MPs from constituencies that are more relaxed about the issue.
To do that, I’ve been relying on the excellent datasets made available from the UK Policy Agendas Project. In particular, I’ve been exploring the possibility of using their hand-coded data to engage in automated coding of parliamentary questions.
One of their data-sets features headlines from the Times. Coincidentally, one of the easier-to-use packages in automated coding of texts (RTextTools) features a data-set with headlines from the New York Times. Both data-sets use similar topic codes, although the UK team has dropped a couple of codes.
How well does automated topic coding work on these two sets of newspaper headlines?
With the New York Times data (3104 headlines over ten years, divided into a 2600 headline training set, and a 400 headline test set), automated topic coding works well. 56.8% of the 400 test headlines put in to the classifier were classified correctly. That’s pretty amazing considering the large number of categories (27) and the limited training data.
How do things fare when we turn to the (London) Times (6571 headlines over ten years, divided into a 6131 headline training set and a 871 headline test set)? Unfortunately, despite having much more in the way of training data, only 46.6% of articles were classified correctly.
Looks like those puns are not just bad for SEO, they’re also bad for the text-as-data movement…
Update: Mark Liberman suggests (convincingly, IMHO) that the difference is due to headline length.