Getting tables out of PDFs in Italy

September 15, 2009

The Italian Parliament annoys me tremendously. Not for substantial reasons (though it might also annoy me for that reason), but for technical reasons.

They have some nicely formatted XML files for the resoconti (minutes) of each parliamentary sitting.

But their voting information is stuck in crappy PDFs.

Grrr.

So, I have to

  • download all the PDF files using a horrible bash script;
  • convert them to XML (for file in *.pdf; do pdftohtml -xml "$file"; done)
  • examine the XML file to find out where the column breaks are
  • write a perl script to parse the files using this information

…and then merge them.

posted in italy, parliament, rollcall by Chris

Follow comments via the RSS Feed | Leave a comment | Trackback URL

3 Comments to "Getting tables out of PDFs in Italy"

  1. Anidride Carbonica wrote:

    Dear Chris,
    may I ask you where in Italian Parliament’s website you find the XML files quoted in your post? Thanks a lot.

  2. Chris wrote:

    Sure — http://www.camera.it -> Documenti -> Resoconti, then pick a month and a seduta from the drop down menu. When you click on a seduta, you’ll see in the left-hand bar, “Resoconto in formato XML”. If you want to download them all (and have a linux-ish system).

    for i in 1:213; do wget “http://documenti.camera.it/apps/resoconto/getXmlStenografico.aspx?idNumero=$i&idLegislatura=16″; done.

  3. Andrea wrote:

    Hi,
    what infos are you looking for exactly ?
    do you know about this project ?
    http://parlamento.openpolis.it

    they seem to have voting details and some
    way to make them accessible.

    my 2c,
    Andrea

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org