We have one hour to take a data story from press release to data analysis, so LET’S GO.
We’re going to look at crime data. The very best thing to have would be our own data from a good, clean, reliable source like the government. Or at least, as reliable as you’re going to get. The best thing to do would be to [expand title=”file a Freedom of Information Request.”]
The Act covers any recorded information that is held by a public authority in England, Wales and Northern Ireland, and by UK-wide public authorities based in Scotland. Information held by Scottish public authorities is covered by Scotland’s own Freedom of Information (Scotland) Act 2002.
Your request can be in the form of a question, rather than a request for specific documents, but the authority does not have to answer your question if this would mean creating new information or giving an opinion or judgment that is not already recorded.
You do not have to mention the Freedom of Information Act, although it may be helpful; know whether the information is covered by the Freedom of Information Act or the Environmental Information Regulations; or say why you want the information.
But since we don’t have time to do that in this one-hour session, let’s start with this press release: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/445753/hosb0115.pdf. Annual statistics, woohoo! If we want to use this data ourselves, we have to get it out of the PDF. There are [expand title=”lots of free tools that can do this.”]
But let’s use Tabula, a free, open-source tool made by journalists. Believe it or not, some of the best tools for journalists out there were made by journalists themselves, who contributed to this open Web community we depend on so much. It depends on us, too.
Once you download Tabula, you can import the PDF into it and select the data table you want to scrape, just like you’re taking a screenshot of it.
As journalists, we’re looking for change. Let’s find a time sequence (data showing change over time) and grab that table with Tabula. You may need to do some cleaning in Excel, which means deleting spaces, separating numbers that clumped together, and so on. IRE, a journalism group in the US, just released a new tutorial on Excel – but there’s lots of different ways you can learn it. And there’s always googling if you have a specific problem. (I’m not being lazy, I swear, that’s really how the “experts” do it.)
To quickly look for change, I might make sparklines like so:
Pick a trend that interests you – cannabis arrests, for instance. Before we really start investigating, we need to think critically about this data. That’s what separates journalism – and civic hacking – from plain data dumping.
With every data set ever made, there are both computer flaws and human flaws you might have to grapple with. [expand title=”Here are a few hints.”]
- Why are the drops so stark? Did the police change their tactics? Was there a change in law? Is it just a mistake in the data? Does the sparkline make it look more intense than it actually is?
- Where are they getting this data? Is it reliable source? Do people actually report as thoroughly as they’re supposed to? What law makes them report? Are there loopholes? Is there a political underpinning?
But if you’re sitting in this Mozfest session, we’re just gonna roll on ahead, throwing caution to the wind! (Disclaimer: Please come back and vet this stuff before you turn in this story to your editor. Thank you.)
Let’s get our won data so we can localize a trend to where we’re from. Generally, there are four paths to data:
- Through humans
- the easy way: ask nicely
- the hard way: file a FOI request (see above)
- Through computers
- the easy way: download from a website
- the hard way: scrape (see this tutorial)
Sometimes getting our own data might mean scraping it off a website. I made another tutorial on doing this – peeling back the covers of websites to find the data underneath.
But for this one-hour session, we’re going to fly over to data that’s already been beautifully opened. Oftentimes you’ll see this data liberated and published by civic hacking groups, but this was actually done by the government: https://data.police.uk/
Here you can download CSVs (comma-separated values, that is, a spreadsheet) of data relating to your county. How is this data different from the press release? It’s individual data points – individual crimes – rather than aggregate numbers.
So if we’re investigating cannabis arrests in our county, for instance, we’ll filter (search) for Crime type: Drugs, then perhaps make a Pivot Table showing the most common outcomes.
As a journalist or civic hacker, it’s most important to think along these lines:
- What data or information is most beneficial or interesting to my community?
- In this case, crime is always interesting because it happens everywhere.
- Where can I get reliable data, and how are all the ways it could possibly lead me wrong?
- Primary sources like the government are the best; but still, it’s going to be full of possible errors both from the computer side and the human side.
- What is the story? Usually, what is the change?
- In this case, we’re looking for change by doing a quick analysis (sparklines) in the national data before narrowing it down to our county.