Yesterday was the one year anniversary of the Canadian federal government’s open data portal. Over the past year government officials have been continuously adding to the portal, but as it isn’t particularly easy to browse data sets on the website, I’ve noticed a lot of people aren’t aware of what data is now available (self included!). Consequently, I want to encourage people to scan the available data sets and blog about ones that they think might be interesting to them personally, to others, or to communities of interests they may know.
Such an undertaking has been rendered MUCH easier thanks to the data.gc.ca administrators decision to publish a list of all the data sets available on the site. Turns out, there are 11680 data sets listed in this file. Of course, reviewing all this data took me much longer than I thought it would! (and to be clear, I didn’t explore each one in detail), but the process has been deeply interesting. Below are some thoughts, ideas and data sets that have come out of this exploration – I hope you’ll keep reading, and that it will be of interest to ordinary citizens, prospective data users and to managers of open government data portals.
- A TagCloud of the Data Sets on data.gc.ca
Some Brief Thoughts on the Portal (and for others thinking about exploring the data)
Trying to review all the data sets on the portal is a enormous task and trying to do it has taught me some lessons about what works and doesn’t. The first is that, while the search function on the website is probably good if you have a keyword or a specific data you are looking for, it is much easier to browse the data in an excel than on the website. What was particularly nice about this is that, in excel, the data was often clustered by type. This made easy to spot related data sets – a great example of this when I found the data on “Building permits, residential values and number of units, by type of dwelling” I could immediately see there were about 12 other data sets on building permits available.
Another issue that became clear to me is the problem of how a data set is classified. For example, because of the way the data is structured (really as a report) the Canadian Dairy Exports data has a unique data file for every month and year (you can look at May 1988 as an example). That means each month is counted as a unique “data set” in the catalog. Of course, French and English versions are also counted as unique. This means that what I would consider to be a single data set “Canadian Dairy Exports Month Dairy Year from 1988 to present” actually counts as 398 data sets. This has two outcomes. First, it is hard to imagine anyone wants the data for just one month. This means a user looking for longitudinal data on this subject has to download 199 distinct data sets (very annoying). Why not just group it into one? Second, given that governments like to keep score about how many data sets they share – counting each month as a unique data set feels… unsportsmanlike. To be clear, this outcome is an artifact of how Agriculture Canada gathers and exports this data, but it is an example of the types of problems an open data catalog needs to come to grips with.
Finally, many users – particularly, but not exclusively, developers – are looking for data that is up to date. Indeed, real time data is particularly sexy since its dynamic nature means you can do interesting things with it. This it was frustrating to occasionally find data sets that were no longer being collected. A great example of this was the Provincial allocation of corporate taxable income, by industry. This data set jumped out at me as I thought it could be quite interesting. Sadly, StatsCan stopped collecting data on this in 1987 so any visualization will have limited use today. This is not to say data like this should be pulled from the catalog, but it might be nice to distinguish between datasets that are being collected on an ongoing basis versus those that are no longer being updated.
Data Sets I found Interesting
Just quickly before I begin, some quick thoughts on my very unscientific methodology for identifying interesting data sets.
- First, browsing the data sets really brought home to me how many will be interesting to different groups – we really are in the world of the long tail of public policy. As a result, there is lots of data that I think will be interesting to many, many people that is not on this list.
- Second, I tried to not include too much of StatsCan’s data. StatsCan data already has a fairly well developed user base. And while I’m confident that base is going to get bigger still now that its data is free, I figure there are already a number of people who will be sharing/talking about it
- Finally, I’ve tried to identify some data sets that I think would make for good mashups or apps. This isn’t easy with federal government data sets since they tend do be more aggregate and high-level than say municipal data sets… but I’ve tried to tease out what I can. That said, I’m sure there is much, much more.
New GeoSpatial API!
So the first data set is a little bit of a cheat since it is not on the open data portal, but I was emailed about it yesterday and it is so damn exciting, I’ve got to share it. It is a recently released public BETA of a new RESTful API from the very cool people at GeoGratis that provides a consolidated access point to several repositories of geospatial data and information products including GeoGratis, GeoPub and Mirage. (huge thank you to the GeoGratis team for sending this to me).
Documentation can be found here (and in french here) and a sample search client that demonstrates some of its functionality and how to interact with the API can be found here. Formats include ATOM, HTML Fragment, CSV, RSS, JSON, and KML. (So you can see results – for example – in Google Earth by using the KML format (example here).
I’m also told that these fine folks have been working on geolocation service, so you can do sexy things like search by place name, by NTS map or by the first three characters of a postal code. Documentation will be posted here in english and french. Super geeks may notice that there is a field in the JSON called CGNDBkey. I’m also told you can use this key to select an individual placename according to the Canadian Geographic names board. Finally, you can also search all their Metadata through search engines like google (here is a sample search for gold they sent me).
All data is currently licensed under GeoGratis.
The National Pollutant Release Inventory
Description: The National Pollutant Release Inventory (NPRI) is Canada’s public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling.
Notes: This is the same data set (but updated) that we used to create emitter.ca. I frankly feel like the opportunities around this data set, for environmentalists, investors (concerned about regulatory and lawsuit risks), the real estate industry, and others, is enormous. The public could be very interested in this.
Greenhouse Gas Emissions Reporting Program
Description: The Greenhouse Gas Emissions Reporting Program (GHGRP) is Canada’s legislated, publicly-accessible inventory of facility-reported greenhouse gas (GHG) data and information.
Notes: What interesting here is that while it doesn’t have lat/longs, it does have facility names and addresses. That means you should be able to cross reference it with the NPRI (which does have lat/longs) to be able to plot where the big greenhouse gas emitters are on a map. Think the same people as the NPRI might be interested in this data.
The Canadian Ice Thickness Program
Description: The Ice Thickness program dataset documents the thickness of ice on the ocean. Measurements begin when the ice is safe to walk on and continue until it is no longer safe to do so. This data can help gauge the impact of global warming and is relevant to shipping data in the north of Canada.
Notes: Students interested in global warming… this could make for some fun visualization.
Argo: Canadian Tracked Data
Description: Argo Data documents some of the approximately 3000 profiling floats were deployed around the world. Once at sea, the float sinks to a preprogrammed target depth of 2000 meters for a preprogrammed period of time. It then floats to the surface, taking temperature and salinity values during its ascent at set depths. — The Canadian Tracked Argo Datadescribes the Argo programme in Canada and provides data and information about Canadian floats.
Notes: Okay, so I can think of no use for this data, but I just that it was so awesome that people are doing this that I totally geeked out.
Civil Aircraft Register Database
Description: Civil Aircraft Register Database – this file contains the current mark, aircraft and owner information of all Canadian civil registered aircraft.
Notes: Here I really think there could be a geeky app. Just a simple app that you can type an aircraft’s number into and it will tell you the owner and details about the plane. I actually think the government could do a lot of work with this data. If regulatory and maintenance data were made available as well – then you’d have a powerful app that would tell you a lot about the planes you fly in. At a minimum would be of interest to flight enthusiasts.
Real Time Hydrometric Data Tool
Description: Real Time Hydrometric Data Tool – this site provides public access to real-time hydrometric (water level and streamflow) data collected at over 1700 locations in Canada. These data are collected under a national program jointly administered under federal-provincial and federal-territorial cost-sharing agreements. It is through partnerships that the Water Survey of Canada program has built a standardized and credible environmental information base for Canada. This dataset contains both current and historical datasets. The current month can be viewed in an HTML table, and historical data can be downloaded in CSV format.
Notes: So ripe for an API! What is cool is that the people at Environment Canada have integrated it into google maps. I could imagine fly fisherman and communities at risk of flooding being interested in this data set.
Access to information data sets
Description: 2006-2010 Access to Information and Privacy Statistics (With the previous years here, here and here.) is a compilation of statistical information about access to information and privacy submitted by government institutions subject to the Access to Information Act and the Privacy Act for 2006-2010.
Notes: I’d love to crunch this stuff again and see whose naughty and nice in the ATIP world…
Poultry and Forestry data
No links, BECAUSE THERE IS SO MUCH OF IT. Anyone interested in the Poultry or Forestry industry will find lots of data… obviously this stuff is useful to people who analyze these industries but I suspect there are a couple of “A” university level papers hidden in that data set as well.
There is tons on building permits., construction.. Actually one of the benefits of looking at the data in a spread sheet, easy to see other related data sets.
It really is amazing how much Statistic Canada data there is. Even reviewing something like the supply and demand of natural gas liquids got me thinking about the wealth of information trapped in there. One thing I do hope statscan starts to do is geolocate its data whenever possible.
As this has been in the news I couldn’t help but include it. It’s nice that any citizen can look at the crime data direct from StatsCan too see how our crime rate is falling (which is why we should build more expensive prisons) Crime statistics, by detailed offences. Of course unreported crime, which we all know is climbing at 3000% a year, is not included in these stats.
Legal Aid Applications
Legal aid applications, by status and type of matter. This was interesting to me since, here in BC there is much talk about funding for the Justice system and yet, the number of legal aid applications has remained more or less flat over the past 5 years.
National Broadband Coverage data
Description: The National Broadband Coverage Data represents broadband coverage information, by technology, for existing broadband service providers as of January 2012. Coverage information for Broadband Canada Program projects is included for all completed projects. Coverage information is aggregated over a grid of hexagons, which are each 6 km across. The estimated range of unserved / underserved population within in each hexagon location is included.
Notes: What’s nice is that there is lat/long data attached to all this, so mapping it, and potentially creating a heat map is possible. I’m certain the people at OpenMedia might appreciate such a map.
Census Consolidated Subdivision
Description: Census Consolidated Subdivision Cartographic Boundary Files portrays the geographic limits used for the 2006 census dissemination. The Census Consolidated Subdivision Boundary Files contain the boundaries of all 2,341 census consolidated subdivisions.
Notes: Obviously this one is on every data geeks radar, but just in case you’ve been asleep for the past 5 months, I wanted to highlight it.
Non-Emergency Surgeries, distribution of waiting times
Description: Non-emergency surgeries, distribution of waiting times, household population aged 15 and over, Canada, provinces and territories
Notes: Would love to see this at the hospital and clinic level!
Border Wait Times
Description: Estimates Border Wait Times (commercial and travellers flow) for the top 22 Canada Border Services Agency land border crossings.
Notes: Here I really think there is an app that could be made. At the very least there is something that could tell you historical averages and ideally, could be integrated into Google and Bing maps when calculating trip times… I can also imagine a lot of companies that export goods to the US are concerned about this issue and would be interested in better data to predict the costs and times of shipping goods. Big potential here.
Okay, that’s my list. Hope it inspires you to take a look yourself, or play with some of the data listed above!