I think this conversation should be on OpenDataBC and CivicAccess.
I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on
data.gc.ca). It would be even better if TBS made this file available on
data.gc.ca.
Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:
1) Data has low precision
Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on
data.gc.ca makes most of the data neither novel nor relevant.
Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.
2) Data has little value on its own
Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.
Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.
Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and
Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using
data.gc.ca datasets.
3) Government has already done valuable analysis
Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.
4) Data is not timely
Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.
5) Data quality is inconsistent
The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.
6) Data is not politically sensitive
There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.
I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).
It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.
--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
Twitter: @opennorth
On 2012-03-05, at 11:53 AM, David Eaves wrote: