CivicAccess - Re: [OpenDataBC] Re: Upcoming events

CivicAccess

Re: [OpenDataBC] Re: Upcoming events

Posted by Herb Lainchbury on Mar 06, 2012; 8:07pm
URL: http://civicaccess.416.s1.nabble.com/Re-Upcoming-events-tp4157p4144.html

James: The province provided about 10,000 datasets to one of our hackathons. Three people rated most of that 10,000 in a few hours at that hackathon.

It was reasonably quick because many of the datasets were similar to each other. The last 100 or so were a bit trickier and took a bit longer.

The 250 on OpenDataBC were done over a longer period of time without any sustained effort in any one session.

On Tue, Mar 6, 2012 at 8:45 AM, James McKinney <[hidden email]> wrote:

(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures.

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested. Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool. As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating. It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :) The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though. It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose. It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture. If we are able to articulate additional qualities they could be added to the tool as well. We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel. We might want to come up with a scheme where more than one person rates every dataset and then we average. Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:

James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests.

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew

On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:

I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.

I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://opennorth.ca/

http://citizenbudget.com/ interactive budget consultations for municipalities
[hidden email]

Twitter: @opennorth
Subscribe to our newsletter

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave

--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury

--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury