Re: Upcoming events

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

Herb Lainchbury
+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury
Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

James McKinney-2
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

Tracey P. Lauriault

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

James McKinney-2
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

David Eaves
+1 on James comment. The purpose here is to get people engaging with the data. Few have even looked to see what is available on data.gc.ca. My feeling is to keep this light and fun. Moreover, I'd love it if different epistemic communities had their own events - I'd love to read a blog post about what urban planners found interesting vs. what political geeks found interesting vs. what an economist found interesting.

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: [OpenDataBC] Re: Upcoming events

Herb Lainchbury
In reply to this post by James McKinney-2
James: The province provided about 10,000 datasets to one of our hackathons.  Three people rated most of that 10,000 in a few hours at that hackathon.

It was reasonably quick because many of the datasets were similar to each other.  The last 100 or so were a bit trickier and took a bit longer.

The 250 on OpenDataBC were done over a longer period of time without any sustained effort in any one session.

H

On Tue, Mar 6, 2012 at 8:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury




--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury
Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

Mark Weiler-2
In reply to this post by David Eaves
+1 to getting librarians involved. Librarians are you here?!


From: David Eaves <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 6, 2012 11:40:35 AM
Subject: Re: [CivicAccess-discuss] Upcoming events

+1 on James comment. The purpose here is to get people engaging with the data. Few have even looked to see what is available on data.gc.ca. My feeling is to keep this light and fun. Moreover, I'd love it if different epistemic communities had their own events - I'd love to read a blog post about what urban planners found interesting vs. what political geeks found interesting vs. what an economist found interesting.

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.
 
For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 
 
The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.
 
These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.
 
The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.
 
So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.
 
GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?
 
These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.
 
The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!
 
The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.
 
How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?
 
Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 
 
And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.
Cheers
t
 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

Clinton Boyda

We host over 100 libraries (2 regions) in Alberta on TownLife, a customized Ruby on Rails CMS.. thinking of experimenting with Socrata open data platform (http://opendata.socrata.com/) , but really waiting to see some kind of provincial leadership to set a precendence before lone wolfing it.

 

 

--

Clinton Boyda

Leading Strategic Development

ph (866) 310-1875 x7

 

Econolution Inc.

Helping Rural Communities Diversify, Grow & Prosper.

www.townlife.com Community Powered Websites!

 

Please consider the environment before printing this email.

 

Confidentiality: The information contained in this transmission is privileged and confidential.  It is intended only for the use of the individuals or entity named above.  If the reader of this message is not the intended recipient, you are hereby notified that you are not authorized to review the following information or attachments, and that any dissemination, distribution, or copying of this communication is strictly prohibited.  If you have received this communication please notify [hidden email] immediately.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Weiler
Sent: March 6, 2012 1:14 PM
To: civicaccess discuss
Subject: Re: [CivicAccess-discuss] Upcoming events

 

+1 to getting librarians involved. Librarians are you here?!

 


From: David Eaves <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 6, 2012 11:40:35 AM
Subject: Re: [CivicAccess-discuss] Upcoming events

 

+1 on James comment. The purpose here is to get people engaging with the data. Few have even looked to see what is available on data.gc.ca. My feeling is to keep this light and fun. Moreover, I'd love it if different epistemic communities had their own events - I'd love to read a blog post about what urban planners found interesting vs. what political geeks found interesting vs. what an economist found interesting.

On 12-03-06 11:37 AM, James McKinney wrote:

Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

 

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

 

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

 

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

 

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

 

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

 

James

 

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:



Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 

On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:

(Removing names that I know are on these two lists. Sorry for cross-post.)

 

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

 

In reply to Andrew:

 

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

 

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

 

Cheers,

 

James

 

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:



+1 for the great responses thus far

 

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

 

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

 

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

 

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

 

H

 

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:

James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

 

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

 

Cheers,
Andrew

 

On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:

I think this conversation should be on OpenDataBC and CivicAccess.

 

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

 

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

 

1) Data has low precision

 

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

 

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

 

2) Data has little value on its own

 

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

 

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

 

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

 

3) Government has already done valuable analysis

 

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

 

4) Data is not timely

 

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

 

5) Data quality is inconsistent

 

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

 

6) Data is not politically sensitive

 

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.

 

 

I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

 

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

 

--

James McKinney
Open North
+1.514.247.0223

http://citizenbudget.com/ interactive budget consultations for municipalities

Twitter: @opennorth

 

On 2012-03-05, at 11:53 AM, David Eaves wrote:

 

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave

 

 



 

--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury

 


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss




--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss




_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

David Eaves
In reply to this post by James McKinney-2
Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada

800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481

2

Sales of commodities of large retailers - English version

Statistics Canada

247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada

102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada

98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

61

9

Sales of fuel used for road motor vehicles, by province and territory - English version

Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version

Library and Archives Canada

51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

James McKinney-2
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada


800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1
Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481
2

Sales of commodities of large retailers - English version

Statistics Canada


247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada


102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada


98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada


61

9

Sales of fuel used for road motor vehicles, by province and territory - English version


Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version


Library and Archives Canada


51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

David Eaves
I have no idea if they'll post them. But you can ask.

You may also receive! A colleague requested these early and sent them to me at some point today. Haven't had a chance to look at them all that much except to note that Reunion has a LOT of downloads.

Here they are for two months (all I have), just had to get them out of this chart in a power point

December
12/02/11-12/08/11 to 12/23/11-12/29/11

Total visits for the month of December : 59, 396

January
12/30/11-01/05/12 to 01/27/11-02/02/12
United States Canada Reunion Germany Australia Europe (West) Netherlands Hong Kong Japan Italy
19638 10143 9392 7248 4252 3679 2941 2917 1975 1833
Total visits for the month of January : 93, 242

Hope to blog on this shortly.

d

On 12-03-07 3:33 PM, James McKinney wrote:
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada


800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1
Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481
2

Sales of commodities of large retailers - English version

Statistics Canada


247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada


102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada


98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada


61

9

Sales of fuel used for road motor vehicles, by province and territory - English version


Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version


Library and Archives Canada


51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

Gerry Tychon
In reply to this post by James McKinney-2
A note on CLI -- since I have some involvement in this -- for those that may be interested.
The generation of CLI ratings was a major effort that ended up being incorporated in many different areas but they were deficient in a number of areas. These included:
  • lack of consideration for climate
  • not all soils taken into account (e.g., organic)
  • ratings were determined subjectively - no repeatable objective process
  • no clear or readily available documents/explanation on how ratings are determined
  • ratings were not consistent over large geographic areas (e.g., nationally).

I have done work on LSRS -- the Land Suitability Rating System -- meant to replace CLI ratings. For those interested some information can be found at:

http://xspatial.com/node/6

... gerry tychon


On Wed, Mar 7, 2012 at 4:33 PM, James McKinney <[hidden email]> wrote:
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada


800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1
Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481
2

Sales of commodities of large retailers - English version

Statistics Canada


247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada


102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada


98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada


61

9

Sales of fuel used for road motor vehicles, by province and territory - English version


Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version


Library and Archives Canada


51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
<a href="tel:613-234-2805" value="+16132342805" target="_blank">613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

James McKinney
In reply to this post by David Eaves
I wonder if France is being misclassified as Reunion. I can understand if France is requesting a lot of immigration (and other) datasets.

On 2012-03-07, at 6:44 PM, David Eaves wrote:

I have no idea if they'll post them. But you can ask.

You may also receive! A colleague requested these early and sent them to me at some point today. Haven't had a chance to look at them all that much except to note that Reunion has a LOT of downloads.

Here they are for two months (all I have), just had to get them out of this chart in a power point

December
12/02/11-12/08/11 to 12/23/11-12/29/11
<gdjaccag.png>
Total visits for the month of December : 59, 396

January
12/30/11-01/05/12 to 01/27/11-02/02/12
United States Canada Reunion Germany Australia Europe (West) Netherlands Hong Kong Japan Italy
19638 10143 9392 7248 4252 3679 2941 2917 1975 1833
Total visits for the month of January : 93, 242

Hope to blog on this shortly.

d

On 12-03-07 3:33 PM, James McKinney wrote:
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada


800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1
Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481
2

Sales of commodities of large retailers - English version

Statistics Canada


247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada


102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada


98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada


61

9

Sales of fuel used for road motor vehicles, by province and territory - English version


Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version


Library and Archives Canada


51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" target="_blank" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

michael gurstein
Message
I believe that Reunion has become a bit of a call centre hub for Francophone countries (France (and Quebec?)...
 
M
-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of James McKinney
Sent: Wednesday, March 07, 2012 5:32 PM
To: [hidden email]
Subject: Re: [CivicAccess-discuss] data.gc.ca usage

I wonder if France is being misclassified as Reunion. I can understand if France is requesting a lot of immigration (and other) datasets.

On 2012-03-07, at 6:44 PM, David Eaves wrote:

I have no idea if they'll post them. But you can ask.

You may also receive! A colleague requested these early and sent them to me at some point today. Haven't had a chance to look at them all that much except to note that Reunion has a LOT of downloads.

Here they are for two months (all I have), just had to get them out of this chart in a power point

December
12/02/11-12/08/11 to 12/23/11-12/29/11
<gdjaccag.png>
Total visits for the month of December : 59, 396

January
12/30/11-01/05/12 to 01/27/11-02/02/12
United States Canada Reunion Germany Australia Europe (West) Netherlands Hong Kong Japan Italy
19638 10143 9392 7248 4252 3679 2941 2917 1975 1833
Total visits for the month of January : 93, 242

Hope to blog on this shortly.

d

On 12-03-07 3:33 PM, James McKinney wrote:
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada


800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1
Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481
2

Sales of commodities of large retailers - English version

Statistics Canada


247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada


102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada


98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada


61

9

Sales of fuel used for road motor vehicles, by province and territory - English version


Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version


Library and Archives Canada


51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<A href="tel:%2B1.514.247.0223" target=_blank moz-do-not-send="true" value="+15142470223">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

James McKinney-2
In reply to this post by James McKinney-2
Published two days ago: Data.gc.ca Portal Catalogue!

http://www.data.gc.ca/default.asp?lang=En&n=5175A6F0-1&xsl=datacataloguerecord&xml=5175A6F0-61E1-49FC-8E5D-0BBCDAF5969D&formid=C4C5C7F1-BFA6-4FF6-B4A0-C164CB2060F7&showfromadmin=1&readonly=true

It seems to have more records than the file David posted to BuzzData, so I think we should refer to this new one, instead. For whatever reason, the file doesn't list itself as being in the catalogue.

On 2012-03-07, at 6:33 PM, James McKinney wrote:

> Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.
>
> On 2012-03-07, at 6:23 PM, David Eaves wrote:
>
>> Some of you may find this deck of interest.
>>
>> Of particular interest were the most downloaded data set numbers
>>
>> Since Launch:
>>
>> Permanent Resident Applications Processed Abroad and Processing Times (English)
>> Citizenship and Immigration Canada
>> 4730
>> Permanent Resident Summary by Mission (English)
>> Citizenship and Immigration Canada
>> 1733
>> Overseas Permanent Resident Inventory (English)
>> Citizenship and Immigration Canada
>> 1558
>> Canada – Permanent residents by category (English)
>> Citizenship and Immigration Canada
>> 1261
>> Permanent Resident Applicants Awaiting a Decision (English)
>> Citizenship and Immigration Canada
>> 873
>> Meteorological Service of Canada (MSC) - City Page Weather
>> Environment Canada
>> 852
>> Meteorological Service of Canada (MSC) – Weather Element Forecasts
>> Environment Canada
>> 851
>> Permanent Resident Visa Applications Received Abroad - English Version
>> Citizenship and Immigration Canada
>>
>> 800
>> Water Quality Indicators - Reports, Maps, Charts and Data
>> Environment Canada
>> 697
>> Canada - Permanent and Temporary Residents - English version
>> Citizenship and Immigration Canada
>> 625
>>
>> Last 30 days:
>> #
>> DATASET
>> DEPARTMENT
>> DOWNLOADS
>> 1
>> Permanent Resident Applications Processed Abroad and Processing Times (English)
>> Citizenship and Immigration Canada
>> 481
>> 2
>> Sales of commodities of large retailers - English version
>> Statistics Canada
>>
>> 247
>> 3
>> Permanent Resident Summary by Mission - English Version
>> Citizenship and Immigration Canada
>> 207
>> 4
>> CIC Operational Network at a Glance - English Version
>> Citizenship and Immigration Canada
>> 163
>> 5
>> Gross domestic product at basic prices, communications, transportation and trade - English version
>> Statistics Canada
>> 159
>> 6
>> Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery
>> Environment Canada
>>
>> 102
>> 7
>> Canada - Permanent residents by category - English version
>> Citizenship and Immigration Canada
>>
>> 98
>> 8
>> Meteorological Service of Canada (MSC) - City Page Weather
>> Environment Canada
>>
>> 61
>> 9
>> Sales of fuel used for road motor vehicles, by province and territory - English version
>>
>> Statistics Canada
>> 52
>> 10
>> Government of Canada Core Subject Thesaurus - English Version
>>
>> Library and Archives Canada
>>
>> 51
>>
>> Anyways, thought this might be interesting.
>>
>> Dave
>>
>> On 12-03-06 11:37 AM, James McKinney wrote:
>>> Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.
>>>
>>> I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.
>>>
>>> Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.
>>>
>>> Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.
>>>
>>> With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.
>>>
>>> We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.
>>>
>>> James
>>>
>>> On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:
>>>
>>>> Appraising the value of a dataset is a very tricky proposition.
>>>>  
>>>> For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area?
>>>>  
>>>> The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.
>>>>  
>>>> These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of                 ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.
>>>>  
>>>> The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.
>>>>  
>>>> So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.
>>>>  
>>>> GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?
>>>>  
>>>> These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.
>>>>  
>>>> The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!
>>>>  
>>>> The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.
>>>>  
>>>> How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?
>>>>  
>>>> Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets.
>>>>  
>>>> And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.
>>>>
>>>> Cheers
>>>>
>>>> t
>>>>
>>>>  
>>>> On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
>>>> (Removing names that I know are on these two lists. Sorry for cross-post.)
>>>>
>>>> Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures.
>>>>
>>>> By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.
>>>>
>>>> In reply to Andrew:
>>>>
>>>> I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.
>>>>
>>>> I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.
>>>>
>>>> Cheers,
>>>>
>>>> James
>>>>
>>>> On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:
>>>>
>>>>> +1 for the great responses thus far
>>>>>
>>>>> I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.
>>>>>
>>>>> I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.
>>>>>
>>>>> The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.
>>>>>
>>>>> Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.
>>>>>
>>>>> H
>>>>>
>>>>> On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
>>>>> James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests.
>>>>>
>>>>> The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.
>>>>>
>>>>> I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.
>>>>>
>>>>> Cheers,
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
>>>>> I think this conversation should be on OpenDataBC and CivicAccess.
>>>>>
>>>>> I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.
>>>>>
>>>>> Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:
>>>>>
>>>>> 1) Data has low precision
>>>>>
>>>>> Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.
>>>>>
>>>>> Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.
>>>>>
>>>>> 2) Data has little value on its own
>>>>>
>>>>> Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.
>>>>>
>>>>> Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.
>>>>>
>>>>> Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.
>>>>>
>>>>> 3) Government has already done valuable analysis
>>>>>
>>>>> Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.
>>>>>
>>>>> 4) Data is not timely
>>>>>
>>>>> Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.
>>>>>
>>>>> 5) Data quality is inconsistent
>>>>>
>>>>> The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.
>>>>>
>>>>> 6) Data is not politically sensitive
>>>>>
>>>>> There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.
>>>>>
>>>>>
>>>>> I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).
>>>>>
>>>>> It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe                                                         on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more                                                         interesting nonpolitical data.
>>>>>
>>>>> --
>>>>> James McKinney
>>>>> Open North
>>>>> +1.514.247.0223
>>>>> http://opennorth.ca/
>>>>> http://citizenbudget.com/ interactive budget consultations for municipalities
>>>>> [hidden email]
>>>>> Twitter: @opennorth
>>>>> Subscribe to our newsletter
>>>>>
>>>>> On 2012-03-05, at 11:53 AM, David Eaves wrote:
>>>>>
>>>>>> Hey guys,
>>>>>>
>>>>>> Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.
>>>>>>
>>>>>> What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.
>>>>>>
>>>>>> If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.
>>>>>>
>>>>>> I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year                                                           anniversary of the open data portal (I think Scilib suggested that date).
>>>>>>
>>>>>> I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.
>>>>>>
>>>>>> I'll blog this idea as well so that others can read it.
>>>>>>
>>>>>> Cheers,
>>>>>> dave
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Herb Lainchbury
>>>>> Dynamic Solutions Inc.
>>>>> www.dynamic-solutions.com
>>>>> http://twitter.com/herblainchbury
>>>>
>>>>
>>>> _______________________________________________
>>>> CivicAccess-discuss mailing list
>>>> [hidden email]
>>>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>>>>
>>>>
>>>>
>>>> --
>>>> Tracey P. Lauriault
>>>> 613-234-2805
>>>>  
>>>> "Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
>>>>  
>>>> Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
>>>> _______________________________________________
>>>> CivicAccess-discuss mailing list
>>>> [hidden email]
>>>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> CivicAccess-discuss mailing list
>>>
>>> [hidden email]
>>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>> _______________________________________________
>> CivicAccess-discuss mailing list
>> [hidden email]
>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


Reply | Threaded
Open this post in threaded view
|

Re: data.gc.ca usage

Gerry Tychon-2
James ...

Thanks for posting that information on the Data.gc.ca portal and the catalogue contents being available.
In October of last year I sent a comment/request that it be included. My opinion is that the contents of an open data catalogue or repository or whatever you would like to call it should also be available as a dataset (or via some kind of api). I think this is absolutely fundamental. To miss this is to miss the point of open data. The reply I received was that it would be done in the future.

The other day I was able to find the catalogue contents document (you mentioned) by doing a search using the spelling "catalogue". The alternate spelling "catalog" did not succeed.

Perhaps someone else knows more, but a quick look at the file shows a free text title, a sometimes longer free text description, and URL's to further information. But there is no structured taxonomy or controlled vocabulary within the file which makes it difficult to do any analysis on the contents of the file.

In addition, Library and Archives Canada, has done a huge amount of work on controlled vocabularies:

http://www.collectionscanada.gc.ca/government/controlled-vocabularies/index-e.html

and metadata:

http://www.collectionscanada.gc.ca/government/products-services/007002-5000-e.html

within the government and yet I can't find any of that via the open data portal. Or at least using the search words and spelling I did use.

... gerry tychon


On 11/03/2012 1:17 PM, James McKinney wrote:
Published two days ago: Data.gc.ca Portal Catalogue!

http://www.data.gc.ca/default.asp?lang=En&n=5175A6F0-1&xsl=datacataloguerecord&xml=5175A6F0-61E1-49FC-8E5D-0BBCDAF5969D&formid=C4C5C7F1-BFA6-4FF6-B4A0-C164CB2060F7&showfromadmin=1&readonly=true

It seems to have more records than the file David posted to BuzzData, so I think we should refer to this new one, instead. For whatever reason, the file doesn't list itself as being in the catalogue.

On 2012-03-07, at 6:33 PM, James McKinney wrote:

Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.

On 2012-03-07, at 6:23 PM, David Eaves wrote:

Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)
Citizenship and Immigration Canada
4730
Permanent Resident Summary by Mission (English)
Citizenship and Immigration Canada
1733
Overseas Permanent Resident Inventory (English)
Citizenship and Immigration Canada
1558
Canada – Permanent residents by category (English)
Citizenship and Immigration Canada
1261
Permanent Resident Applicants Awaiting a Decision (English)
Citizenship and Immigration Canada
873
Meteorological Service of Canada (MSC) - City Page Weather
Environment Canada
852
Meteorological Service of Canada (MSC) – Weather Element Forecasts
Environment Canada
851
Permanent Resident Visa Applications Received Abroad - English Version
Citizenship and Immigration Canada

800
Water Quality Indicators - Reports, Maps, Charts and Data
Environment Canada
697
Canada - Permanent and Temporary Residents - English version
Citizenship and Immigration Canada
625

Last 30 days:
#
DATASET
DEPARTMENT
DOWNLOADS
1
Permanent Resident Applications Processed Abroad and Processing Times (English)
Citizenship and Immigration Canada
481
2
Sales of commodities of large retailers - English version
Statistics Canada

247
3
Permanent Resident Summary by Mission - English Version
Citizenship and Immigration Canada
207
4
CIC Operational Network at a Glance - English Version
Citizenship and Immigration Canada
163
5
Gross domestic product at basic prices, communications, transportation and trade - English version
Statistics Canada
159
6
Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery
Environment Canada

102
7
Canada - Permanent residents by category - English version
Citizenship and Immigration Canada

98
8
Meteorological Service of Canada (MSC) - City Page Weather
Environment Canada

61
9
Sales of fuel used for road motor vehicles, by province and territory - English version

Statistics Canada
52
10
Government of Canada Core Subject Thesaurus - English Version

Library and Archives Canada

51

Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.
 
For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 
 
The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.
 
These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of                 ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.
 
The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.
 
So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.
 
GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?
 
These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.
 
The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!
 
The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.
 
How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?
 
Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 
 
And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney [hidden email] wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck [hidden email] wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney [hidden email] wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe                                                         on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more                                                         interesting nonpolitical data.

--
James McKinney
Open North
+1.514.247.0223
http://opennorth.ca/
http://citizenbudget.com/ interactive budget consultations for municipalities
[hidden email]
Twitter: @opennorth
Subscribe to our newsletter

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year                                                           anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave




-- 
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



-- 
Tracey P. Lauriault
613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list

[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

Andrew Dyck
In reply to this post by David Eaves
I just read an article by those behind allourideas on the analysis method used for their pairwise comparison data. The method they use is quite complex. Good opportunity for some guerilla statistical analysis :)

Any updates on plans for this event?

On Wed, Mar 7, 2012 at 5:23 PM, David Eaves <[hidden email]> wrote:
Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada

800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481

2

Sales of commodities of large retailers - English version

Statistics Canada

247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada

102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada

98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

61

9

Sales of fuel used for road motor vehicles, by province and territory - English version

Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version

Library and Archives Canada

51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
<a href="tel:613-234-2805" value="+16132342805" target="_blank">613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Upcoming events

David Eaves
Hi Andrew (and everyone),

There is no central event, I'm going through all the data sets with some friends and we are pulling out what we think is interesting. If anyone ends up writing a blog post on what you think is interesting on data.gc.ca or wants to do anything more ambitious (like create a better interface), please do let me know.

cheers,
dave

On 12-03-13 7:56 AM, Andrew Dyck wrote:
I just read an article by those behind allourideas on the analysis method used for their pairwise comparison data. The method they use is quite complex. Good opportunity for some guerilla statistical analysis :)

Any updates on plans for this event?

On Wed, Mar 7, 2012 at 5:23 PM, David Eaves <[hidden email]> wrote:
Some of you may find this deck of interest.

Of particular interest were the most downloaded data set numbers

Since Launch:

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

4730

Permanent Resident Summary by Mission (English)

Citizenship and Immigration Canada

1733

Overseas Permanent Resident Inventory (English)

Citizenship and Immigration Canada

1558

Canada – Permanent residents by category (English)

Citizenship and Immigration Canada

1261

Permanent Resident Applicants Awaiting a Decision (English)

Citizenship and Immigration Canada

873

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

852

Meteorological Service of Canada (MSC) – Weather Element Forecasts

Environment Canada

851

Permanent Resident Visa Applications Received Abroad - English Version

Citizenship and Immigration Canada

800

Water Quality Indicators - Reports, Maps, Charts and Data

Environment Canada

697

Canada - Permanent and Temporary Residents - English version

Citizenship and Immigration Canada

625


Last 30 days:

#

DATASET

DEPARTMENT

DOWNLOADS

1

Permanent Resident Applications Processed Abroad and Processing Times (English)

Citizenship and Immigration Canada

481

2

Sales of commodities of large retailers - English version

Statistics Canada

247

3

Permanent Resident Summary by Mission - English Version

Citizenship and Immigration Canada

207

4

CIC Operational Network at a Glance - English Version

Citizenship and Immigration Canada

163

5

Gross domestic product at basic prices, communications, transportation and trade - English version

Statistics Canada

159

6

Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery

Environment Canada

102

7

Canada - Permanent residents by category - English version

Citizenship and Immigration Canada

98

8

Meteorological Service of Canada (MSC) - City Page Weather

Environment Canada

61

9

Sales of fuel used for road motor vehicles, by province and territory - English version

Statistics Canada

52

10

Government of Canada Core Subject Thesaurus - English Version

Library and Archives Canada

51


Anyways, thought this might be interesting.

Dave

On 12-03-06 11:37 AM, James McKinney wrote:
Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.

I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.

Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.

Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.

With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.

We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.

James

On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:

Appraising the value of a dataset is a very tricky proposition.

 

For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area? 

 

The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization.  The project was eventually, shelved by 'management' and forgotten and the software became obsolete.  Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data.  Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.

 

These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified.  Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings.  The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.

 

The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded.  This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on.  It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.

 

So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140?  A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc.  1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save.  A dataset that cost millions.  A dataset which described the platform upon which we live.  Also, a national historical artifact.  It is not a NEW dataset, but it remains a critically important dataset.

 

GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them.  Data that fit into their business practices and their disciplinary fields.  The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop.  It is a foundational data set for all other datasets.  That would also be counted as 1 of 140?

 

These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871.  They have their communities and they have a context.

 

The problem with the TBS portal is its lack of context and a lack of a user community.  Who does it serve?  Which community is it aimed at?  And which communities were involved in its production, and design?  What is it really trying to do?  And is that the new model to replace something like Geogratis?  I hope not!

 

The TBS portal also lacks curatorship and a well-developed mechanism to discover the data.  Furthermore, the data are stripped of context.  I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal.  The TBS portal does not deliver data in a way that I can trust them.  There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed.  A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.

 

How can we decide the value of a set of data stripped of context in a spreadsheet?  Who gets to assign value?  Is fit for use part of that value proposition? Who are we?  What are our criteria?

 

Bref, I am not sure we are the best people to do this job.  I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data.  I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc.  I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets. 

 

And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base.  And I think we would be way better off, if we acknowledged that we know stuff,  but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding.  Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.

Cheers

t

 
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)

Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures. 

By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.

In reply to Andrew:

I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.

I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.

Cheers,

James

On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:

+1 for the great responses thus far

I'm copying OpenDataBC and civic-access as suggested.  Apologies for duplicates.

I think it's important to provide government with actionable feedback in the form of measures using a tool.  As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating.  It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :)  The point of it is to show government exactly where the issues are so they can be addressed.

The ODUI only addresses usability though.  It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose.  It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture.  If we are able to articulate additional qualities they could be added to the tool as well.  We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.

Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel.  We might want to come up with a scheme where more than one person rates every dataset and then we average.  Depending on how much help we can muster.

H

On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests. 

The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.

I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.

Cheers,
Andrew


On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.

I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.

Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:

1) Data has low precision

Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.

Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.

2) Data has little value on its own

Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.

Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.

Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.

3) Government has already done valuable analysis

Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.

4) Data is not timely

Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.

5) Data quality is inconsistent

The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.

6) Data is not politically sensitive

There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.


I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).

It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.

--
James McKinney
Open North
<a moz-do-not-send="true" href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223
http://citizenbudget.com/ interactive budget consultations for municipalities
Twitter: @opennorth

On 2012-03-05, at 11:53 AM, David Eaves wrote:

Hey guys,

Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.

What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.

If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.

I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).

I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.

I'll blog this idea as well so that others can read it.

Cheers,
dave





--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
<a moz-do-not-send="true" href="tel:613-234-2805" value="+16132342805" target="_blank">613-234-2805
 
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
 
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

allourideas opendata

Gerry Tychon-2
In reply to this post by Andrew Dyck
Andrew ...

Thanks for posting the link on allourideas.org. I believe I have used this approach (AHP - Analytic Hierarchy Process) before for land use suitability. It allows for both qualitative and quantitative analysis and avoids "herding". But, I have not seen a web site constructed such as allourideas.org.

As an experiment, I have created a question on allourideas.org and seeded it with some ideas.

The question is: What will make Open Data successful?

The link is:

http://www.allourideas.org/opendata

I think it would be great if folks gave their opinion or added ideas.


... gerry tychon