I have done work on LSRS -- the Land Suitability Rating System -- meant to replace CLI ratings. For those interested some information can be found at:
... gerry tychon
Is there a possibility TBS will post these tables on its website? I'd be really interested to know how many downloads are from Canadian vs. foreign IPs. It'd also be great to know what people are doing with these datasets.On 2012-03-07, at 6:23 PM, David Eaves wrote:Some of you may find this deck of interest._______________________________________________
Of particular interest were the most downloaded data set numbers
Since Launch:
Permanent Resident Applications Processed Abroad and Processing Times (English)Citizenship and Immigration Canada
4730Permanent Resident Summary by Mission (English)
Citizenship and Immigration Canada
1733Overseas Permanent Resident Inventory (English)
Citizenship and Immigration Canada
1558
Canada – Permanent residents by category (English)
Citizenship and Immigration Canada
1261
Permanent Resident Applicants Awaiting a Decision (English)
Citizenship and Immigration Canada
873
Meteorological Service of Canada (MSC) - City Page Weather
Environment Canada
852
Meteorological Service of Canada (MSC) – Weather Element Forecasts
Environment Canada
851
Permanent Resident Visa Applications Received Abroad - English Version
Citizenship and Immigration Canada
800
Water Quality Indicators - Reports, Maps, Charts and Data
Environment Canada
697
Canada - Permanent and Temporary Residents - English version
Citizenship and Immigration Canada
625
Last 30 days:
#
DATASET
DEPARTMENT
DOWNLOADS
1 Permanent Resident Applications Processed Abroad and Processing Times (English)Citizenship and Immigration Canada
481 2Sales of commodities of large retailers - English version
Statistics Canada
2473
Permanent Resident Summary by Mission - English Version
Citizenship and Immigration Canada
207
4
CIC Operational Network at a Glance - English Version
Citizenship and Immigration Canada
163
5
Gross domestic product at basic prices, communications, transportation and trade - English version
Statistics Canada
159
6
Anthropogenic disturbance footprint within boreal caribou ranges across Canada - As interpreted from 2008-2010 Landsat satellite imagery
Environment Canada
102
7
Canada - Permanent residents by category - English version
Citizenship and Immigration Canada
98
8
Meteorological Service of Canada (MSC) - City Page Weather
Environment Canada
61
9
Sales of fuel used for road motor vehicles, by province and territory - English version
Statistics Canada
52
10
Government of Canada Core Subject Thesaurus - English Version
Library and Archives Canada
51
Anyways, thought this might be interesting.
Dave
On 12-03-06 11:37 AM, James McKinney wrote:Thanks for the thoughtful reply, Tracey. Do you have any specific suggestions for how we can "work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data"? I think that would help a lot.
I don't think anyone was suggesting that any of us could rank the absolute value of datasets. The primary suggestion was that we can encourage people to find datasets that are interesting to them. I expect that's uncontroversial. It's good to help people satisfy their interests. The current TBS portal makes that hard. We are trying to find ways to make it easy.
Another suggestion was to aggregate people's interests in order to recommend datasets. Yes, we run into the problem of *who* is expressing interest. But even if only one community were to participate, it would still be worthwhile, because we would at least know what datasets that community finds interesting, and be able to make recommendations to that community.
Another suggestion was to score datasets according to different metrics. Coming up with good metrics is hard, and in this context we barely even broached the subject. In a different context, OpenDataBC came up with metrics for "usability": http://www.opendatabc.ca/odui.html Datasets that cost more, for example, are less "usable". (They define what they mean by "usability".) The point of such metrics is to help government identify potential problem areas.
With respect to CLI, no one disagrees that it's important, valuable, etc. as are so many of the GeoGratis, etc. datasets. However, these datasets are already so successful (as you wrote, the CLI is of the most popular and downloaded) that they wouldn't benefit as much from extra publicity as other datasets. They are already used for a vast number of purposes by the community. On the other hand, there are likely many datasets on the TBS portal that haven't been used except by the government for the purposes for which they were collected in the first place. We'd like to encourage use of those potentially underused datasets.
We're trying to come up with better mechanisms to discover data, which, as you write, the TBS portal lacks. Additional context is great for evaluating individual datasets, but I don't see how it helps with discovery. We don't know what the best solutions are, so we are experimenting. We are hoping that better discovery mechanisms with help start to build a community around data.gc.ca.
James
On 2012-03-06, at 1:57 PM, Tracey P. Lauriault wrote:
Appraising the value of a dataset is a very tricky proposition.
For example, how would the Canada Land Inventory (CLI) be evaluated in that 1 of 140? Can you compare it to a spreadsheet of that includes a demographic variable arranged by census metropolitan area?
The CLI for example fueled the invention of GIS (a Canadian invention) and at the time (1960s) it was one of the world’s biggest land classification undertakings. It was part of that big post war push to map Canada with air photo recognizance and part of early computerization. The project was eventually, shelved by 'management' and forgotten and the software became obsolete. Boxes of tapes got moved around, eventually I think, they made their way into someone's garage (classic). A team was eventually built, to restore these data. Ecologists, agricultural data specialist, geomaticians and engineers dedicated hours of 'after hours' work to extract the data off of those tapes and to strip them of obsolete code.
These data now form the base of so many other important data sets and maps, for example, they were used to inform the creation of ecological regions and territories which changed the way Canadian land was envisioned, and it inspired a new way to consider resources and transformed the way land was to be managed, organized and classified. Canada after all, remains a resource based extraction economy, with lots of 'unpeopled places' even if we do not see that from the vantage of our urban cafe settings. The CLI was also one of the first datasets released on Geogratis which as you are aware was the first government portal to deliver free data and it did so with ground-breaking licenses.
The CLI, is also, according to some NRCan friends, one of the most popular data sets and the most oft downloaded. This data set has tremendous value to geographers, agriculturalists, ecologists, urban planners, road builders and so on. It is also invaluable as it forms the base of other maps and it was a game changer technologically and information wise worldwide.
So how do we urbanites with our mobile devices value it? It counts as 1 in a spreadsheet of 140? A list of 140 datasets stripped of context, history, influence, cost to produce, science, metadata, etc. 1 dataset that took decades to create and was built by hundreds of scientists collaborating to do so, and then a least a decade to save. A dataset that cost millions. A dataset which described the platform upon which we live. Also, a national historical artifact. It is not a NEW dataset, but it remains a critically important dataset.
GeoGratis, GeoBase and the GeoConnections discovery portal are successful because they are delivering data to an established community of practice, who collaboratively developed standards, metadata, file formats, services and an infrastructure to manage and disseminate data of value to them. Data that fit into their business practices and their disciplinary fields. The street network file for instance, not only costs millions to produce and update, but it is also a distributed data collaboration between provinces, territories and the feds, produced and shared under a signed accord of relationships between levels of government and that took decades to develop. It is a foundational data set for all other datasets. That would also be counted as 1 of 140?
These geodata, like statistical data are part of well-established disciplines, and are foundational Canadian Institutions, 1842 for NRCan and 1666 for the census or officially 1871. They have their communities and they have a context.
The problem with the TBS portal is its lack of context and a lack of a user community. Who does it serve? Which community is it aimed at? And which communities were involved in its production, and design? What is it really trying to do? And is that the new model to replace something like Geogratis? I hope not!
The TBS portal also lacks curatorship and a well-developed mechanism to discover the data. Furthermore, the data are stripped of context. I trust data coming from DFO or NRCan or NRC, I do not have the same level of trust in the data I see in the TBS portal. The TBS portal does not deliver data in a way that I can trust them. There is very little metadata, there are no methodological documents to explain the data fields or how the data were collected, what is available to help assess the quality, reliability and accuracy of the data, there are no data dictionaries, there is no provenance - lineage - or dataset authority, these critical elements of what constitutes a method to evaluate fit for use have been removed. A municipal portal is one thing, as the municipality is a context, but at the federal level where these is a mix of administrative, scientific, geomatics, statistical data, the TBS portal falls short.
How can we decide the value of a set of data stripped of context in a spreadsheet? Who gets to assign value? Is fit for use part of that value proposition? Who are we? What are our criteria?
Bref, I am not sure we are the best people to do this job. I would rather we work with people who curate, appraise, describe, catalogue, manage, preserve, and disseminate data. I am not a traditionalist nor am I conservative, but I do have tremendous respect for some well-established practices and roles and professions, such as librarianship, curatorship, archivists, scientists, etc. I also appreciate communities of practice, disciplines and recognized bodies of knowledge, and people who thematically understand certain datasets.
And I wish, please forgive me here, we could stop being so arrogant as to think that we can do this work without tapping into that rich knowledge base. And I think we would be way better off, if we acknowledged that we know stuff, but that we would be smarter and do better work if it we collaboratively worked with best people in a varied of data fields who can provide some grounding. Otherwise, I think, we will keep building sub-optimal solutions that satisfy no one.
Cheers
t
On Tue, Mar 6, 2012 at 11:45 AM, James McKinney <[hidden email]> wrote:
(Removing names that I know are on these two lists. Sorry for cross-post.)
Herb, how many people and how much time did it take to rate the 240 datasets on opendatabc.ca? For Canada, the low count is 4778 non-geographic and 102 geographic. Just want to check how much we can conceivably get through. I think ODUI is a great measure of functional openness. For data.gc.ca, the challenge is relevance and novelty. Subject matter, precision, timeliness and the possibility to create something new all contribute to one or both of those measures.
By the way, my earlier point (2) is not really something data publishers can improve upon - it's just an acknowledgment of an issue that will limit interest in the data on data.gc.ca. Unlike federal data, municipal data often has value on its own, e.g. locations of public bathrooms.
In reply to Andrew:
I think the event is still valuable. If we encourage people to explore the data, I'm sure many will find something of interest that they didn't know was available. For many, a particular dataset will spark a desire to do some deeper analysis. I'm just saying I don't expect awesome apps like Recollect or Open Parliament to come out of the datasets on data.gc.ca.
I think for online collaboration, an Etherpad (e.g. http://okfnpad.org/ or http://beta.etherpad.org/) is best, to avoid Wiki page edit conflicts. Etherpad allows live collaboration and chat history. We can then organize and publish the Etherpad on a wiki.
Cheers,
James
On 2012-03-06, at 3:13 AM, Herb Lainchbury wrote:
+1 for the great responses thus far
I'm copying OpenDataBC and civic-access as suggested. Apologies for duplicates.
I think it's important to provide government with actionable feedback in the form of measures using a tool. As most of you know we do have the ODUI (http://www.opendatabc.ca/odui.html) which we could leverage to rate all of the data sets to give them a basic usability rating. It's a hackathon project and has had much (sometimes vigorous) discussion in the OpenDataBC group. :) The point of it is to show government exactly where the issues are so they can be addressed.
The ODUI only addresses usability though. It doesn't address the subject matter (how interesting it might be) or the close-to-the-sourceness or rawness of the data, or the timeliness of the data, as wasn't really it's purpose. It does do a decent job of many other aspects in terms of legal framework, accessibility and readability so perhaps it's at least part of the picture. If we are able to articulate additional qualities they could be added to the tool as well. We don't have to use the ODUI of course, but we should at least use something that everyone can use as a guide for rating.
Given a standardized tool then I think we could throw the catalogue into a Google spreadsheet and invite people to have a go at rating it in parallel. We might want to come up with a scheme where more than one person rates every dataset and then we average. Depending on how much help we can muster.
H
On Mon, Mar 5, 2012 at 9:31 PM, Andrew Dyck <[hidden email]> wrote:
James, I agree with you on a number of your points. I think that these are areas that we can get movement on from the federal government through something like what David suggests.
The lack of precision in open data is one of the most important things that we could use improvement on. I think that this problem can arise out of a desire to aggregate data in order to protect privacy of individuals as with many StatsCan datasets. It can also be out of a genuine desire to provide citizens with information, like a mortality rate, and save us the nitty gritty details. Since these details are what we are looking after most often, we could really use a way to communicate this.
I'm interested in participating in something next week that works towards a crowd-sourced data portal. A simple wiki page with some project ideas we hash out and links to the data you'd need to do it would be useful too for others to run with it. Better that someone else work on an idea of mine than it just sit on my desk and go stale.
Cheers,
Andrew
On Mon, Mar 5, 2012 at 5:20 PM, James McKinney <[hidden email]> wrote:
I think this conversation should be on OpenDataBC and CivicAccess.
I went through the geographic datasets. I didn't see anything of interest, though they can be useful when combined with other datasets. For the other datasets, it would be great if you could get a CSV that includes the department, category and file format (as on data.gc.ca). It would be even better if TBS made this file available on data.gc.ca.
Of the 9556 non-geographic datasets in the spreadsheet, halve that to get the actual number of datasets (4778), as every dataset is translated and counted twice, once for each official language. There are a number of issues that limit the level of interest:
1) Data has low precision
Much value is derived from merging statistical data with geospatial data to analyze and visualize trends and to offer local, relevant information to citizens. Most, if not all, of the 1,911 CANSIM tables, for example, have no geography (e.g. revenue from sales of recordings by musical category) or only large geographies (e.g. provincial level). Even when data is given at a city-wide level, it's often only for the largest cities. Data has to be relevant and novel to be interesting. The low geographic precision of the data on data.gc.ca makes most of the data neither novel nor relevant.
Financial data is similarly uninteresting at low precision. MP expenses are interesting. The total budget of Parliament is not.
2) Data has little value on its own
Most of the datasets need significant value-added to be of interest. For example, one of the datasets is election results going back to Confederation, which I had requested. I had previously scraped this to answer the question, "Is a riding's electoral history a good predictor of an upcoming election's results?" The results: In Canada, not so much. In the UK, much more so. Coming up with a rough estimate of how "safe" each riding is (using this naive historical approach) was time-consuming.
Another example: you can easily draw graphs using the Fiscal Reference Tables - but what story does it tell? For it to be interesting, you'd have to mark the graphs with important events, like recessions, introduction of GST, corporate tax cuts, etc. And if you've marked the graphs diligently, you may be able to analyze different parties' contributions to Canada's expenses and revenues. The data, on its own, is boring, unless you already have a specific research question.
Finally, a warning: you have to be careful with methodology. For example, I live in Montreal, and Emitter.ca shows a lot of "bad" polluters. In this case, "bad" just means high percentile rank for total emissions *within a specific industry*, even though a "bad" bakery is probably "good" compared to a petro-chemical plant. At any rate, "bad" according to Emitter is likely not actually bad. Polluters are regulated and measured. A "bad" grade should be reserved for those who actually make the air unhealthy to breathe. Not to pick on a specific project, but it does bring up a challenge to using data.gc.ca datasets.
3) Government has already done valuable analysis
Much of the most interesting analysis of data has already been done by the government (which is the reason for their having collected the data in the first place). I have little interest in, for example, radio listening data, as the major trends have already been identified by the government. It's not easy to come up with novel ways to use much of this data.
4) Data is not timely
Some of the data would be more interesting if it were timely. For example, timely forest fire data could power a mushroom hunting community website. If I want to find wild morels, I need to rely on non-open provincial data.
5) Data quality is inconsistent
The Canada-provided industry categorizations used by Emitter are poor quality, with many missing or unusual categorizations. Such issues make it hard or impossible to perform some analyses. Some datasets are misnamed. "Inter-city indexes of consumer price levels" compares only four cities in the Maritimes.
6) Data is not politically sensitive
There are 1600 datasets on eggs, dairy and meat production. I can't count one dataset along the lines of votes in the House of Commons, MP expenses and schedules, lobbying registry and meetings, etc.
I've only used the boundary files so far (census subdivision, electoral districts, etc.). Before receiving this list, I thought there were interesting datasets and that they were just too hard to find. Having read the list, I think there simply may not be interesting datasets (unless, as I've written above, you have a specific research question and the resources to add the significant, necessary value to the data).
It may be the case that the federal level just doesn't have that much of interest that can be released (i.e. that doesn't infringe on privacy laws by having high precision) and that is not politically sensitive, and that activity around open data is concentrated at the municipal level not only because cities have been first-movers but especially because cities have more interesting nonpolitical data.
--James McKinney
Open North
<a href="tel:%2B1.514.247.0223" value="+15142470223" target="_blank">+1.514.247.0223http://citizenbudget.com/ interactive budget consultations for municipalitiesTwitter: @opennorth
On 2012-03-05, at 11:53 AM, David Eaves wrote:
Hey guys,
Expanding the number of people on this thread. I know I haven't included everyone - so feel free to add.
What I'd like to propose was we go through the data on data.gc.ca and see what is interesting and pull it out so people know about it. I've heard several people say they'd like to design an alternative data portal - I'm definitely game for that and am happy to offer up the datadotgc.ca url for that too. My suspicion is that almost none of us know what is actually available since a) there is a lot, b) much of it is not interesting and c) it is very hard to search.
If someone had a clever proposal about how to go through it, I'd love for us to highlight the high value datasets (if there are any) available in data.gc.ca. Such an exercise would also help us write supportive and critical comments about the site.
I think it would be fun to have a day where people post the cool data sets they've found - or we do something even more collaborative. Something that doesn't require a full hackathon or even full day, but a couple of hours that people can do after their work. I was going to propose afterwork on March 15th in everyone's respective timezone as this is the one year anniversary of the open data portal (I think Scilib suggested that date).
I've been waiting to get lists of data from the feds, which I just got and have uploaded to buzzdata. Non-geographic data list is here, geographic data list is here.
I'll blog this idea as well so that others can read it.
Cheers,
dave
--
Herb Lainchbury
Dynamic Solutions Inc.
www.dynamic-solutions.com
http://twitter.com/herblainchbury
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
--
Tracey P. Lauriault
<a href="tel:613-234-2805" value="+16132342805" target="_blank">613-234-2805
"Every epoch dreams the one that follows it's the dream form of the future, not its reality" it is the "wish image of the collective".
Walter Benjamin, between 1927-1940, (http://www.columbia.edu/itc/architecture/ockman/pdfs/dossier_4/buck-morss.pdf)
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________ CivicAccess-discuss mailing list [hidden email] http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Free forum by Nabble | Edit this page |