Standards rant on structured documents

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Standards rant on structured documents

john whelan
I'm feeling old, I was at the CanLII Law, Government and Open Data Conference today and one of the issues that came up was that the data was often supplied in .pdf format or printed out on paper.

I think Tracy recently mentioned something in the energy world which triggered off a memory.

These is an ISO standard "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)" which was adopted by Treasury Board as one of the Treasury Board Informatics Standards or TBITS.  The idea behind it was to provide a tagged document that was machine readable. I seem to recall a DTD defined the tags.  One area where a lot of input went into SGML was legal documents.

NEB or National Energy Board was a strong proponent and accepted submissions from companies and other parties in SGML format and I think that is the history behind Tracy's pipeline data.  We smiled sweetly at both Microsoft and WordPerfect and I'm fairly certain that both companies provided SGML exporters to their word processors so that for Government use documents could be stored and sent out in a format that was stable and could be read by different computers systems.  Even in those days document formats changed with every version of software.  XML grew out of SGML by the way.

Today probably more information is handled in emails rather than word processed documents but could / should an effort be made on access to information requests and for other documents supplied by government sources that the document be supplied in ISO format?  ie SGML, It might make life a lot easier handling unstructured documents obtained through Access to Information requests etc.

I think there is work to be done on setting standards for structured documents to make them easier to process but step one might be to brush off the ideas behind SGML.

I shall now retreat beneath my stone.

Cheerio John

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

James McKinney-2
HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1] - because most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

James

On 2013-09-13, at 9:08 PM, john whelan wrote:

I'm feeling old, I was at the CanLII Law, Government and Open Data Conference today and one of the issues that came up was that the data was often supplied in .pdf format or printed out on paper.

I think Tracy recently mentioned something in the energy world which triggered off a memory.

These is an ISO standard "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)" which was adopted by Treasury Board as one of the Treasury Board Informatics Standards or TBITS.  The idea behind it was to provide a tagged document that was machine readable. I seem to recall a DTD defined the tags.  One area where a lot of input went into SGML was legal documents.

NEB or National Energy Board was a strong proponent and accepted submissions from companies and other parties in SGML format and I think that is the history behind Tracy's pipeline data.  We smiled sweetly at both Microsoft and WordPerfect and I'm fairly certain that both companies provided SGML exporters to their word processors so that for Government use documents could be stored and sent out in a format that was stable and could be read by different computers systems.  Even in those days document formats changed with every version of software.  XML grew out of SGML by the way.

Today probably more information is handled in emails rather than word processed documents but could / should an effort be made on access to information requests and for other documents supplied by government sources that the document be supplied in ISO format?  ie SGML, It might make life a lot easier handling unstructured documents obtained through Access to Information requests etc.

I think there is work to be done on setting standards for structured documents to make them easier to process but step one might be to brush off the ideas behind SGML.

I shall now retreat beneath my stone.

Cheerio John
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Peder Jakobsen

On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 




_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

James McKinney-2
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Stéphane Guidoin
In reply to this post by James McKinney-2
Hello,

Just to add a point here, more about the initial point John raised: the issue is not how we format the data but the will to do something else than what is already done. 2 significant elements were raised during the discussion yesterday (I had the pleasure to be there since I worked on the CanLii API within Open North):
- Some tribunal want to publish their judgements and other documents in locked PDF so that nobody could change it (we know it's useless... but some want to do it)
- The source and "form factor" used to present a document to a court is at the discretion of the judge. He can somehow ask for the original paper document of a judgemet for example. There again, nothing impossible to have both plain text paper based AND structured data but it just shows you how far the discussion AND the willingness are.

As tribunals are not covered by legislation, enforcement of open data and other requirements is not possible. Each tribunal does as he want.

However, it is not just an issue of tribunals. To my knowledge, now legislation is Canada is available as open data while it is in the UK: http://www.legislation.gov.uk/

To come back on the discussion about how to format it:
- Just plain old HTML with correct tags like <hX> where they belong and links when citing other cases/legislation (however such a link, to be useful, would need a unified registry at the source.. which does not exist) and other existing html/html5 features would be an incredible improvement to process the data.
- Obviously, adding semantic value with something like schema.org would be the ideal.

Steph

Le 2013-09-14 13:36, James McKinney a écrit :
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
I think there are two points one is getting some one to create the documents electronically and the second is getting every one to create the documents electronically.  Normally if you are trying to do something revolutionary you don't say its revolutionary in government.  You take it a tiny step at a time.  So find some one who is creating documents as you would like them or at least tagged in someway that you can refer back to a standard.


I think Tim Knight showed the use of "RDF:author" as an XML tag.  That would be ideal if that schema could be used and as long as it was documented in the right way and easy for the department to do then given a friendly department or court you might be able to get the first request through.  After that you follow gradually with other departments or courts, "I found this format terribly useful for xyz could you supply the data in this format?"  A bit of flattery goes a long way and sometimes summer students can get away with murder.

We'd need to identify something that was easy to do, something that could be exported from a Microsoft Word document, someone somewhere must have done it if not then Visual Basic has extensions to rip open a Word document and get hold of the values in the document properties.  So you need to know that it can be done technically easily first then you work on some one to do it for you.

Once you have one or two running smoothly then you come back to governments are now consumers of the same data so its in their best interests to make this stuff happen.

The reference to National Energy Board was simply this is an agency that has been doing this sort of thing for some time now so either learn from them or cite them as an example.  Yes I understand this stuff isn't easy I spent years setting things up then saying something like Oh we could do this this way, and lo and behold we were doing it completely differently but I don't think management found out until a year or two later.

Cheerio John

 


On 14 September 2013 14:02, Stéphane Guidoin <[hidden email]> wrote:
Hello,

Just to add a point here, more about the initial point John raised: the issue is not how we format the data but the will to do something else than what is already done. 2 significant elements were raised during the discussion yesterday (I had the pleasure to be there since I worked on the CanLii API within Open North):
- Some tribunal want to publish their judgements and other documents in locked PDF so that nobody could change it (we know it's useless... but some want to do it)
- The source and "form factor" used to present a document to a court is at the discretion of the judge. He can somehow ask for the original paper document of a judgemet for example. There again, nothing impossible to have both plain text paper based AND structured data but it just shows you how far the discussion AND the willingness are.

As tribunals are not covered by legislation, enforcement of open data and other requirements is not possible. Each tribunal does as he want.

However, it is not just an issue of tribunals. To my knowledge, now legislation is Canada is available as open data while it is in the UK: http://www.legislation.gov.uk/

To come back on the discussion about how to format it:
- Just plain old HTML with correct tags like <hX> where they belong and links when citing other cases/legislation (however such a link, to be useful, would need a unified registry at the source.. which does not exist) and other existing html/html5 features would be an incredible improvement to process the data.
- Obviously, adding semantic value with something like schema.org would be the ideal.

Steph

Le 2013-09-14 13:36, James McKinney a écrit :
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

David Akin
In reply to this post by john whelan
I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon
On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: +1 613 698 7412


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Stéphane Guidoin
Well, we must be careful not to mix 2 differents things: judgements and regulations are indeed coming from various sources, but 1. they are all the same documents (judgement and regulations) and even if there is not common data format for them (electronically speaking), there are overall practices to publish these documents in a common format... which allows CanLii to propose them in a rather uniform way (even if unstructured besides some metadata). The idea would be to move new judgements and regulation (whose number is high but manageable, as opposed to previous judgements) to a possible new data format (which is not started yet... to my knowledge).

Showing in a structure format the lost structure of a document is more or less a manual job. That's why it would be manageable for regulation but not for past judgement.

ATI request is a different beast: it's coming coming from much various sources (only in Québec, there are thousands of organizations covered by the ATI regulation) and the type of formats is another scale for complexity: you are memo, notes, reports, already structured data, structured data whose structure has been lost, etc.

So it's not unreasonable to ask for structured data in the case of regulation and possibly new judgements. for the moment, it would appear to me to ask the impossible to get structured data for ATI outcomes.

Steph

Le 2013-09-14 21:18, David Akin a écrit :
I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon
On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: +1 613 698 7412



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by David Akin
There was some work being done at Treasury board some years ago, I can remember the peron's face but not their name, they had a librarianship background.  The NEB initiative certainly had TBITS support but according to the TB web site it looks as if TBITS has been cut back.  I'm trying to think of the best approach, I'd be inclined to go incremental if possible.  NEB are using structured electronically submitted documents and I'd be inclined initially to see if their schema could at least be reused as it is almost certainly ISO standards based.  It may not be the latest ISO standard but most TBITS were ISO based and once you can persuade them to give you something tagged the tags can always be renamed.

I think [hidden email] is heading up the opendata side so maybe a request for one or two structured documents from NEB or somewhere else if we can specify something non-controversial then follow up with Wow wouldn't it be great if other documents were in this ISO standard?  Do be aware that Federal government uses a variety of word processing software, mainly Microsoft Word but I think there are pockets of Wordperfect and Ami pro around.  Lotus notes might also be around, plus emails have taken the place of many documents but reports are still word processed.

Tim Knight York university? seems to have his finger on the pulse of current standards and I think you need to pick the right one and he might be the person to assist in the selection.  I'm out of date unfortunately.  I seem to recall Tim mentioning NRC which means they would have an interest in publishing scientific reports and these may well be published electronically with tags.  That might give us a foot in the door.

Cheerio John


On 14 September 2013 21:18, David Akin <[hidden email]> wrote:
I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon

On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: <a href="tel:%2B1%20613%20698%207412" value="+16136987412" target="_blank">+1 613 698 7412


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by Stéphane Guidoin

>So it's not unreasonable to ask for structured data in the case of regulation and possibly new judgements. for the moment, it would appear to me to ask the impossible to get structured data for ATI outcomes.

If we can get some that's an improvement.  Rome wasn't built in a day but I think the process is find one or two then coax the rest.  There is a requirement internally to manage these documents so the feds certainly have an incentive to tag them and they are better stored in a neutral format  XML or ISO than in a proprietary format.  I seem to recall the minister's office can get quite touchy when told no we don't have any way to get this document off a 12 inch floppy.

Cheerio John


On 14 September 2013 21:37, Stéphane Guidoin <[hidden email]> wrote:
Well, we must be careful not to mix 2 differents things: judgements and regulations are indeed coming from various sources, but 1. they are all the same documents (judgement and regulations) and even if there is not common data format for them (electronically speaking), there are overall practices to publish these documents in a common format... which allows CanLii to propose them in a rather uniform way (even if unstructured besides some metadata). The idea would be to move new judgements and regulation (whose number is high but manageable, as opposed to previous judgements) to a possible new data format (which is not started yet... to my knowledge).

Showing in a structure format the lost structure of a document is more or less a manual job. That's why it would be manageable for regulation but not for past judgement.

ATI request is a different beast: it's coming coming from much various sources (only in Québec, there are thousands of organizations covered by the ATI regulation) and the type of formats is another scale for complexity: you are memo, notes, reports, already structured data, structured data whose structure has been lost, etc.

So it's not unreasonable to ask for structured data in the case of regulation and possibly new judgements. for the moment, it would appear to me to ask the impossible to get structured data for ATI outcomes.

Steph

Le 2013-09-14 21:18, David Akin a écrit :
I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon
On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: <a href="tel:%2B1%20613%20698%207412" value="+16136987412" target="_blank">+1 613 698 7412



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Peder Jakobsen
In reply to this post by James McKinney-2

On 2013-09-14, at 1:36 PM, James McKinney <[hidden email]> wrote:

The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. 

And that's OK.  The desire to organize things is understandable, but the problem with standards is that they can and do impede information production, and production is ultimately more important than extraction.  

Peder 





_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Peder Jakobsen
In reply to this post by john whelan

On 2013-09-14, at 9:44 PM, john whelan <[hidden email]> wrote:

I think [hidden email] is heading up the open data 

You are right, he's the boss, but he juggles many very different balls, so you may have to drill down a bit at TBS to get to the right people who know the details.   I worked on the Open Data v.2  project including the development of the new standard as a "technical advisor". My main task was to make the current Open Data collection from all departments comply with the new schema as we developed it, and many "nice to have" features had to be dropped in favour of pragmatism and cost savings.   The government  really did this project (v.2)  on a dime, and in record time (thanks in part to OKFN and CKAN) - kudos to them. 

If you have any questions I can perhaps try to answer them.  All of our work, including the schema, is available on Github, so there are no secrets.  
 
The basis for the standard was The North American Profile (NAP) of the ISO 19115

Peder 








_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Karl Dubost
In reply to this post by john whelan
john whelan [2013-09-13T21:08]:
> I think there is work to be done on setting standards for structured documents to make them easier to process


# principles

1. regularity in the document == processable now
2. continuity in time == processable in the future
3. follow your nose == autonomy when browsed/parsed

Standards are conventions chosen by a community of interests. All communities have these conventions for working passed from person to person, by teaching or recorded in a manual. Take a simple dictionary of words under the form of a paper book. It's full of conventions.

# paper dictionary:

1. regularity: Each definition are organized the same across the book
2. continuity: The paper entity is not modified through time. If I come back to the same dictionary I will still be able to understand with the same understanding.
3. follow your nose: The way to read and understand the conventions is contained in the book itself.


# setting standards

So when we complain about the lack of standards, we often complain about in fact the lack of one of the 3 points above for a digital form about the data.

The key for these 3 points to exist is that:

    The community of interests
    which is producing the data
    has an immediate benefit
    in creating these conventions.

So if a specific community doesn't an organized stable workable digital form of the data, there is a high chance that secondary users of data will be highly frustrated at a point in time ;)

--
Karl Dubost
http://www.la-grange.net/karl/


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

signature.asc (507 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by Peder Jakobsen
So ideally we would like something that is quick and easy for the government departments to produce, XML format, ISO as every one recognizes and accepts it especially in government circles.


May I suggest ISO/IEC 29500, it has a basic standard schema, its already used internally for documents that Treasury Board creates with other departments.  Templates are available for many word processors.  It's schema might not be ideal but its a start.

I'd suggest some one tags their Access to Information request with would it be possible to have the results as an email attachment in ISO/IEC 29500 format, a Microsoft Word document would be acceptable and see what happens. 

ISO/IEC 29500 is more or less the native format of the Microsoft Office suite but it is an ISO standard.

Again on the Legal documents request them in ISO/IEC 29500 from the courts.  If you get them great, at least they are machine readable and you can always parse the documents to pull out the text for the XML tags.  There should be an agreed set of XML tags for legal documents that might not be perfect for Canadian documents but it might be a starting point.

Cheerio John


On 14 September 2013 22:38, Peder Jakobsen <[hidden email]> wrote:

On 2013-09-14, at 9:44 PM, john whelan <[hidden email]> wrote:

I think [hidden email] is heading up the open data 

You are right, he's the boss, but he juggles many very different balls, so you may have to drill down a bit at TBS to get to the right people who know the details.   I worked on the Open Data v.2  project including the development of the new standard as a "technical advisor". My main task was to make the current Open Data collection from all departments comply with the new schema as we developed it, and many "nice to have" features had to be dropped in favour of pragmatism and cost savings.   The government  really did this project (v.2)  on a dime, and in record time (thanks in part to OKFN and CKAN) - kudos to them. 

If you have any questions I can perhaps try to answer them.  All of our work, including the schema, is available on Github, so there are no secrets.  
 
The basis for the standard was The North American Profile (NAP) of the ISO 19115

Peder 








_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Peder Jakobsen
In the context of Open Data, rather than having standards for the documents themselves, isn't it enough to simply have a common  metadata schema to describe the documents so they can be searched and shared across various open data portals?  

Peder 

On 2013-09-15, at 8:43 AM, john whelan <[hidden email]> wrote:

So ideally we would like something that is quick and easy for the government departments to produce, XML format, ISO as every one recognizes and accepts it especially in government circles. 


May I suggest ISO/IEC 29500, it has a basic standard schema, its already used internally for documents that Treasury Board creates with other departments.  Templates are available for many word processors.  It's schema might not be ideal but its a start.

I'd suggest some one tags their Access to Information request with would it be possible to have the results as an email attachment in ISO/IEC 29500 format, a Microsoft Word document would be acceptable and see what happens.  

ISO/IEC 29500 is more or less the native format of the Microsoft Office suite but it is an ISO standard.

Again on the Legal documents request them in ISO/IEC 29500 from the courts.  If you get them great, at least they are machine readable and you can always parse the documents to pull out the text for the XML tags.  There should be an agreed set of XML tags for legal documents that might not be perfect for Canadian documents but it might be a starting point.


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
Two issues first getting the documents from various sources in machine readable format with as much tagging information as possible with minimum effort to the sources ie make it easy for them or they'll find a reason not to do it.  Second is what to do with them once you've got them.

Cheerio John


On 15 September 2013 09:27, Peder Jakobsen <[hidden email]> wrote:
In the context of Open Data, rather than having standards for the documents themselves, isn't it enough to simply have a common  metadata schema to describe the documents so they can be searched and shared across various open data portals?  

Peder 


On 2013-09-15, at 8:43 AM, john whelan <[hidden email]> wrote:

So ideally we would like something that is quick and easy for the government departments to produce, XML format, ISO as every one recognizes and accepts it especially in government circles. 


May I suggest ISO/IEC 29500, it has a basic standard schema, its already used internally for documents that Treasury Board creates with other departments.  Templates are available for many word processors.  It's schema might not be ideal but its a start.

I'd suggest some one tags their Access to Information request with would it be possible to have the results as an email attachment in ISO/IEC 29500 format, a Microsoft Word document would be acceptable and see what happens.  

ISO/IEC 29500 is more or less the native format of the Microsoft Office suite but it is an ISO standard.

Again on the Legal documents request them in ISO/IEC 29500 from the courts.  If you get them great, at least they are machine readable and you can always parse the documents to pull out the text for the XML tags.  There should be an agreed set of XML tags for legal documents that might not be perfect for Canadian documents but it might be a starting point.


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

James McKinney-2
In reply to this post by David Akin
Hi David, to try to answer your questions:

1. Treasury Board seems to be sticking to the Open Government Partnership Action Plan, which only promised searchable ATI request summaries and being able to submit ATI requests online - nothing about publication of responses the way BC does. I think if schema were to be adopted, documents released via ATI would have to be more centralized than they are now (as Stephane described). Anyway, it's worthwhile to make inquiries.

2. There are parts of governments that publish documents with metadata, etc. but I don't know of any government-wide programs. The EU Commission has issued a lot of public sector information directives though, so there may be something there.

3. For a generic document, Dublin Core is the most widely adopted standard for metadata (author, date published, topics, etc.). For specific types of documents, there are more specific standards.

James

On 2013-09-14, at 9:18 PM, David Akin wrote:

I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon
On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: +1 613 698 7412

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
>3. For a generic document, Dublin Core is the most widely adopted standard for metadata (author, date published, topics, etc.). For specific types of documents, there are more specific standards.

I'm under the impression that much of this could be extracted from the XML of an ISO/IEC 29500 document.

Cheerio John


On 15 September 2013 12:35, James McKinney <[hidden email]> wrote:
Hi David, to try to answer your questions:

1. Treasury Board seems to be sticking to the Open Government Partnership Action Plan, which only promised searchable ATI request summaries and being able to submit ATI requests online - nothing about publication of responses the way BC does. I think if schema were to be adopted, documents released via ATI would have to be more centralized than they are now (as Stephane described). Anyway, it's worthwhile to make inquiries.

2. There are parts of governments that publish documents with metadata, etc. but I don't know of any government-wide programs. The EU Commission has issued a lot of public sector information directives though, so there may be something there.

3. For a generic document, Dublin Core is the most widely adopted standard for metadata (author, date published, topics, etc.). For specific types of documents, there are more specific standards.

James

On 2013-09-14, at 9:18 PM, David Akin wrote:

I, too, would love to see ATI releases and any electronic govt document published with any sort of schema that would allow for easier data extraction, analysis , etc.

Questions:

1. Has anyone asked anyone at Treasury Board if this is under consideration? (I'm happy to make inquiries and report to the list)
2. Is anyone aware of other national governments publishing documents this way? Examples?
3. Is there any work underway anywhere for some standards/suggested tages/schemas that national and sub-national governments ought to use?



Has anyone asked Treasury Board if anuyon
On 2013-09-14, at 1:58 PM, john whelan <[hidden email]> wrote:

Most Access to Information requests are answered with a printed paper response and if you look at the CANLII.org / Canadian Legal Information Institute they have a collection of documents that relate to  "This website provides access to court judgments, tribunal decisions, statutes and regulations from all Canadian jurisdictions" their sources of documents are mixed.  Not all are machine readable.

What I'd like to see is one SGML/XML schema used and all documents from government sources be supplied in machine readable form and tagged in the same way.  It might be easier to request that government follow an accepted standard such as the one set by National Energy board than go through the effort of getting consensus as to switch schema should be used.

Step one machine readable, step two with standard tags.

Cheerio John


On 14 September 2013 13:36, James McKinney <[hidden email]> wrote:
Peder, my main points, copied below, are everything after that first sentence. I don't particularly care which data format is easier to extract structured data from. The point is HTML documents are a tag soup with very little semantic structure - and when there is semantic structure, it very rarely follows any standard. Hence the push for people to adopt Schema.org, etc.


most HTML documents are not carefully written to communicate structured data, but rather to display information in a browser to a human reader. Hence the emergence of various standards for adding more structure to HTML documents, like Schema.org, RDFa, etc.

More relevantly, SGML, XML, HTML, JSON, etc. are standards for data *formats*, but they are not standards for *data*. A single data standard might define multiple serializations (formats). Schema.org is a data standard (an "ontology" in data modeling jargon), and it defines HTML and RDFa Lite serializations. Structured data using Schema.org terms can also be serialized in any RDF serialization, like RDF/XML, Turtle, JSON-LD, etc.

So, salvation is more likely to take the form of adopting common ontologies/vocabularies, than adopting specific data formats.

On 2013-09-14, at 10:49 AM, Peder Jakobsen wrote:


On 2013-09-13, at 10:08 PM, James McKinney <[hidden email]> wrote:

HTML, though not an application of SGML, is related to it. Extracting information from HTML documents (and XML documents) in the general case is not much easier than extracting information from PDFs [1]

I have not worked with PDFs , but it seems quite intimidating.   Extracting information from XML documents, on the other hand, is straight forward because there is a language for that - the XPath query language.  

If HTML pages are generated from a data store by a web app (which most are), then XPath still applies, and it's even easier because you can see the structure with your eyes as you work with the queries.   In fact, you don't need to be a programmer to do this, you just download a plugin for Chrome or Firefox that will generate the XPATH queries automatically with point & click.  Then you just send those queries to a programmer and he/she will whip up something for you in a matter of hours that is as structured as you want it to be. 

If the HTML pages are not XHTML, then all the tools and libraries available today for scraping the web are truly awesome, but that requires that a programmer is involved for the whole roundtrip.  But so does querying most other complex data sources AFAIK. (SQL, etc)

On another note, without the many innovations in SGML, the world wide web perhaps never would have happened as quickly as it did (or at all).  

Peder 



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss

David Akin
Full contact details at:
cell: <a href="tel:%2B1%20613%20698%207412" value="+16136987412" target="_blank">+1 613 698 7412

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Russell McOrmond

On 13-09-15 09:41 PM, john whelan wrote:
>>3. For a generic document, Dublin Core is the most widely adopted
> standard for metadata (author, date published, topics, etc.). For
> specific types of documents, there are more specific standards.
>
> I'm under the impression that much of this could be extracted from the
> XML of an ISO/IEC 29500 document.

  Just a reminder in case anyone missed it, ISO/IEC 29500 is the highly
controversial "Office Open XML" from Microsoft that was laundered
through the standards process to compete against the multi-vendor Open
Document Format (ODF).  OOXML is a standard way for third party
applications to interoperate with the Microsoft Office Suite, and thus
isn't the same as a platform neutral standard like ODF.

  While it is possible to extract metadata from ISO/IEC 29500 files,
replicating formatting really needs the Microsoft Office rendering
engines as there are many controversial tags in the transactional (such
as the options that said "lay out the document like Word 95") that were
never adequately documented.  While the claim is that these will be
phased out, Microsoft is obviously planning to do as Adobe did which is
to confuse governments by having multiple formats that "sound" the same
(IE: PDF/1 ISO standard, PDF publicly documented non-standard, and
proprietary Acrobat files all share common .PDF file extension but are
not the same).

  I'm not intending to initiate a flame-fest or anything, but to remind
people that there is vendor politics and anti-competitive vendor
behavior that will always get in the way of Civic Access work.


BTW: Something being in XML can theoretically provide more
reverse-engineering clues when documentation isn't provided, but
something isn't automatically an open file format simply because it is
XML.   We know the "structure" of binary nearly as much without
documentation.  Some vendors have used XML as a marketing tool to claim
their formats are more open than they actually are.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
12