Standards rant on structured documents

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Glen Newton
Thanks Russell: I was getting round to saying this too.
Suggesting ISO/IEC 29500 from both a moral and technical perspective
is offensive.

-Glen

On Mon, Sep 16, 2013 at 10:44 AM, Russell McOrmond <[hidden email]> wrote:

>
> On 13-09-15 09:41 PM, john whelan wrote:
>>>3. For a generic document, Dublin Core is the most widely adopted
>> standard for metadata (author, date published, topics, etc.). For
>> specific types of documents, there are more specific standards.
>>
>> I'm under the impression that much of this could be extracted from the
>> XML of an ISO/IEC 29500 document.
>
>   Just a reminder in case anyone missed it, ISO/IEC 29500 is the highly
> controversial "Office Open XML" from Microsoft that was laundered
> through the standards process to compete against the multi-vendor Open
> Document Format (ODF).  OOXML is a standard way for third party
> applications to interoperate with the Microsoft Office Suite, and thus
> isn't the same as a platform neutral standard like ODF.
>
>   While it is possible to extract metadata from ISO/IEC 29500 files,
> replicating formatting really needs the Microsoft Office rendering
> engines as there are many controversial tags in the transactional (such
> as the options that said "lay out the document like Word 95") that were
> never adequately documented.  While the claim is that these will be
> phased out, Microsoft is obviously planning to do as Adobe did which is
> to confuse governments by having multiple formats that "sound" the same
> (IE: PDF/1 ISO standard, PDF publicly documented non-standard, and
> proprietary Acrobat files all share common .PDF file extension but are
> not the same).
>
>   I'm not intending to initiate a flame-fest or anything, but to remind
> people that there is vendor politics and anti-competitive vendor
> behavior that will always get in the way of Civic Access work.
>
>
> BTW: Something being in XML can theoretically provide more
> reverse-engineering clues when documentation isn't provided, but
> something isn't automatically an open file format simply because it is
> XML.   We know the "structure" of binary nearly as much without
> documentation.  Some vendors have used XML as a marketing tool to claim
> their formats are more open than they actually are.
>
> --
>  Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
>  Please help us tell the Canadian Parliament to protect our property
>  rights as owners of Information Technology. Sign the petition!
>  http://l.c11.ca/ict
>
>  "The government, lobbied by legacy copyright holders and hardware
>   manufacturers, can pry my camcorder, computer, home theatre, or
>   portable media player from my cold dead hands!"
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
-
http://zzzoot.blogspot.com/
-
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by Russell McOrmond
Layout is always problematical, typefaces for example are copyrighted.  However ISO/IEC 29500 does have defined XML tags that can be used to extract useful meta data which I think was the objective here to identify something that could be machine readable, and let us get at the author and other information without manual intervention and its attendant costs and errors.

All Federal Government departments have it available since its the way the create / review documents with Treasury Board even though they use Ami Pro, WordPerfect etc more generally.  There are templates available so that Word will create a formal ISO/IEC 29500 document, and there are templates available for other word processors.  From a practical point of view it's cheap and easy for the originators to do.  From our point of view we don't have to reverse engineer the XML tags then verify we got them all correct every time the software is revised.

I'm in complete agreement about vendor politics and extensions to .pdf etc and I've often thought that we should have an ISO Open Standard word processor but reality is these days that Word processing is no longer stand alone but part of a system.  Form letters are created with fields filled in from an SQL database.  Microsoft's Visual Basic is part of a word processor these days.  You just can't switch to another word processor without impacting all sorts of systems.  For example there is a whole slew of software for Blind and other handicapped people and it was being able to provide support to these people that stopped one local government in the US from switching to the "free" OpenOffice.

Realistically you can't request documents in LibreOffice format, its propitiatory, besides its a lot of extra work for a government department to have it available and support it.  It has to be a formal standard and ideally one that is in line with with TB standards.  TB has a history of promoting ISO standards.  The alternative would be the ISO SGML standard and DTD used by NEB, from our point of view I don't think it matters but that would be extra work on the departments side.  Why not request the results in either ISO standard, both are more useful than a pile of paper.

Lowest common denominator solutions often leave things to be desired but they are useful in that it gives us a way to request data in machine readable format that we can extract some tags are more useful that a pile of printed paper that needs to be scanned or retyped.

How you store the information once you have it is a different question and I personally don't think it should be stored in Microsoft Word format but if you wish to give access to handicapped people it might well be a consideration.

Cheerio John


On 16 September 2013 10:44, Russell McOrmond <[hidden email]> wrote:

On 13-09-15 09:41 PM, john whelan wrote:
>>3. For a generic document, Dublin Core is the most widely adopted
> standard for metadata (author, date published, topics, etc.). For
> specific types of documents, there are more specific standards.
>
> I'm under the impression that much of this could be extracted from the
> XML of an ISO/IEC 29500 document.

  Just a reminder in case anyone missed it, ISO/IEC 29500 is the highly
controversial "Office Open XML" from Microsoft that was laundered
through the standards process to compete against the multi-vendor Open
Document Format (ODF).  OOXML is a standard way for third party
applications to interoperate with the Microsoft Office Suite, and thus
isn't the same as a platform neutral standard like ODF.

  While it is possible to extract metadata from ISO/IEC 29500 files,
replicating formatting really needs the Microsoft Office rendering
engines as there are many controversial tags in the transactional (such
as the options that said "lay out the document like Word 95") that were
never adequately documented.  While the claim is that these will be
phased out, Microsoft is obviously planning to do as Adobe did which is
to confuse governments by having multiple formats that "sound" the same
(IE: PDF/1 ISO standard, PDF publicly documented non-standard, and
proprietary Acrobat files all share common .PDF file extension but are
not the same).

  I'm not intending to initiate a flame-fest or anything, but to remind
people that there is vendor politics and anti-competitive vendor
behavior that will always get in the way of Civic Access work.


BTW: Something being in XML can theoretically provide more
reverse-engineering clues when documentation isn't provided, but
something isn't automatically an open file format simply because it is
XML.   We know the "structure" of binary nearly as much without
documentation.  Some vendors have used XML as a marketing tool to claim
their formats are more open than they actually are.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Glen Newton
>Realistically you can't request documents in LibreOffice format, its propitiatory
Incorrect.
OpenOffice (which Libreoffice uses) format is an ISO and OASIS
standard. ISO/IEC 26300:2006/Amd 1:2012 - Open Document Format for
Office Applications
https://en.wikipedia.org/wiki/OpenDocument

BTW MS-Office 2010, 2013 both support Open Office format
https://en.wikipedia.org/wiki/OpenDocument#Software.
So you could change the format without changing the software, allowing
MS to keep at least its software monopoly in the government office, if
that is important to you. ;-)

-Glen

On Mon, Sep 16, 2013 at 11:32 AM, john whelan <[hidden email]> wrote:

> Layout is always problematical, typefaces for example are copyrighted.
> However ISO/IEC 29500 does have defined XML tags that can be used to extract
> useful meta data which I think was the objective here to identify something
> that could be machine readable, and let us get at the author and other
> information without manual intervention and its attendant costs and errors.
>
> All Federal Government departments have it available since its the way the
> create / review documents with Treasury Board even though they use Ami Pro,
> WordPerfect etc more generally.  There are templates available so that Word
> will create a formal ISO/IEC 29500 document, and there are templates
> available for other word processors.  From a practical point of view it's
> cheap and easy for the originators to do.  From our point of view we don't
> have to reverse engineer the XML tags then verify we got them all correct
> every time the software is revised.
>
> I'm in complete agreement about vendor politics and extensions to .pdf etc
> and I've often thought that we should have an ISO Open Standard word
> processor but reality is these days that Word processing is no longer stand
> alone but part of a system.  Form letters are created with fields filled in
> from an SQL database.  Microsoft's Visual Basic is part of a word processor
> these days.  You just can't switch to another word processor without
> impacting all sorts of systems.  For example there is a whole slew of
> software for Blind and other handicapped people and it was being able to
> provide support to these people that stopped one local government in the US
> from switching to the "free" OpenOffice.
>
> Realistically you can't request documents in LibreOffice format, its
> propitiatory, besides its a lot of extra work for a government department to
> have it available and support it.  It has to be a formal standard and
> ideally one that is in line with with TB standards.  TB has a history of
> promoting ISO standards.  The alternative would be the ISO SGML standard and
> DTD used by NEB, from our point of view I don't think it matters but that
> would be extra work on the departments side.  Why not request the results in
> either ISO standard, both are more useful than a pile of paper.
>
> Lowest common denominator solutions often leave things to be desired but
> they are useful in that it gives us a way to request data in machine
> readable format that we can extract some tags are more useful that a pile of
> printed paper that needs to be scanned or retyped.
>
> How you store the information once you have it is a different question and I
> personally don't think it should be stored in Microsoft Word format but if
> you wish to give access to handicapped people it might well be a
> consideration.
>
> Cheerio John
>
>
> On 16 September 2013 10:44, Russell McOrmond <[hidden email]> wrote:
>>
>>
>> On 13-09-15 09:41 PM, john whelan wrote:
>> >>3. For a generic document, Dublin Core is the most widely adopted
>> > standard for metadata (author, date published, topics, etc.). For
>> > specific types of documents, there are more specific standards.
>> >
>> > I'm under the impression that much of this could be extracted from the
>> > XML of an ISO/IEC 29500 document.
>>
>>   Just a reminder in case anyone missed it, ISO/IEC 29500 is the highly
>> controversial "Office Open XML" from Microsoft that was laundered
>> through the standards process to compete against the multi-vendor Open
>> Document Format (ODF).  OOXML is a standard way for third party
>> applications to interoperate with the Microsoft Office Suite, and thus
>> isn't the same as a platform neutral standard like ODF.
>>
>>   While it is possible to extract metadata from ISO/IEC 29500 files,
>> replicating formatting really needs the Microsoft Office rendering
>> engines as there are many controversial tags in the transactional (such
>> as the options that said "lay out the document like Word 95") that were
>> never adequately documented.  While the claim is that these will be
>> phased out, Microsoft is obviously planning to do as Adobe did which is
>> to confuse governments by having multiple formats that "sound" the same
>> (IE: PDF/1 ISO standard, PDF publicly documented non-standard, and
>> proprietary Acrobat files all share common .PDF file extension but are
>> not the same).
>>
>>   I'm not intending to initiate a flame-fest or anything, but to remind
>> people that there is vendor politics and anti-competitive vendor
>> behavior that will always get in the way of Civic Access work.
>>
>>
>> BTW: Something being in XML can theoretically provide more
>> reverse-engineering clues when documentation isn't provided, but
>> something isn't automatically an open file format simply because it is
>> XML.   We know the "structure" of binary nearly as much without
>> documentation.  Some vendors have used XML as a marketing tool to claim
>> their formats are more open than they actually are.
>>
>> --
>>  Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
>>  Please help us tell the Canadian Parliament to protect our property
>>  rights as owners of Information Technology. Sign the petition!
>>  http://l.c11.ca/ict
>>
>>  "The government, lobbied by legacy copyright holders and hardware
>>   manufacturers, can pry my camcorder, computer, home theatre, or
>>   portable media player from my cold dead hands!"
>> _______________________________________________
>> CivicAccess-discuss mailing list
>> [hidden email]
>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>
>
>
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
-
http://zzzoot.blogspot.com/
-
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
I stand corrected on LibreOffice I was under the impression that it had extended the ISO standard of OpenOffice so wasn't exactly the ISO standard.

So even better we can now request that the document be supplied in either ISO/IEC 26300:2006 or ISO/IEC 29500or even the ISO SGML standard used by NEB but that needs an agreed DTD.

I'm not wedded to Microsoft by the way, at INAC I guided the department into Wordperfect for technical reasons one major one being the susceptibility of Word documents to virus macros.  Then we had a new IT Director who held deep religious views that we should be using Microsoft Word.  He paused though when I asked who would pay the training costs.  In those days I used to take the average salary double it then divide by 200 to give a cost per day.  Five days of conversion training dwarfed the cost of the software.  The department stayed with WordPerfect.

At Stats we seriously looked at getting out of Microsoft Word, Stats has traditionally hired computer science graduates who hold beliefs that UNIX and open source are wonderful so the willingness was more than there.  Stats had enough technical expertise to overcome any technical problems.  Microsoft was very concerned because if Stats showed it could be done then a lot of other governments might follow.

There were three show stoppers.  First was typefaces, each typeface comes with kerning pairs and it is extremely difficult to have exactly the same layout in two different typefaces.  The manager who looked at it first thought that a three page difference over a three hundred page document didn't matter.  Well it doesn't until you get complex documents with foot notes etc that is going to be printed.

Second one was handicap access tools.  There was simply more available in the Microsoft environment and its a political one.

The third one was Visual Basic macros.  There were complex macros that really were VB programs that had been written by summer students and others that we had no central record of but were critical to a particular survey and some surveys only ran once every five years.  The cost of finding, redeveloping and testing them was too high.

To give you some idea of testing costs for year 2000 the cost of running one months data in CPI was estimated at $1,000,000 and when asked for the money to run the test TB's answer was use existing budgets.  I'll let you guess if that test was run.

Microsoft made a presentation saying they could save us a couple of million a year in licensing if we switched to SQL server rather than use Oracle.  True enough but the cost of redeveloping and testing one single system would have been greater than the savings of five years licensing fees.

Cheerio John


On 16 September 2013 11:43, Glen Newton <[hidden email]> wrote:
>Realistically you can't request documents in LibreOffice format, its propitiatory
Incorrect.
OpenOffice (which Libreoffice uses) format is an ISO and OASIS
standard. ISO/IEC 26300:2006/Amd 1:2012 - Open Document Format for
Office Applications
https://en.wikipedia.org/wiki/OpenDocument

BTW MS-Office 2010, 2013 both support Open Office format
https://en.wikipedia.org/wiki/OpenDocument#Software.
So you could change the format without changing the software, allowing
MS to keep at least its software monopoly in the government office, if
that is important to you. ;-)

-Glen

On Mon, Sep 16, 2013 at 11:32 AM, john whelan <[hidden email]> wrote:
> Layout is always problematical, typefaces for example are copyrighted.
> However ISO/IEC 29500 does have defined XML tags that can be used to extract
> useful meta data which I think was the objective here to identify something
> that could be machine readable, and let us get at the author and other
> information without manual intervention and its attendant costs and errors.
>
> All Federal Government departments have it available since its the way the
> create / review documents with Treasury Board even though they use Ami Pro,
> WordPerfect etc more generally.  There are templates available so that Word
> will create a formal ISO/IEC 29500 document, and there are templates
> available for other word processors.  From a practical point of view it's
> cheap and easy for the originators to do.  From our point of view we don't
> have to reverse engineer the XML tags then verify we got them all correct
> every time the software is revised.
>
> I'm in complete agreement about vendor politics and extensions to .pdf etc
> and I've often thought that we should have an ISO Open Standard word
> processor but reality is these days that Word processing is no longer stand
> alone but part of a system.  Form letters are created with fields filled in
> from an SQL database.  Microsoft's Visual Basic is part of a word processor
> these days.  You just can't switch to another word processor without
> impacting all sorts of systems.  For example there is a whole slew of
> software for Blind and other handicapped people and it was being able to
> provide support to these people that stopped one local government in the US
> from switching to the "free" OpenOffice.
>
> Realistically you can't request documents in LibreOffice format, its
> propitiatory, besides its a lot of extra work for a government department to
> have it available and support it.  It has to be a formal standard and
> ideally one that is in line with with TB standards.  TB has a history of
> promoting ISO standards.  The alternative would be the ISO SGML standard and
> DTD used by NEB, from our point of view I don't think it matters but that
> would be extra work on the departments side.  Why not request the results in
> either ISO standard, both are more useful than a pile of paper.
>
> Lowest common denominator solutions often leave things to be desired but
> they are useful in that it gives us a way to request data in machine
> readable format that we can extract some tags are more useful that a pile of
> printed paper that needs to be scanned or retyped.
>
> How you store the information once you have it is a different question and I
> personally don't think it should be stored in Microsoft Word format but if
> you wish to give access to handicapped people it might well be a
> consideration.
>
> Cheerio John
>
>
> On 16 September 2013 10:44, Russell McOrmond <[hidden email]> wrote:
>>
>>
>> On 13-09-15 09:41 PM, john whelan wrote:
>> >>3. For a generic document, Dublin Core is the most widely adopted
>> > standard for metadata (author, date published, topics, etc.). For
>> > specific types of documents, there are more specific standards.
>> >
>> > I'm under the impression that much of this could be extracted from the
>> > XML of an ISO/IEC 29500 document.
>>
>>   Just a reminder in case anyone missed it, ISO/IEC 29500 is the highly
>> controversial "Office Open XML" from Microsoft that was laundered
>> through the standards process to compete against the multi-vendor Open
>> Document Format (ODF).  OOXML is a standard way for third party
>> applications to interoperate with the Microsoft Office Suite, and thus
>> isn't the same as a platform neutral standard like ODF.
>>
>>   While it is possible to extract metadata from ISO/IEC 29500 files,
>> replicating formatting really needs the Microsoft Office rendering
>> engines as there are many controversial tags in the transactional (such
>> as the options that said "lay out the document like Word 95") that were
>> never adequately documented.  While the claim is that these will be
>> phased out, Microsoft is obviously planning to do as Adobe did which is
>> to confuse governments by having multiple formats that "sound" the same
>> (IE: PDF/1 ISO standard, PDF publicly documented non-standard, and
>> proprietary Acrobat files all share common .PDF file extension but are
>> not the same).
>>
>>   I'm not intending to initiate a flame-fest or anything, but to remind
>> people that there is vendor politics and anti-competitive vendor
>> behavior that will always get in the way of Civic Access work.
>>
>>
>> BTW: Something being in XML can theoretically provide more
>> reverse-engineering clues when documentation isn't provided, but
>> something isn't automatically an open file format simply because it is
>> XML.   We know the "structure" of binary nearly as much without
>> documentation.  Some vendors have used XML as a marketing tool to claim
>> their formats are more open than they actually are.
>>
>> --
>>  Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
>>  Please help us tell the Canadian Parliament to protect our property
>>  rights as owners of Information Technology. Sign the petition!
>>  http://l.c11.ca/ict
>>
>>  "The government, lobbied by legacy copyright holders and hardware
>>   manufacturers, can pry my camcorder, computer, home theatre, or
>>   portable media player from my cold dead hands!"
>> _______________________________________________
>> CivicAccess-discuss mailing list
>> [hidden email]
>> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>
>
>
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Russell McOrmond
In reply to this post by john whelan
On 13-09-16 11:32 AM, john whelan wrote:
> I'm in complete agreement about vendor politics and extensions to .pdf
> etc and I've often thought that we should have an ISO Open Standard word
> processor but reality is these days that Word processing is no longer
> stand alone but part of a system.  Form letters are created with fields
> filled in from an SQL database.  Microsoft's Visual Basic is part of a
> word processor these days.

   This is exactly why many of us don't (and realistically can't)
separate Open Data from Free/Libre and open source software and file
formats.  If a foreign (or domestic) corporation controls access to the
government data it can't be claimed to be open.

  Microsoft Visual Basic can't be part of my word processor, given there
is no Microsoft software on my Ubuntu desktops or servers or the various
Android mobile devices I own.   No Apple either, but they aren't as much
of a problem in this space (far worse on the "who owns your computer",
etc space).

  Saying that I have to run specific offensive vendor software in order
to access government data would be as offensive to me as telling a
devout religious person that they must publicly renounce their faith in
order to interact with the government.

> Realistically you can't request documents in LibreOffice format, its
> propitiatory,

  This isn't the case.  LibreOffice is a fork of the OpenOffice
software, but both use the ODF standard format which pre-dates OOXML --
in fact it was the existing ODF standard and the global governmental
movement towards that vendor-neutral standard that sparked Microsoft to
create the far-less-standard OOXML in the first place (and to corrupt
the ECMA and ISO processes to fast-track this process).

> Why not request the results in either ISO standard, both are more useful
> than a pile of paper.

 I currently work at Canadiana.org, and part of what we do is scan old
government documents to make available online.   Without adequate
documentation of a digital file format I would question the suggestion
that it would automatically be better than a pile of paper.   We have no
interoperability problems with older paper, but documents written far
more recently in "word processors" from the 60's to recently are
inaccessible.

  The vendor-lock-in from untrustworthy companies like Adobe and
Microsoft make this into an issue that CivicAccess advocates should be
very well aware of.  It isn't enough that the government release its
limitations on access to government information, they need to not
collaborate to transfer control over barriers to the private sector.

  I'm agreeing that there is a sliding scale from from worse to better,
but that "being digital" doesn't automatically make it better any more
than "being XML" does.  There are more complications involved in
evaluating that, and some of these complications were deliberately
manufacturers by technology vendors.

> How you store the information once you have it is a different question
> and I personally don't think it should be stored in Microsoft Word
> format but if you wish to give access to handicapped people it might
> well be a consideration.

  I realize that Microsoft is good at the politics abusing communities
to work on their behalf, but that doesn't mean we have to buy into it.

  Accessibility issues are a user interface issue within operating
systems and applications, not a file format issue.  Microsoft is the
source of the manufactured incompatibilities that handicapped people
have historically expressed concern with, not the solution.

  Governments have the ability to fund solutions to these problems,
rather than claiming that if some historical version of a competitor
doesn't have a feature that it is legitimate reason for the government
to lock all citizens into a single vendor.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

David Akin
In reply to this post by Stéphane Guidoin
I think it quite possible to provide some meta-tags to any ATI release. For one thing, every release is "wrapped" in the same structure, i.e a cover page and often a list of how various exceptions (censor marks) have been applied and why.

And then: The records themselves typically have a basic taxonomy; financial records, estimates, house cards, invoices, memos, briefs, e-mails, etc.

Within each "doctype",then, there would be other common kinds of structure …

Would make comparing, collating, and indexing ATIs for journalists much more valuable.


On 2013-09-14, at 9:37 PM, Stéphane Guidoin <[hidden email]> wrote:

So it's not unreasonable to ask for structured data in the case of regulation and possibly new judgements. for the moment, it would appear to me to ask the impossible to get structured data for ATI outcomes.

David Akin
Full contact details at:
cell: +1 613 698 7412


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by Russell McOrmond

The point I was trying to make was Federal Government has a lot of existing software.  Not all of it is Microsoft Word.  However there are pockets in government that use the extended feature set of the Microsoft environment and currently its too expensive to change nor as you point out are there reasonable alternatives in the UNIX world.  It's even too expensive to change every one in federal government to the same Word processor be it AMI Pro, Microsoft Word etc.

When you scan paper something has to crawl over it and tag it somehow.  If you can get a document in some form of ISO standard machine readable format you can read then you can normally get at some meta data generated.  The idea is to use the ISO standard as an interface, I don't care that you used Visual Basic to create the document all I want is a machine readable document that follows an ISO standard.  I'm fairly certain that you can read any of the ISO either SGML or XML tagged standard document formats in the UNIX environment.

I don't see that asking for a document in an ISO standard format locks you into a particular vendor in order to read or process the document.  In database terms the text is a blob field that can be text searched but the blob can also have tags with the author, title, date and all the other things one tags a document with.  I make no recommendation as to how the documents should be converted, stored nor the schema that should be used.

Yes there are documents in government that were produced before ISO standards were around.  I think at INAC we paid someone $1,000,000 to convert our Wang word processing documents to HP word format.  When we moved to WordPerfect I was able to identify a way to extract the text and font characteristics from the HP Word environment so that they could be converted directly into WordPerfect documents.

Commercial interests are unfortunately part of the Open Data environment.  Telus didn't sponsor the City of Ottawa's Open Data whatever out of the goodness of their heart but rather that most of the apps were running on smart phones and needed an on line data connection to work.  Bus stops don't move very often.  If you have an offline map with the 560 bus stop numbers you can either text or phone to find out the time of the next bus.  You may not require a $400 a year data plan.  How much Open Data is laid over Google maps?  There are Open Data alternatives such as OpenStreetMap, FOSM etc. but that's a different issue.

Cheerio John


On 16 September 2013 15:49, Russell McOrmond <[hidden email]> wrote:
On 13-09-16 11:32 AM, john whelan wrote:
> I'm in complete agreement about vendor politics and extensions to .pdf
> etc and I've often thought that we should have an ISO Open Standard word
> processor but reality is these days that Word processing is no longer
> stand alone but part of a system.  Form letters are created with fields
> filled in from an SQL database.  Microsoft's Visual Basic is part of a
> word processor these days.

   This is exactly why many of us don't (and realistically can't)
separate Open Data from Free/Libre and open source software and file
formats.  If a foreign (or domestic) corporation controls access to the
government data it can't be claimed to be open.

  Microsoft Visual Basic can't be part of my word processor, given there
is no Microsoft software on my Ubuntu desktops or servers or the various
Android mobile devices I own.   No Apple either, but they aren't as much
of a problem in this space (far worse on the "who owns your computer",
etc space).

  Saying that I have to run specific offensive vendor software in order
to access government data would be as offensive to me as telling a
devout religious person that they must publicly renounce their faith in
order to interact with the government.

> Realistically you can't request documents in LibreOffice format, its
> propitiatory,

  This isn't the case.  LibreOffice is a fork of the OpenOffice
software, but both use the ODF standard format which pre-dates OOXML --
in fact it was the existing ODF standard and the global governmental
movement towards that vendor-neutral standard that sparked Microsoft to
create the far-less-standard OOXML in the first place (and to corrupt
the ECMA and ISO processes to fast-track this process).

> Why not request the results in either ISO standard, both are more useful
> than a pile of paper.

 I currently work at Canadiana.org, and part of what we do is scan old
government documents to make available online.   Without adequate
documentation of a digital file format I would question the suggestion
that it would automatically be better than a pile of paper.   We have no
interoperability problems with older paper, but documents written far
more recently in "word processors" from the 60's to recently are
inaccessible.

  The vendor-lock-in from untrustworthy companies like Adobe and
Microsoft make this into an issue that CivicAccess advocates should be
very well aware of.  It isn't enough that the government release its
limitations on access to government information, they need to not
collaborate to transfer control over barriers to the private sector.

  I'm agreeing that there is a sliding scale from from worse to better,
but that "being digital" doesn't automatically make it better any more
than "being XML" does.  There are more complications involved in
evaluating that, and some of these complications were deliberately
manufacturers by technology vendors.

> How you store the information once you have it is a different question
> and I personally don't think it should be stored in Microsoft Word
> format but if you wish to give access to handicapped people it might
> well be a consideration.

  I realize that Microsoft is good at the politics abusing communities
to work on their behalf, but that doesn't mean we have to buy into it.

  Accessibility issues are a user interface issue within operating
systems and applications, not a file format issue.  Microsoft is the
source of the manufactured incompatibilities that handicapped people
have historically expressed concern with, not the solution.

  Governments have the ability to fund solutions to these problems,
rather than claiming that if some historical version of a competitor
doesn't have a feature that it is legitimate reason for the government
to lock all citizens into a single vendor.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Stéphane Guidoin
In reply to this post by David Akin
> I think it quite possible to provide some meta-tags to any ATI release. For one thing, every release is "wrapped" in the same structure, i.e a cover page and often a list of how various exceptions (censor marks) have been applied and why.

Yes! I thought you meant a standard format for the content by itself. Yes, metatags/metadata would be possible as James wrote previously.

Stephane

Le 2013-09-16 15:57, David Akin a écrit :
I think it quite possible to provide some meta-tags to any ATI release. For one thing, every release is "wrapped" in the same structure, i.e a cover page and often a list of how various exceptions (censor marks) have been applied and why.

And then: The records themselves typically have a basic taxonomy; financial records, estimates, house cards, invoices, memos, briefs, e-mails, etc.

Within each "doctype",then, there would be other common kinds of structure …

Would make comparing, collating, and indexing ATIs for journalists much more valuable.


On 2013-09-14, at 9:37 PM, Stéphane Guidoin <[hidden email]> wrote:

So it's not unreasonable to ask for structured data in the case of regulation and possibly new judgements. for the moment, it would appear to me to ask the impossible to get structured data for ATI outcomes.

David Akin
Full contact details at:
cell: +1 613 698 7412



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Russell McOrmond
In reply to this post by john whelan

On 13-09-16 04:20 PM, john whelan wrote:
> I don't see that asking for a document in an ISO standard format locks
> you into a particular vendor in order to read or process the document.


  Something being an "ISO standard" doesn't mean what you seem to think
it means.  For instance, an ISO standard can be patent encumbered and
under a RAND license which legally excludes FLOSS implementations of
that standard.  Sure, there may not be a "particular vendor" one is
locked into with RAND but that is like saying you can choose any
candidate as long as they are part of the Communist party, and claim
that is choice.

  Being an ISO standard doesn't even mean that the format is defined
enough to do what you need to be able to do with the document.  For
instance, if you want to be able to view or print a document you need to
be able to render it, and there is quite a bit of OOXML (as only one
example -- this isn't a Microsoft issue) that isn't documented such that
third parties can render it.

  If what you mean is that you can extract metadata and store the data
blob in a database for others to find, you can do that without there
being any "ISO standard" at all.

  It does go back to what one means by "open data", and what value we
think it might have.  If all we mean is that the data blob is copyright
licensed in such a way that it allows third parties to store and
communicate it without additional permission or payment then that
doesn't really mean "open data" is very valuable for civic access to
government.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

Stéphane Guidoin
While watching the OKCon
webcast(http://new.livestream.com/accounts/5389255/okcon), I found this:
Legal XML - http://legalxml.org/about/index.shtml

Steph

Le 2013-09-17 11:18, Russell McOrmond a écrit :

> On 13-09-16 04:20 PM, john whelan wrote:
>> I don't see that asking for a document in an ISO standard format locks
>> you into a particular vendor in order to read or process the document.
>
>    Something being an "ISO standard" doesn't mean what you seem to think
> it means.  For instance, an ISO standard can be patent encumbered and
> under a RAND license which legally excludes FLOSS implementations of
> that standard.  Sure, there may not be a "particular vendor" one is
> locked into with RAND but that is like saying you can choose any
> candidate as long as they are part of the Communist party, and claim
> that is choice.
>
>    Being an ISO standard doesn't even mean that the format is defined
> enough to do what you need to be able to do with the document.  For
> instance, if you want to be able to view or print a document you need to
> be able to render it, and there is quite a bit of OOXML (as only one
> example -- this isn't a Microsoft issue) that isn't documented such that
> third parties can render it.
>
>    If what you mean is that you can extract metadata and store the data
> blob in a database for others to find, you can do that without there
> being any "ISO standard" at all.
>
>    It does go back to what one means by "open data", and what value we
> think it might have.  If all we mean is that the data blob is copyright
> licensed in such a way that it allows third parties to store and
> communicate it without additional permission or payment then that
> doesn't really mean "open data" is very valuable for civic access to
> government.
>

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

James McKinney-2
Indeed, I've made an incomplete list of legislative and legal document standards here: https://github.com/opennorth/popolo-spec/issues/13


On 2013-09-17, at 11:34 AM, Stéphane Guidoin wrote:

While watching the OKCon webcast(http://new.livestream.com/accounts/5389255/okcon), I found this:
Legal XML - http://legalxml.org/about/index.shtml

Steph

Le 2013-09-17 11:18, Russell McOrmond a écrit :
On 13-09-16 04:20 PM, john whelan wrote:
I don't see that asking for a document in an ISO standard format locks
you into a particular vendor in order to read or process the document.

  Something being an "ISO standard" doesn't mean what you seem to think
it means.  For instance, an ISO standard can be patent encumbered and
under a RAND license which legally excludes FLOSS implementations of
that standard.  Sure, there may not be a "particular vendor" one is
locked into with RAND but that is like saying you can choose any
candidate as long as they are part of the Communist party, and claim
that is choice.

  Being an ISO standard doesn't even mean that the format is defined
enough to do what you need to be able to do with the document.  For
instance, if you want to be able to view or print a document you need to
be able to render it, and there is quite a bit of OOXML (as only one
example -- this isn't a Microsoft issue) that isn't documented such that
third parties can render it.

  If what you mean is that you can extract metadata and store the data
blob in a database for others to find, you can do that without there
being any "ISO standard" at all.

  It does go back to what one means by "open data", and what value we
think it might have.  If all we mean is that the data blob is copyright
licensed in such a way that it allows third parties to store and
communicate it without additional permission or payment then that
doesn't really mean "open data" is very valuable for civic access to
government.


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Standards rant on structured documents

john whelan
In reply to this post by Russell McOrmond
The objective is to request machine readable format documents from federal government, nothing else.

Being pragmatic to do that successfully you need to find a solution that is easy for them to implement and appears to be in a non-proprietary format. TB has a tradition of following ISO standards so hence the ISO standard format.  This is a big change from a paper photocopy and since it is a very large change to be successful it has to appear none threatening and it has to appear to be a minor change.  Anything else they'd have to think about and that's civil service for no without saying it just take ten years to think about it.

I agree you don't need an ISO standard to use machine readable format documents but what format are you going to request?  And that's the key here.  If you request it as plain text you lose the meta data and which character set is it encoded in?  Anything else than those ISO standards and they have to think about it.

Layout will always be a problem unless you have access to the original software and typefaces.  Postscript might work if you want the layout but that doesn't tag the meta data.

With paper you need to OCR it and OCR isn't as reliable as you might like.  Additionally it needs manual intervention to tag the document otherwise you end up with lots of .jpg images and it can be difficult to search through them and find the one you want.

I think Steph has identified an XML standard way of tagging legal documents with LegalXML and I think that would be an excellent way to store the tagged documents using agreed standard defined tags.  This is fairly common in the XML world but whilst it would be nice the LegalXML tags are not appropriate for every document that the government produces and notice the word agreed translation that means the feds would need to think about it and normally that taken effort on their part which means you might not get a positive answer.  However some documents might be tagged in this way internally and if you can get them so much the better.

It's not a perfect answer but it is a solution that the feds might find easy to implement and would give machine readable format documents.  A major change can take a long time, it took three years from the time that Stats introduced resource booking as part of outlook before it was regularly used to book rooms.  This wasn't a technology issue just a fairly simple change in the way things were done.

Cheerio John



On 17 September 2013 11:18, Russell McOrmond <[hidden email]> wrote:

On 13-09-16 04:20 PM, john whelan wrote:
> I don't see that asking for a document in an ISO standard format locks
> you into a particular vendor in order to read or process the document.


  Something being an "ISO standard" doesn't mean what you seem to think
it means.  For instance, an ISO standard can be patent encumbered and
under a RAND license which legally excludes FLOSS implementations of
that standard.  Sure, there may not be a "particular vendor" one is
locked into with RAND but that is like saying you can choose any
candidate as long as they are part of the Communist party, and claim
that is choice.

  Being an ISO standard doesn't even mean that the format is defined
enough to do what you need to be able to do with the document.  For
instance, if you want to be able to view or print a document you need to
be able to render it, and there is quite a bit of OOXML (as only one
example -- this isn't a Microsoft issue) that isn't documented such that
third parties can render it.

  If what you mean is that you can extract metadata and store the data
blob in a database for others to find, you can do that without there
being any "ISO standard" at all.

  It does go back to what one means by "open data", and what value we
think it might have.  If all we mean is that the data blob is copyright
licensed in such a way that it allows third parties to store and
communicate it without additional permission or payment then that
doesn't really mean "open data" is very valuable for civic access to
government.

--
 Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
 Please help us tell the Canadian Parliament to protect our property
 rights as owners of Information Technology. Sign the petition!
 http://l.c11.ca/ict

 "The government, lobbied by legacy copyright holders and hardware
  manufacturers, can pry my camcorder, computer, home theatre, or
  portable media player from my cold dead hands!"
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
12