Practical question: how to manage "textual" content

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Practical question: how to manage "textual" content

Stéphane Guidoin
Hi there,

I don't think I had the opportunity to mention it here, but since July,
I joined the Smart City Office of the City of Montréal where I am
handling the open data program.

Today, I have a fairly simple, very practical but still tough (in my
mind) issue: how to deal with "textual" content. Here, we have minutes
from councils (unfortunately yet unstructured) but also all sorts of
reports, some of them already published on our open data report.

I have a natural tendency to believe that this is not "data" as what
people expect in open data (and as such all theses minutes and report do
not fit in an open data portal). And let's face it, open data portal
tools are not really convenient for this sort of content (lacking, for
example, capabilities to search text into documents). On the other hand,
users tend to find it useful to have one place to have all "documents"
that are open and it's possible to consider text as data.

Any comments on this? Any example of portal managing textual content
efficiently?

Bonus track: what about images (like historical pictures, etc.)

Stéphane
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Glen Newton
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:

> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Robin Millette
Allô,

On Mon, Aug 24, 2015 at 4:18 PM, Glen Newton <[hidden email]> wrote:
> To make it searchable: use Solr. https://lucene.apache.org/solr/
> You _can_ also (only) store it in Solr (Lucene in the backend), but
> you _should_  (also) have the content in some kind off primary store
> (Postgres, Mysql, even sqllite...)

Depending on the types of documents, tika might be more appropriate:
https://tika.apache.org/

P.S.:
If you're in Québec, join us for the Semaine québécoise de
l'informatique libre next month: http://2015.sqil.info/ :-)

--
Robin
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

David H. Mason
In reply to this post by Glen Newton

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Gerry Tychon
In reply to this post by Stéphane Guidoin
Stéphane …

Without getting into technical aspects which others have addressed I think having a mechanism to find electronic documents in a central location/portal would be most useful – a form of electronic library. Currently most government organizations have documents scattered across various departments/ministries which often leads to frustrations in finding what you want efficiently.

I don’t want to call free form electronic documents “open data” since it goes against early propositions that the data can be consumed in a some standard fashion. And I would be concerned that government employees will publish documents as PDF files and then say they have published open data.

But if we can keep the distinction between open data and electronic publications then I think some good can come.

Pet peeve – many of the publications created and then put out as PDF’s are designed as if the method of consumption is a glossy paper document. Today, most documents are viewed electronically and/or printed locally. I find it frustrating to have to print a document that has 10 pages of content but has been puffed up to 20 or more pages with needless images or charts. Let us make these documents more green – less paper used, less toner, less electricity, and less time wasted.

… gerry


On Mon, Aug 24, 2015 at 1:12 PM, Stéphane Guidoin <[hidden email]> wrote:
Hi there,

I don't think I had the opportunity to mention it here, but since July, I joined the Smart City Office of the City of Montréal where I am handling the open data program.

Today, I have a fairly simple, very practical but still tough (in my mind) issue: how to deal with "textual" content. Here, we have minutes from councils (unfortunately yet unstructured) but also all sorts of reports, some of them already published on our open data report.

I have a natural tendency to believe that this is not "data" as what people expect in open data (and as such all theses minutes and report do not fit in an open data portal). And let's face it, open data portal tools are not really convenient for this sort of content (lacking, for example, capabilities to search text into documents). On the other hand, users tend to find it useful to have one place to have all "documents" that are open and it's possible to consider text as data.

Any comments on this? Any example of portal managing textual content efficiently?

Bonus track: what about images (like historical pictures, etc.)

Stéphane
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Pascal Robichaud
In reply to this post by David H. Mason
For the contrats in the 'Ordre du jour', I've been working on a Python script to extract the data.

The tests are working for the Executive Commitee. 

I'm putting the final touch to put everything together.

Until a better solution is put in place, the goal is to watch the city web site for new PDF files. If so, download it, convert to text, and process it to extract the contrats in csv format, and then send a Twit to inform people about new contrats to be adopted. 

This would then be a pro-active approch, instead of the actuel 'after the fact'.

Citizens would then be informed in order to take actions if need be,

It won't be 100% efficient, but will be better than nothing ;-) We have to start somewhere.

Pascal

2015-08-24 16:36 GMT-04:00 David H. Mason <[hidden email]>:

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Josée Plamondon
Stéphane,

I totally agree with Gerry about that : "I don’t want to call free form electronic documents “open data” since it goes against early propositions that the data can be consumed in a some standard fashion. And I would be concerned that government employees will publish documents as PDF files and then say they have published open data. "

Working on a metadata project at an other level of government (that could have an open data "side" in a near future). It is important to, at least, maintain the distinction (separate section, page or tab) between readily usable and machine readable data vs. text files and PDF and to explain why it is so. It is still hard to change the old habits of putting anything in a PDF to make it "public".

Josée



2015-08-24 17:14 GMT-04:00 Pascal Robichaud <[hidden email]>:
For the contrats in the 'Ordre du jour', I've been working on a Python script to extract the data.

The tests are working for the Executive Commitee. 

I'm putting the final touch to put everything together.

Until a better solution is put in place, the goal is to watch the city web site for new PDF files. If so, download it, convert to text, and process it to extract the contrats in csv format, and then send a Twit to inform people about new contrats to be adopted. 

This would then be a pro-active approch, instead of the actuel 'after the fact'.

Citizens would then be informed in order to take actions if need be,

It won't be 100% efficient, but will be better than nothing ;-) We have to start somewhere.

Pascal

2015-08-24 16:36 GMT-04:00 David H. Mason <[hidden email]>:

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Josée Plamondon, MSI MBA
Analyste, Exploitation de contenu numérique

514.969.1273



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

john whelan
I think I'd just make them available in their original format.  It's easy enough to put a collection of documents on a hard drive and Wordperfect for example comes with a full text search engine that can search across document collections.

Also if you convert the document you tend to lose something.

Cheerio John

On 24 August 2015 at 18:44, Josée Plamondon <[hidden email]> wrote:
Stéphane,

I totally agree with Gerry about that : "I don’t want to call free form electronic documents “open data” since it goes against early propositions that the data can be consumed in a some standard fashion. And I would be concerned that government employees will publish documents as PDF files and then say they have published open data. "

Working on a metadata project at an other level of government (that could have an open data "side" in a near future). It is important to, at least, maintain the distinction (separate section, page or tab) between readily usable and machine readable data vs. text files and PDF and to explain why it is so. It is still hard to change the old habits of putting anything in a PDF to make it "public".

Josée



2015-08-24 17:14 GMT-04:00 Pascal Robichaud <[hidden email]>:
For the contrats in the 'Ordre du jour', I've been working on a Python script to extract the data.

The tests are working for the Executive Commitee. 

I'm putting the final touch to put everything together.

Until a better solution is put in place, the goal is to watch the city web site for new PDF files. If so, download it, convert to text, and process it to extract the contrats in csv format, and then send a Twit to inform people about new contrats to be adopted. 

This would then be a pro-active approch, instead of the actuel 'after the fact'.

Citizens would then be informed in order to take actions if need be,

It won't be 100% efficient, but will be better than nothing ;-) We have to start somewhere.

Pascal

2015-08-24 16:36 GMT-04:00 David H. Mason <[hidden email]>:

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Josée Plamondon, MSI MBA
Analyste, Exploitation de contenu numérique

<a href="tel:514.969.1273" value="+15149691273" target="_blank">514.969.1273



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Stéphane Guidoin
Thanks all for the answers/pointers. For the moment I am less in the "how to do it technically" than in "what should go where".

Currently, reports are posted on the open data portal but even if published as open document text, like others, I am not comfortable to put then next to CSVs and other structured data. It is still difficult to have many "data owners" understand what is structured data. Accepting textual docs give the impression that *any* electronic file makes it.

On top of that, we are bounded by open data portal software: most of them are made for structured data. Even if I add full text search to docs into CKAN, I am still stuck with a structure that does not have a nice support for textual documents.

Last week, we had some exchanges on twitter (thanks Josée & Pascal) about this, but 140 chars is not the best form factor to discuss that. I wanted to see with the practionner here how people felt about textual docs sitting next to structured data.

Don't hesitate to point me places where you feel textual documents (and images) are well supported in an approach similar to open data.

Stéphane


On 2015-08-24 19:24, john whelan wrote:
I think I'd just make them available in their original format.  It's easy enough to put a collection of documents on a hard drive and Wordperfect for example comes with a full text search engine that can search across document collections.

Also if you convert the document you tend to lose something.

Cheerio John

On 24 August 2015 at 18:44, Josée Plamondon <[hidden email]> wrote:
Stéphane,

I totally agree with Gerry about that : "I don’t want to call free form electronic documents “open data” since it goes against early propositions that the data can be consumed in a some standard fashion. And I would be concerned that government employees will publish documents as PDF files and then say they have published open data. "

Working on a metadata project at an other level of government (that could have an open data "side" in a near future). It is important to, at least, maintain the distinction (separate section, page or tab) between readily usable and machine readable data vs. text files and PDF and to explain why it is so. It is still hard to change the old habits of putting anything in a PDF to make it "public".

Josée



2015-08-24 17:14 GMT-04:00 Pascal Robichaud <[hidden email]>:
For the contrats in the 'Ordre du jour', I've been working on a Python script to extract the data.

The tests are working for the Executive Commitee. 

I'm putting the final touch to put everything together.

Until a better solution is put in place, the goal is to watch the city web site for new PDF files. If so, download it, convert to text, and process it to extract the contrats in csv format, and then send a Twit to inform people about new contrats to be adopted. 

This would then be a pro-active approch, instead of the actuel 'after the fact'.

Citizens would then be informed in order to take actions if need be,

It won't be 100% efficient, but will be better than nothing ;-) We have to start somewhere.

Pascal

2015-08-24 16:36 GMT-04:00 David H. Mason <[hidden email]>:

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Josée Plamondon, MSI MBA
Analyste, Exploitation de contenu numérique

<a moz-do-not-send="true" href="tel:514.969.1273" value="+15149691273" target="_blank">514.969.1273



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Practical question: how to manage "textual" content

Tracey P. Lauriault
Stephane;

I am not sure how to deal with the technical issues of structured and unstructured, but there is certainly a need for documentation to be in the portal, especially documentation that accompany the data, methodological documents, reports and so on. https://rd-alliance.org/groups/rdawds-publishing-data-services-wg.html

The tagging of things, is at the moment not good enough, and I cannot believe I am saying this, but we need to link as in linked data, and we need unique identifiers for both datasets and textual documents that may be related in some way to the data.

There was quite a bit of talk in Ireland about a URI engine (https://data.gov.ie/technical-framework#unique-resource-identifiers), but what and how to label things was an issues, especially as institutional names change. There are DOIs for documents however! 

http://www.icsu-wds.org/files/nasa-doi-sept-oct-2012.pdf

Irrespective, there is much talk about this at the Research Data Alliance, and the big linked data shops such as INSIGHT GALWAY, and the semantic web people.

I am not sure if that helps or not, but this is what came to mind while reading the exchange.

Cheerio
t

On Mon, Aug 24, 2015 at 9:32 PM, Stéphane Guidoin <[hidden email]> wrote:
Thanks all for the answers/pointers. For the moment I am less in the "how to do it technically" than in "what should go where".

Currently, reports are posted on the open data portal but even if published as open document text, like others, I am not comfortable to put then next to CSVs and other structured data. It is still difficult to have many "data owners" understand what is structured data. Accepting textual docs give the impression that *any* electronic file makes it.

On top of that, we are bounded by open data portal software: most of them are made for structured data. Even if I add full text search to docs into CKAN, I am still stuck with a structure that does not have a nice support for textual documents.

Last week, we had some exchanges on twitter (thanks Josée & Pascal) about this, but 140 chars is not the best form factor to discuss that. I wanted to see with the practionner here how people felt about textual docs sitting next to structured data.

Don't hesitate to point me places where you feel textual documents (and images) are well supported in an approach similar to open data.

Stéphane



On 2015-08-24 19:24, john whelan wrote:
I think I'd just make them available in their original format.  It's easy enough to put a collection of documents on a hard drive and Wordperfect for example comes with a full text search engine that can search across document collections.

Also if you convert the document you tend to lose something.

Cheerio John

On 24 August 2015 at 18:44, Josée Plamondon <[hidden email]> wrote:
Stéphane,

I totally agree with Gerry about that : "I don’t want to call free form electronic documents “open data” since it goes against early propositions that the data can be consumed in a some standard fashion. And I would be concerned that government employees will publish documents as PDF files and then say they have published open data. "

Working on a metadata project at an other level of government (that could have an open data "side" in a near future). It is important to, at least, maintain the distinction (separate section, page or tab) between readily usable and machine readable data vs. text files and PDF and to explain why it is so. It is still hard to change the old habits of putting anything in a PDF to make it "public".

Josée



2015-08-24 17:14 GMT-04:00 Pascal Robichaud <[hidden email]>:
For the contrats in the 'Ordre du jour', I've been working on a Python script to extract the data.

The tests are working for the Executive Commitee. 

I'm putting the final touch to put everything together.

Until a better solution is put in place, the goal is to watch the city web site for new PDF files. If so, download it, convert to text, and process it to extract the contrats in csv format, and then send a Twit to inform people about new contrats to be adopted. 

This would then be a pro-active approch, instead of the actuel 'after the fact'.

Citizens would then be informed in order to take actions if need be,

It won't be 100% efficient, but will be better than nothing ;-) We have to start somewhere.

Pascal

2015-08-24 16:36 GMT-04:00 David H. Mason <[hidden email]>:

With documents the best approach is tags and search. Similar to Glen's suggestion, I'd suggest ElasticSearch. It's also based on Lucene, is a sibling of Solr. At one point ElasticSearch pulled ahead in ease of use (particularly around clustering but also schema-less data), though Solr has I believe mostly caught up. ElasticSearch is native JSON, with support for plugins that support indexing Office document types, relevance searches, clustering, and so on.  See [1]. As Robin linked, it supports Tika natively. [2] Language (human and programming) support is very good.

I worked extensively with ES in developing an open source portal like system for use in research organizations to actively classify and share information. [3] It worked well but the project didn't continue. Though my dance card is quite full, I'd love a chance to be part of a version 2.

David


On 24 August 2015 at 16:18, Glen Newton <[hidden email]> wrote:
To make it searchable: use Solr. https://lucene.apache.org/solr/
You _can_ also (only) store it in Solr (Lucene in the backend), but
you _should_  (also) have the content in some kind off primary store
(Postgres, Mysql, even sqllite...)
If you want some help with it, let me know. I'm a bit of a Lucene/Solr expert...

Glen

On Mon, Aug 24, 2015 at 2:12 PM, Stéphane Guidoin
<[hidden email]> wrote:
> Hi there,
>
> I don't think I had the opportunity to mention it here, but since July, I
> joined the Smart City Office of the City of Montréal where I am handling the
> open data program.
>
> Today, I have a fairly simple, very practical but still tough (in my mind)
> issue: how to deal with "textual" content. Here, we have minutes from
> councils (unfortunately yet unstructured) but also all sorts of reports,
> some of them already published on our open data report.
>
> I have a natural tendency to believe that this is not "data" as what people
> expect in open data (and as such all theses minutes and report do not fit in
> an open data portal). And let's face it, open data portal tools are not
> really convenient for this sort of content (lacking, for example,
> capabilities to search text into documents). On the other hand, users tend
> to find it useful to have one place to have all "documents" that are open
> and it's possible to consider text as data.
>
> Any comments on this? Any example of portal managing textual content
> efficiently?
>
> Bonus track: what about images (like historical pictures, etc.)
>
> Stéphane
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Josée Plamondon, MSI MBA
Analyste, Exploitation de contenu numérique

<a href="tel:514.969.1273" target="_blank" value="+15149691273">514.969.1273



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss


_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss



--
Tracey P. Lauriault
Assistant Professor 
Critical Media Studies and Big Data
Communication Studies
School of Journalism and Communication
Suite 4110, River Building
Carleton University
1125 Colonel By Drive
Ottawa (ON) K1S 5B6
1-613-520-2600 x7443
[hidden email]
@TraceyLauriault
Skype: Tracey.P.Lauriault

_______________________________________________
CivicAccess-discuss mailing list
[hidden email]
http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss