scraping and scrubbing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

scraping and scrubbing

Michael Lenczner-2
I'm spending some time exploring some of the data tools out there.
Anybody playing around with Needlebase?
http://www.needlebase.com

I'm also interesting in chatting with people playing around with
Refine or Tableau.  I'm wondering how I should be using what, since
there's some overlap.  I want to geek out on some data acquisition /
scrubing / integrating / visualizing processes.

Reply | Threaded
Open this post in threaded view
|

Re: scraping and scrubbing

Michael Mulley
Just read this today:

http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data

It's an introductory guide to scraping data, which goes into some
especially heroic cases (scanned document images delivered via Flash).
Most of it's targeted to people with some coding knowledge, but the
first part of the series is a nice example of using Refine.

On Tue, Jan 4, 2011 at 12:01 AM, Michael Lenczner <[hidden email]> wrote:

> I'm spending some time exploring some of the data tools out there.
> Anybody playing around with Needlebase?
> http://www.needlebase.com
>
> I'm also interesting in chatting with people playing around with
> Refine or Tableau.  I'm wondering how I should be using what, since
> there's some overlap.  I want to geek out on some data acquisition /
> scrubing / integrating / visualizing processes.
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>

Reply | Threaded
Open this post in threaded view
|

Re: scraping and scrubbing

Karl Dubost
In reply to this post by Michael Lenczner-2

Le 4 janv. 2011 à 00:01, Michael Lenczner a écrit :
> I'm spending some time exploring some of the data tools out there.

Also, if they do not scrape it, Yahoo! Pipes can be a convenient tool.
http://pipes.yahoo.com/pipes/

--
Karl Dubost
Montréal, QC, Canada
http://www.la-grange.net/karl/


Reply | Threaded
Open this post in threaded view
|

Re: scraping and scrubbing

Morgen Peers
In reply to this post by Michael Lenczner-2
for a great scraping/multi-purpose internet adventuring tool, checkout:

Outwit Hub / Outwit Technologies

http://www.outwit.com/products/hub/

Cheers,
Morgen
Reply | Threaded
Open this post in threaded view
|

Re: scraping and scrubbing

Michael Lenczner-2
That's a great article, Michael.  And thanks, Morgen and Karl.

I'm more interested in data scrubbing and merging.  Gridworks and
Needlebase do clustering, and you can use Freebase (and now other
datasets via an API) to additionally reconcile stuff from a Gridworks
project.

There's a thread here on reconciling data based on an entire record
(my needs are name and address and registration number), not only one
field - http://lists.freebase.com/pipermail/freebase-discuss/2010-May/001491.html

Any suggestions / thoughts for ways forward? I'm happy to go off-list
on this if anyone has expertise on this?

I hope this isn't too off-topic for CA.  I figured we could do with
some technical discussion. :) The data I'm working on is Canadian
federal data on non-profits.

Mike

On Tue, Jan 4, 2011 at 9:05 AM, Morgen Peers <[hidden email]> wrote:

> for a great scraping/multi-purpose internet adventuring tool, checkout:
>
> Outwit Hub / Outwit Technologies
>
> http://www.outwit.com/products/hub/
>
> Cheers,
> Morgen
>
> _______________________________________________
> CivicAccess-discuss mailing list
> [hidden email]
> http://lists.pwd.ca/mailman/listinfo/civicaccess-discuss
>