Hi folks,
It took about 2 hours of work, but I have a list of 813358 postal codes. That must include some old ones. If anyone wants a copy, it's available with either: http://s3.amazonaws.com/danielharan/postal_codes.txt.gz http://s3.amazonaws.com/danielharan/postal_codes.txt.gz?torrent Now the work on scraping the corresponding EDIDs can begin. Cheers, d. |
Some was asking me what page scraping means. Could you explain - in sorta lay person terms?
cheers t On Tue, Sep 23, 2008 at 4:33 PM, Daniel Haran <[hidden email]> wrote: Hi folks, -- Tracey P. Lauriault 613-234-2805 https://gcrc.carleton.ca/confluence/display/GCRCWEB/Lauriault |
On Tue, Sep 23, 2008 at 5:46 PM, Tracey P. Lauriault <[hidden email]> wrote:
> Some was asking me what page scraping means. Could you explain - in sorta > lay person terms? Scraping is a way to extract structured information from websites. Let's use my next project as an example. The list of 813,358 postal codes is now public. I am writing software that will go to a political party's website, submit the form to 'find your candidate' and save the resulting page. Then I'll write another small bit of software that reads each page, finds the electoral district id, and outputs a single line: <postal_code>,<district_id> 813,358 pages, one resulting file with as many lines. Because of the large number of requests, compiling the data can take a very long time. Getting one page per second, it would still take 9.4 days to get this data file. I hope that helps... I may be in too deep to offer a good lay person's explanation :) d. |
In reply to this post by Daniel Haran
Daniel Haran wrote: > Hi folks, > > It took about 2 hours of work, but I have a list of 813358 postal > codes. That must include some old ones. The number of codes in the August 2007 PCFRF database is 813666, and 801,340 from the February 2005 version. That suggests to me that what you have is pretty darn complete! -- Russell McOrmond, Internet Consultant: <http://www.flora.ca/> Please help us tell the Canadian Parliament to protect our property rights as owners of Information Technology. Sign the petition! http://www.digital-copyright.ca/petition/ict/ "The government, lobbied by legacy copyright holders and hardware manufacturers, can pry my camcorder, computer, home theatre, or portable media player from my cold dead hands!" |
In reply to this post by Tracey P. Lauriault
Tracey P. Lauriault wrote:
> Some was asking me what page scraping means. Could you explain - in > sorta lay person terms? A computer automated cut-and-paste where what page you go to is automated, and what piece of information you try to learn from the resulting page is automated. -- Russell McOrmond, Internet Consultant: <http://www.flora.ca/> Please help us tell the Canadian Parliament to protect our property rights as owners of Information Technology. Sign the petition! http://www.digital-copyright.ca/petition/ict/ "The government, lobbied by legacy copyright holders and hardware manufacturers, can pry my camcorder, computer, home theatre, or portable media player from my cold dead hands!" |
Le Tue, 23 Sep 2008 19:56:55 -0400,
Russell McOrmond <[hidden email]> a écrit : > Tracey P. Lauriault wrote: > > Some was asking me what page scraping means. Could you explain - in > > sorta lay person terms? > > A computer automated cut-and-paste where what page you go to is > automated, and what piece of information you try to learn from the > resulting page is automated. When it's _really_ automated, it's called a feed or a microformat. It's called scaping because it usually also involves manual labor to get the job done right, as HTML pages are often modified with no regards to its semantic value. -- Robin |
Robin Millette wrote:
> Le Tue, 23 Sep 2008 19:56:55 -0400, Russell McOrmond > <[hidden email]> a écrit : > >> Tracey P. Lauriault wrote: >>> Some was asking me what page scraping means. Could you explain - >>> in sorta lay person terms? >> A computer automated cut-and-paste where what page you go to is >> automated, and what piece of information you try to learn from the >> resulting page is automated. > > When it's _really_ automated, it's called a feed or a microformat. > It's called scaping because it usually also involves manual labor to > get the job done right, as HTML pages are often modified with no > regards to its semantic value. Aren't definitions fun. The difference in my mind between a feed/microformat and 'scraping' is whether the relevant output format was designed to be human readable (IE: html) or machine readable (XML, csv, etc). Whether there is manual labour is unrelated in my mind. It is often called "screen scraping" from back in the days that a screen of information was drawn, and then we tried to pull information from that screen based on the location of information. Re-intepreting HTML like we are doing here is a bit different, but we are still talking about taking a page intended to be read by a human (rendered by a browers) and instead interpret it as data as input to a program/database/etc. -- Russell McOrmond, Internet Consultant: <http://www.flora.ca/> Please help us tell the Canadian Parliament to protect our property rights as owners of Information Technology. Sign the petition! http://www.digital-copyright.ca/petition/ict/ "The government, lobbied by legacy copyright holders and hardware manufacturers, can pry my camcorder, computer, home theatre, or portable media player from my cold dead hands!" |
In reply to this post by Russell McOrmond-2
Thanks gang!
On Tue, Sep 23, 2008 at 7:56 PM, Russell McOrmond <[hidden email]> wrote:
-- Tracey P. Lauriault 613-234-2805 https://gcrc.carleton.ca/confluence/display/GCRCWEB/Lauriault |
Free forum by Nabble | Edit this page |