Login  Register

Re: What page scraping means

Posted by Russell McOrmond-2 on Sep 24, 2008; 1:58am
URL: http://civicaccess.416.s1.nabble.com/Canadian-Postal-Code-list-tp1258p1272.html

Robin Millette wrote:

> Le Tue, 23 Sep 2008 19:56:55 -0400, Russell McOrmond
> <[hidden email]> a écrit :
>
>> Tracey P. Lauriault wrote:
>>> Some was asking me what page scraping means.  Could you explain -
>>> in sorta lay person terms?
>> A computer automated cut-and-paste where what page you go to is
>> automated, and what piece of information you try to learn from the
>>  resulting page is automated.
>
> When it's _really_ automated, it's called a feed or a microformat.
> It's called scaping because it usually also involves manual labor to
> get the job done right, as HTML pages are often modified with no
> regards to its semantic value.


   Aren't definitions fun.  The difference in my mind between a
feed/microformat and 'scraping' is whether the relevant output format
was designed to be human readable (IE: html) or machine readable (XML,
csv, etc).  Whether there is manual labour is unrelated in my mind.

   It is often called "screen scraping" from back in the days that a
screen of information was drawn, and then we tried to pull information
from that screen based on the location of information.   Re-intepreting
HTML like we are doing here is a bit different, but we are still talking
about taking a page intended to be read by a human (rendered by a
browers) and instead interpret it as data as input to a
program/database/etc.

--
  Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
  Please help us tell the Canadian Parliament to protect our property
  rights as owners of Information Technology. Sign the petition!
  http://www.digital-copyright.ca/petition/ict/

  "The government, lobbied by legacy copyright holders and hardware
   manufacturers, can pry my camcorder, computer, home theatre, or
   portable media player from my cold dead hands!"