Using Twitter as a source for regional linguistic data (and R for the analysis)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Twitter as a source for regional linguistic data (and R for the analysis)

Glen Newton
This is pretty cool:
"Soda vs. Pop with Twitter"
http://blog.echen.me/2012/07/06/soda-vs-pop-with-twitter/

However,  one of the problems with using Twitter data is that the
license explicitly the collection of data for re-distribution.
This means that if you collect some tweets, do some analysis on them,
and publish a paper, you cannot make your tweet collection available
to others.
This has significantly impacted the research area that I work in
(natural language processing and information retrieval) to the extent
that research corpora (=text data set) have been yanked after Twitter
told the researchers to cease and desist.

These two corpora have been yanked:
1 - http://snap.stanford.edu/data/twitter7.html
2 - http://homepages.inf.ed.ac.uk/miles/papers/socmed10.pdf

There are a few examples of corpora available, but require a signed
agreement that you will not redistribute
(http://trec.nist.gov/data/tweets/tweets2011-agreement.pdf ).

So while there are various instances of very useful big data out
there, most of them are in the hands of private companies that do not
have open data policies. They all have much more data on most of us
than any single government.

Related: How Recent Changes to Twitter's Terms of Service Might Hurt
Academic Research,
http://www.readwriteweb.com/archives/how_recent_changes_to_twitters_terms_of_service_mi.php

-Glen

--
-
http://zzzoot.blogspot.com/
-

Reply | Threaded
Open this post in threaded view
|

Re: Using Twitter as a source for regional linguistic data (and R for the analysis)

Russell McOrmond


On Jul 21, 2012 12:43 AM, "Glen Newton" <[hidden email]> wrote:
> However,  one of the problems with using Twitter data is that the
> license explicitly the collection of data for re-distribution.

  I know this is going to seem redundant, but this is the reason for Fair Use/Dealings .  Data like this should not be able to be restricted for non-commercial academic/research purposes.   Unfortunately governments are largely decreasing these limits (or bypassing the limits/exceptions to copyright entirely through access rights and access controls), and institutions are getting more and more conservative about exercising these limits/exceptions.

  We shouldn't have to be relying on open access/data licenses for this type of research  --- the law should simply not support these restrictions in the first place.