slides from my R tutorial on Twitter text mining #rstats

July 4, 2011 — Jeffrey Breen

Update: An expanded version of this tutorial will appear in the new Elsevier book Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Gary Miner et. al which is now available for pre-order from Amazon.

In conjunction with the book, I have cleaned up the tutorial code and published it on github.

Last month I presented this introduction to R at the Boston Predictive Analytics MeetUp on Twitter Sentiment.

The goal of the presentation was to expose a first-time (but technically savvy) audience to working in R. The scenario we work through is to estimate the sentiment expressed in tweets about major U.S. airlines. Even with a tiny sample and a very crude algorithm (simply counting the number of positive vs. negative words), we find a believable result. We conclude by comparing our result with scores we scrape from the American Consumer Satisfaction Index web site.

Jeff Gentry’s twitteR package makes it easy to fetch the tweets. Also featured are the plyr, ggplot2, doBy, and XML packages. A real analysis would, no doubt, lean heavily on the tm text mining package for stemming, etc.

Here is the slimmed-down version of the slides:

And here’s a PDF version to download.

Special thanks to John Verostek for putting together such an interesting event, and for providing valuable feedback and help with these slides.

Update: thanks to eagle-eyed Carl Howe for noticing a slightly out-of-date version of the score.sentiment() function in the deck. Missing was handling for NA values from match(). The deck has been updated and the code is reproduced here for convenience:


score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
	require(plyr)
	require(stringr)
	
	# we got a vector of sentences. plyr will handle a list
	# or a vector as an "l" for us
	# we want a simple array ("a") of scores back, so we use 
	# "l" + "a" + "ply" = "laply":
	scores = laply(sentences, function(sentence, pos.words, neg.words) {
		
		# clean up sentences with R's regex-driven global substitute, gsub():
		sentence = gsub('[[:punct:]]', '', sentence)
		sentence = gsub('[[:cntrl:]]', '', sentence)
		sentence = gsub('\\d+', '', sentence)
		# and convert to lower case:
		sentence = tolower(sentence)

		# split into words. str_split is in the stringr package
		word.list = str_split(sentence, '\\s+')
		# sometimes a list() is one level of hierarchy too much
		words = unlist(word.list)

		# compare our words to the dictionaries of positive & negative terms
		pos.matches = match(words, pos.words)
		neg.matches = match(words, neg.words)
	
		# match() returns the position of the matched term or NA
		# we just want a TRUE/FALSE:
		pos.matches = !is.na(pos.matches)
		neg.matches = !is.na(neg.matches)

		# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
		score = sum(pos.matches) - sum(neg.matches)

		return(score)
	}, pos.words, neg.words, .progress=.progress )

	scores.df = data.frame(score=scores, text=sentences)
	return(scores.df)
}

Posted in Tutorials. Tags: airlines, Boston Predictive Analytics, doBy, ggplot2, Hu & Liu, plyr, R, sentiment analysis, text mining, tm. 129 Comments »

> library(twitteR) Loading required package: RCurl Loading required package: bitops Loading required package: rjson > tweets.list = searchTwitter('#rstats') > tweets.df = twListToDF(tweets.list) > write.csv(tweets.df, file='/tmp/tweets.csv', row.names=F)

$ head -3 /tmp/tweets.csv "text","favorited","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName" "Anyone know why Jonathan Chang’s lda package was removed from CRAN? #rstats",FALSE,NA,2012-09-04 18:06:24,FALSE,NA,"243047374027640832",NA,"<a href="http://tapbots.com">Tweetbot for Mac</a>","treycausey" "RT @jebyrnes: Awesome visually weighted regression in #rstats with #ggplot2 http://t.co/EYCLiM7b (link!)",FALSE,NA,2012-09-04 17:49:29,FALSE,NA,"243043116171550721",NA,"<a href="http://twitter.com/">web</a>","RD_Denton"

Ronert Obst Says:
July 5, 2011 at 2:35 AM

interesting post!

Jeffrey Breen Says:
July 5, 2011 at 9:47 AM
Thanks, Ronert!

- venkat Says:
  November 27, 2012 at 8:27 AM
  while executing the statement:
  r.tweets = searchTwitter(‘#rstats’, n=1500)
  I am getting the error below :
  Error in function (type, msg, asError = TRUE) :
  Failure when receiving data from the peer
  Please let me know whats the problem is ..
Will Says:
December 13, 2014 at 5:22 PM
Wow I must confess you make some very trhecnant points.

biao Says:
July 5, 2011 at 9:28 AM

thanks a lot, Jeffrey, I am very interested in the slides, especially your example of using twitteR package, would you mind sending me a copy of this slide? it seems i can’t download it. I won’t share with others, thanks.

Jeffrey Breen Says:
July 5, 2011 at 9:47 AM
Feel free to share — you should be able to download from slideshare, but I have emailed you a PDF copy just in case. (The original was made in Keynote which you may not have.)

- biao Says:
  July 7, 2011 at 5:28 AM
  thanks a lot, I got it.

Norm Albertson Says:
July 5, 2011 at 10:07 AM

Very good presentation. Interesting subject but your slide presentation is outstanding. As an R neophyte I felt comfortable following the slides, understanding both the data mining and manipulation techniques, and more importantly, what you were accomplishing with each step. All without your narration to explain things! Well done, thanks for sharing.

Jeffrey Breen Says:
July 5, 2011 at 10:19 AM
Thanks, Norm!

John Verostek helped enormously with the slides. In addition to wordsmithing and flow, he suggested the colored text box callouts. I think they helped the live audience, but they definitely make it more “standalone”.

I’m glad you found the presentation useful.

Jeffrey

Shea Says:
July 5, 2011 at 5:08 PM

I of course love the faceting ability in ggplot… but for comparing distributions may I recommend doing a geom_freqpoly or geom_density with an aes(color=airline). It makes it much easier to see discrepancies in their distributions.

Thanks for the fun bits about twitteR too!

Jeffrey Breen Says:
July 5, 2011 at 7:03 PM
Good suggestion, Shea.

Glad you enjoyed it!

Jeffrey

Shea Says:
July 5, 2011 at 5:11 PM

Additionally, I’ve *heard* that if you transform the tweets into word frequencies and then perform SVD, the first dimension most often naturally represents positive or negative feelings. There’s some fancy “stubbing” and other bits you need to do too. I haven’t tried my hand, but I’m thinking up plenty of twitter terms that might give me some meaningful data to play with.

Jeffrey Breen Says:
July 5, 2011 at 7:04 PM
That sounds interesting. There’s lots in the tm package to do stemming, stopwords, and the like.

Let me know what you turn up!

Jeffrey

Soren Macbeth Says:
July 5, 2011 at 9:52 PM

Very cool. line 10 has a typo. instead of:

scores = laply(sentences, function(sentence, pos.words, neg.words) {

it should read:

scores = lapply(sentences, function(sentence, pos.words, neg.words) {

Jeffrey Breen Says:
July 5, 2011 at 10:41 PM
Thanks, Soren!

That line is correct — it’s using the laply() function from Hadley Wickham’s plyr package. Base R’s lapply() would work, too (except for the .progress=.progress parameter specified near the bottom), but I favor plyr for a few reasons:
1. Standard naming convention. I could never remember whether I wanted mapply(), tapply(), sapply() or what. The functions in the plyr package are named simply: the first letter specifies what data type you’re passing in (“d” for data.frame, “l” or list, “a” for array, etc.) and the second letter specifies the data type you want as output. So, laply() takes a list and outputs an array (or a vector if one-dimensional). Similarly, ldply() would take the same list but return a data.frame.
2. Support for parallel processing with foreach. Just specify .parallel=TRUE and plyr‘s functions will execute using the foreach package’s parallel backend automagically.
3. Progress bars for free. I find this especially handy during interactive, exploratory work. Just specify a value for the .progress parameter and you’ll receive graphical feedback of its progress. Valid values include “tk” for Tcl/Tk (Unix/Mac), “win” for Windows, and my favorite, “text” for old school ASCII in the console window. (Default is “none”.)
Jeffrey

Larry (IEOR Tools) Says:
July 6, 2011 at 8:07 AM

How were you able to use searchTwitter() to get 1500 tweets. I thought Twitter has an api limit of 100. I’m only able to get 100 tweets.

Jeffrey Breen Says:
July 6, 2011 at 8:35 AM
Hi Larry:

My understanding is that searchTwitter() uses Twitter’s Search API, and according to Twitter’s Things Every Developer Should Know (at the bottom), its limit is 1500. (You can try specifying a larger n, but 1500 is the maximum you’ll get back.)

But there’s also a time limit on what is in the index (and a rate limit too, by IP), so if you’re only getting 100 tweets, it may be that your query is relatively low-volume and they have aged out of the cache. Take a look at “#rstats” vs. “#twitter”:
```
> r.tweets = searchTwitter('#rstats', n=1500)
> length(r.tweets)
[1] 104
> twitter.tweets = searchTwitter('#twitter', n=1500)
> length(twitter.tweets)
[1] 1500
```
But those 104 R tweets go back a few days, whereas 1500 Twitter tweets barely span two hours:
```
> r.tweets[[104]]$getCreated()
[1] "2011-07-02 13:42:08 UTC"
> twitter.tweets[[1500]]$getCreated()
[1] "2011-07-06 11:53:17 UTC"
```
What happens if you search for something more common like ‘#twitter’?

Jeffrey

- Jeff Gentry Says:
  July 6, 2011 at 10:19 AM
  There are multiple limitations, both time & number. The search will only go back for 2 days and a maximum of 1500 tweets as Jeffrey said.
  
  However the search API will only return 100 tweets at a time, in a paged manner. The searchTwitter() function handles all of the pagination for you, which allows you to get more than 100 at once.
  
  The Twitter API has a few different ways of handling that sort of situation, but in all cases that is abstracted away from the user in the twitteR package.
- Jeffrey Breen Says:
  July 6, 2011 at 10:37 AM
  Hi Jeff:
  
  Thanks for chiming in. Without the hard work you put into the twitteR package, I’m sure the whole presentation would only use 100 tweets (if that)!
  
  Thanks again,
  Jeffrey
- Jeff Gentry Says:
  July 6, 2011 at 12:28 PM
  Without a good story/use case it’d just sit there unused, so I know which is more valuable 😉 Now I’ve got something cool to point to when people ask for an example of it being useful!
- Larry (IEOR Tools) Says:
  July 6, 2011 at 12:43 PM
  I still only get 100 tweets when doing the following
  
  > tweets length(tweets)
  [1] 100
- Jeff Gentry Says:
  July 6, 2011 at 12:49 PM
  How are you generating “tweets”? What version of the package (IIRC there was a bug a couple of months ago)
- Larry (IEOR Tools) Says:
  July 6, 2011 at 1:22 PM
  That was it! I was using an old version. For those that want to know version 0.99.9 works for this example.

steve o'grady Says:
July 6, 2011 at 10:16 AM

Outstanding presentation, and I’m having a lot of fun playing with the scripts. Ran a few without incident, but ran into trouble with my last one, which kicked a: “50%Error in tolower(sentence).” Not to ask you to do public support, but thought I’d drop it just in case you or anyone else here had seen and handled this.

In any event, thanks for sharing this.

Jeffrey Breen Says:
July 6, 2011 at 10:33 AM
Thanks, Steve!

That seems like a non-controversial place to fail — sounds like a character set issue. I didn’t run into it for this presentation (apparently people only curse at US airlines in ASCII), but I see that I have a call to Base R’s iconv() function in some “real” code I had written before.

Try inserting this line before the call to tolower():
```
sentence = iconv(sentence, 'UTF-8', 'ASCII')
```
HTH,
Jeffrey

Tony Breyal Says:
July 6, 2011 at 11:12 AM

I know it’s been said above, but still, I feel this really is a set of very good stand-alone presentation slides; I often find that I can get lost in the narrative of a talk, but by referring back to the flow chart I found it easy to follow and knew where in the story I was, where I’d been and where I was heading next. This visual idea is something I hope to use myself in the future.

Very interesting topic indeed. Out of curiosity, why did you opt not to use the tm package?

Jeffrey Breen Says:
July 6, 2011 at 11:47 AM
Thanks very much, Tony.

The tm package is great, but I wanted to focus on the basics of R (along with my go-to packages). Also, I am far from a text mining expert, and didn’t want to get bogged down trying to explain (or answer questions) about stemming, stopwords, and document-term/term-document matrices — especially with such a crude sentiment algorithm. Believe me — no one was more surprised than I when a sensible result came out (and all I had to fudge was throwing out low-volume, about-to-be-merged-away @continental)!

Jeffrey

Angela Waner Says:
July 6, 2011 at 2:17 PM

Please contact me if you would be interested in including your tutorial in a book on text mining. The same authors who wrote this book, http://www.amazon.com/Handbook-Statistical-Analysis-Applications-ebook/dp/B002ZJSVPA/ref=sr_1_2?ie=UTF8&qid=1309979464&sr=8-2, are working on a text mining book.

John Johnson Says:
July 6, 2011 at 10:55 PM

Interesting, and I thoroughly enjoyed working through the example. Then I tried the techniques on the major players in my industry, the pharmaceutical industry. From my previous experience working through Matthew Russell’s _Mining the Social Web_ I realized that most of the people who tweet about pharma are experienced commentators or consultants, so as it turned out, “sentiment” turned out to equate to whether the company had good or bad news (you could tell by looking at the text of the tweets that had the extreme sentiments). So this could be useful for some automated early signal detection about companies you want to follow.

Jeffrey Breen Says:
July 7, 2011 at 8:41 AM
Hi John:

That’s an interesting thing to look at. I wouldn’t have guessed that most of the airline tweeters are industry insiders or experts — perhaps a B2B vs. B2C split?

Thanks,
Jeffrey

Paul Says:
July 7, 2011 at 5:19 AM

Can I get a PDF copy?

Thanks!

Paul

Jeffrey Breen Says:
July 7, 2011 at 8:14 AM
On its way.

Jeffrey

pinboard July 7, 2011 — arghh.net Says:
July 7, 2011 at 12:36 PM

[…] R tutorial on twitter text mining […]

State of Data #56 « Dr Data's Blog Says:
July 8, 2011 at 2:50 AM

[…] of Data #56 by doctordata on July 8, 2011 #analysis – Mining Twitter for consumer attitudes towards airlines (using R) – (a) “search twitter in 1 line of code”; (b) Estimate sentiment from ‘opinion […]

andrew clark Says:
July 8, 2011 at 8:21 AM

Fascinating work. I’m having problems installing twitteR with R v 2.13.0 on Windows

Warning: dependencies ‘RCurl’, ‘RJSONIO’ are not available

The RCurl package does not display in the available CRAN list and if I try
install.packages(“RCurl”, dependencies = TRUE) I get error message

package ‘RCurl’ is not available (for R version 2.13.0)

Any suggestions?

Jeffrey Breen Says:
July 8, 2011 at 12:36 PM
Hi Andrew:

Thanks!

It looks like Windows users have to fetch the RCurl and XML binaries from http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/ per this Readme.

RJSONIO appears to be at http://cran.r-project.org/bin/windows/contrib/r-release/RJSONIO_0.7-3.zip, so that should be OK.

Good luck!
Jeffrey

andrew clark Says:
July 8, 2011 at 2:22 PM

Jeffrey
Thanks for that. I have installed all three packages successfully but when I load twitteR I get
Error: package ‘RJSONIO’ is not installed for ‘arch=x64’

guess that means it does not work on R 64-bit? Though cannot immediately see that referred to in docs
Andrew

Jeffrey Breen Says:
July 9, 2011 at 9:40 AM
Hi Andrew:

It sounds as though the RJSONIO binary wasn’t compiled for 64-bit (as confirmed by this answer on StackOverflow.

Unless you want to compile your own (which sounds like a bit of an ordeal on Windows), you may be limited to 32-bit R. But here is someone on StackOverflow who did get it compiled and working, along with pointers to the 64 bit toolchain you’ll need. (Perhaps you could get a binary from him?)

(Personally, I gave up on 64-bit Windows when my preview edition of XP-64 expired. It was a mercy killing/suicide since it was impossible to find drivers — but I know things have improved somewhat since 2005. 🙂

The R for Windows FAQ has a lot of good info on 32-vs-64 bit.

Jeffrey

Suresh Says:
July 11, 2011 at 4:39 AM

Nice post, Jeffrey would you mind sending me a copy of this?

Jeffrey Breen Says:
July 11, 2011 at 8:58 AM
Thanks, Suresh.

Here’s a PDF version: http://ow.ly/5Bn6K — looks like SlideShare only allows downloading of the original format.

Jeffrey

» Data analysis of Twitter reaction to the Carbon Tax Tom's Blog Says:
July 12, 2011 at 6:04 AM

[…] To perform this analysis I used R an awesome stats language, a ‘sentiment-lexicon’ from Hu & Liu and the method described in this powerpoint by Jeffrey Breen. […]

Abhijit Sanyal Says:
July 13, 2011 at 11:45 PM

Hi Jeffrey,
I am taking the risk of taking your time to ask for free support but I have not been able to find resolution to running this basic twitteR command in your presentation and so any help would be appreciated. I am a relatively new user of R and have installed twitteR but cannot get beyond the error below:
> delta.tweets = searchTwitter(‘@delta’, n=1500)
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
couldn’t connect to host
Thanks for your help
Abhijit

Jeff Gentry Says:
July 14, 2011 at 9:02 AM
Abhijit –

Funny you should mention this, as Jeffrey’s been running into the same problem and we’ve been trying to identify the cause. The current theory* is that it’s the API version of the fail whale, and I realized last night why it’s not being gracefully handled (to be resolved soon, although it’d still throw an error – just a less cryptic one).

* I have a few reasons to believe that this is at least partially incorrect, but either way I still believe it’s something along those lines

- Jeffrey Breen Says:
  July 14, 2011 at 9:19 AM
  Hi Jeff:
  
  Thanks so much for jumping in! (I swear I never got such great support for commercial vendor$…)
  
  I got the impression that Abhijit is having this problem all of the time. My problem is more intermittent and restricted to the userTimeline() function (so far…), not with simpler non-OAuth calls to searchTwitter().
  
  Jeffrey
- Emma Says:
  September 23, 2011 at 2:06 PM
  Hi Jeff,
  
  I was wondering if you had solved this issue. I am running into a similar problem. The behavior I am experiencing is an error that results in too many mysterious open connections.
  
  Thanks!
Jeffrey Breen Says:
July 14, 2011 at 9:13 AM
Hi Abhijit:

searchTwitter() connects to the Twitter Search API via http://search.twitter.com/search.json.

First make sure you can ‘ping search.twitter.com’ from your operating system.

Then, from R, you can try accessing the API directly:
```
library(RCurl)
url = "http://search.twitter.com/search.json?q=%40Delta&result_type=recent&rpp=100&page=1"
getURL(url)
```
Woa — I seem unable to get WordPress to display the URL without wrapping it in HTML cruft. The string you need to feed to getURL is:
```
url = "http: //search.twitter.com/search.json?q=%40Delta&amp;result_type=recent&amp;rpp=100&amp;page=1"
```
without the space between the “:” and the “/”.

Hopefully that helps you narrow down the problem.

Good luck!

Jeffrey

- Jeff Gentry Says:
  July 14, 2011 at 9:19 AM
  Actually you’re right Jeffrey 🙂 I don’t think that’s exactly the same error you’ve been getting, I just looked at it quickly. It does look more like a network connectivity issue.
  
  Another thing to test is just to make sure that network connectivity is working w/ R in general – ie can you use install.packages() from the R prompt.

Abhijit Sanyal Says:
July 15, 2011 at 10:57 PM

Hi Jeffrey and Jeff
Many thanks for your responses. First I should say that I am doing all this on Windows XP. I did try the various suggestions given above – which gave me the same result as before “Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : couldn’t connect to host”. However I was able to ping twitter from the C:\ command line and I can download and install R packages.
I installed R on another laptop but connected to the same home office internet service provider and in this case twitteR worked and I was able to work on some of the examples given in the above presentations. I think the problem is machine specific – something to do with the internet connectivity / postini restrictions that are built into my current machine.
However here are some of the issues that I ran up against:
The other airline names are very generic and while @delta might work, american does not work very well and americanair was better but I could not get more than 400 tweets. Most of the other airlines had the same problem and rbind gives errors if the number of columns are not the same. This may require a different rshape function or a flexible rbind. With jetblue – I got 1500 tweets and I think unique and distinctive brands garner more specific data at the first pass.
I must say that I am very grateful to you for your help and responsiveness. I may come back to you for further advice and help.
Thanks

Abhijit

Jeffrey Breen Says:
July 17, 2011 at 6:42 PM
Glad you got twitteR working, albeit on a different machine.

Some airlines will definitely give fewer than 1,500 tweets. Twitter’s Search API only keeps a couple days’ of tweets in its index, so lower-volume airlines’ tweets can age out.

But rbind() will only fail if your data.frames differ in number of columns, not rows. If rbind() is failing, check that one of the data.frames isn’t empty, or that you have added extra columns (like airline names) to all.

Good luck!
Jeffrey

vasundhar Says:
July 21, 2011 at 8:25 AM

Thanks for wonderful guide.

Jeffrey Breen Says:
July 21, 2011 at 9:56 AM
You’re welcome! Thanks for the kind words.

Jeffrey

Page not found « Things I tend to forget Says:
July 21, 2011 at 10:01 AM

[…] Comments Jeffrey Breen on slides from my R tutorial on T…vasundhar on slides from my R tutorial on T…Use Dropbox’s … on One-liners which […]

One-liners which make me love R: twitteR’s searchTwitter() #rstats « Things I tend to forget Says:
July 21, 2011 at 10:02 AM

One-liners which make me love R: twitteR’s searchTwitter() #rstats » 统计代码银行StatCodeBank Says:
July 21, 2011 at 9:45 PM

[…] recent R tutorial on mining Twitter for consumer sentiment wouldn’t have been possible without Jeff Gentry’s amazing twitteR package (available on CRAN). […]

Mining Twitter with R Says:
July 22, 2011 at 1:13 PM

[…] Jeffrey Breen’s blog you can find slides from a presentation entitled “R by Example: mining Twitter for consumer […]

Madhan Says:
July 28, 2011 at 5:29 AM

Hi,
Great Article.. I desperately wanted to try my hands on it… But then i am stuck at a place where i need your expertise to help !!!! i couldn’t proceed beyond this error
delta.tweets = searchTwitter(‘@delta’, n=1500)
“Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
Could not resolve host: search.twitter.com; No data record of requested type”
I am Behind the proxy , so is there a way to move forward ?????

Jeffrey Breen Says:
July 28, 2011 at 9:55 AM
It sounds as though you need to specify your proxy details. twitteR uses RCurl, so this post by Duncan Temple Lang should point you in the right direction:

http://r.789695.n4.nabble.com/RGoogleDocs-RCurl-through-proxy-tp892277p892278.html

RCurl uses libcurl, so it’s possible Google can help you find how to set such options globally for your OS.

Good luck!
Jeffrey

David Says:
July 29, 2011 at 10:31 AM

Any idea why all my scores would be returning as zeroes? The three samples are even scoring as zeroes.

David Says:
July 29, 2011 at 10:52 AM
Nevermind, got it.

- Jeffrey Breen Says:
  July 29, 2011 at 11:08 AM
  Hi David:
  
  Glad you got it sorted. Was it related to the word lists? All-zero scores could come from either no matches with the word lists (from, say, empty word lists) or identical, canceling matches (by mistakenly using the same word list for both positive and negative).
  
  Jeffrey

Brand sentiment showdown Says:
July 29, 2011 at 5:25 PM

[…] Breen provides an easy-to-follow tutorial on Twitter sentiment in R. The scoring system is pretty basic. All you do is load tweets with a […]

微博上的公司晴雨表 | 视物 | 致知 Says:
August 7, 2011 at 11:53 AM

[…] […]

Watch The Throne Tweets: Twitter Users React To The New Jay-Z Kanye West Album Says:
August 10, 2011 at 7:45 AM

[…] think about the album. So, on Monday afternoon, using a Twitter sentiment analysis procedure developed by Jeff Gentry, The Huffington Post analyzed the sentiment of 9,394 tweets about tracks on the album and scored […]

Watch The Throne Tweets: Twitter Users React To The New Jay-Z Kanye West Album | Twitter Template Blog Says:
August 10, 2011 at 12:07 PM

[…] think about the album. So, on Monday afternoon, using a Twitter sentiment analysis procedure developed by Jeffrey Breen, The Huffington Post analyzed the sentiment of 9,394 tweets about tracks on the album and scored […]

Introducing Project Blue Bird: An Open Source Web Front End for R Sentiment Analysis – tecosystems Says:
September 10, 2011 at 4:09 PM

[…] early July, I ran across Jeffrey Breen’s post on doing sentiment analysis in R a bit before two in the morning. It was interesting enough that I […]

R tutorial on Twitter text mining #rstats (via Things I tend to forget) « Nothing but Truth Says:
September 17, 2011 at 6:49 PM

[…] Update: An expanded version of this tutorial will appear in the new Elsevier book Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Gary Miner et. al which is now available for pre-order from Amazon. In conjunction with the book, I have cleaned up the tutorial code and published it on github. Last month I presented this introduction to R at the Boston Predictive Analytics MeetUp on Twitter Sentiment. The … Read More […]

slides from R tutorial on Twitter text mining #rstats | R | Scoop.it Says:
September 22, 2011 at 10:20 AM

[…] slides from R tutorial on Twitter text mining #rstats […]

VISUALIZING ÁRAS ELECTION, Part Two – - Martha Rotter's BlogMartha Rotter's Blog Says:
October 20, 2011 at 8:19 AM

[…] thanks to Jeffrey Breen for his excellent slides on Twitter text mining and for publishing the code – very […]

Ben Says:
November 22, 2011 at 5:58 PM

Thanks for this fantastic tutorial and code for the sentiment function. I know I’m a bit late to the party, but your work here is really a gem of public service, I’ve found it very interesting and useful. Congratulations on getting it published, also!

Jeffrey Breen Says:
December 5, 2011 at 2:57 PM
Thanks, Ben!

Are there any frameworks that perform sentiment analysis? - Quora Says:
December 5, 2011 at 2:55 PM

[…] McCann, Statistician for a prominent online m… You could use R of course; here is an example http://jeffreybreen.wordpress.co…This answer .Please specify the necessary improvements. Edit Link Text Show answer summary […]

arifwic Says:
December 8, 2011 at 6:48 PM

Interesting post,

According to you experience, is R suitable for longer text e.g. text that coming from blog or news?

Thank you.

Jeffrey Breen Says:
March 21, 2012 at 9:10 AM
Definitely. Check out the Natural Language Processing task view on CRAN for a comprehensive overview of the R packages available for more traditional text processing (= not quick and dirty, like mine).

Andy Harper Says:
January 23, 2012 at 9:10 AM

That’s very interesting indeed! Thanks for sharing this and being so open (code snippets).

Aleksei Beloshytski (@LadderRunner) Says:
January 28, 2012 at 9:05 AM

Thank you for this, Jeffrey.

My only concern is why did you compare absolute estimations (ACSI and yours), since I m not sure ACSI used the same approach for scaling values as you did. So in my opinion it’s more suitable to compare airlines relatively each other by position number.

Jeffrey Breen Says:
March 21, 2012 at 9:25 AM
I am so sorry this got lost in the comment queue:

Thanks for the question. I didn’t worry about the relative scalings because… well, I didn’t. Upon further reflection, I’m still not worried because as long as both measures purport to be linear, any scaling difference is simply an arbitrary constant with no meaning of its own.

The ACSI scores fall on a 0-100 scale reflecting customer satisfaction as measured through surveys and (I think) interviews. My Twitter sentiment scores also fall on a 0-100 scale based on the percentage of strong expressed emotions which are positive. Any such measurement may have biases and inaccuracies, but both should be linear.

isomorphismes Says:
March 21, 2012 at 7:03 AM

your javascript is using the warning() a few times

isomorphismes Says:
March 21, 2012 at 7:04 AM

Awesome, thank you for sharing.

isomorphismes Says:
April 1, 2012 at 10:29 PM

What do you think about other sentiments besides “good” and “bad” polarity? How any “sentiments” do you think you think could usefully be counted in a typical airline-customer application? Annoyed at delays; suggestions; merely comments that they are taking off soon; comments on the flight; etc.

isomorphismes Says:
April 1, 2012 at 10:31 PM
*how many

Gunjan Says:
April 30, 2012 at 2:58 AM
We should add “actionable positive” and “actionable negative” also as sentiments. Customer are saying just negative or they are frustrated to such a level that there is a risk of churn.

- Jeffrey Breen Says:
  August 18, 2012 at 9:58 AM
  Agreed there are many subtleties glossed over for this simple example. One could also differentiate between the emotional content of the words themselves (from ‘OK’ and ‘whatever’ to ‘awesome’ and F-bomb).

jchoi007 Says:
April 3, 2012 at 2:32 PM

Great presentation. A little late to the party but trying to follow along with the example and getting “Error in sort.list(y): invalid input ‘@Delta @thenyrangers JORDAN ain’t got nothin on me! í ½í¸œ’ in ‘utf8towcs’ ” So it’s not creating my delta.scores object…any ideas?

jchoi007 Says:
April 3, 2012 at 2:35 PM

Any idea how to get around this error? Error in sort.list(y) :
invalid input ‘@Delta @thenyrangers JORDAN ain’t got nothin on me! í ½í¸œ’ in ‘utf8towcs’

jchoi007 Says:
April 3, 2012 at 2:36 PM
Sorry for double posting, browser was acting funny so didn’t know if it made it through!

Gunjan Says:
April 12, 2012 at 2:45 AM

Hi Jeffrey, Is there any other post fron you on twitteR package? Is there any package for facebook also?

Jeffrey Breen Says:
August 18, 2012 at 9:57 AM
There’s an expanded version of this tutorial on Revolution’s Inside-R.org site: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment

Jeffrey Breen Says:
August 18, 2012 at 9:59 AM
I haven’t looked at Facebook, but I know others have.

santosh Says:
June 11, 2012 at 2:54 PM

Hi Jeff ,

I have tried to run the function :

>delta.scores=score.sentiment(delta.text,pos.words,neg.words,.progress=’text’)
> delta.scores
NULL

Could you please help ?

Thanks,

Regards,
Santosh

Jeffrey Breen Says:
August 18, 2012 at 9:54 AM
Sorry for the delay… and that I can’t be much help without more detail.

That line attempts to score the messages in `delta.text` using the dictionaries in `pos.words` and `neg.words`. The first thing to check is that each of those objects contains what what you expect (you can use the `str` function for a quick peek as in `str(delta.text)`.

asidrinkcoffee Says:
June 28, 2012 at 2:40 PM

I made a video out of this tutorial, I hope you don’t mind. Credited you though!

Jeffrey Breen Says:
August 18, 2012 at 9:49 AM
Nicely done. (And great subject matter. WTH is going on down there since I left?!?)

Mr. Lee Says:
July 17, 2012 at 9:28 AM

Should RT’s be ignored? For example, a company’s website could tweet some promotion detail using lots of positive words and people would retweet that to share with their followers. This could skew the distribution quite significantly.

Jeffrey Breen Says:
August 18, 2012 at 9:51 AM
Good point. When I use this technique for the day job, I aggressively filter out all potential duplicates (including RTs). Otherwise, as you point out, you can be fooled by the echo chamber of positive marketing. 🙂

Elif Elif (@androidine) Says:
September 4, 2012 at 9:45 AM

Great work! Thanks for the tutorial!
Is ist possible to store tweets in an excel or csv file, to perform the analysis later?
I want to store tweets that are older than one week, but I couldn’t find a tutorial related to this question.
Thanks in advance!

Jeffrey Breen Says:
September 4, 2012 at 1:37 PM

Thanks.

Absolutely — once the tweets are in a data.frame, you can use any of R’s standard I/O functions to store them however you’d like. save() will write them to disk in R’s native format, but CSV is just as easy (especially with twitteR’s new `twListToDF` convenience function):

$ head -3 /tmp/tweets.csv 
"text","favorited","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName"
"Anyone know why Jonathan Chang’s lda package was removed from CRAN? #rstats",FALSE,NA,2012-09-04 18:06:24,FALSE,NA,"243047374027640832",NA,"&lt;a href=&quot;http://tapbots.com&quot;&gt;Tweetbot for Mac&lt;/a&gt;","treycausey"
"RT @jebyrnes: Awesome visually weighted regression in #rstats with #ggplot2 http://t.co/EYCLiM7b (link!)",FALSE,NA,2012-09-04 17:49:29,FALSE,NA,"243043116171550721",NA,"&lt;a href=&quot;http://twitter.com/&quot;&gt;web&lt;/a&gt;","RD_Denton"

Manju Says:
May 1, 2013 at 3:49 PM
Am getting an error message
Error in twInterfaceObj$doAPICall(cmd, params, “GET”, …) :
OAuth authentication is required with Twitter’s API v1.1
Please guide

Elif Elif (@androidine) Says:
September 4, 2012 at 3:36 PM

Thank you very much for your quick reply.
I’ve just collected the tweets and realized, that I need to extract some tweets that are multiple or even not related to the topic. Is it possible to manipulate (delete) cells within the csv file?

Jeffrey Breen Says:
September 4, 2012 at 3:56 PM
There are a number of ways to subset data in R. Check out the subset() function. (Type ?subset to see its documentation page). Also, there are some excellent resources on the Net. The Quick-R site has a nice page showing how to subset using bracket notation in addition to the subset function: http://www.statmethods.net/management/subset.html

Depending on how sophisticated you need your text matching to be, you may be interested in knowing that R can also handle regular expressions: see the grep() and grepl() functions.

Elif Elif (@androidine) Says:
September 4, 2012 at 5:23 PM

Great link, thanks! Just a last question:
I want to load dataset from the csv file and get corresponding tweet texts. Just tried to figure out, if following code would function.
test = read.cvs(file=’/tmp/tweets.csv’)
tweet_text = sapply(test, function(x) x$getText())
However, I get this error: x$getText : $ operator is invalid for atomic vectors.

Jeffrey Breen Says:
September 4, 2012 at 8:45 PM
The sapply(… x$getText()) bit is used to pull the text field out of the Tweet object. But we parsed all those fields out before we saved the tweets to disk (it’s what the twListToDF function did).

In any case, read.csv() returns a data.frame so you can access the tweet_text as test$text. You should use the str() function (or just click on the object in RStudio) to take a look at the contents of test — it should make more sense then.

HTH,
Jeffrey

- Abhishek Says:
  December 30, 2013 at 8:03 AM
  hi Jeffrey Breen
  Gr8 link. Thanks a lot for sharing
  Even i am facing same problem
  Presently I am using R 3.02 (on windows platform
  Below is the code
  
  df<- read.csv ('C:/Users/abc/Desktop/SSIndiaTweets.csv')
  dm_txt = sapply(df, function(x) x$getTweet_text ())
  
  I get this error: x$getText : $ operator is invalid for atomic vectors.
  
  is there a way out
  
  Thanks in Advance
  
  Regards
  Abhishek

bindu Says:
November 4, 2012 at 6:13 AM

m using sentiment package for analysing some tweets..most of the emotions are just not available..can u suggest me any solution?

bindu Says:
November 6, 2012 at 4:12 AM

hello…m using R sentiment package..it shows emotion as NA for “i am not angry” and also for don’t trust anybody”…can sumbdy xplain dis concept

rizwana irfan Says:
November 25, 2012 at 5:52 PM

hallo Jeffery, it is indeed a very usefull and informative tutorial on twitter text mining and sentiment analysis. i am a PhD student and trying to use your code for practice. Unfortunatelly, i have some problem. I hope you can helpe me out. i am using the code
pos.matches = match(words, pos.words)
> neg.matches = match(words, neg.words)
but all of my out put is in NA. i have checked every thing but not figuring out what is the problem. following is the code for your understanding:
sample = c(“You’re Awesome and I love you”,
“I hate and hate and hate. So angry. Die!”,
“Impressed and Amazed: you are peerless in your achievement of
unparalleled mediocrity.”)
scores = laply(sample, function(sentence, pos.words, neg.words){
clean up sentences with R’s regex-driven global substitute, gsub():
sample = gsub(‘[[:punct:]]’, ”, sample)
sample = gsub(‘[[:cntrl:]]’, ”, sample)
sample = gsub(‘\\d+’, ”, sample)
# and convert to lower case:
sample = tolower(sample)
sample
# split into words. str_split is in the stringr package
word.list = str_split(sample, ‘\\s+’)
word.list
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
words
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
The out put should not be the NA as the positive words are in the pos.words. same as in the case of neg.words.
your help would be highly appreciated
Rizwana irfan

Arnaud Says:
November 25, 2014 at 9:53 AM
i got a similar problem trying to get the positive or negative words of a sample. my full code:
scores = laply(sample, function(sentence, pos.words, neg.words){
clean up sentences with R’s regex-driven global substitute, gsub():
sample = gsub(‘[[:punct:]]’, ”, sample)
sample = gsub(‘[[:cntrl:]]’, ”, sample)
sample = gsub(‘\\d+’, ”, sample)
# and convert to lower case:
sample = tolower(sample)
sample
# split into words. str_split is in the stringr package
word.list = str_split(sample, ‘\\s+’)
word.list
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
words
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) – sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )

Error in llply(.data = .data, .fun = .fun, …, .progress = .progress, :
object ‘.progress’ not found

Alex Says:
November 30, 2012 at 9:58 AM

This is a fabulous presentation, very clear, informative and works very well as a stand alone document. Thanks so much for posting it.

Alex Benning Says:
December 14, 2012 at 5:38 PM

Great tutorial. I have gotten this to work, but I am having an issue with duplicate tweets skewing the results. Is there a simple way to remove the duplicate tweets?

wushuo1988 Says:
December 18, 2012 at 2:00 PM

I keep getting error like this when I run this statement:

>delta.scores=score.sentiment(delta.text,pos.words,neg.words,.progress=’text’)

Error in sort.list(y) :
invalid input ‘@delta should rethink touchscreen in-seat games like bejewelled on long haul flights. Someone played morse code with my head for hours í ½í¸³’ in ‘utf8towcs’

it seems the tweet has some symbols/emoticons like 🙂 😦 . How do I handle this issue? Thanks in advance!

Shokoufeh Mirzaei (@sxmirzaei) Says:
June 11, 2013 at 2:24 PM
could you solve the problem? I am getting the same error!

Nandi Says:
December 19, 2014 at 5:35 PM
Just try
delta.text=str_replace_all(delta.text,”[^[:graph:]]”, ” “)

Removing the graphical character worked for me

skishchampi Says:
December 21, 2012 at 9:27 AM

Hi Jeffrey.

I was following the slides for a different @user and got stuck at the following

hist(ge.scores$score)
Error in plot.new(): figure margins to large

I am very new to R. Can you help me out here ?

Thanks,
Skish.

Mohamed Ali Abdulle Says:
December 22, 2012 at 5:29 PM

Hi jeffrey,
Thanks for your presentation it was really helpfull.
I was trying try check the kuwait Airways tweets. It is my first time to use R
Am using windows 7, having this problem for a couple of times
neg.words = c(hu.liu.neg)
Error: object ‘hu.liu.neg’ not found
Could you please help.
Best Regards,
Mohamed Ali

Marian Dragt Says:
December 27, 2012 at 3:56 PM

Fantastic post, thanks! I added 1 line to deal with “strange” characters in tweets:

sentence = gsub(‘[[:punct:]]’, ”, sentence)
sentence = gsub(‘[[:cntrl:]]’, ”, sentence)
—>sentence = gsub(‘[[:alnum:]]’, ”, sentence)
sentence = gsub(‘\\d+’, ”, sentence)

Marian Dragt Says:
December 27, 2012 at 4:23 PM

Sorry, it should be:
#strip strange characters
x.text = gsub(“[^[:alnum:]/// ‘]”, ”, x.text)

Shokoufeh Mirzaei (@sxmirzaei) Says:
June 11, 2013 at 2:25 PM
I tried this, it did not work! I am still getting the same error!

sai Says:
January 2, 2013 at 11:55 PM

can yo send me the pdf version of tis…

Bryan Osorio Says:
January 4, 2013 at 12:02 AM

Hi Jeffrey, thanks a lot very useful and gret for learn, the function code returns a matrix with n+1 columns, where n is the number of tweets. I change the last line before return for:
scores.df = data.frame(cbind(score=scores, text=sentences))

I only add cbind, is it correct?
Thank you!
Bryan
https://www.facebook.com/ClickMetrics

Aniks Says:
January 24, 2013 at 7:52 PM

Hello Jeffrey,

This is a great presentation, I learnt many new things using this example. I was successful to go through the entire presentation without errors.

Now when I am trying to work with it again I am getting this error.

> twitter.tweets = searchTwitter(‘@delta’, n=1500)
Error in .self$twFromJSON(out) :
Error: Malformed response from server, was not JSON

Earlier it will work fine.

Is there some limit on the tweets to be accessed. I checked online and found that need to load all libraries. I am doing that, can you me on this.

Thank you very much for sharing this.

Duncan McQueen Says:
April 21, 2013 at 12:46 PM

I fixed the utf8 issue and updated the code for Twitter’s OAuth authentication. My patch is located here – http://pastebin.com/Pp8ijRTk

Fernando Says:
July 24, 2013 at 7:41 AM

Can you help me? This error occur:

> source(“R/scrape.R”)
[1] “Searching Twitter for airline tweets and saving to disk”
Error in twInterfaceObj$doAPICall(cmd, params, “GET”, …) :
OAuth authentication is required with Twitter’s API v1.1

simak Says:
August 8, 2013 at 7:20 AM

Fantastic presentation…Thanks for sharing.

If we did a search for “delta” in the tweet – without the hashtag, I understand that searchTwitter returns tweets with “delta” in the handle as well. is there a way to force it return only the tweets and not the tweets and handles?

Henk Says:
November 15, 2013 at 3:20 AM

Very good presentation sir.

Deepak Says:
December 5, 2013 at 9:09 PM

Hello Jeffrey,
I need to get past tweets using R. I tried to use “since” and “until” but API returned me no tweets. How could I get those tweets. Any idea???

Federica Says:
January 2, 2014 at 2:06 PM

Very useful post..Thanks!!
I would like to ask you a help..I’m doing my final thesis work to get my second-cycle degree in Marketing and I’m studying the use of social networks. I’m using TwitteR package( the searchtwitter query ) to export in a csv format all the tweets containg a specific hashtag. I would like to analyze their text and discover how many of them cointain a specific list of words that I have just saved in a file called importantwords.txt. Could you help me to create a function that could return me a score of how many tweets contain the words that I have written in my file importantwords.txt?

I created this draft of function but it doesn’t work. Could you correct it for me?

library (plyr)
library (stringr)

score.sentiment = function(sentences, important.words, .progress=’none’)
{
require(plyr)
require(stringr)
scores = laply(sentences, function(sentence, important.words) {

sentence = gsub(‘[[:punct:]]’, ”, sentence)

sentence = gsub(‘[[:cntrl:]]’, ”, sentence)

sentence = gsub(‘\\d+’, ”, sentence)

sentence = tolower(sentence)

word.list = str_split(sentence, ‘\\s+’)

words = unlist(word.list)

pos.matches = match(words, important.words)

pos.matches = !is.na(pos.matches)

score = sum(pos.matches)
return(score)

}, important.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)

}

hu.liu.pos = scan(‘C:/Users/XX/Desktop/importantwords.txt’, what=’character’, comment.char=’;’)
pos.words = c(hu.liu.pos)

Thank you very much for your help; there isn’t anybody that I knoe that can use this package and your posts explain its use so well and for that I asked you your help.

Bach Says:
March 11, 2014 at 4:58 PM

Thanks for the great presentation. Very helpful.
Assuming that I want to replicate my sentiment index on a monthly basis to track the change of trend since last month, is there a specific command I need to add to get only the tweets made since the previous month only .
Thanks

Paola Says:
April 4, 2014 at 8:33 AM

Hi Jeffry, can you help me with this error!

Error: unexpected ‘)’ in:
”
scores =laply(sentences, function(sentence, pos.words, neg.words)

I from Colombia and I’m trying to use your code for sentiment but in Spanish.

Thanks!!

abraham Says:
April 27, 2014 at 1:15 PM

i was trying to check trending words for a certain country using the function getTrends() but i get the following error.
Error in twInterfaceObj$doAPICall(“trends/place”, params = params, …) :
Error: Could not resolve host: api.twitter.com; No data record of requested type

but when I go to twitter i am able to what could be the cause of the error.
In fact I was able to use it last week

j k lakshna (@lakshnajk) Says:
January 20, 2015 at 3:33 AM

Hello, I’m getting an error in the line

}, pos.words, neg.words, .progress=.progress )

the error says

Error: unexpected ‘}’ in:
“}
, pos.words,neg.words, .progress=.progress }”

Kindly help!
Thanks

	Teresa on My first R package: zipco…
	j k lakshna (@lakshn… on slides from my R tutorial on T…
	Nandi on slides from my R tutorial on T…
	Will on slides from my R tutorial on T…
	sillywabbit4562 on Data source to map Zip codes t…
	Arnaud on slides from my R tutorial on T…
	David on Use geom_rect() to add recessi…
	abraham on slides from my R tutorial on T…
	Paola on slides from my R tutorial on T…
	Bach on slides from my R tutorial on T…

Things I tend to forget

Tags

Recent Comments