slides from my R tutorial on Twitter text mining #rstats

Update: An expanded version of this tutorial will appear in the new Elsevier book Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Gary Miner et. al which is now available for pre-order from Amazon.

In conjunction with the book, I have cleaned up the tutorial code and published it on github.


Last month I presented this introduction to R at the Boston Predictive Analytics MeetUp on Twitter Sentiment.

The goal of the presentation was to expose a first-time (but technically savvy) audience to working in R. The scenario we work through is to estimate the sentiment expressed in tweets about major U.S. airlines. Even with a tiny sample and a very crude algorithm (simply counting the number of positive vs. negative words), we find a believable result. We conclude by comparing our result with scores we scrape from the American Consumer Satisfaction Index web site.

Jeff Gentry’s twitteR package makes it easy to fetch the tweets. Also featured are the plyr, ggplot2, doBy, and XML packages. A real analysis would, no doubt, lean heavily on the tm text mining package for stemming, etc.

Here is the slimmed-down version of the slides:

And here’s a PDF version to download.

Special thanks to John Verostek for putting together such an interesting event, and for providing valuable feedback and help with these slides.


Update: thanks to eagle-eyed Carl Howe for noticing a slightly out-of-date version of the score.sentiment() function in the deck. Missing was handling for NA values from match(). The deck has been updated and the code is reproduced here for convenience:


score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
	require(plyr)
	require(stringr)
	
	# we got a vector of sentences. plyr will handle a list
	# or a vector as an "l" for us
	# we want a simple array ("a") of scores back, so we use 
	# "l" + "a" + "ply" = "laply":
	scores = laply(sentences, function(sentence, pos.words, neg.words) {
		
		# clean up sentences with R's regex-driven global substitute, gsub():
		sentence = gsub('[[:punct:]]', '', sentence)
		sentence = gsub('[[:cntrl:]]', '', sentence)
		sentence = gsub('\\d+', '', sentence)
		# and convert to lower case:
		sentence = tolower(sentence)

		# split into words. str_split is in the stringr package
		word.list = str_split(sentence, '\\s+')
		# sometimes a list() is one level of hierarchy too much
		words = unlist(word.list)

		# compare our words to the dictionaries of positive & negative terms
		pos.matches = match(words, pos.words)
		neg.matches = match(words, neg.words)
	
		# match() returns the position of the matched term or NA
		# we just want a TRUE/FALSE:
		pos.matches = !is.na(pos.matches)
		neg.matches = !is.na(neg.matches)

		# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
		score = sum(pos.matches) - sum(neg.matches)

		return(score)
	}, pos.words, neg.words, .progress=.progress )

	scores.df = data.frame(score=scores, text=sentences)
	return(scores.df)
}

129 Responses to “slides from my R tutorial on Twitter text mining #rstats”

  1. Ronert Obst Says:

    interesting post!

  2. biao Says:

    thanks a lot, Jeffrey, I am very interested in the slides, especially your example of using twitteR package, would you mind sending me a copy of this slide? it seems i can’t download it. I won’t share with others, thanks.

  3. Norm Albertson Says:

    Very good presentation. Interesting subject but your slide presentation is outstanding. As an R neophyte I felt comfortable following the slides, understanding both the data mining and manipulation techniques, and more importantly, what you were accomplishing with each step. All without your narration to explain things! Well done, thanks for sharing.

    • Jeffrey Breen Says:

      Thanks, Norm!

      John Verostek helped enormously with the slides. In addition to wordsmithing and flow, he suggested the colored text box callouts. I think they helped the live audience, but they definitely make it more “standalone”.

      I’m glad you found the presentation useful.

      Jeffrey

  4. Shea Says:

    I of course love the faceting ability in ggplot… but for comparing distributions may I recommend doing a geom_freqpoly or geom_density with an aes(color=airline). It makes it much easier to see discrepancies in their distributions.

    Thanks for the fun bits about twitteR too!

  5. Shea Says:

    Additionally, I’ve *heard* that if you transform the tweets into word frequencies and then perform SVD, the first dimension most often naturally represents positive or negative feelings. There’s some fancy “stubbing” and other bits you need to do too. I haven’t tried my hand, but I’m thinking up plenty of twitter terms that might give me some meaningful data to play with.

  6. Soren Macbeth Says:

    Very cool. line 10 has a typo. instead of:

    scores = laply(sentences, function(sentence, pos.words, neg.words) {

    it should read:

    scores = lapply(sentences, function(sentence, pos.words, neg.words) {

    • Jeffrey Breen Says:

      Thanks, Soren!

      That line is correct — it’s using the laply() function from Hadley Wickham’s plyr package. Base R’s lapply() would work, too (except for the .progress=.progress parameter specified near the bottom), but I favor plyr for a few reasons:

      1. Standard naming convention. I could never remember whether I wanted mapply(), tapply(), sapply() or what. The functions in the plyr package are named simply: the first letter specifies what data type you’re passing in (“d” for data.frame, “l” or list, “a” for array, etc.) and the second letter specifies the data type you want as output. So, laply() takes a list and outputs an array (or a vector if one-dimensional). Similarly, ldply() would take the same list but return a data.frame.
      2. Support for parallel processing with foreach. Just specify .parallel=TRUE and plyr‘s functions will execute using the foreach package’s parallel backend automagically.
      3. Progress bars for free. I find this especially handy during interactive, exploratory work. Just specify a value for the .progress parameter and you’ll receive graphical feedback of its progress. Valid values include “tk” for Tcl/Tk (Unix/Mac), “win” for Windows, and my favorite, “text” for old school ASCII in the console window. (Default is “none”.)

      Jeffrey

  7. Larry (IEOR Tools) Says:

    How were you able to use searchTwitter() to get 1500 tweets. I thought Twitter has an api limit of 100. I’m only able to get 100 tweets.

    • Jeffrey Breen Says:

      Hi Larry:

      My understanding is that searchTwitter() uses Twitter’s Search API, and according to Twitter’s Things Every Developer Should Know (at the bottom), its limit is 1500. (You can try specifying a larger n, but 1500 is the maximum you’ll get back.)

      But there’s also a time limit on what is in the index (and a rate limit too, by IP), so if you’re only getting 100 tweets, it may be that your query is relatively low-volume and they have aged out of the cache. Take a look at “#rstats” vs. “#twitter”:

      > r.tweets = searchTwitter('#rstats', n=1500)
      > length(r.tweets)
      [1] 104
      > twitter.tweets = searchTwitter('#twitter', n=1500)
      > length(twitter.tweets)
      [1] 1500
      

      But those 104 R tweets go back a few days, whereas 1500 Twitter tweets barely span two hours:

      > r.tweets[[104]]$getCreated()
      [1] "2011-07-02 13:42:08 UTC"
      > twitter.tweets[[1500]]$getCreated()
      [1] "2011-07-06 11:53:17 UTC"
      

      What happens if you search for something more common like ‘#twitter’?

      Jeffrey

      • Jeff Gentry Says:

        There are multiple limitations, both time & number. The search will only go back for 2 days and a maximum of 1500 tweets as Jeffrey said.

        However the search API will only return 100 tweets at a time, in a paged manner. The searchTwitter() function handles all of the pagination for you, which allows you to get more than 100 at once.

        The Twitter API has a few different ways of handling that sort of situation, but in all cases that is abstracted away from the user in the twitteR package.

      • Jeffrey Breen Says:

        Hi Jeff:

        Thanks for chiming in. Without the hard work you put into the twitteR package, I’m sure the whole presentation would only use 100 tweets (if that)!

        Thanks again,
        Jeffrey

      • Jeff Gentry Says:

        Without a good story/use case it’d just sit there unused, so I know which is more valuable 😉 Now I’ve got something cool to point to when people ask for an example of it being useful!

      • Larry (IEOR Tools) Says:

        I still only get 100 tweets when doing the following

        > tweets length(tweets)
        [1] 100

      • Jeff Gentry Says:

        How are you generating “tweets”? What version of the package (IIRC there was a bug a couple of months ago)

      • Larry (IEOR Tools) Says:

        That was it! I was using an old version. For those that want to know version 0.99.9 works for this example.

  8. steve o'grady Says:

    Outstanding presentation, and I’m having a lot of fun playing with the scripts. Ran a few without incident, but ran into trouble with my last one, which kicked a: “50%Error in tolower(sentence).” Not to ask you to do public support, but thought I’d drop it just in case you or anyone else here had seen and handled this.

    In any event, thanks for sharing this.

    • Jeffrey Breen Says:

      Thanks, Steve!

      That seems like a non-controversial place to fail — sounds like a character set issue. I didn’t run into it for this presentation (apparently people only curse at US airlines in ASCII), but I see that I have a call to Base R’s iconv() function in some “real” code I had written before.

      Try inserting this line before the call to tolower():

      sentence = iconv(sentence, 'UTF-8', 'ASCII')
      

      HTH,
      Jeffrey

  9. Tony Breyal Says:

    I know it’s been said above, but still, I feel this really is a set of very good stand-alone presentation slides; I often find that I can get lost in the narrative of a talk, but by referring back to the flow chart I found it easy to follow and knew where in the story I was, where I’d been and where I was heading next. This visual idea is something I hope to use myself in the future.

    Very interesting topic indeed. Out of curiosity, why did you opt not to use the tm package?

    • Jeffrey Breen Says:

      Thanks very much, Tony.

      The tm package is great, but I wanted to focus on the basics of R (along with my go-to packages). Also, I am far from a text mining expert, and didn’t want to get bogged down trying to explain (or answer questions) about stemming, stopwords, and document-term/term-document matrices — especially with such a crude sentiment algorithm. Believe me — no one was more surprised than I when a sensible result came out (and all I had to fudge was throwing out low-volume, about-to-be-merged-away @continental)!

      Jeffrey

  10. Angela Waner Says:

    Please contact me if you would be interested in including your tutorial in a book on text mining. The same authors who wrote this book, http://www.amazon.com/Handbook-Statistical-Analysis-Applications-ebook/dp/B002ZJSVPA/ref=sr_1_2?ie=UTF8&qid=1309979464&sr=8-2, are working on a text mining book.

  11. John Johnson Says:

    Interesting, and I thoroughly enjoyed working through the example. Then I tried the techniques on the major players in my industry, the pharmaceutical industry. From my previous experience working through Matthew Russell’s _Mining the Social Web_ I realized that most of the people who tweet about pharma are experienced commentators or consultants, so as it turned out, “sentiment” turned out to equate to whether the company had good or bad news (you could tell by looking at the text of the tweets that had the extreme sentiments). So this could be useful for some automated early signal detection about companies you want to follow.

    • Jeffrey Breen Says:

      Hi John:

      That’s an interesting thing to look at. I wouldn’t have guessed that most of the airline tweeters are industry insiders or experts — perhaps a B2B vs. B2C split?

      Thanks,
      Jeffrey

  12. Paul Says:

    Can I get a PDF copy?

    Thanks!

    Paul

  13. pinboard July 7, 2011 — arghh.net Says:

    […] R tutorial on twitter text mining […]

  14. State of Data #56 « Dr Data's Blog Says:

    […] of Data #56 by doctordata on July 8, 2011  #analysis – Mining Twitter for consumer attitudes towards airlines (using R) – (a) “search twitter in 1 line of code”; (b) Estimate sentiment from ‘opinion […]

  15. andrew clark Says:

    Fascinating work. I’m having problems installing twitteR with R v 2.13.0 on Windows

    Warning: dependencies ‘RCurl’, ‘RJSONIO’ are not available

    The RCurl package does not display in the available CRAN list and if I try
    install.packages(“RCurl”, dependencies = TRUE) I get error message

    package ‘RCurl’ is not available (for R version 2.13.0)

    Any suggestions?

  16. andrew clark Says:

    Jeffrey
    Thanks for that. I have installed all three packages successfully but when I load twitteR I get
    Error: package ‘RJSONIO’ is not installed for ‘arch=x64’

    guess that means it does not work on R 64-bit? Though cannot immediately see that referred to in docs
    Andrew

  17. Suresh Says:

    Nice post, Jeffrey would you mind sending me a copy of this?

  18. » Data analysis of Twitter reaction to the Carbon Tax Tom's Blog Says:

    […] To perform this analysis I used R an awesome stats language, a ‘sentiment-lexicon’ from Hu & Liu and the method described in this powerpoint by Jeffrey Breen. […]

  19. Abhijit Sanyal Says:

    Hi Jeffrey,
    I am taking the risk of taking your time to ask for free support but I have not been able to find resolution to running this basic twitteR command in your presentation and so any help would be appreciated. I am a relatively new user of R and have installed twitteR but cannot get beyond the error below:
    > delta.tweets = searchTwitter(‘@delta’, n=1500)
    Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
    couldn’t connect to host
    Thanks for your help
    Abhijit

    • Jeff Gentry Says:

      Abhijit –

      Funny you should mention this, as Jeffrey’s been running into the same problem and we’ve been trying to identify the cause. The current theory* is that it’s the API version of the fail whale, and I realized last night why it’s not being gracefully handled (to be resolved soon, although it’d still throw an error – just a less cryptic one).

      * I have a few reasons to believe that this is at least partially incorrect, but either way I still believe it’s something along those lines

      • Jeffrey Breen Says:

        Hi Jeff:

        Thanks so much for jumping in! (I swear I never got such great support for commercial vendor$…)

        I got the impression that Abhijit is having this problem all of the time. My problem is more intermittent and restricted to the userTimeline() function (so far…), not with simpler non-OAuth calls to searchTwitter().

        Jeffrey

      • Emma Says:

        Hi Jeff,

        I was wondering if you had solved this issue. I am running into a similar problem. The behavior I am experiencing is an error that results in too many mysterious open connections.

        Thanks!

    • Jeffrey Breen Says:

      Hi Abhijit:

      searchTwitter() connects to the Twitter Search API via http://search.twitter.com/search.json.

      First make sure you can ‘ping search.twitter.com’ from your operating system.

      Then, from R, you can try accessing the API directly:

      library(RCurl)
      url = "http://search.twitter.com/search.json?q=%40Delta&result_type=recent&rpp=100&page=1"
      getURL(url)
      

      Woa — I seem unable to get WordPress to display the URL without wrapping it in HTML cruft. The string you need to feed to getURL is:

      url = "http: //search.twitter.com/search.json?q=%40Delta&result_type=recent&rpp=100&page=1"
      

      without the space between the “:” and the “/”.

      Hopefully that helps you narrow down the problem.

      Good luck!

      Jeffrey

      • Jeff Gentry Says:

        Actually you’re right Jeffrey 🙂 I don’t think that’s exactly the same error you’ve been getting, I just looked at it quickly. It does look more like a network connectivity issue.

        Another thing to test is just to make sure that network connectivity is working w/ R in general – ie can you use install.packages() from the R prompt.

  20. Abhijit Sanyal Says:

    Hi Jeffrey and Jeff
    Many thanks for your responses. First I should say that I am doing all this on Windows XP. I did try the various suggestions given above – which gave me the same result as before “Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : couldn’t connect to host”. However I was able to ping twitter from the C:\ command line and I can download and install R packages.
    I installed R on another laptop but connected to the same home office internet service provider and in this case twitteR worked and I was able to work on some of the examples given in the above presentations. I think the problem is machine specific – something to do with the internet connectivity / postini restrictions that are built into my current machine.
    However here are some of the issues that I ran up against:
    The other airline names are very generic and while @delta might work, american does not work very well and americanair was better but I could not get more than 400 tweets. Most of the other airlines had the same problem and rbind gives errors if the number of columns are not the same. This may require a different rshape function or a flexible rbind. With jetblue – I got 1500 tweets and I think unique and distinctive brands garner more specific data at the first pass.
    I must say that I am very grateful to you for your help and responsiveness. I may come back to you for further advice and help.
    Thanks

    Abhijit

    • Jeffrey Breen Says:

      Glad you got twitteR working, albeit on a different machine.

      Some airlines will definitely give fewer than 1,500 tweets. Twitter’s Search API only keeps a couple days’ of tweets in its index, so lower-volume airlines’ tweets can age out.

      But rbind() will only fail if your data.frames differ in number of columns, not rows. If rbind() is failing, check that one of the data.frames isn’t empty, or that you have added extra columns (like airline names) to all.

      Good luck!
      Jeffrey

  21. vasundhar Says:

    Thanks for wonderful guide.

  22. Page not found « Things I tend to forget Says:

    […] Comments Jeffrey Breen on slides from my R tutorial on T…vasundhar on slides from my R tutorial on T…Use Dropbox’s … on One-liners which […]

  23. One-liners which make me love R: twitteR’s searchTwitter() #rstats « Things I tend to forget Says:

    […] Comments Jeffrey Breen on slides from my R tutorial on T…vasundhar on slides from my R tutorial on T…Use Dropbox’s … on One-liners which […]

  24. One-liners which make me love R: twitteR’s searchTwitter() #rstats » 统计代码银行StatCodeBank Says:

    […] recent R tutorial on mining Twitter for consumer sentiment wouldn’t have been possible without Jeff Gentry’s amazing twitteR package (available on CRAN). […]

  25. Mining Twitter with R Says:

    […] Jeffrey Breen’s blog you can find slides from a presentation entitled “R by Example: mining Twitter for consumer […]

  26. Madhan Says:

    Hi,
    Great Article.. I desperately wanted to try my hands on it… But then i am stuck at a place where i need your expertise to help !!!! i couldn’t proceed beyond this error
    delta.tweets = searchTwitter(‘@delta’, n=1500)
    “Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
    Could not resolve host: search.twitter.com; No data record of requested type”
    I am Behind the proxy , so is there a way to move forward ?????

  27. David Says:

    Any idea why all my scores would be returning as zeroes? The three samples are even scoring as zeroes.

    • David Says:

      Nevermind, got it.

      • Jeffrey Breen Says:

        Hi David:

        Glad you got it sorted. Was it related to the word lists? All-zero scores could come from either no matches with the word lists (from, say, empty word lists) or identical, canceling matches (by mistakenly using the same word list for both positive and negative).

        Jeffrey

  28. Brand sentiment showdown Says:

    […] Breen provides an easy-to-follow tutorial on Twitter sentiment in R. The scoring system is pretty basic. All you do is load tweets with a […]

  29. Watch The Throne Tweets: Twitter Users React To The New Jay-Z Kanye West Album Says:

    […] think about the album. So, on Monday afternoon, using a Twitter sentiment analysis procedure developed by Jeff Gentry, The Huffington Post analyzed the sentiment of 9,394 tweets about tracks on the album and scored […]

  30. Watch The Throne Tweets: Twitter Users React To The New Jay-Z Kanye West Album | Twitter Template Blog Says:

    […] think about the album. So, on Monday afternoon, using a Twitter sentiment analysis procedure developed by Jeffrey Breen, The Huffington Post analyzed the sentiment of 9,394 tweets about tracks on the album and scored […]

  31. Introducing Project Blue Bird: An Open Source Web Front End for R Sentiment Analysis – tecosystems Says:

    […] early July, I ran across Jeffrey Breen’s post on doing sentiment analysis in R a bit before two in the morning. It was interesting enough that I […]

  32. R tutorial on Twitter text mining #rstats (via Things I tend to forget) « Nothing but Truth Says:

    […] Update: An expanded version of this tutorial will appear in the new Elsevier book Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Gary Miner et. al which is now available for pre-order from Amazon. In conjunction with the book, I have cleaned up the tutorial code and published it on github. Last month I presented this introduction to R at the Boston Predictive Analytics MeetUp on Twitter Sentiment. The … Read More […]

  33. slides from R tutorial on Twitter text mining #rstats | R | Scoop.it Says:

    […] slides from R tutorial on Twitter text mining #rstats […]

  34. VISUALIZING ÁRAS ELECTION, Part Two – - Martha Rotter's BlogMartha Rotter's Blog Says:

    […] thanks to Jef­frey Breen for his excel­lent slides on Twit­ter text min­ing and for pub­lish­ing the code – very […]

  35. Ben Says:

    Thanks for this fantastic tutorial and code for the sentiment function. I know I’m a bit late to the party, but your work here is really a gem of public service, I’ve found it very interesting and useful. Congratulations on getting it published, also!

  36. Are there any frameworks that perform sentiment analysis? - Quora Says:

    […] McCann, Statistician for a prominent online m… You could use R of course; here is an example http://jeffreybreen.wordpress.co…This answer .Please specify the necessary improvements. Edit Link Text Show answer summary […]

  37. arifwic Says:

    Interesting post,

    According to you experience, is R suitable for longer text e.g. text that coming from blog or news?

    Thank you.

  38. Andy Harper Says:

    That’s very interesting indeed! Thanks for sharing this and being so open (code snippets).

  39. Aleksei Beloshytski (@LadderRunner) Says:

    Thank you for this, Jeffrey.

    My only concern is why did you compare absolute estimations (ACSI and yours), since I m not sure ACSI used the same approach for scaling values as you did. So in my opinion it’s more suitable to compare airlines relatively each other by position number.

    • Jeffrey Breen Says:

      I am so sorry this got lost in the comment queue:

      Thanks for the question. I didn’t worry about the relative scalings because… well, I didn’t. Upon further reflection, I’m still not worried because as long as both measures purport to be linear, any scaling difference is simply an arbitrary constant with no meaning of its own.

      The ACSI scores fall on a 0-100 scale reflecting customer satisfaction as measured through surveys and (I think) interviews. My Twitter sentiment scores also fall on a 0-100 scale based on the percentage of strong expressed emotions which are positive. Any such measurement may have biases and inaccuracies, but both should be linear.

  40. isomorphismes Says:

    your javascript is using the warning() a few times

  41. isomorphismes Says:

    Awesome, thank you for sharing.

  42. isomorphismes Says:

    What do you think about other sentiments besides “good” and “bad” polarity? How any “sentiments” do you think you think could usefully be counted in a typical airline-customer application? Annoyed at delays; suggestions; merely comments that they are taking off soon; comments on the flight; etc.

    • isomorphismes Says:

      *how many

    • Gunjan Says:

      We should add “actionable positive” and “actionable negative” also as sentiments. Customer are saying just negative or they are frustrated to such a level that there is a risk of churn.

      • Jeffrey Breen Says:

        Agreed there are many subtleties glossed over for this simple example. One could also differentiate between the emotional content of the words themselves (from ‘OK’ and ‘whatever’ to ‘awesome’ and F-bomb).

  43. jchoi007 Says:

    Great presentation. A little late to the party but trying to follow along with the example and getting “Error in sort.list(y): invalid input ‘@Delta @thenyrangers JORDAN ain’t got nothin on me! 😜’ in ‘utf8towcs’ ” So it’s not creating my delta.scores object…any ideas?

  44. jchoi007 Says:

    Any idea how to get around this error? Error in sort.list(y) :
    invalid input ‘@Delta @thenyrangers JORDAN ain’t got nothin on me! 😜’ in ‘utf8towcs’

  45. Gunjan Says:

    Hi Jeffrey, Is there any other post fron you on twitteR package? Is there any package for facebook also?

  46. santosh Says:

    Hi Jeff ,

    I have tried to run the function :

    >delta.scores=score.sentiment(delta.text,pos.words,neg.words,.progress=’text’)
    > delta.scores
    NULL

    Could you please help ?

    Thanks,

    Regards,
    Santosh

    • Jeffrey Breen Says:

      Sorry for the delay… and that I can’t be much help without more detail.

      That line attempts to score the messages in `delta.text` using the dictionaries in `pos.words` and `neg.words`. The first thing to check is that each of those objects contains what what you expect (you can use the `str` function for a quick peek as in `str(delta.text)`.

  47. asidrinkcoffee Says:

    I made a video out of this tutorial, I hope you don’t mind. Credited you though!

  48. Mr. Lee Says:

    Should RT’s be ignored? For example, a company’s website could tweet some promotion detail using lots of positive words and people would retweet that to share with their followers. This could skew the distribution quite significantly.

    • Jeffrey Breen Says:

      Good point. When I use this technique for the day job, I aggressively filter out all potential duplicates (including RTs). Otherwise, as you point out, you can be fooled by the echo chamber of positive marketing. 🙂

  49. Elif Elif (@androidine) Says:

    Great work! Thanks for the tutorial!
    Is ist possible to store tweets in an excel or csv file, to perform the analysis later?
    I want to store tweets that are older than one week, but I couldn’t find a tutorial related to this question.
    Thanks in advance!

    • Jeffrey Breen Says:

      Thanks.

      Absolutely — once the tweets are in a data.frame, you can use any of R’s standard I/O functions to store them however you’d like. save() will write them to disk in R’s native format, but CSV is just as easy (especially with twitteR’s new `twListToDF` convenience function):

      > library(twitteR)
      Loading required package: RCurl
      Loading required package: bitops
      Loading required package: rjson
      > tweets.list = searchTwitter('#rstats')
      > tweets.df = twListToDF(tweets.list)
      > write.csv(tweets.df, file='/tmp/tweets.csv', row.names=F)
      
      $ head -3 /tmp/tweets.csv 
      "text","favorited","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName"
      "Anyone know why Jonathan Chang’s lda package was removed from CRAN? #rstats",FALSE,NA,2012-09-04 18:06:24,FALSE,NA,"243047374027640832",NA,"<a href="http://tapbots.com">Tweetbot for Mac</a>","treycausey"
      "RT @jebyrnes: Awesome visually weighted regression in #rstats with #ggplot2 http://t.co/EYCLiM7b (link!)",FALSE,NA,2012-09-04 17:49:29,FALSE,NA,"243043116171550721",NA,"<a href="http://twitter.com/">web</a>","RD_Denton"
      
      
      • Manju Says:

        Am getting an error message
        Error in twInterfaceObj$doAPICall(cmd, params, “GET”, …) :
        OAuth authentication is required with Twitter’s API v1.1
        Please guide

  50. Elif Elif (@androidine) Says:

    Thank you very much for your quick reply.
    I’ve just collected the tweets and realized, that I need to extract some tweets that are multiple or even not related to the topic. Is it possible to manipulate (delete) cells within the csv file?

    • Jeffrey Breen Says:

      There are a number of ways to subset data in R. Check out the subset() function. (Type ?subset to see its documentation page). Also, there are some excellent resources on the Net. The Quick-R site has a nice page showing how to subset using bracket notation in addition to the subset function: http://www.statmethods.net/management/subset.html

      Depending on how sophisticated you need your text matching to be, you may be interested in knowing that R can also handle regular expressions: see the grep() and grepl() functions.

  51. Elif Elif (@androidine) Says:

    Great link, thanks! Just a last question:
    I want to load dataset from the csv file and get corresponding tweet texts. Just tried to figure out, if following code would function.
    test = read.cvs(file=’/tmp/tweets.csv’)
    tweet_text = sapply(test, function(x) x$getText())
    However, I get this error: x$getText : $ operator is invalid for atomic vectors.

    • Jeffrey Breen Says:

      The sapply(… x$getText()) bit is used to pull the text field out of the Tweet object. But we parsed all those fields out before we saved the tweets to disk (it’s what the twListToDF function did).

      In any case, read.csv() returns a data.frame so you can access the tweet_text as test$text. You should use the str() function (or just click on the object in RStudio) to take a look at the contents of test — it should make more sense then.

      HTH,
      Jeffrey

      • Abhishek Says:

        hi Jeffrey Breen
        Gr8 link. Thanks a lot for sharing
        Even i am facing same problem
        Presently I am using R 3.02 (on windows platform
        Below is the code

        df<- read.csv ('C:/Users/abc/Desktop/SSIndiaTweets.csv')
        dm_txt = sapply(df, function(x) x$getTweet_text ())

        I get this error: x$getText : $ operator is invalid for atomic vectors.

        is there a way out

        Thanks in Advance

        Regards
        Abhishek

  52. bindu Says:

    m using sentiment package for analysing some tweets..most of the emotions are just not available..can u suggest me any solution?

  53. bindu Says:

    hello…m using R sentiment package..it shows emotion as NA for “i am not angry” and also for don’t trust anybody”…can sumbdy xplain dis concept

  54. rizwana irfan Says:

    hallo Jeffery, it is indeed a very usefull and informative tutorial on twitter text mining and sentiment analysis. i am a PhD student and trying to use your code for practice. Unfortunatelly, i have some problem. I hope you can helpe me out. i am using the code
    pos.matches = match(words, pos.words)
    > neg.matches = match(words, neg.words)
    but all of my out put is in NA. i have checked every thing but not figuring out what is the problem. following is the code for your understanding:
    sample = c(“You’re Awesome and I love you”,
    “I hate and hate and hate. So angry. Die!”,
    “Impressed and Amazed: you are peerless in your achievement of
    unparalleled mediocrity.”)
    scores = laply(sample, function(sentence, pos.words, neg.words){
    clean up sentences with R’s regex-driven global substitute, gsub():
    sample = gsub(‘[[:punct:]]’, ”, sample)
    sample = gsub(‘[[:cntrl:]]’, ”, sample)
    sample = gsub(‘\\d+’, ”, sample)
    # and convert to lower case:
    sample = tolower(sample)
    sample
    # split into words. str_split is in the stringr package
    word.list = str_split(sample, ‘\\s+’)
    word.list
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    words
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    The out put should not be the NA as the positive words are in the pos.words. same as in the case of neg.words.
    your help would be highly appreciated
    Rizwana irfan

    • Arnaud Says:

      i got a similar problem trying to get the positive or negative words of a sample. my full code:
      scores = laply(sample, function(sentence, pos.words, neg.words){
      clean up sentences with R’s regex-driven global substitute, gsub():
      sample = gsub(‘[[:punct:]]’, ”, sample)
      sample = gsub(‘[[:cntrl:]]’, ”, sample)
      sample = gsub(‘\\d+’, ”, sample)
      # and convert to lower case:
      sample = tolower(sample)
      sample
      # split into words. str_split is in the stringr package
      word.list = str_split(sample, ‘\\s+’)
      word.list
      # sometimes a list() is one level of hierarchy too much
      words = unlist(word.list)
      words
      # compare our words to the dictionaries of positive & negative terms
      pos.matches = match(words, pos.words)
      neg.matches = match(words, neg.words)
      # match() returns the position of the matched term or NA
      # we just want a TRUE/FALSE:
      pos.matches = !is.na(pos.matches)
      neg.matches = !is.na(neg.matches)
      # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
      score = sum(pos.matches) – sum(neg.matches)
      return(score)
      }, pos.words, neg.words, .progress=.progress )

      Error in llply(.data = .data, .fun = .fun, …, .progress = .progress, :
      object ‘.progress’ not found

  55. Alex Says:

    This is a fabulous presentation, very clear, informative and works very well as a stand alone document. Thanks so much for posting it.

  56. Alex Benning Says:

    Great tutorial. I have gotten this to work, but I am having an issue with duplicate tweets skewing the results. Is there a simple way to remove the duplicate tweets?

  57. wushuo1988 Says:

    Hi Jeff,

    I keep getting error like this when I run this statement:

    >delta.scores=score.sentiment(delta.text,pos.words,neg.words,.progress=’text’)

    Error in sort.list(y) :
    invalid input ‘@delta should rethink touchscreen in-seat games like bejewelled on long haul flights. Someone played morse code with my head for hours í ½í¸³’ in ‘utf8towcs’

    it seems the tweet has some symbols/emoticons like 🙂 😦 . How do I handle this issue? Thanks in advance!

  58. skishchampi Says:

    Hi Jeffrey.

    I was following the slides for a different @user and got stuck at the following

    hist(ge.scores$score)
    Error in plot.new(): figure margins to large

    I am very new to R. Can you help me out here ?

    Thanks,
    Skish.

  59. Mohamed Ali Abdulle Says:

    Hi jeffrey,
    Thanks for your presentation it was really helpfull.
    I was trying try check the kuwait Airways tweets. It is my first time to use R
    Am using windows 7, having this problem for a couple of times
    neg.words = c(hu.liu.neg)
    Error: object ‘hu.liu.neg’ not found
    Could you please help.
    Best Regards,
    Mohamed Ali

  60. Marian Dragt Says:

    Fantastic post, thanks! I added 1 line to deal with “strange” characters in tweets:

    sentence = gsub(‘[[:punct:]]’, ”, sentence)
    sentence = gsub(‘[[:cntrl:]]’, ”, sentence)
    —>sentence = gsub(‘[[:alnum:]]’, ”, sentence)
    sentence = gsub(‘\\d+’, ”, sentence)

  61. Marian Dragt Says:

    Sorry, it should be:
    #strip strange characters
    x.text = gsub(“[^[:alnum:]/// ‘]”, ”, x.text)

  62. sai Says:

    can yo send me the pdf version of tis…

  63. Bryan Osorio Says:

    Hi Jeffrey, thanks a lot very useful and gret for learn, the function code returns a matrix with n+1 columns, where n is the number of tweets. I change the last line before return for:
    scores.df = data.frame(cbind(score=scores, text=sentences))

    I only add cbind, is it correct?
    Thank you!
    Bryan
    https://www.facebook.com/ClickMetrics

  64. Aniks Says:

    Hello Jeffrey,

    This is a great presentation, I learnt many new things using this example. I was successful to go through the entire presentation without errors.

    Now when I am trying to work with it again I am getting this error.

    > twitter.tweets = searchTwitter(‘@delta’, n=1500)
    Error in .self$twFromJSON(out) :
    Error: Malformed response from server, was not JSON

    Earlier it will work fine.

    Is there some limit on the tweets to be accessed. I checked online and found that need to load all libraries. I am doing that, can you me on this.

    Thank you very much for sharing this.

  65. Duncan McQueen Says:

    I fixed the utf8 issue and updated the code for Twitter’s OAuth authentication. My patch is located here – http://pastebin.com/Pp8ijRTk

  66. Fernando Says:

    Can you help me? This error occur:

    > source(“R/scrape.R”)
    [1] “Searching Twitter for airline tweets and saving to disk”
    Error in twInterfaceObj$doAPICall(cmd, params, “GET”, …) :
    OAuth authentication is required with Twitter’s API v1.1

  67. simak Says:

    Fantastic presentation…Thanks for sharing.

    If we did a search for “delta” in the tweet – without the hashtag, I understand that searchTwitter returns tweets with “delta” in the handle as well. is there a way to force it return only the tweets and not the tweets and handles?

  68. Henk Says:

    Very good presentation sir.

  69. Deepak Says:

    Hello Jeffrey,
    I need to get past tweets using R. I tried to use “since” and “until” but API returned me no tweets. How could I get those tweets. Any idea???

  70. Federica Says:

    Very useful post..Thanks!!
    I would like to ask you a help..I’m doing my final thesis work to get my second-cycle degree in Marketing and I’m studying the use of social networks. I’m using TwitteR package( the searchtwitter query ) to export in a csv format all the tweets containg a specific hashtag. I would like to analyze their text and discover how many of them cointain a specific list of words that I have just saved in a file called importantwords.txt. Could you help me to create a function that could return me a score of how many tweets contain the words that I have written in my file importantwords.txt?

    I created this draft of function but it doesn’t work. Could you correct it for me?

    library (plyr)
    library (stringr)

    score.sentiment = function(sentences, important.words, .progress=’none’)
    {
    require(plyr)
    require(stringr)
    scores = laply(sentences, function(sentence, important.words) {

    sentence = gsub(‘[[:punct:]]’, ”, sentence)

    sentence = gsub(‘[[:cntrl:]]’, ”, sentence)

    sentence = gsub(‘\\d+’, ”, sentence)

    sentence = tolower(sentence)

    word.list = str_split(sentence, ‘\\s+’)

    words = unlist(word.list)

    pos.matches = match(words, important.words)

    pos.matches = !is.na(pos.matches)

    score = sum(pos.matches)
    return(score)

    }, important.words, .progress=.progress )
    scores.df = data.frame(score=scores, text=sentences)
    return(scores.df)

    }

    hu.liu.pos = scan(‘C:/Users/XX/Desktop/importantwords.txt’, what=’character’, comment.char=’;’)
    pos.words = c(hu.liu.pos)

    Thank you very much for your help; there isn’t anybody that I knoe that can use this package and your posts explain its use so well and for that I asked you your help.

  71. Bach Says:

    Thanks for the great presentation. Very helpful.
    Assuming that I want to replicate my sentiment index on a monthly basis to track the change of trend since last month, is there a specific command I need to add to get only the tweets made since the previous month only .
    Thanks

  72. Paola Says:

    Hi Jeffry, can you help me with this error!

    Error: unexpected ‘)’ in:

    scores =laply(sentences, function(sentence, pos.words, neg.words)

    I from Colombia and I’m trying to use your code for sentiment but in Spanish.

    Thanks!!

  73. abraham Says:

    i was trying to check trending words for a certain country using the function getTrends() but i get the following error.
    Error in twInterfaceObj$doAPICall(“trends/place”, params = params, …) :
    Error: Could not resolve host: api.twitter.com; No data record of requested type

    but when I go to twitter i am able to what could be the cause of the error.
    In fact I was able to use it last week

  74. j k lakshna (@lakshnajk) Says:

    Hello, I’m getting an error in the line

    }, pos.words, neg.words, .progress=.progress )

    the error says

    Error: unexpected ‘}’ in:
    “}
    , pos.words,neg.words, .progress=.progress }”

    Kindly help!
    Thanks


Leave a reply to j k lakshna (@lakshnajk) Cancel reply