Analysis of Cricketer “Dhoni’s 200” tweets on Twitter using R

What an innings of 200 on Day 3 in Chennai yesterday. I loved it. Just thought of exploring what people think on twitter about his 200 is what triggered me to write this blog but unfortunately it required lot of learning on using Twitter with R which I have summed it below. Irrespective of the intent behind analyzing Dhoni’s 200 data it also makes lot of business sense to analyze on trends in social media. In a bid to understand how the social media is dealing with your brand or products it’s important to analyze the data available in twitter. I’m trying to use R for fundamental analysis of tweets based on the TwitteR package available with R.

  1. If you have not installed the twitteR package you need to use the command install.packages(“twitter”)
  2. It will also install the necessary dependencies of that package(RCurl, bitops,rJson).
  3. Load the twitter package using library(twitter)

  4. In the above R console statements I tried to get the maximum tweets upto 1000, but I managed to get only up to 377 tweets. That’s the reason you are seeing n=377, otherwise it returned me error “Error: Malformed response from server, was not JSON”
  5. If you don’t mention value of n , by default it will return 25 records which you can determine using length(dhoni200_tweets)
  6. Next we need to analyze the tweets, so installing the Textmining package “tm”

  7. Next step is to give the tweets which we have collected to the text mining but for doing so we need to convert the tweets into data frame use the following commands to do so:


    > dim(dhoni200_df)

    [1] 377 10

    > dhoni200_df

  8. Next we need move the textdata as vectorSource to Corpus. Using the command > dhoni200.corpus=Corpus(VectorSource(dhoni200_df$text))
  9. When we issue the command > dhoni200.corpus you will get the result “A corpus with 377 text documents”
  10. Next refine the content by converting to lowercase, removing punctuation and unwanted words and convert to a term document matrix:

    > dhoni200.corpus=tm_map(dhoni200.corpus,tolower)

    > dhoni200.corpus=tm_map(dhoni200.corpus,removePunctuation)

    > mystopwords=c(stopwords(‘english’),’profile’,’prochoice’)

    > dhoni200.corpus=tm_map(dhoni200.corpus,removeWords,mystopwords)

    > dhoni200.dtm=TermDocumentMatrix(dhoni200.corpus)

    > dhoni200.dtm

    A term-document matrix (783 terms, 377 documents)

    Non-/sparse entries: 3930/291261

    Sparsity : 99%

    Maximal term length: 23

    Weighting : term frequency (tf)

  11. Analysis: When we try to analyze the words which has occurred 30 and 50 times respectively these were the results:

  12. Analysis: I tried to analyze further the association words when we use the word “century”. The following were the results:

    The term firstever seems to be of the highest with 0.61. In this command findAssocs the number 0.20 is the correlation factor.

  13. The command names(dhoni200_df) will list you the various columns which are coming out as tweets when converted to a data frame.

    [1] “text” “favorited” “replyToSN” “created” “truncated”

    [6] “replyToSID” “id” “replyToUID” “statusSource” “screenName”

  14. Analysis: Most number of tweets

    > counts=table(dhoni200_df$screenName)

    > barplot(counts)