Analysis of Cricketer “Dhoni’s 200” tweets on Twitter using R

What an innings of 200 on Day 3 in Chennai yesterday. I loved it. Just thought of exploring what people think on twitter about his 200 is what triggered me to write this blog but unfortunately it required lot of learning on using Twitter with R which I have summed it below. Irrespective of the intent behind analyzing Dhoni’s 200 data it also makes lot of business sense to analyze on trends in social media. In a bid to understand how the social media is dealing with your brand or products it’s important to analyze the data available in twitter. I’m trying to use R for fundamental analysis of tweets based on the TwitteR package available with R.

  1. If you have not installed the twitteR package you need to use the command install.packages(“twitter”)
  2. It will also install the necessary dependencies of that package(RCurl, bitops,rJson).
  3. Load the twitter package using library(twitter)

  4. In the above R console statements I tried to get the maximum tweets upto 1000, but I managed to get only up to 377 tweets. That’s the reason you are seeing n=377, otherwise it returned me error “Error: Malformed response from server, was not JSON”
  5. If you don’t mention value of n , by default it will return 25 records which you can determine using length(dhoni200_tweets)
  6. Next we need to analyze the tweets, so installing the Textmining package “tm”

  7. Next step is to give the tweets which we have collected to the text mining but for doing so we need to convert the tweets into data frame use the following commands to do so:


    > dim(dhoni200_df)

    [1] 377 10

    > dhoni200_df

  8. Next we need move the textdata as vectorSource to Corpus. Using the command > dhoni200.corpus=Corpus(VectorSource(dhoni200_df$text))
  9. When we issue the command > dhoni200.corpus you will get the result “A corpus with 377 text documents”
  10. Next refine the content by converting to lowercase, removing punctuation and unwanted words and convert to a term document matrix:

    > dhoni200.corpus=tm_map(dhoni200.corpus,tolower)

    > dhoni200.corpus=tm_map(dhoni200.corpus,removePunctuation)

    > mystopwords=c(stopwords(‘english’),’profile’,’prochoice’)

    > dhoni200.corpus=tm_map(dhoni200.corpus,removeWords,mystopwords)

    > dhoni200.dtm=TermDocumentMatrix(dhoni200.corpus)

    > dhoni200.dtm

    A term-document matrix (783 terms, 377 documents)

    Non-/sparse entries: 3930/291261

    Sparsity : 99%

    Maximal term length: 23

    Weighting : term frequency (tf)

  11. Analysis: When we try to analyze the words which has occurred 30 and 50 times respectively these were the results:

  12. Analysis: I tried to analyze further the association words when we use the word “century”. The following were the results:

    The term firstever seems to be of the highest with 0.61. In this command findAssocs the number 0.20 is the correlation factor.

  13. The command names(dhoni200_df) will list you the various columns which are coming out as tweets when converted to a data frame.

    [1] “text” “favorited” “replyToSN” “created” “truncated”

    [6] “replyToSID” “id” “replyToUID” “statusSource” “screenName”

  14. Analysis: Most number of tweets

    > counts=table(dhoni200_df$screenName)

    > barplot(counts)


Correlation made simple using R

It’s always useful to understand how the data is correlated with each other, when we have a dataset. In this blog we will take the example of the studentsdata set which we have discussed in my previous blog to understand how well scores in individual subjects is correlating with the total scores.

What is correlation?

In simple terms, how two dataset establishes relationship with the each other can be termed as correlation. Example you would like to correlate whether you salary has increase over your age, how much of impact your expenditure has increased over your increase in salary, How much of sales has increased in the sale of umbrella based on the level or periodicity of the rainfall, etc,

Before we begin, If you want to know how to import data from excel to R environment please read my blog.

We will have a look at the data which we have now. So the studentsdata is having 8 columns which are imported from an excelsheet.

We can use the cor(var1,var2) method to determine the correlation, which will default return the pearsons correlation co-efficient. Now will initially find the correlation between the scores of tamil subject and TotalScores. If you see the below picture we have used the function cor(studentsdata$Tamil, studentsdata$TotalScores) which is returning the value of 0.4370992 which is 43.70% which seems to low positive correlation. We have also tried to plot the data between both the variables using plot. If you wanna learn how to do calculation for correlation please refer to this link for a simple example.

Interestingly in this dataset when we refer to the English subject and TotalScores the correlation coefficient value is 0.7475341 which is 74.75%. This seems to establish a strong relationship between the totalscores they have secured in relation with the subject English. If one variable increases when the second one increases, then there is a positive correlation. In this case the correlation coefficient will be closer to 1. In this instance if the score of English marks increase and TotalScores will increase significantly as they very positively correlated.

The same scatter plot has been plotted with ggplot2 library using qplot method. Please find the picture below:

How the correlation fairs with other subjects have a look, R Made it so simple isn’t?

Selecting top 10 and bottom 10 based in R: Easy approach

In my previous article you would have seen how to rank the data in a table. In this blog we will see how simple it is in R to display the Top 10 records. I assume that you have the dataset ready with ranking done. Before we even looking at ranking we will find out we can select specific column from the data table:

Selecting specific columns from the data table:

> RankedScores[,c(‘Name’,’TotalScores’,’Rank’)]

We have selected only the required fields such as Name, TotalScores and Rank. In this you can realize that all the 20 records are shown even thought its ordered by rank. Now we will see how we can select Top 10 from the same set of data.

> RankedScores[1:10,c(‘Name’,’TotalScores’,’Rank’)]

To display the bottom 10 data

> RankedScores[20:10,c(‘Name’,’TotalScores’,’Rank’)]