What an innings of 200 on Day 3 in Chennai yesterday. I loved it. Just thought of exploring what people think on twitter about his 200 is what triggered me to write this blog but unfortunately it required lot of learning on using Twitter with R which I have summed it below. Irrespective of the intent behind analyzing Dhoni’s 200 data it also makes lot of business sense to analyze on trends in social media. In a bid to understand how the social media is dealing with your brand or products it’s important to analyze the data available in twitter. I’m trying to use R for fundamental analysis of tweets based on the TwitteR package available with R.
- If you have not installed the twitteR package you need to use the command install.packages(“twitter”)
- It will also install the necessary dependencies of that package(RCurl, bitops,rJson).
Load the twitter package using library(twitter)
- In the above R console statements I tried to get the maximum tweets upto 1000, but I managed to get only up to 377 tweets. That’s the reason you are seeing n=377, otherwise it returned me error “Error: Malformed response from server, was not JSON”
- If you don’t mention value of n , by default it will return 25 records which you can determine using length(dhoni200_tweets)
Next we need to analyze the tweets, so installing the Textmining package “tm”
Next step is to give the tweets which we have collected to the text mining but for doing so we need to convert the tweets into data frame use the following commands to do so:
 377 10
- Next we need move the textdata as vectorSource to Corpus. Using the command > dhoni200.corpus=Corpus(VectorSource(dhoni200_df$text))
- When we issue the command > dhoni200.corpus you will get the result “A corpus with 377 text documents”
Next refine the content by converting to lowercase, removing punctuation and unwanted words and convert to a term document matrix:
A term-document matrix (783 terms, 377 documents)
Non-/sparse entries: 3930/291261
Sparsity : 99%
Maximal term length: 23
Weighting : term frequency (tf)
Analysis: When we try to analyze the words which has occurred 30 and 50 times respectively these were the results:
Analysis: I tried to analyze further the association words when we use the word “century”. The following were the results:
The term firstever seems to be of the highest with 0.61. In this command findAssocs the number 0.20 is the correlation factor.
The command names(dhoni200_df) will list you the various columns which are coming out as tweets when converted to a data frame.
 “text” “favorited” “replyToSN” “created” “truncated”
 “replyToSID” “id” “replyToUID” “statusSource” “screenName”
Analysis: Most number of tweets
“Data exploration is a process of probing more deeply into the dataset, while being careful to stay organized and avoid errors.”
It is always important to bring focus to the data before analysis begin, especially when the data is not collected in a controlled manner. Data exploration is the term used to refer the steps or process involved in search and analysis of large amount of data to analyze based on the data gathered. There are two methods in data exploration they are the following:
Data exploration refers to the following typical tasks:
- Checking the data for similar patterns
- Looking for the data structure and relationships
- Obtaining straight forward graphical representation of data to understanding aspects of a & b
Why do we need to data exploration?
- Identify data related issues or errors or outliers?
- Patterns: Symmetric, Skewed, Bimodal, Clusters. Please click here to know more about data patterns.
- Relationship: Linear, Polynomial, exponential
- Identification of data model
Good References for data exploration with R:
It’s always useful to understand how the data is correlated with each other, when we have a dataset. In this blog we will take the example of the studentsdata set which we have discussed in my previous blog to understand how well scores in individual subjects is correlating with the total scores.
What is correlation?
In simple terms, how two dataset establishes relationship with the each other can be termed as correlation. Example you would like to correlate whether you salary has increase over your age, how much of impact your expenditure has increased over your increase in salary, How much of sales has increased in the sale of umbrella based on the level or periodicity of the rainfall, etc,
Before we begin, If you want to know how to import data from excel to R environment please read my blog.
We will have a look at the data which we have now. So the studentsdata is having 8 columns which are imported from an excelsheet.
We can use the cor(var1,var2) method to determine the correlation, which will default return the pearsons correlation co-efficient. Now will initially find the correlation between the scores of tamil subject and TotalScores. If you see the below picture we have used the function cor(studentsdata$Tamil, studentsdata$TotalScores) which is returning the value of 0.4370992 which is 43.70% which seems to low positive correlation. We have also tried to plot the data between both the variables using plot. If you wanna learn how to do calculation for correlation please refer to this link for a simple example.
Interestingly in this dataset when we refer to the English subject and TotalScores the correlation coefficient value is 0.7475341 which is 74.75%. This seems to establish a strong relationship between the totalscores they have secured in relation with the subject English. If one variable increases when the second one increases, then there is a positive correlation. In this case the correlation coefficient will be closer to 1. In this instance if the score of English marks increase and TotalScores will increase significantly as they very positively correlated.
The same scatter plot has been plotted with ggplot2 library using qplot method. Please find the picture below:
How the correlation fairs with other subjects have a look, R Made it so simple isn’t?
In my previous article you would have seen how to rank the data in a table. In this blog we will see how simple it is in R to display the Top 10 records. I assume that you have the dataset ready with ranking done. Before we even looking at ranking we will find out we can select specific column from the data table:
Selecting specific columns from the data table:
We have selected only the required fields such as Name, TotalScores and Rank. In this you can realize that all the 20 records are shown even thought its ordered by rank. Now we will see how we can select Top 10 from the same set of data.
To display the bottom 10 data
Ranking using R: Step by Step
In this blog we are going to see how we can rank the students in a class based on their total scores using R.
Let’s have the worksheet with the details of the all the students with their marks and total. We will name the worksheet as DataSourceMarks.xls
First let’s try to get what is the maximum total score in this table. We can do that by using the tablename accompanied by $ with the column name. So for getting the maximum of the total score we can use the following command.
Before we rank the data we will try to order by the data by TotalScores and Name by using the following command in the ascending order:
We can rank the same in descending order using the minus “-” symbol and changing the command as follows:
Next we will store this ordered data into OrderedScores objects for further ranking using the following command:
Next we will create new object with rank column added to this table after the ranking being done on the orderedscores using the following command.
Now we have got the ranked data where in the person named Prabhakar with the total score of 437 is ranked number 1
Things we have learnt:
- To access the column we need to use the tableobject$columnname
- To use the Max function to get the maximum value in a vector.
- We can order the data using order function.
- We can rank the ordered data using the order function.
Update: Please refer to this blog for various other ways to import excel http://www.milanor.net/blog/?p=779 as the steps outlined by me has dependency on Perl.
In this blog I’m going to share with you the steps involved in importing the data and understanding its aspects from R.
- Windows XP
- Completed the installation of R (In this example I’m using R version 2.15.2)
Step 1: Keep the Excel ready with you, in this example I’ve prepared my own sample data which is the table which captures of the various employees who have shown interest for stream change. The column labeled “Agreed” captures whether they are agreed or not. Snapshot of the worksheet:
Step 2: I have named the file as datasource.xls
Step 3: For using the Excel we will need the “gdata” package. If we have not installed it we can do it by using the command install.packages(“gdata”). Make sure you have internet connection to download the package.
Step 4: Then issue the command library(gdata), which will enable support for using .xls files in “R”
Step 5: We are going to use the command read.xls. If you need more or additional help regarding the same you can issue help (read.xls) which will start the server and load its relevant content.
Step 6: I have saved the datasource.xls in mydocuments folder
Step 7: Issue the command mydata=read.xls(“datasource.xls”) and again type mydata to see the excel file loaded to the R environment. If the file is missing or there is problem in the path you will get an error.
Step 8: In this final step we can see the summary of data using the command Summary(mydata)