Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – III

After lot of difficulties my 3rd post on this topic in this weekend. In my first post we saw what is sentiment analysis and what are the steps involved in it. In my previous post we saw how to retrieve the tweets and store it in the File step by step. Now we will move on to the step of Sentiment analysis.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets which are retrieved and stored in the previous post and start doing the analysis. In this post I’m going to use the simple algorithm as used by Jeffrey Breen to determine the scores/moods of the particular brand in twitter.

We will use the opinion lexicon provided by him which is primarily based on Hu and Liu papers. You can visit their site for lot of useful information on sentiment analysis. We can determine the positive and negative words in the tweets, based on which scoring will happen.

Step 1: We will import the CSV file into R using read.csv and you can use the summary to display the summary of the dataframe.

Step 2: We can load the Positive words and Negative words and store it locally and can import using Scan function as given below:


Step 3:

Now we will look at the code for evaluating the Sentiment. This has been taken from http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/. Thanks for the source code by Jeffrey.


Step 4:

We will test this sentiment.score function with some sample data.


In this step we have created test and added 3 sentences to it. This contains different words which may be positive or negative. Pass this “Test” to the score.sentiment function with pos_words and neg_words which we have loaded in the previous tests. Now you get the result score from the score.sentiment function against each sentence.

we will also try understand little more about this function and what it does:

a. Two libraries are loaded they are plyr and stringr. Both written by Hadley Wickham one of the great contributor to R. You can also learn more about plyr using this page or tutorial. You can also get more insights on split-apply-combine details here best place to start according to Hadley Wickham. You can think of it on analogy with Map-Reduce algorithm by Google which is used more in terms of Parallelism. stringr makes the string handling easier.

b. Next laply being used. You can learn more on what apply functions do here. In our case we pass on the sentences vector to the laply method. In simple terms this method takes each tweet and pass on to the function along with Positive and negative words and combines the result.

c. Next gsub helps to handle the replacements with the help using gsub(pattern, replacement, x).

d. Then convert the sentence to lowercase

e. Convert the sentences to words using the split methods and retrieve the appropriate scores using score methods.

Step 5: Now we will give the tweetsofaritel from airteltweetdata$text to the sentiment function to retrieve the score.

Step 6: We will see the summary of the scores and its histogram:

The histogram outcome:

It shows the most of the response out of 1499 is negative about airtel.

Disclaimer: Please note that this is only sample data which is analyzed only for the purpose of educational and learning purpose. It’s not to target any brand or influence any brand.

Advertisements

Market Basket Analysis Retail Foodmart Example: Step by step using R

This post will be a small step by step implementation of Market Basket Analysis using Apriori Algorithm using R for better understanding of the implementation with R using a small dataset. This will also help to give detailed understanding of how simply we can use R for such purposes.

I’ve made the data from the foodmart dataset into this transaction set using the combination of the Time_Id and Customer_Id composite key. This will be a unique transaction Id which has been created as Trans_ID and incorporated Product name for easier understanding with the table name POS_Transcations. I have exported the data from this table as RetailFoodMartData.csv. This has 86829 records. For the sake of simplicity and quick understanding I have copied few data in transactions limited the number of rows to 105 and reexecuted the whole with RetailFoodMartDataTest.csv. The final result shown will be the output of RetailFoodMartDataTest.csv.

Before we move on to convert them into transaction to put to use in the Apriori algorithm we need to make sure there is no duplicates exists in the vector or data.frame. Otherwise you will get a error like “cannot coerce list with transactions with duplicated items”. So please remove the data from the CSV source file using Data->Remove Duplicates before you import data to R

Hope this would suffice for this exercise.

I’m using R version 3.0.1 for the analysis.

Data Preparation:

Step 1: Import Excel to the R environment. If you would like to know how to import, please refer to my blog post here.

Step 2: Please find the outcomes of the import steps and summary using R and you can find the top 5 records using head.

Step 3: In the above screenshot you can realize that first 6 items are belonging to the same transaction set, now our objective is to group or aggregate the items together based on the transaction id. We can do that using AggPosData<=split(RetailPosData$ProductName,RetailPosData$Trans_Id). This will aggregate the transactions with product name. In the example shown below it for Transaction ID 396 it shows 3 Products

Implementation of Association Rules Algorithm:

“Mining frequent item sets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases.” As the next step we would need to load the arules library to the RConsole.

Step 4: When you try to invoke the arules library packages if doesn’t exist it will show a error as shown in the picture, now you can install the package and load the library.

Step 5: Now we need to use the aggregate which has been done using split method.
We need to coerce the transaction for the purpose of Apriori algorithm to process the data we will do it as per the following: Txns<-as(AggPosData,”transactions”). This is being done with data which is aggregated in Step 3.

Now we will quickly review the Apriori Algorithm implementation in R with the picture which shows its process in a simplified manner:


Courtesy: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/img054.jpg

Result of Summary (Txns)


In this example the summary provides the summary of the transactions as itemMatrix, this will be the input to the Apriori algorithm. In this example Atomic Bubble Gum with 6 occurrences.

Step 6: Now we will run the algorithm using the following statement:

Rules<-apriori(Txns,parameter=list(supp=0.05,conf=0.4,target=”rules”,minlen=2))

In the above obtained results it gives an understanding that if a customer buys Just Right Canned Yams there is 100% possibility that he might by Atomic Bubble Gum, similarly if a customer purchase CDR Hot Chocolate there is a possibility for him to buy either "Just Right Large Canned Shrimp" or "Atomic Bubble Gum". Confidence refers to the likelihood of the purchase and Support refers to the % of involvement in the transactions.

Step 7: Now we will decrease the confidence level to 0.2 and see the results given below, now the rules generated has increased, You can inspect the rules using Inspect(Rules) and you can specifically look at the rules using Inspect(Rules[1]):

Step 8: Now we will visualize the Top 10 items by frequency by using the following statement, itemFrequencyPlot(Txns, topN = 5)

 

Good references are available which have more steps in detail:

http://snowplowanalytics.com/analytics/catalog-analytics/market-basket-analysis-identifying-products-that-sell-well-together.html

http://prdeepakbabu.wordpress.com/2010/11/13/market-basket-analysisassociation-rule-mining-using-r-package-arules/

http://www.eecs.qmul.ac.uk/~christof/html/courses/ml4dm/week10-association-4pp.pdf

Though this is considered to be “Poor man recommendation engine” it’s a very useful one. In my next post we will continue to analyze how we can do this kind of analysis on large volume of data.

Connecting to Mysql via SQOOP – List Tables

In continuation to my posts on the Market basket analysis, I would continue my next steps towards the analytics using the data available in the FoodMart Dataset which you can download from this url https://sites.google.com/a/dlpage.phi-integration.com/pentaho/mondrian/mysql-foodmart-database.  Before moving on to next steps its important that we understand certain things with respect to connecting to mysql from Sqoop as we are focusing on big data as retail is always big. Here are the steps..

    1. Please download the JDBC driver for Sqoop to interact with MySQL using the following URL : http://dev.mysql.com/downloads/connector/j/
    2. Make sure you downloaded the mysql-connector-java-5.1.25.tar.gz either using wget or you can download it from your windows machine if your connected with VirtualBox or VMWare.
    3. Then extract the files to get the mysql-connector-java-5.1.25-bin.jar file and place under sqoop/lib folder
    4. Make sure you have the necessary mysql server information like hostname, username and password with necessary access.
    5. Once you have got that make you have provided necessary privileges for other host to access the mysql server using the following statement:

grant all privileges on *.* to ‘username’@’%’ identified by ‘userpassword’;

  • Then you can get the list of tables from the mysql database foodmart using the following command:

sqoop list-tables –connect jdbc:mysql://192.168.1.32:3306/foodmart -username root

Note: I have done this experiment with Sqoop version 1.4.3, Ubuntu 12.0.4 LTS on Virtualbox and mysql 5.5.24 with WAMP.

Caution: In my example I have used root as the username please don’t use the root username.

Other Links for your references:

http://www.devx.com/Java/hadoop-sqoop-to-import-data-from-mysql.html

http://www.datastax.com/docs/datastax_enterprise2.0/sqoop/sqoop_demo

Recommendation in Retail

So you go to a shop you see that a specific brand of Deodorant and Bathing bar are bundled as a product and have been displayed with a specific discount and you hand pick it with immense happiness (??) and satisfaction of a good deal. How does the shop keepers come to know about this? Intuition, Analytics, Case Based Reasoning, Pattern matching, etc.,

It could be Walmart, Target, Macys, TESCO or even a small self-owned retail outlet its important that they understand the customer/consumer behavior correctly to make good profit end of the day. Lets not think that its particularly useful in retail industry its very much important for Services based organization also to understand the consumer behavior.

For the sake of ease of understanding and moving towards practical aspects of such implementation we will try and understand some of the factors which would or could influence recommendation.

  • Demography (City, Locality, Country,etc.,) (Transactions)
  • Culture(Transactions)
  • Product mix based past sales history(Transactions)
  • Social recommendations (Twitter, Facebook, posts) (Social Analytic/NoSQL/Semi Structured)
  • Product Reviews(Blog/Review/Semi Structured Data)
  • Post-Sales experience (Transactions)

The challenge would be to related these data and to make good recommendation through the system in a very short span of time to influence customer buying decisions. In my next post we will try to evaluate some of the data sets available in the internet for the further experiments on the same.

My aim would be to understand and implement a recommendation system or at least arrive at the right steps for making an recommendation system which would be reliable and can handle the complexity involved in data.

Keep waiting for next post.