Beginners guide to ARIMA: Problem Identification & Data Gathering – Part 2

In continuation to my earlier post, I’m trying to explore ARIMA using an Example. In this post we will go into each step in detail how we can accomplish ARIMA based forecasting for a problem.

Step 1: Problem Identification or Our Scenario

We are going to consider the past history of time series data on the Household Power consumption and use that data to forecast using ARIMA. There is also a research paper published in Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, on the same dataset analysing the performance between ARMA and ARIMA. Our post will focus on step by step accomplishing forecast using R on the same dataset for ease of use for Beginners. Sooner or later we will evaluate tools such as AutoBox, R which can be used for solving this problems.

Step 2: Data Gathering or Identification of dataset

The dataset we are going to use would be a dataset on Individual household electric power consumption available in UCI Repository under the URL: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. This dataset is a multivariate dataset. Please check this link to understand the difference between univariate, bivariate and multivariate.

Quick Summary of the Dataset:

  • Dataset contains data between December 2006 and November 2010
  • It has around 19.7 MB of Data
  • File is in .TXT Format
  • Columns in the Dataset are:
    • Date (DD/MM/YYYY)
    • Time (HH:MM:SS)
    • Global active power
    • Global Reactive power
    • Voltage
    • Global Intensity
    • Sub Metering 1
    • Sub Metering 2
    • Sub Metering 3

You can open this semicolon delimited text file in Excel and make the necessary steps on the wizard you will be having an excel sheet with the data as given below. I was able to load rows only up to 1048576. The actual total number of rows in the text file is 2075260. Whoa..

 

Next to do the Step 3: preliminary analysis we can use R as a tool. For using R as a tool we need to load this data into R and for analysing it. For this I can save this excel sheet in CSV format or in XLS format and the import into R as outlined in my other post or using this link. I’m using RStudio for the purpose and demonstrating the data loading process in the screenshots in the subsequent sections.

First the installation had shown some error, after that in the subsequent attempt the installation of gdata was successful. Now we can load the library using the command library(gdata). After we which we have loaded powerData variable with the data available in the CSV file for further analysis and we can view the data using View. Please check the console window for the code.

In the next post we will do some preliminary analysis on this data which we have loaded.

Advertisements

Step by Step connecting to MySQL from R

Many of you would like to explore the data from the MySql database with R. This would help to analyze data with relationship which exists in R. This post will talk about the steps to connect to MySQL from R. This post is a step by step approach similar to the to the steps outlined in this PDF Document, though there are some additional information and easy to follow sequence.

Assumptions:

  1. MySQL is already installed on your System
  2. Operating System is Windows 7 (64-Bit)

Dataset:

We will use the Employee dataset which is available in the http://www.eclipse.org/birt/phoenix/db/ which can be used for testing applications and database with the fictious names “classicmodels”. (approximately 3.1 MB). Make sure you import the dataset to MySQL as outlined here. It has the following tables Customers, Employees, Orders, OrderDetails, Payments, Products and Product Lines. You can look at what is available in the database using the SQLYOG Community Edition.

Various ways to connect to MySQL from R:

There following are the ways in which we can connect to MySQL from R:

  • Using RODBC Library
  • Using RMySQL Library

In this post we will see the steps for connecting to MySQL from R using ODBC. Also refer to this PDF Attachment on the ODBC connectivity with R.

Step 1:

Download the ODBC Driver from the site https://dev.mysql.com/downloads/connector/odbc/ to make sure if you have the driven if you don’t have one. I have downloaded mysql-connector-odbc-5.3.2-winx64.msi as My operating and System is 64-bit. Please proceed with the installation of the same.

ODBC Setup:

Going forward we will do the ODBC setup related steps.

Step 2:

Goto Control Panel->Administrative Tools->ODBC


Step 3: Click Add to add a new ODBC Setup for MySQL


 

Step 4: Once you click finish you will get the below given screen where in you need to feed the IP Address/Hostname of the MySQL Server, Username and Password credentials and database and click on the Test button to make sure you are able to connect to the database without any problems. Once the Test is successful click OK to added to the List of ODBC connections.


After the MySQL ODBC connection is added:


 

Now we will move to R to invoke this Datasource and try to access any one of the table in R

Step 5: Now load the library RODBC using the following command, if its not getting loaded you can look at the instructions in the link to install the RODBC Library.

library(RODBC)


Step 6: Now having the RODBC installed now we can connect to the ClassicModels database in the MySQL and test the RODBC Library

Hope now we have got an idea of how to bring on the MySQL database to R for further processing, in the next post we will try to attempt of using RMySQL Library. Thanks, please share your feedback.

Data Visualization: Scatter Plot step by step with R

In continuation to my previous post on scatter plot which was more of an introduction, in this post we will see step by step approach on doing scatter plot with R. Our objective in this post would be to draw a scatter plot using R step by step.

Background:

The following table provides the data which will be used in this post. This has two columns one is “Year” and another one is “Total Telephones” which is in millions.

 Source: http://mospi.nic.in/Mospi_New/upload/SYB2014/ch31.html

 Source Data:


Steps:

Step 1: We have this data in a CSV file named “ScatterPlotData.csv”

Step 2: We will load this data using load.csv method using mydata=read.table(“E:\\Personal\\Learning\\Data Visualization\\ScatterPlotData.csv”,header=TRUE). This will handle the header and load the data appropriately.

Step 3: Now we have the data in mydata

Step 4: Now we can plot the Graph using the following statement :

mydata=read.table("E:\\Personal\\Learning\\Data Visualization\\ScatterPlotData.csv",header=TRUE)
plot(mydata$Year,mydata$TotalTelephones,main="Year Vs Sale of Telephones",xlab="Year",ylab="Sale of Telephones(Millions)",xaxt="n",ann="True")
axis(1, at=1:length(mydata$Year), lab=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013))

Step 5: The Result

Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – III

After lot of difficulties my 3rd post on this topic in this weekend. In my first post we saw what is sentiment analysis and what are the steps involved in it. In my previous post we saw how to retrieve the tweets and store it in the File step by step. Now we will move on to the step of Sentiment analysis.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets which are retrieved and stored in the previous post and start doing the analysis. In this post I’m going to use the simple algorithm as used by Jeffrey Breen to determine the scores/moods of the particular brand in twitter.

We will use the opinion lexicon provided by him which is primarily based on Hu and Liu papers. You can visit their site for lot of useful information on sentiment analysis. We can determine the positive and negative words in the tweets, based on which scoring will happen.

Step 1: We will import the CSV file into R using read.csv and you can use the summary to display the summary of the dataframe.

Step 2: We can load the Positive words and Negative words and store it locally and can import using Scan function as given below:


Step 3:

Now we will look at the code for evaluating the Sentiment. This has been taken from http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/. Thanks for the source code by Jeffrey.


Step 4:

We will test this sentiment.score function with some sample data.


In this step we have created test and added 3 sentences to it. This contains different words which may be positive or negative. Pass this “Test” to the score.sentiment function with pos_words and neg_words which we have loaded in the previous tests. Now you get the result score from the score.sentiment function against each sentence.

we will also try understand little more about this function and what it does:

a. Two libraries are loaded they are plyr and stringr. Both written by Hadley Wickham one of the great contributor to R. You can also learn more about plyr using this page or tutorial. You can also get more insights on split-apply-combine details here best place to start according to Hadley Wickham. You can think of it on analogy with Map-Reduce algorithm by Google which is used more in terms of Parallelism. stringr makes the string handling easier.

b. Next laply being used. You can learn more on what apply functions do here. In our case we pass on the sentences vector to the laply method. In simple terms this method takes each tweet and pass on to the function along with Positive and negative words and combines the result.

c. Next gsub helps to handle the replacements with the help using gsub(pattern, replacement, x).

d. Then convert the sentence to lowercase

e. Convert the sentences to words using the split methods and retrieve the appropriate scores using score methods.

Step 5: Now we will give the tweetsofaritel from airteltweetdata$text to the sentiment function to retrieve the score.

Step 6: We will see the summary of the scores and its histogram:

The histogram outcome:

It shows the most of the response out of 1499 is negative about airtel.

Disclaimer: Please note that this is only sample data which is analyzed only for the purpose of educational and learning purpose. It’s not to target any brand or influence any brand.

Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – II

In the previous post we saw what is sentiment analysis and what are the steps involved in it. In this post we will go through step by step instruction on doing Sentiment Analysis on the Micro blogging site “Twitter”. We will have specific objective to do so. I came across an interesting post by Chetan S on the DTH operators involvement in using Social Media for providing customer support. It triggered me the idea for this post.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets, look at how to access the Twitter API and make best use of the TwitteR R package and write these tweets to a file.

Important Note:

1. when you would like to use the searchTwitter, go to dev.twitter.com and your application go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to click on the save button after doing this.”

Refer to this link http://stackoverflow.com/questions/15713073/twitter-help-unable-to-authorize-even-with-registering

2. When you are trying to search using searchTwitter after the above step if you get ssl problem make sure you have enable rCurl and do the steps outline here: http://stackoverflow.com/questions/15347233/ssl-certificate-failed-for-twitter-in-r.

options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))) also make sure you have loaded the Necessary Packages like ROAuth,

Step 1: Make sure you have done the OAuth authentication with Twitter using the Previous post and the steps outlined above, you can also check the library loaded with sessionInfo(). Step 2: Make sure you load the tweets from the Twitter from the Twitter Handle accordingly > airtel.tweets=searchTwitter(“@airtel_presence”,n=1500) Now we have loaded the 1499 tweets which was responded by the Twitter API in to airtel.tweets. Now what we will do is to save these to a file for future processing. Step 3: Before we write these tweets to a file, for better understanding we will try to look at some of the tweets and data collected so far. head(airtel.tweets) provides the top 6 tweets. Further to our analysis, we try to get the length of the tweets, what kind of class it is and how can we access the tweets. Look at the below given screenshot. Step 4: We will look at some examples of How to access the twitter data in a better fashion with respect to the Twitter API using TwitteR library by accessing one tweets from the 1499 available. In this above given example we have selected the 3rd item from the list and we have tried to get till the user information, how many friends he has and how many followers he has, etc., These are the things which are vital to understand as these factors can become viral and impact the image of a particular brand. Now will go to the next step of identifying the steps to store these tweets for further analysis. Step 5: We will store these tweets we collected in airtel.tweets to a file for future analysis and reference. We are going to convert the list of tweets to separate data using apply functions and write to a file. We are going to use the library plyr for the same. Plyr allows the user to split a data set apart into smaller subsets, apply methods to the subsets, and combine the results. Please click here for detailed introduction on plyr. So we are converting the list to data frame for preparing it to be written to a file. Now the tweets and all the necessary information is available in the tweets.df data frame. You can look at the below screenshots for its summary. Step 6: Setup the Working directory and write the tweets.df data frame to the file airteltweets.csv. You can verify the data available in this file using Notepad++ or Excel In the next post we will look at how to do sentiment analysis with this file data.

Market Basket Analysis Retail Foodmart Example: Step by step using R

This post will be a small step by step implementation of Market Basket Analysis using Apriori Algorithm using R for better understanding of the implementation with R using a small dataset. This will also help to give detailed understanding of how simply we can use R for such purposes.

I’ve made the data from the foodmart dataset into this transaction set using the combination of the Time_Id and Customer_Id composite key. This will be a unique transaction Id which has been created as Trans_ID and incorporated Product name for easier understanding with the table name POS_Transcations. I have exported the data from this table as RetailFoodMartData.csv. This has 86829 records. For the sake of simplicity and quick understanding I have copied few data in transactions limited the number of rows to 105 and reexecuted the whole with RetailFoodMartDataTest.csv. The final result shown will be the output of RetailFoodMartDataTest.csv.

Before we move on to convert them into transaction to put to use in the Apriori algorithm we need to make sure there is no duplicates exists in the vector or data.frame. Otherwise you will get a error like “cannot coerce list with transactions with duplicated items”. So please remove the data from the CSV source file using Data->Remove Duplicates before you import data to R

Hope this would suffice for this exercise.

I’m using R version 3.0.1 for the analysis.

Data Preparation:

Step 1: Import Excel to the R environment. If you would like to know how to import, please refer to my blog post here.

Step 2: Please find the outcomes of the import steps and summary using R and you can find the top 5 records using head.

Step 3: In the above screenshot you can realize that first 6 items are belonging to the same transaction set, now our objective is to group or aggregate the items together based on the transaction id. We can do that using AggPosData<=split(RetailPosData$ProductName,RetailPosData$Trans_Id). This will aggregate the transactions with product name. In the example shown below it for Transaction ID 396 it shows 3 Products

Implementation of Association Rules Algorithm:

“Mining frequent item sets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases.” As the next step we would need to load the arules library to the RConsole.

Step 4: When you try to invoke the arules library packages if doesn’t exist it will show a error as shown in the picture, now you can install the package and load the library.

Step 5: Now we need to use the aggregate which has been done using split method.
We need to coerce the transaction for the purpose of Apriori algorithm to process the data we will do it as per the following: Txns<-as(AggPosData,”transactions”). This is being done with data which is aggregated in Step 3.

Now we will quickly review the Apriori Algorithm implementation in R with the picture which shows its process in a simplified manner:


Courtesy: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/img054.jpg

Result of Summary (Txns)


In this example the summary provides the summary of the transactions as itemMatrix, this will be the input to the Apriori algorithm. In this example Atomic Bubble Gum with 6 occurrences.

Step 6: Now we will run the algorithm using the following statement:

Rules<-apriori(Txns,parameter=list(supp=0.05,conf=0.4,target=”rules”,minlen=2))

In the above obtained results it gives an understanding that if a customer buys Just Right Canned Yams there is 100% possibility that he might by Atomic Bubble Gum, similarly if a customer purchase CDR Hot Chocolate there is a possibility for him to buy either "Just Right Large Canned Shrimp" or "Atomic Bubble Gum". Confidence refers to the likelihood of the purchase and Support refers to the % of involvement in the transactions.

Step 7: Now we will decrease the confidence level to 0.2 and see the results given below, now the rules generated has increased, You can inspect the rules using Inspect(Rules) and you can specifically look at the rules using Inspect(Rules[1]):

Step 8: Now we will visualize the Top 10 items by frequency by using the following statement, itemFrequencyPlot(Txns, topN = 5)

 

Good references are available which have more steps in detail:

http://snowplowanalytics.com/analytics/catalog-analytics/market-basket-analysis-identifying-products-that-sell-well-together.html

http://prdeepakbabu.wordpress.com/2010/11/13/market-basket-analysisassociation-rule-mining-using-r-package-arules/

http://www.eecs.qmul.ac.uk/~christof/html/courses/ml4dm/week10-association-4pp.pdf

Though this is considered to be “Poor man recommendation engine” it’s a very useful one. In my next post we will continue to analyze how we can do this kind of analysis on large volume of data.

Introduction to Market Basket Analysis

Market Basket Analysis (Association Analysis) is a mathematical modeling technique based upon the theory that if you buy a certain group of items, you are likely to buy another group of items.  It is used to analyze the consumer purchasing behavior and helps in increasing the sales and maintain inventory by focusing on the point of sale transaction(POS) data. Apriori algorithm is used to achieve this.

Apriori Algorithm

This algorithm is used to identify the pattern of data. It’s basically based on observation of data pattern around a transaction.

Example:

If a person goes to a gift shop and purchase a Birthday Card and a gift, it’s likely that he might purchase a Cake, Candles or Candy.  So these combinations help predict the possible combination of purchase to the retail shop owner to club or package it as offers to make better margins. This also enables to understand consumer behavior.

When we look at apriori algorithm its essential to understand what is Association rules too. That will help to understand in the right perspective.

Association rule learning is a popular machine learning technique in data mining. It helps to understand relationship between variables in large databases. It’s being primarily implemented in Point of Sale in retail where large transactions are recorded.

Reference links for Begineers:

http://en.wikipedia.org/wiki/Apriori_algorithm

http://en.wikipedia.org/wiki/Association_rule_learning

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_moc.semityn.www&_r=0

http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf

I like this http://nikhilvithlani.blogspot.in/2012/03/apriori-algorithm-for-data-mining-made.html url very simple and easy to understand for novice or beginners.

Reference links for Researchers and algorithm lovers:

http://learninglover.com/blog/?p=245

http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf

http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap6_basic_association_analysis.pdf

My objective of this post is a pre-cursor to use R and Big Data to use Market Basket analysis to do recommendation in retail point of sale domain or based on billions of e-Commerce transactions. In the upcoming posts we will see how we leverage this algorithm and do appropriate analysis on a point of sale data. Keep watching this space.