This post will be a small step by step implementation of Market Basket Analysis using Apriori Algorithm using R for better understanding of the implementation with R using a small dataset. This will also help to give detailed understanding of how simply we can use R for such purposes.
I’ve made the data from the foodmart dataset into this transaction set using the combination of the Time_Id and Customer_Id composite key. This will be a unique transaction Id which has been created as Trans_ID and incorporated Product name for easier understanding with the table name POS_Transcations. I have exported the data from this table as RetailFoodMartData.csv. This has 86829 records. For the sake of simplicity and quick understanding I have copied few data in transactions limited the number of rows to 105 and reexecuted the whole with RetailFoodMartDataTest.csv. The final result shown will be the output of RetailFoodMartDataTest.csv.
Before we move on to convert them into transaction to put to use in the Apriori algorithm we need to make sure there is no duplicates exists in the vector or data.frame. Otherwise you will get a error like “cannot coerce list with transactions with duplicated items”. So please remove the data from the CSV source file using Data->Remove Duplicates before you import data to R
Hope this would suffice for this exercise.
I’m using R version 3.0.1 for the analysis.
Step 1: Import Excel to the R environment. If you would like to know how to import, please refer to my blog post here.
Step 2: Please find the outcomes of the import steps and summary using R and you can find the top 5 records using head.
Step 3: In the above screenshot you can realize that first 6 items are belonging to the same transaction set, now our objective is to group or aggregate the items together based on the transaction id. We can do that using AggPosData<=split(RetailPosData$ProductName,RetailPosData$Trans_Id). This will aggregate the transactions with product name. In the example shown below it for Transaction ID 396 it shows 3 Products
Implementation of Association Rules Algorithm:
“Mining frequent item sets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases.” As the next step we would need to load the arules library to the RConsole.
Step 4: When you try to invoke the arules library packages if doesn’t exist it will show a error as shown in the picture, now you can install the package and load the library.
Step 5: Now we need to use the aggregate which has been done using split method.
We need to coerce the transaction for the purpose of Apriori algorithm to process the data we will do it as per the following: Txns<-as(AggPosData,”transactions”). This is being done with data which is aggregated in Step 3.
Now we will quickly review the Apriori Algorithm implementation in R with the picture which shows its process in a simplified manner:
Result of Summary (Txns)
In this example the summary provides the summary of the transactions as itemMatrix, this will be the input to the Apriori algorithm. In this example Atomic Bubble Gum with 6 occurrences.
Step 6: Now we will run the algorithm using the following statement:
In the above obtained results it gives an understanding that if a customer buys Just Right Canned Yams there is 100% possibility that he might by Atomic Bubble Gum, similarly if a customer purchase CDR Hot Chocolate there is a possibility for him to buy either "Just Right Large Canned Shrimp" or "Atomic Bubble Gum". Confidence refers to the likelihood of the purchase and Support refers to the % of involvement in the transactions.
Step 7: Now we will decrease the confidence level to 0.2 and see the results given below, now the rules generated has increased, You can inspect the rules using Inspect(Rules) and you can specifically look at the rules using Inspect(Rules):
Step 8: Now we will visualize the Top 10 items by frequency by using the following statement, itemFrequencyPlot(Txns, topN = 5)
Good references are available which have more steps in detail:
Though this is considered to be “Poor man recommendation engine” it’s a very useful one. In my next post we will continue to analyze how we can do this kind of analysis on large volume of data.