Top 25 words used in Thirukkural using Hadoop

Tools used:

Hadoop,HDFS, Hive, Eclipse,Putty,WinScp,Excel

The Process:


Data Source:


Java Program:

In addition to the traditional WordCount hadoop example also added a line = line.replaceAll(“[\\d\\.\\d\\.\\d]”, “”); to eliminate the numbers and decimals in the text file.

Using the following command create a external table where it will use the file part in the given location:

create external table thirukural(word String, count bigint) location ‘/user/hduser/out.txt’;

Describe the table created:

Order the words by count and words and write to a file in HDFS:

insert overwrite directory ‘/user/hduser/out.txt/result.txt’ select * from thirukural order by count,word;

Result for completing the Map and Reduce:

Hive output:


After little bit of refining of data in excel the final result:

Word Count
படும்

42

தரும்

37

இல்

32

கெடும்

28

என்னும்

24

இல்லை

22

செயல்

22

எல்லாம்

21

தலை

21

காமம்

20

கொளல்

20

பவர்

19

பெறின்

19

அரிது

18

இன்பம்

18

உலகு

18

கண்ணும்

18

தவர்

18

யவர்

18

விடல்

18

விடும்

17

கண்

16

செயின்

16

தற்று

16

Market Basket Analysis with Hadoop: Importing mysql data to Hive using SQOOP

Now we have an existing data warehouse which is in MySql now we will need the following tables, which are Product and sales fact tables for the year 1997 and 1998. We will take steps to import this to HDFS for the further analysis using Hive. Please go through previous blog post of understanding how to establish connectivity with MySql using Sqoop.

We can start import using the following statement:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root

  • Now you can see that it has imported the data to HDFS in 50.2663 seconds which is at 3.0000 KB/sec. If you issue the command hadoop dfs –ls it will show a item added /user/hduser/product

Subsequent query with hadoop dfs –ls /user/hduser/product reveals the following:

Since we will use hive to analyze the data, we will import the data again to hive using –hive-import option, but if we do that the following sequence of things will happen:

  1. First step is the data will be imported to HDFS
  2. Sqoop generates hive scripts to load the data from the hdfs to hive.

So, we would need to remove the product folder which is imported to HDFS through the Sqoop as it will find the folder exists while its trying to import to hive. So we will remove the same using the following statement:

hadoop dfs -rmr /user/hduser/product

Now we will import the data using Sqoop using the hive option:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root –hive-import

Once the import is complete you will see something like the below:

Now we will go ahead and check the data in hive by using show tables and describe product:

In my next post we will import the remaining table to be used for market basket analysis and start querying with hive.

Why move data from RDBMS or existing data warehouse to HDFS?

In continuation to my previous post, many would have questions like the following?

  1. When there are traditional ways doing Market basket analysis with Cubes with Enterprise Data Warehouse (EDW) why do we need to adopt this route of moving data to HDFS?
  2. What kind of benefits will someone get by moving this RDBMS data to HDFS?
  3. Will it provide any cost savings?

This post is a mere analysis of possible use cases which could complement your data warehousing and analytics strategy. With the big hype around big data and related technologies its important to understand what to use and when to use accordingly. Good reasons for using Hadoop to complement the datawarehouse.

  1. Usage of tools like Sqoop and moving the data to an HDFS infrastructure environment will provide you the following benefits:
    1. Storage of extremely high volume data with the help of Hadoop infrastructure
    2. Accelerating the data movements with nightly batches with the help of MR Tasks.
    3. Automatic and redundant backup with the help of HDFS’s natural fault –tolerant data nodes.
    4. Low cost commodity hardware for scalability.
  2. Movement of the structured data to the HDFS enable to analyze the data in relationship with the unstructured data or semi structured data such as tweets, blogs, etc.,
  3. Not necessary to model the data as it can be handled while it’s being read.
  4. This also provides you capabilities to do quick exploratory analytics before moving to the warehouse for final analysis.

You can look at the following papers for more information and detailed understanding of the same:

http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Using_Cloudera_to_Improve_Data_Processing_Whitepaper.pdf

http://www.b-eye-network.com/blogs/eckerson/archives/2013/02/what_is_unique.php

http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP

Introduction to Market Basket Analysis

Market Basket Analysis (Association Analysis) is a mathematical modeling technique based upon the theory that if you buy a certain group of items, you are likely to buy another group of items.  It is used to analyze the consumer purchasing behavior and helps in increasing the sales and maintain inventory by focusing on the point of sale transaction(POS) data. Apriori algorithm is used to achieve this.

Apriori Algorithm

This algorithm is used to identify the pattern of data. It’s basically based on observation of data pattern around a transaction.

Example:

If a person goes to a gift shop and purchase a Birthday Card and a gift, it’s likely that he might purchase a Cake, Candles or Candy.  So these combinations help predict the possible combination of purchase to the retail shop owner to club or package it as offers to make better margins. This also enables to understand consumer behavior.

When we look at apriori algorithm its essential to understand what is Association rules too. That will help to understand in the right perspective.

Association rule learning is a popular machine learning technique in data mining. It helps to understand relationship between variables in large databases. It’s being primarily implemented in Point of Sale in retail where large transactions are recorded.

Reference links for Begineers:

http://en.wikipedia.org/wiki/Apriori_algorithm

http://en.wikipedia.org/wiki/Association_rule_learning

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_moc.semityn.www&_r=0

http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf

I like this http://nikhilvithlani.blogspot.in/2012/03/apriori-algorithm-for-data-mining-made.html url very simple and easy to understand for novice or beginners.

Reference links for Researchers and algorithm lovers:

http://learninglover.com/blog/?p=245

http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf

http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap6_basic_association_analysis.pdf

My objective of this post is a pre-cursor to use R and Big Data to use Market Basket analysis to do recommendation in retail point of sale domain or based on billions of e-Commerce transactions. In the upcoming posts we will see how we leverage this algorithm and do appropriate analysis on a point of sale data. Keep watching this space.

Fundamentals of Pig: Workings with Tuples

In the previous blog we uploaded the Windows Event log to the Hadoop environment and started analyzing it using Pig. We will see in this blog how we can work with the tuples.

Filtering Data:

In the script below there is no filter applied, so it fetches all the tuples.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = FOREACH Events GENERATE Level,EventID, TaskDescription;

Dump Result;

You can see one such example is highlighted in the picture given below.

Tuples of data can be filtered using the FILTER option in Pig.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = Filter Events by EventID is not null

Dump Result;

In this above code snipped the events are filtered when the EventID is not null, you can see the results.

 

More to come..

Pig: Exploring more on Schema and data models

Schema in Pig:

Schemas are for both simple and complex types of data and can be used appropriately wherever required. It can be used with LOAD, STREAM and FOREACH operations using the AS Clause. We will see a case and example further.

When we specify a schema we can mention about the field name and also its data type. If there is no mention about the data type while we are providing the schema it’s automatically considered as bytearrray if required can be casted to a different datatype later. The fieldname specified in the schema can be accessed by its name or positional notation. We will see that in the example going forward.

Case:

I would like to do some analysis on the EventViewer in my PC with the Pig Environment along with exploring more on Tuple. So I have exported my events from the Event viewer and uploaded to my Hortonworks environment as a filename ‘MyAppEvents.csv’.

In this sample pig Script given below, it’s unable to determine the schema as you can see in the output window below with a message “Schema for events unknown”.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',');
Describe Events;
Dump Events;

Now we will try to provide schema to this same pig script and see what happens with the new code with schema definition.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;

Now assume we would like to only access the Level, EventId and TaskDescription we would need to use FOREACH.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;
Result = FOREACH Events GENERATE Level,EventID, TaskDescription;
Dump Result;

This will provide results like this and now we will move on to understanding tuple.

Tuple:

Now, we will understand more about tuple.

A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.

Each field can be of different data type in a tuple. Constants are referred in single quotes and they are delimited by commas.

Example:

(Siva,33,’M’,Chennai)

Other Note:

Tuple can have its own schema if required to describe the fields in it. So it might help the end user in determining the data types expected in a tuple.

Fundamentals of Pig

Introduction:

Pig is a High-level scripting platform for analyzing large data sets. It’s an abstraction built on top of hadoop. It contains domain-specific dataflow language Pig Latin and a translation engine which converts the Pig Latin to MapReduce jobs. It uses familiar keywords such as Join, Group and filter. This has been Hadoop Subproject since 2007.

What do I need to work with Pig:

You might need Windows or Linux environment with Hadoop with Java 1.6 above. It would be easy if you can get started with Cloudera or Hortonworks distribution of Hadoop.

Running Pig:

You can run pig as commands or statements in Local mode or MapReduce mode. In Local mode all the files are installed in local host and filesystem. In the Mapreduce mode we need to access to the Hadoop Cluster and HDFS installation. Mapreduce is the default mode of execution.

Big Picture in a simple way:


Structure of the Pig Latin Script: