Connecting to Mysql via SQOOP – List Tables

In continuation to my posts on the Market basket analysis, I would continue my next steps towards the analytics using the data available in the FoodMart Dataset which you can download from this url  Before moving on to next steps its important that we understand certain things with respect to connecting to mysql from Sqoop as we are focusing on big data as retail is always big. Here are the steps..

    1. Please download the JDBC driver for Sqoop to interact with MySQL using the following URL :
    2. Make sure you downloaded the mysql-connector-java-5.1.25.tar.gz either using wget or you can download it from your windows machine if your connected with VirtualBox or VMWare.
    3. Then extract the files to get the mysql-connector-java-5.1.25-bin.jar file and place under sqoop/lib folder
    4. Make sure you have the necessary mysql server information like hostname, username and password with necessary access.
    5. Once you have got that make you have provided necessary privileges for other host to access the mysql server using the following statement:

grant all privileges on *.* to ‘username’@’%’ identified by ‘userpassword’;

  • Then you can get the list of tables from the mysql database foodmart using the following command:

sqoop list-tables –connect jdbc:mysql:// -username root

Note: I have done this experiment with Sqoop version 1.4.3, Ubuntu 12.0.4 LTS on Virtualbox and mysql 5.5.24 with WAMP.

Caution: In my example I have used root as the username please don’t use the root username.

Other Links for your references:


Recommendation in Retail

So you go to a shop you see that a specific brand of Deodorant and Bathing bar are bundled as a product and have been displayed with a specific discount and you hand pick it with immense happiness (??) and satisfaction of a good deal. How does the shop keepers come to know about this? Intuition, Analytics, Case Based Reasoning, Pattern matching, etc.,

It could be Walmart, Target, Macys, TESCO or even a small self-owned retail outlet its important that they understand the customer/consumer behavior correctly to make good profit end of the day. Lets not think that its particularly useful in retail industry its very much important for Services based organization also to understand the consumer behavior.

For the sake of ease of understanding and moving towards practical aspects of such implementation we will try and understand some of the factors which would or could influence recommendation.

  • Demography (City, Locality, Country,etc.,) (Transactions)
  • Culture(Transactions)
  • Product mix based past sales history(Transactions)
  • Social recommendations (Twitter, Facebook, posts) (Social Analytic/NoSQL/Semi Structured)
  • Product Reviews(Blog/Review/Semi Structured Data)
  • Post-Sales experience (Transactions)

The challenge would be to related these data and to make good recommendation through the system in a very short span of time to influence customer buying decisions. In my next post we will try to evaluate some of the data sets available in the internet for the further experiments on the same.

My aim would be to understand and implement a recommendation system or at least arrive at the right steps for making an recommendation system which would be reliable and can handle the complexity involved in data.

Keep waiting for next post.

Introduction to Market Basket Analysis

Market Basket Analysis (Association Analysis) is a mathematical modeling technique based upon the theory that if you buy a certain group of items, you are likely to buy another group of items.  It is used to analyze the consumer purchasing behavior and helps in increasing the sales and maintain inventory by focusing on the point of sale transaction(POS) data. Apriori algorithm is used to achieve this.

Apriori Algorithm

This algorithm is used to identify the pattern of data. It’s basically based on observation of data pattern around a transaction.


If a person goes to a gift shop and purchase a Birthday Card and a gift, it’s likely that he might purchase a Cake, Candles or Candy.  So these combinations help predict the possible combination of purchase to the retail shop owner to club or package it as offers to make better margins. This also enables to understand consumer behavior.

When we look at apriori algorithm its essential to understand what is Association rules too. That will help to understand in the right perspective.

Association rule learning is a popular machine learning technique in data mining. It helps to understand relationship between variables in large databases. It’s being primarily implemented in Point of Sale in retail where large transactions are recorded.

Reference links for Begineers:

I like this url very simple and easy to understand for novice or beginners.

Reference links for Researchers and algorithm lovers:

My objective of this post is a pre-cursor to use R and Big Data to use Market Basket analysis to do recommendation in retail point of sale domain or based on billions of e-Commerce transactions. In the upcoming posts we will see how we leverage this algorithm and do appropriate analysis on a point of sale data. Keep watching this space.

Fundamentals of Pig: Workings with Tuples

In the previous blog we uploaded the Windows Event log to the Hadoop environment and started analyzing it using Pig. We will see in this blog how we can work with the tuples.

Filtering Data:

In the script below there is no filter applied, so it fetches all the tuples.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = FOREACH Events GENERATE Level,EventID, TaskDescription;

Dump Result;

You can see one such example is highlighted in the picture given below.

Tuples of data can be filtered using the FILTER option in Pig.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = Filter Events by EventID is not null

Dump Result;

In this above code snipped the events are filtered when the EventID is not null, you can see the results.


More to come..

Pig: Exploring more on Schema and data models

Schema in Pig:

Schemas are for both simple and complex types of data and can be used appropriately wherever required. It can be used with LOAD, STREAM and FOREACH operations using the AS Clause. We will see a case and example further.

When we specify a schema we can mention about the field name and also its data type. If there is no mention about the data type while we are providing the schema it’s automatically considered as bytearrray if required can be casted to a different datatype later. The fieldname specified in the schema can be accessed by its name or positional notation. We will see that in the example going forward.


I would like to do some analysis on the EventViewer in my PC with the Pig Environment along with exploring more on Tuple. So I have exported my events from the Event viewer and uploaded to my Hortonworks environment as a filename ‘MyAppEvents.csv’.

In this sample pig Script given below, it’s unable to determine the schema as you can see in the output window below with a message “Schema for events unknown”.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',');
Describe Events;
Dump Events;

Now we will try to provide schema to this same pig script and see what happens with the new code with schema definition.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;

Now assume we would like to only access the Level, EventId and TaskDescription we would need to use FOREACH.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;
Result = FOREACH Events GENERATE Level,EventID, TaskDescription;
Dump Result;

This will provide results like this and now we will move on to understanding tuple.


Now, we will understand more about tuple.

A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.

Each field can be of different data type in a tuple. Constants are referred in single quotes and they are delimited by commas.



Other Note:

Tuple can have its own schema if required to describe the fields in it. So it might help the end user in determining the data types expected in a tuple.

Fundamentals of Pig


Pig is a High-level scripting platform for analyzing large data sets. It’s an abstraction built on top of hadoop. It contains domain-specific dataflow language Pig Latin and a translation engine which converts the Pig Latin to MapReduce jobs. It uses familiar keywords such as Join, Group and filter. This has been Hadoop Subproject since 2007.

What do I need to work with Pig:

You might need Windows or Linux environment with Hadoop with Java 1.6 above. It would be easy if you can get started with Cloudera or Hortonworks distribution of Hadoop.

Running Pig:

You can run pig as commands or statements in Local mode or MapReduce mode. In Local mode all the files are installed in local host and filesystem. In the Mapreduce mode we need to access to the Hadoop Cluster and HDFS installation. Mapreduce is the default mode of execution.

Big Picture in a simple way:

Structure of the Pig Latin Script: