Market Basket Analysis with Hadoop: Importing mysql data to Hive using SQOOP

Now we have an existing data warehouse which is in MySql now we will need the following tables, which are Product and sales fact tables for the year 1997 and 1998. We will take steps to import this to HDFS for the further analysis using Hive. Please go through previous blog post of understanding how to establish connectivity with MySql using Sqoop.

We can start import using the following statement:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root

  • Now you can see that it has imported the data to HDFS in 50.2663 seconds which is at 3.0000 KB/sec. If you issue the command hadoop dfs –ls it will show a item added /user/hduser/product

Subsequent query with hadoop dfs –ls /user/hduser/product reveals the following:

Since we will use hive to analyze the data, we will import the data again to hive using –hive-import option, but if we do that the following sequence of things will happen:

  1. First step is the data will be imported to HDFS
  2. Sqoop generates hive scripts to load the data from the hdfs to hive.

So, we would need to remove the product folder which is imported to HDFS through the Sqoop as it will find the folder exists while its trying to import to hive. So we will remove the same using the following statement:

hadoop dfs -rmr /user/hduser/product

Now we will import the data using Sqoop using the hive option:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root –hive-import

Once the import is complete you will see something like the below:

Now we will go ahead and check the data in hive by using show tables and describe product:

In my next post we will import the remaining table to be used for market basket analysis and start querying with hive.

Advertisements

Why move data from RDBMS or existing data warehouse to HDFS?

In continuation to my previous post, many would have questions like the following?

  1. When there are traditional ways doing Market basket analysis with Cubes with Enterprise Data Warehouse (EDW) why do we need to adopt this route of moving data to HDFS?
  2. What kind of benefits will someone get by moving this RDBMS data to HDFS?
  3. Will it provide any cost savings?

This post is a mere analysis of possible use cases which could complement your data warehousing and analytics strategy. With the big hype around big data and related technologies its important to understand what to use and when to use accordingly. Good reasons for using Hadoop to complement the datawarehouse.

  1. Usage of tools like Sqoop and moving the data to an HDFS infrastructure environment will provide you the following benefits:
    1. Storage of extremely high volume data with the help of Hadoop infrastructure
    2. Accelerating the data movements with nightly batches with the help of MR Tasks.
    3. Automatic and redundant backup with the help of HDFS’s natural fault –tolerant data nodes.
    4. Low cost commodity hardware for scalability.
  2. Movement of the structured data to the HDFS enable to analyze the data in relationship with the unstructured data or semi structured data such as tweets, blogs, etc.,
  3. Not necessary to model the data as it can be handled while it’s being read.
  4. This also provides you capabilities to do quick exploratory analytics before moving to the warehouse for final analysis.

You can look at the following papers for more information and detailed understanding of the same:

http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Using_Cloudera_to_Improve_Data_Processing_Whitepaper.pdf

http://www.b-eye-network.com/blogs/eckerson/archives/2013/02/what_is_unique.php

http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP

Connecting to Mysql via SQOOP – List Tables

In continuation to my posts on the Market basket analysis, I would continue my next steps towards the analytics using the data available in the FoodMart Dataset which you can download from this url https://sites.google.com/a/dlpage.phi-integration.com/pentaho/mondrian/mysql-foodmart-database.  Before moving on to next steps its important that we understand certain things with respect to connecting to mysql from Sqoop as we are focusing on big data as retail is always big. Here are the steps..

    1. Please download the JDBC driver for Sqoop to interact with MySQL using the following URL : http://dev.mysql.com/downloads/connector/j/
    2. Make sure you downloaded the mysql-connector-java-5.1.25.tar.gz either using wget or you can download it from your windows machine if your connected with VirtualBox or VMWare.
    3. Then extract the files to get the mysql-connector-java-5.1.25-bin.jar file and place under sqoop/lib folder
    4. Make sure you have the necessary mysql server information like hostname, username and password with necessary access.
    5. Once you have got that make you have provided necessary privileges for other host to access the mysql server using the following statement:

grant all privileges on *.* to ‘username’@’%’ identified by ‘userpassword’;

  • Then you can get the list of tables from the mysql database foodmart using the following command:

sqoop list-tables –connect jdbc:mysql://192.168.1.32:3306/foodmart -username root

Note: I have done this experiment with Sqoop version 1.4.3, Ubuntu 12.0.4 LTS on Virtualbox and mysql 5.5.24 with WAMP.

Caution: In my example I have used root as the username please don’t use the root username.

Other Links for your references:

http://www.devx.com/Java/hadoop-sqoop-to-import-data-from-mysql.html

http://www.datastax.com/docs/datastax_enterprise2.0/sqoop/sqoop_demo