Market Basket Analysis with Hadoop: Importing mysql data to Hive using SQOOP

Now we have an existing data warehouse which is in MySql now we will need the following tables, which are Product and sales fact tables for the year 1997 and 1998. We will take steps to import this to HDFS for the further analysis using Hive. Please go through previous blog post of understanding how to establish connectivity with MySql using Sqoop.

We can start import using the following statement:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root

  • Now you can see that it has imported the data to HDFS in 50.2663 seconds which is at 3.0000 KB/sec. If you issue the command hadoop dfs –ls it will show a item added /user/hduser/product

Subsequent query with hadoop dfs –ls /user/hduser/product reveals the following:

Since we will use hive to analyze the data, we will import the data again to hive using –hive-import option, but if we do that the following sequence of things will happen:

  1. First step is the data will be imported to HDFS
  2. Sqoop generates hive scripts to load the data from the hdfs to hive.

So, we would need to remove the product folder which is imported to HDFS through the Sqoop as it will find the folder exists while its trying to import to hive. So we will remove the same using the following statement:

hadoop dfs -rmr /user/hduser/product

Now we will import the data using Sqoop using the hive option:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root –hive-import

Once the import is complete you will see something like the below:

Now we will go ahead and check the data in hive by using show tables and describe product:

In my next post we will import the remaining table to be used for market basket analysis and start querying with hive.