Top 25 words used in Thirukkural using Hadoop

Tools used:

Hadoop,HDFS, Hive, Eclipse,Putty,WinScp,Excel

The Process:


Data Source:


Java Program:

In addition to the traditional WordCount hadoop example also added a line = line.replaceAll(“[\\d\\.\\d\\.\\d]”, “”); to eliminate the numbers and decimals in the text file.

Using the following command create a external table where it will use the file part in the given location:

create external table thirukural(word String, count bigint) location ‘/user/hduser/out.txt’;

Describe the table created:

Order the words by count and words and write to a file in HDFS:

insert overwrite directory ‘/user/hduser/out.txt/result.txt’ select * from thirukural order by count,word;

Result for completing the Map and Reduce:

Hive output:


After little bit of refining of data in excel the final result:

Word Count
படும்

42

தரும்

37

இல்

32

கெடும்

28

என்னும்

24

இல்லை

22

செயல்

22

எல்லாம்

21

தலை

21

காமம்

20

கொளல்

20

பவர்

19

பெறின்

19

அரிது

18

இன்பம்

18

உலகு

18

கண்ணும்

18

தவர்

18

யவர்

18

விடல்

18

விடும்

17

கண்

16

செயின்

16

தற்று

16

Why move data from RDBMS or existing data warehouse to HDFS?

In continuation to my previous post, many would have questions like the following?

  1. When there are traditional ways doing Market basket analysis with Cubes with Enterprise Data Warehouse (EDW) why do we need to adopt this route of moving data to HDFS?
  2. What kind of benefits will someone get by moving this RDBMS data to HDFS?
  3. Will it provide any cost savings?

This post is a mere analysis of possible use cases which could complement your data warehousing and analytics strategy. With the big hype around big data and related technologies its important to understand what to use and when to use accordingly. Good reasons for using Hadoop to complement the datawarehouse.

  1. Usage of tools like Sqoop and moving the data to an HDFS infrastructure environment will provide you the following benefits:
    1. Storage of extremely high volume data with the help of Hadoop infrastructure
    2. Accelerating the data movements with nightly batches with the help of MR Tasks.
    3. Automatic and redundant backup with the help of HDFS’s natural fault –tolerant data nodes.
    4. Low cost commodity hardware for scalability.
  2. Movement of the structured data to the HDFS enable to analyze the data in relationship with the unstructured data or semi structured data such as tweets, blogs, etc.,
  3. Not necessary to model the data as it can be handled while it’s being read.
  4. This also provides you capabilities to do quick exploratory analytics before moving to the warehouse for final analysis.

You can look at the following papers for more information and detailed understanding of the same:

http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Using_Cloudera_to_Improve_Data_Processing_Whitepaper.pdf

http://www.b-eye-network.com/blogs/eckerson/archives/2013/02/what_is_unique.php

http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP