NLTK – Use Cases

I was venturing to a new research on chatbots where I ended up with putting efforts on understanding NLTK which is a Natural  Language Processing Toolkit for Python. This toolkit helps to simplify the efforts related to NLP processing. There are excellent Youtube video tutorials on NLTK which you can look one by sentdex which has dealt a lot of sentiment analysis right from installation.

In this post I will attempt to share my thoughts on how we can use this NLTK to solve different use cases.

Recommendation: Recommendation of content can be made based on the similarity. Against similarity can be calculated based on semantic similarity and lexical similarity. Lot of more can be explored such as cosine similarity.

Sentiment Analysis:  Sentiment analysis can be used to determine the authors attitude on the content based on the good words dictionary by adopting simple scoring techniques.

N-Gram Analysis: By tokenizing the content we can analyze the content in large text for large text analysis.

NLTK is a very powerful tool, which can be used for extensive programming pertaining to natural text. It also has package called which could be used for building chatbots.

Step by Step to access wikipedia dump using MongoDB and PHP for data analytics

In an effort to evaluate how to handle unstructured data using mongodb I write this post to extract data from WikiDump and importing to Mongodb and accessing the same using PHP MongoDB Client libraries. Subsequently to do behavior analytics on the data.

Objective: Retrieve the wikidump import it into mongodb and retrieve the data using mongoclient.

Pre-Requisites: We would need the following softwares are pre-requisites for going step by step as per the post.

Step by Step Procedure: Importing the Tamil Wikipedia Dump to MongoDb.

Step 1: First you can download the wikipedia dump from the website In this case I downloaded the Tamil wiki dump from the following location

Step 2: Then I would like to analyze the articles in the Tamil wikipedia so downloaded the file tawiki-20131204-pages-articles.xml.bz2 which is of 68.5 MB size

Step 3: You would also need the appropriate library for handling mongoclient and bzopen libraries. Please download those and use it accordingly.

Step 3a: MongoClient Library installation: and look at the s3 amazon link location I downloaded

Note: Extract that zip file and download appropriate version dll to the ext folder of php location. Make sure to add extension=php_mongo-1.4.5-5.4-vc9-x86_64.dll to php.ini ( I was running this in 64-bit os, choose appropariately)

Step 3b: BZip Installation: and refer this link for setting up of the BZip library in PHP.


a. BZip2.dll has to be copied from the BZip installation folder to php folder, refer the picture given below).

b. Make sure to add extension=php_bz2.dll to php.ini or uncomment the same if it exists.

Step 4: Make sure you are running mongodb using the command line interface as shown in the picture below:

Step 5: Enough of setup, now it’s time to import the data from the wikidump to the mongodb. Please follow the instructions on executing the PHP file for importing from the command line. We will use the PHP code from James Linden from the URL provided here

Download the PHP script and place it under WAMP/XAMPP folder accordingly. Make sure you change the $dsname = ‘mongodb://localhost/wp20130708’;
$file = ‘enwiki-20130708-pages-articles.xml.bz2’;
$log = ‘./’;

aspects in the PHP Script. Then go to the command prompt and give the command PHP

I’m running the file from PHPScript folder but executing the PHP bin folder. This will also create a log file while the execution is complete.

Step 6: We will verify the imported data using the client mongo

Step 7: The following PHP Code helps to connect to the mongodb which has the wikidump

The following is the code which has modified:


<meta http-equiv=”content-type” content=”text/html;charset=utf-8″ />


// PHP code to look at the Wikipedia data which are imported from Wikipedia Dump to MongoDB

$dbhost = ‘mongodb://localhost/tawiki’;

$dbname = “tawiki”;

//Connect to the localhost

$m = new mongoclient($dbhost);

//Retrive the collections exists in the database “tawiki”

$db = $m->selectDB(“tawiki”);


//Display the Collections

echo (“<b>Collections in </b>” . $dbname . “<br/><hr/>”);


//Lets retrive the data in the collection “Page”

$collection=new MongoCollection($db, ‘page’);

//$Query =array();//(“username” => “Seesiva”);

$Query =array(“revision.contributor.username”=>”Seesiva”);

$cursor = $collection->find($Query);



echo (“<b>Titles Contribution for the Query: </b>” . array_shift($Query) . ” and total results found:” . strval($mycount) .”<br/><hr/>”);

//Retrieve the titles contributed by the User

foreach ($cursor as $doc) {







Step 8: Find the results

Top 25 words used in Thirukkural using Hadoop

Tools used:

Hadoop,HDFS, Hive, Eclipse,Putty,WinScp,Excel

The Process:

Data Source:

Java Program:

In addition to the traditional WordCount hadoop example also added a line = line.replaceAll(“[\\d\\.\\d\\.\\d]”, “”); to eliminate the numbers and decimals in the text file.

Using the following command create a external table where it will use the file part in the given location:

create external table thirukural(word String, count bigint) location ‘/user/hduser/out.txt’;

Describe the table created:

Order the words by count and words and write to a file in HDFS:

insert overwrite directory ‘/user/hduser/out.txt/result.txt’ select * from thirukural order by count,word;

Result for completing the Map and Reduce:

Hive output:

After little bit of refining of data in excel the final result:

Word Count
















































Market Basket Analysis with Hadoop: Importing mysql data to Hive using SQOOP

Now we have an existing data warehouse which is in MySql now we will need the following tables, which are Product and sales fact tables for the year 1997 and 1998. We will take steps to import this to HDFS for the further analysis using Hive. Please go through previous blog post of understanding how to establish connectivity with MySql using Sqoop.

We can start import using the following statement:

sqoop import –connect jdbc:mysql:// –table product –username root

  • Now you can see that it has imported the data to HDFS in 50.2663 seconds which is at 3.0000 KB/sec. If you issue the command hadoop dfs –ls it will show a item added /user/hduser/product

Subsequent query with hadoop dfs –ls /user/hduser/product reveals the following:

Since we will use hive to analyze the data, we will import the data again to hive using –hive-import option, but if we do that the following sequence of things will happen:

  1. First step is the data will be imported to HDFS
  2. Sqoop generates hive scripts to load the data from the hdfs to hive.

So, we would need to remove the product folder which is imported to HDFS through the Sqoop as it will find the folder exists while its trying to import to hive. So we will remove the same using the following statement:

hadoop dfs -rmr /user/hduser/product

Now we will import the data using Sqoop using the hive option:

sqoop import –connect jdbc:mysql:// –table product –username root –hive-import

Once the import is complete you will see something like the below:

Now we will go ahead and check the data in hive by using show tables and describe product:

In my next post we will import the remaining table to be used for market basket analysis and start querying with hive.

Why move data from RDBMS or existing data warehouse to HDFS?

In continuation to my previous post, many would have questions like the following?

  1. When there are traditional ways doing Market basket analysis with Cubes with Enterprise Data Warehouse (EDW) why do we need to adopt this route of moving data to HDFS?
  2. What kind of benefits will someone get by moving this RDBMS data to HDFS?
  3. Will it provide any cost savings?

This post is a mere analysis of possible use cases which could complement your data warehousing and analytics strategy. With the big hype around big data and related technologies its important to understand what to use and when to use accordingly. Good reasons for using Hadoop to complement the datawarehouse.

  1. Usage of tools like Sqoop and moving the data to an HDFS infrastructure environment will provide you the following benefits:
    1. Storage of extremely high volume data with the help of Hadoop infrastructure
    2. Accelerating the data movements with nightly batches with the help of MR Tasks.
    3. Automatic and redundant backup with the help of HDFS’s natural fault –tolerant data nodes.
    4. Low cost commodity hardware for scalability.
  2. Movement of the structured data to the HDFS enable to analyze the data in relationship with the unstructured data or semi structured data such as tweets, blogs, etc.,
  3. Not necessary to model the data as it can be handled while it’s being read.
  4. This also provides you capabilities to do quick exploratory analytics before moving to the warehouse for final analysis.

You can look at the following papers for more information and detailed understanding of the same:

Connecting to Mysql via SQOOP – List Tables

In continuation to my posts on the Market basket analysis, I would continue my next steps towards the analytics using the data available in the FoodMart Dataset which you can download from this url  Before moving on to next steps its important that we understand certain things with respect to connecting to mysql from Sqoop as we are focusing on big data as retail is always big. Here are the steps..

    1. Please download the JDBC driver for Sqoop to interact with MySQL using the following URL :
    2. Make sure you downloaded the mysql-connector-java-5.1.25.tar.gz either using wget or you can download it from your windows machine if your connected with VirtualBox or VMWare.
    3. Then extract the files to get the mysql-connector-java-5.1.25-bin.jar file and place under sqoop/lib folder
    4. Make sure you have the necessary mysql server information like hostname, username and password with necessary access.
    5. Once you have got that make you have provided necessary privileges for other host to access the mysql server using the following statement:

grant all privileges on *.* to ‘username’@’%’ identified by ‘userpassword’;

  • Then you can get the list of tables from the mysql database foodmart using the following command:

sqoop list-tables –connect jdbc:mysql:// -username root

Note: I have done this experiment with Sqoop version 1.4.3, Ubuntu 12.0.4 LTS on Virtualbox and mysql 5.5.24 with WAMP.

Caution: In my example I have used root as the username please don’t use the root username.

Other Links for your references:

Recommendation in Retail

So you go to a shop you see that a specific brand of Deodorant and Bathing bar are bundled as a product and have been displayed with a specific discount and you hand pick it with immense happiness (??) and satisfaction of a good deal. How does the shop keepers come to know about this? Intuition, Analytics, Case Based Reasoning, Pattern matching, etc.,

It could be Walmart, Target, Macys, TESCO or even a small self-owned retail outlet its important that they understand the customer/consumer behavior correctly to make good profit end of the day. Lets not think that its particularly useful in retail industry its very much important for Services based organization also to understand the consumer behavior.

For the sake of ease of understanding and moving towards practical aspects of such implementation we will try and understand some of the factors which would or could influence recommendation.

  • Demography (City, Locality, Country,etc.,) (Transactions)
  • Culture(Transactions)
  • Product mix based past sales history(Transactions)
  • Social recommendations (Twitter, Facebook, posts) (Social Analytic/NoSQL/Semi Structured)
  • Product Reviews(Blog/Review/Semi Structured Data)
  • Post-Sales experience (Transactions)

The challenge would be to related these data and to make good recommendation through the system in a very short span of time to influence customer buying decisions. In my next post we will try to evaluate some of the data sets available in the internet for the further experiments on the same.

My aim would be to understand and implement a recommendation system or at least arrive at the right steps for making an recommendation system which would be reliable and can handle the complexity involved in data.

Keep waiting for next post.