NLTK – Use Cases

I was venturing to a new research on chatbots where I ended up with putting efforts on understanding NLTK which is a Natural  Language Processing Toolkit for Python. This toolkit helps to simplify the efforts related to NLP processing. There are excellent Youtube video tutorials on NLTK which you can look one by sentdex which has dealt a lot of sentiment analysis right from installation.

In this post I will attempt to share my thoughts on how we can use this NLTK to solve different use cases.

Recommendation: Recommendation of content can be made based on the similarity. Against similarity can be calculated based on semantic similarity and lexical similarity. Lot of more can be explored such as cosine similarity.

Sentiment Analysis:  Sentiment analysis can be used to determine the authors attitude on the content based on the good words dictionary by adopting simple scoring techniques.

N-Gram Analysis: By tokenizing the content we can analyze the content in large text for large text analysis.

NLTK is a very powerful tool, which can be used for extensive programming pertaining to natural text. It also has package called which could be used for building chatbots.


Apache Nifi for Real-time scenarios in IoT – Part 1

In this two-part series, I will try to share why we have chosen Apache Nifi as a choice for an IoT Implementation for handling real-time data flow.

There was a need for near real-time data processing requirements for one of IoT project which has multiple integration touch points that’s when I was evaluating different options such as Kapacitor, Apache Storm, Apache Kafka. When I encountered Apache Nifi as a data flow engine which was used NSA was curious to explore. Initially thought it might be a complex attempt but seems to be an easy go once we started exploring. So before trying to share my use cases on when to use Apache Nifi from my own context, would try to quickly have an understanding what is a real-time data processing.

Real-Time Data Processing:

Typically stream of data flowing at very high response rate which needs to be processed for gaining insights. Though the term “real-time” itself would be subjective based on the context or usage. Typically, we need to process the data with zero latency.


The following were some of the challenges we were encountering in a typical IoT Implementation:

  1. Need to track the flow of data across the information value chain
  2. Once data is ingested into the processing flow there could be different data processing requirements such as:
    1. Validation
    2. Threshold Checks
    3. Initiating business events
    4. Alerting
    5. Aggregation
  3. Need to make sure that data flow is seamless and if there are problems it could be isolated without impacting each other
  4. Enable handle different protocol such as MQTT, JSON, HTTP
  5. Integration requirements through API, validation needs with Regular Expression
  6. Need to handle DB Operations on the way of data flow was also a key requirement
  7. The performance needs to be optimal to manage flow requirements.
  8. Need for parallelism required across different data flow points was also a key aspect into our considerations

Needless to say that there were other constraints on resources such as Time and People. In the next part we will discuss what is Apache Nifi and how it handles these challenges.

Dashboard Analysis: Google Fit

This post is to understand the various aspects of the user interface of a web based dashboard. This is an attempt to understand the little details and insights a dashboard can provide visually. This would help to increase the knowledge on the dashboard design and development. The Analysis is AS I SEE IT.

This dashboard indicates excellent design using Material Design concepts using Paper-cards. This also happens to be good example of gamification for maintaining good health. The data is obtained from Android mobile phones with Accelerometer sensors data integeration.

As soon as you login you will see the given dashboard, which will give the following insights.

  1. How much you have walked against the goal. Goals Vs Achieved
  2. Against the goal how much is each activity contributes is given in circle with different color indicators.
  3. Data in different Unit of measurements on a given day such as Minutes, Distance, Calories & Steps
  4. Also list of recent activities ordered by date combined with the activity is provided as well.
  5. Personal records shows the key achievements.
  6. Distance travelled in terms of summary is provided as well.

Dashboard across periods:

  • If we look further down it has facility to analyse data on Day, Week and Month.
  • You can also analyse the data in Time, Steps, Distance, Calories, Heart Rate and Weight
  • The data can also be filtered based on the various activities such as Walk, Running, Cycling, etc.,

Dashboard over a monthly calendar:

  • Here the maximum activity flagged nicely.
  • The activity summary is mentioned over a line bar
  • The current date is displayed in a different color for identification of current date
  • The weekly total is provided on the extreme right over the respective weeks

Beginners guide to ARIMA: Problem Identification & Data Gathering – Part 2

In continuation to my earlier post, I’m trying to explore ARIMA using an Example. In this post we will go into each step in detail how we can accomplish ARIMA based forecasting for a problem.

Step 1: Problem Identification or Our Scenario

We are going to consider the past history of time series data on the Household Power consumption and use that data to forecast using ARIMA. There is also a research paper published in Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, on the same dataset analysing the performance between ARMA and ARIMA. Our post will focus on step by step accomplishing forecast using R on the same dataset for ease of use for Beginners. Sooner or later we will evaluate tools such as AutoBox, R which can be used for solving this problems.

Step 2: Data Gathering or Identification of dataset

The dataset we are going to use would be a dataset on Individual household electric power consumption available in UCI Repository under the URL: This dataset is a multivariate dataset. Please check this link to understand the difference between univariate, bivariate and multivariate.

Quick Summary of the Dataset:

  • Dataset contains data between December 2006 and November 2010
  • It has around 19.7 MB of Data
  • File is in .TXT Format
  • Columns in the Dataset are:
    • Date (DD/MM/YYYY)
    • Time (HH:MM:SS)
    • Global active power
    • Global Reactive power
    • Voltage
    • Global Intensity
    • Sub Metering 1
    • Sub Metering 2
    • Sub Metering 3

You can open this semicolon delimited text file in Excel and make the necessary steps on the wizard you will be having an excel sheet with the data as given below. I was able to load rows only up to 1048576. The actual total number of rows in the text file is 2075260. Whoa..


Next to do the Step 3: preliminary analysis we can use R as a tool. For using R as a tool we need to load this data into R and for analysing it. For this I can save this excel sheet in CSV format or in XLS format and the import into R as outlined in my other post or using this link. I’m using RStudio for the purpose and demonstrating the data loading process in the screenshots in the subsequent sections.

First the installation had shown some error, after that in the subsequent attempt the installation of gdata was successful. Now we can load the library using the command library(gdata). After we which we have loaded powerData variable with the data available in the CSV file for further analysis and we can view the data using View. Please check the console window for the code.

In the next post we will do some preliminary analysis on this data which we have loaded.

Beginners guide to ARIMA: ARIMA Forecasting technique learn by example

Word “ARIMA” in Tamil language the means Lion.

Everybody is curious and anxious enough to know what the future holds? It’s always exciting to know about it. Though there are various forecasting models available in this post we will look at ARIMA. Welcome to the world of Forecasting with ARIMA.

What is ARIMA?

ARIMA is a forecasting technique. ARIMA– Auto Regressive Integrated Moving Average the key tool in Time Series Analysis. This link from Penn State University gives good introduction on the time series fundamentals.

What is the purpose?

To Forecast. The book Forecasting: principles and practice gives a very good understanding to the whole subject. You can read it online.

What kind of business problems it can solve?

To give examples the following are some of the use cases of ARIMA.

  • Forecast revenue
  • Forecast whether to buy a new asset or not
  • Forecast of currency exchange rates
  • Forecast consumption of energy or utilities

What is mandate to get started?

  1. It is very important to have clarity on what to forecast. Example if you want to forecast revenue whether it is for a product line, demography, etc., has to be analysed before venturing on to the actual task.
  2. Period or the horizon in which the forecast is to be done is also crucial. Example: Monthly, Quarterly, Half-yearly etc.,

What are the preferred pre-requisites on data for Time series forecasting?

Updated after comment from tomdireill:

  1. Data should be part of time series. That is data which is observed sequentially over time.
  2. It can be seasonal. Means it should have highs and lows. As per the notes from Duke University it can be also applied on flat pattern less data too.
  3. It should have trend of increasing or decreasing
  4. outliers
    can be handled as outlined here

Ok, Now we got to understand what is essential to get started on forecasting, before we devolve lets work on the steps.

5 Steps towards forecasting:

In the next post we will take up an example and work on the above steps one by one. Keep waiting.

Database synchronization needs of multi-location enterprises

Recently during my interaction with one of our colleagues there came a discussion about using the same database replicated or making available across multiple locations. In the advent of various connectivity options exists these days and when people are talking about cloud based apps and implementation why is this need. This post is a search of an answer for that.

Business needs for multi-location enterprise solutions:

  1. Requires using one application across the enterprise to ensure data integrity and single version of truth.
  2. Get to know the data of what’s happens in other locations or other manufacturing or outlets.
  3. Helps to plan and react better based on the data insights available from other locations.
  4. Process control and improvement across the enterprise with a single solution
  5. Low training cost

Challenges in accomplishing these business needs:

  1. Lack of connectivity or poor connectivity between the locations
  2. Higher bandwidth costs or complex internet solutions required to support the enterprise needs
  3. No control or process enablement in the locations or facilities
  4. Enterprise applications does not support the scenarios of multi-location with better control on data and process
  5. Processes and applications established at locations without understanding the impact of connectivity and process issues
  6. Limited accountability and responsibility at the locations in comparison with corporate or head quarters

Solutions or options are available for us:

  1. If we are very sure about the connectivity and availability we can adopt cloud based solution which resolves problems for once for all
  2. When there is connectivity issues, we might need to resort to Database synchronization options which would be more feasible to manage enterprise applications
  3. The key things to these kind of scenarios is to identify the following with respect to data:
    1. Who is the data owner?
    2. Who has to create it?
    3. Where it has to be created?
    4. Who is going to consume it?
    5. Is it required real-time?
    6. What controls to be established upon the data?

Related articles for more reading:

Near Realtime data in charts using AngularJS

Whenever we develop an Web Application for displaying realtime or near realtime charts there would be a need to refresh the page at regular intervals to fetch the data and display. This is essential for the following reasons:

  1. To ensure that we display the latest details to the end users
  2. To provide interactive update which gives good look & feel for the end users
  3. Will help in avoiding taking wrong decisions or delayed information
  4. Can help in analyzing the current situation instead of past history

In one of the recent initiatives I was working on I have to refresh the chart for displaying the current data available at a manufacturing facility in near-real time. The user interface was developed with HTML5 + CSS3 and AngularJS. I came across this post found in by Shahid Shaikh which helps to refresh the DIV using the interval. Very useful post indeed.

I used ChartJS + AngularJS for my own implementation. I used REST API developed based on Web API 2 with Dapper ORM to interact with my Datawarehouse.