Apache Nifi for Real-time scenarios in IoT – Part 1

In this two-part series, I will try to share why we have chosen Apache Nifi as a choice for an IoT Implementation for handling real-time data flow.

There was a need for near real-time data processing requirements for one of IoT project which has multiple integration touch points that’s when I was evaluating different options such as Kapacitor, Apache Storm, Apache Kafka. When I encountered Apache Nifi as a data flow engine which was used NSA was curious to explore. Initially thought it might be a complex attempt but seems to be an easy go once we started exploring. So before trying to share my use cases on when to use Apache Nifi from my own context, would try to quickly have an understanding what is a real-time data processing.

Real-Time Data Processing:

Typically stream of data flowing at very high response rate which needs to be processed for gaining insights. Though the term “real-time” itself would be subjective based on the context or usage. Typically, we need to process the data with zero latency.

Challenges:

The following were some of the challenges we were encountering in a typical IoT Implementation:

  1. Need to track the flow of data across the information value chain
  2. Once data is ingested into the processing flow there could be different data processing requirements such as:
    1. Validation
    2. Threshold Checks
    3. Initiating business events
    4. Alerting
    5. Aggregation
  3. Need to make sure that data flow is seamless and if there are problems it could be isolated without impacting each other
  4. Enable handle different protocol such as MQTT, JSON, HTTP
  5. Integration requirements through API, validation needs with Regular Expression
  6. Need to handle DB Operations on the way of data flow was also a key requirement
  7. The performance needs to be optimal to manage flow requirements.
  8. Need for parallelism required across different data flow points was also a key aspect into our considerations

Needless to say that there were other constraints on resources such as Time and People. In the next part we will discuss what is Apache Nifi and how it handles these challenges.

Advertisements

Identify duplicates and null values in a Column using Talend Open Studio for Data Quality

Writing this post after quite a long time on verifying the quality of data in a column using Data profiling techniques. Let’s take a simple example of a table called Country which has Code and Description.

CODE    DESCRIPTION
IND    India

US    United States of America

UK    United Kingdom

IND    INDIA

GER    Germany

AUS    <<null>>

AF    Afganistan

DZ    Algeria

Alb    Albania

Arg    Argentina

 

Possible problems in data:

What could be the possible problems in this with respect to data:

  1. The application might have been designed in such a way that Code must be 3 characters in length where in during data migration there could be some code which might of 2 characters in length
  2. Code might have been destined to be All caps which might have been compromised
  3. There should not be any null values in the Description

 

Now let us see how we can identify these problems in the table Country with the example.

 

Step 1: Connect to the oracle database using the DQ Repository

 

Step 2: Now we will add simple Column analysis on the column Code for the table Country

 

Step 3: Select the Indicators for each column to analyze, that is essential for analysis. Without indicators we will not be analyze the issues. For this example I have chosen some of the parameters for analysis as given below.

 

Step 4: Run the Analysis

When you run the analysis you can get to the graphical chart as depicted in the snapshot provided below:

 

In the above picture we can realize the row count is 10 and it has one duplicate values and there are 9 distinct values.

 


 

In this picture you can also find the length related metrics and Text statistics which even takes care of the case related issues.

 

This way we can easily identify issues in a specific column. Hope this post gives a simple example which might be useful in different context for ensuring data quality.

 

Article on GCompris published in today’s Tamil Hindu by Shrini

Great post

Going GNU

I wrote an article for Tamil Hindu on GCompris, Free educational suite for kids in GNU/Linux.

It has been published today.

Thanks for Mr.Valliappan, Tamil Hindu team for showing interest on Free Software related articles in Tamil Hindu.

Thanks for KathirVel for the images.

It is a good sign that Print Media in Tamil showing interest in publishing news about Free/Open Source Software.

Planning to write more about Free Software available in Education and other departments in Tamil Hindu.

Let me come out of my own lazyness and start writing more.

View original post