Apache Nifi for Real-time scenarios in IoT – Part 1

In this two-part series, I will try to share why we have chosen Apache Nifi as a choice for an IoT Implementation for handling real-time data flow.

There was a need for near real-time data processing requirements for one of IoT project which has multiple integration touch points that’s when I was evaluating different options such as Kapacitor, Apache Storm, Apache Kafka. When I encountered Apache Nifi as a data flow engine which was used NSA was curious to explore. Initially thought it might be a complex attempt but seems to be an easy go once we started exploring. So before trying to share my use cases on when to use Apache Nifi from my own context, would try to quickly have an understanding what is a real-time data processing.

Real-Time Data Processing:

Typically stream of data flowing at very high response rate which needs to be processed for gaining insights. Though the term “real-time” itself would be subjective based on the context or usage. Typically, we need to process the data with zero latency.

Challenges:

The following were some of the challenges we were encountering in a typical IoT Implementation:

  1. Need to track the flow of data across the information value chain
  2. Once data is ingested into the processing flow there could be different data processing requirements such as:
    1. Validation
    2. Threshold Checks
    3. Initiating business events
    4. Alerting
    5. Aggregation
  3. Need to make sure that data flow is seamless and if there are problems it could be isolated without impacting each other
  4. Enable handle different protocol such as MQTT, JSON, HTTP
  5. Integration requirements through API, validation needs with Regular Expression
  6. Need to handle DB Operations on the way of data flow was also a key requirement
  7. The performance needs to be optimal to manage flow requirements.
  8. Need for parallelism required across different data flow points was also a key aspect into our considerations

Needless to say that there were other constraints on resources such as Time and People. In the next part we will discuss what is Apache Nifi and how it handles these challenges.

Open source tools for Data Profiling

Data Profiling is nothing but analyzing the existing data available in a data source and identifying the meta data on the same. This post is an high level introduction to data profiling and just provide pointers to data profiling.

What is the use of doing data profiling?

  1. To understand the metadata characteristics of the data under purview.
  2. To have an enterprise view of the data for the purpose of Master Data Management and Data Governance
  3. Helps in identifying the right candidates for Source-Target mapping.
  4. Ensure data fits for the intended purpose
  5. It helps to identify the Data issues and quantify them.

Typical project types its being put to use:

  • Data warehousing/Business Intelligence Projects
  • Research Engagements
  • Data research projects
  • Data Conversion/Migration Projects
  • Source System Quality initiatives.

Some of the open source tools which can be used for Data Profiling:

Some links which points to understand various commercial players exists and there comparison and evaluation:

In the next post we will evaluate certain aspects of data profiling with any of the tools mentioned in this blog post.