Identify duplicates and null values in a Column using Talend Open Studio for Data Quality

Writing this post after quite a long time on verifying the quality of data in a column using Data profiling techniques. Let’s take a simple example of a table called Country which has Code and Description.

CODE    DESCRIPTION
IND    India

US    United States of America

UK    United Kingdom

IND    INDIA

GER    Germany

AUS    <<null>>

AF    Afganistan

DZ    Algeria

Alb    Albania

Arg    Argentina

 

Possible problems in data:

What could be the possible problems in this with respect to data:

  1. The application might have been designed in such a way that Code must be 3 characters in length where in during data migration there could be some code which might of 2 characters in length
  2. Code might have been destined to be All caps which might have been compromised
  3. There should not be any null values in the Description

 

Now let us see how we can identify these problems in the table Country with the example.

 

Step 1: Connect to the oracle database using the DQ Repository

 

Step 2: Now we will add simple Column analysis on the column Code for the table Country

 

Step 3: Select the Indicators for each column to analyze, that is essential for analysis. Without indicators we will not be analyze the issues. For this example I have chosen some of the parameters for analysis as given below.

 

Step 4: Run the Analysis

When you run the analysis you can get to the graphical chart as depicted in the snapshot provided below:

 

In the above picture we can realize the row count is 10 and it has one duplicate values and there are 9 distinct values.

 


 

In this picture you can also find the length related metrics and Text statistics which even takes care of the case related issues.

 

This way we can easily identify issues in a specific column. Hope this post gives a simple example which might be useful in different context for ensuring data quality.

 

Data Profiling: Step by Step connection analysis using Talend

In continuation to my previous post, in this post we will look at some sample data and use Talend Open Studio for Data Quality for Data profiling. You can also refer to this link as an alternative tutorial. In this blog post we will evaluate how to do Connection analysis which will help us to the following key parameters in a database:

  • How many tables exists?
  • How many rows exists?
  • How many views exists?
  • How many rows exists in each table etc.,?

Why do we need to do Connection Analysis:

The connection analysis helps you to get a overview of the Database connection in the context quickly.

Creating a DB connection

Step 1:

Step 2: Installing mysql drivers

Step 3: Checking the connectivity

Starting Connection Analysis:

Step 1: Create the Database Structure Overview as given in the steps.

Step 2: Select the DB Connections

Step 3: Select the tables you need to analyze on.

Step 4: Now run the ClassicModel Connection Analysis

Step 5: Execution status after run

Step 6: Now it results in data analysis with statistical information such as 3864 rows and 8 tables.

Open source tools for Data Profiling

Data Profiling is nothing but analyzing the existing data available in a data source and identifying the meta data on the same. This post is an high level introduction to data profiling and just provide pointers to data profiling.

What is the use of doing data profiling?

  1. To understand the metadata characteristics of the data under purview.
  2. To have an enterprise view of the data for the purpose of Master Data Management and Data Governance
  3. Helps in identifying the right candidates for Source-Target mapping.
  4. Ensure data fits for the intended purpose
  5. It helps to identify the Data issues and quantify them.

Typical project types its being put to use:

  • Data warehousing/Business Intelligence Projects
  • Research Engagements
  • Data research projects
  • Data Conversion/Migration Projects
  • Source System Quality initiatives.

Some of the open source tools which can be used for Data Profiling:

Some links which points to understand various commercial players exists and there comparison and evaluation:

In the next post we will evaluate certain aspects of data profiling with any of the tools mentioned in this blog post.

Step by Step Correlation Matrix using Rapid miner on the Fuel Consumption Data of cars in Canada

Correlation Matrix will help you understand the co-relation between various variables. It is a Symmetrical Matrix where ij element in the matrix is equal to the correlation co-efficient between the variable i and j. The diagonal element are always equivalent to 1. (Thanks http://www.statistics.com/glossary&term_id=310).

Purpose of using Correlation Matrix:

  • To identify the outliers
  • To identify the co-linearity exists between the variables.
  • Used for regression analysis

Simple understanding:

Correlation is a number between +1 and -1 that helps you to measure the relationship between two variables which are being linear(e.g., Higher the income, Higher the Tax) where correlation is +1 or positive, on the other hand (e.g., every item sold will reduce your inventory) where the correlation is -1 or Negative. If it’s near to zero it means that co-relation doesn’t exists (e.g., Average temperature in summer, Average sales of news magazines) which would reflect linear independence between variables. Also its very important to understand the correlation would not affect by the scale of the variables and how its measured.

About the Dataset:

Source: Thanks to Fuel Consumption Ratings from data.gc.ca , Link: http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

The dataset is which I have used is a refined one of the data from the above link. The dataset I use in this post has the following attributes:

  • Make – Car Make
  • Class – Referred as given below

COMPACT

C

SPECIAL PURPOSE

SP

MID-SIZE

M

SUBCOMPACT

S

TWO-SEATER

T

STATION WAGON

W

FULL-SIZE

L

PICKUP TRUCK

PU

LARGE VAN

F

MINIVAN

V

  • Engine
  • Transmission
  • Fuel Type
  • City (Fuel Consumption during City drive in mi/gallons)
  • Hwy (Fuel Consumption during Highway drive in mi/gallons)

Tool Usage:

In this post we will use Rapid Miner tool to understand the Fuel Consumption of cars in Canada for the Year 2013 data related variables.

Steps to evaluate correlation Matrix:

Step 1: Open Rapid Miner which you can download from rapidminer.com

 

Step 2: Import the data from the local drive. In my case I have kept it in excel format, for that you have to click “Import Excel Sheet…” under the Repository Tab. Also you can look at a repository named “SivaRepository” which I have created previously.

Step 3: After you import and click Finish you will something like this as given below, also you can see the log to identify if there are any errors, I have given the name for the dataset as CanadaCarsFuelConsumption2013.


Step 4: Now we will do the correlation matrix from this data. Select the correlation matrix operator from the Operators under Modeling/Correlation and Dependency Computation Section.

Step 5: Now also drag the CanadaCarsFuelConsumption2013 dataset to the process area and connect the out to the exa of the CorrelationMatrix Operator. Then connect the mat output to the process res for the output.

Step 6: Now let’s run the process to the see the results.


Result:

Based on this outcome we can realize that City, Hwy and Fuel related variables have a close correlation that other parameters as the relationship is very positive when compared to other variables. We can also look at the pair wise tables to have better understanding.