Identify duplicates and null values in a Column using Talend Open Studio for Data Quality

Writing this post after quite a long time on verifying the quality of data in a column using Data profiling techniques. Let’s take a simple example of a table called Country which has Code and Description.

CODE    DESCRIPTION
IND    India

US    United States of America

UK    United Kingdom

IND    INDIA

GER    Germany

AUS    <<null>>

AF    Afganistan

DZ    Algeria

Alb    Albania

Arg    Argentina

 

Possible problems in data:

What could be the possible problems in this with respect to data:

  1. The application might have been designed in such a way that Code must be 3 characters in length where in during data migration there could be some code which might of 2 characters in length
  2. Code might have been destined to be All caps which might have been compromised
  3. There should not be any null values in the Description

 

Now let us see how we can identify these problems in the table Country with the example.

 

Step 1: Connect to the oracle database using the DQ Repository

 

Step 2: Now we will add simple Column analysis on the column Code for the table Country

 

Step 3: Select the Indicators for each column to analyze, that is essential for analysis. Without indicators we will not be analyze the issues. For this example I have chosen some of the parameters for analysis as given below.

 

Step 4: Run the Analysis

When you run the analysis you can get to the graphical chart as depicted in the snapshot provided below:

 

In the above picture we can realize the row count is 10 and it has one duplicate values and there are 9 distinct values.

 


 

In this picture you can also find the length related metrics and Text statistics which even takes care of the case related issues.

 

This way we can easily identify issues in a specific column. Hope this post gives a simple example which might be useful in different context for ensuring data quality.