Data Profiling: Step by Step connection analysis using Talend

In continuation to my previous post, in this post we will look at some sample data and use Talend Open Studio for Data Quality for Data profiling. You can also refer to this link as an alternative tutorial. In this blog post we will evaluate how to do Connection analysis which will help us to the following key parameters in a database:

  • How many tables exists?
  • How many rows exists?
  • How many views exists?
  • How many rows exists in each table etc.,?

Why do we need to do Connection Analysis:

The connection analysis helps you to get a overview of the Database connection in the context quickly.

Creating a DB connection

Step 1:

Step 2: Installing mysql drivers

Step 3: Checking the connectivity

Starting Connection Analysis:

Step 1: Create the Database Structure Overview as given in the steps.

Step 2: Select the DB Connections

Step 3: Select the tables you need to analyze on.

Step 4: Now run the ClassicModel Connection Analysis

Step 5: Execution status after run

Step 6: Now it results in data analysis with statistical information such as 3864 rows and 8 tables.

Open source tools for Data Profiling

Data Profiling is nothing but analyzing the existing data available in a data source and identifying the meta data on the same. This post is an high level introduction to data profiling and just provide pointers to data profiling.

What is the use of doing data profiling?

  1. To understand the metadata characteristics of the data under purview.
  2. To have an enterprise view of the data for the purpose of Master Data Management and Data Governance
  3. Helps in identifying the right candidates for Source-Target mapping.
  4. Ensure data fits for the intended purpose
  5. It helps to identify the Data issues and quantify them.

Typical project types its being put to use:

  • Data warehousing/Business Intelligence Projects
  • Research Engagements
  • Data research projects
  • Data Conversion/Migration Projects
  • Source System Quality initiatives.

Some of the open source tools which can be used for Data Profiling:

Some links which points to understand various commercial players exists and there comparison and evaluation:

In the next post we will evaluate certain aspects of data profiling with any of the tools mentioned in this blog post.

India voter turnout for 2014: Using simple Excel Linear regression model

Wondering what would be the voter turnout for the year 2014 based the past history of elections in India. Here is an attempt in my post to evaluate the Linear regression model and fitting a line to data to determine what could be the possible voter turnout percentage for the year 2014.

Source of Data:

International Institute for Democracy and Electoral Assistance has the data for the voter turnout in the elections in the past in India. http://www.idea.int/vt/countryview.cfm?CountryCode=IN. We will use that as the source for this analysis. Though the elections are not held linearly this is only an attempt to predict what will the line of fit within the given data.

Step 1: Let’s put the data in a spreadsheet. For the purpose of simplicity and to bring some linearity we have added Index column along with the data.

Year

Index

Voter turnout Percentage

1952

0

61.17%

1957

1

62.23%

1962

2

55.42%

1967

3

61.04%

1971

4

55.25%

1977

5

60.49%

1980

6

56.92%

1984

7

63.56%

1989

8

61.98%

1991

9

56.73%

1996

10

57.94%

1998

11

61.97%

1999

12

59.99%

2004

13

58.07%

2009

14

58.17%

 

Step 2: Now let’s plot a scatter plot against this data as shown below. Make sure you select both the columns Index and Voter turnout percentage and select “Scatter with Only Markers” type (I’m using Office 2007). This highest voter turnout seems to be in the year 1984 with 63.56%

 

Step 3: Now let’s bring the linear regression equation to this chart, which will probably help us to evaluate what possibly could be the turnout in 2014? After selecting the chart select the “Design” Tab and look for “layout 9” which has fx along with a trend line and select that. It’s given in the picture below:

 

Step 4: After selecting the Layout 9, you will have the line of fit and its relevant equation. To have better clarity this equation has been moved on to the right.


 

Step 5: So putting things into the equation we are expecting a 59% overall turnout for 2014. Not sure, I’m also awaiting for the results. See the screenshots with the updated data in the spreadsheet.

Year to Predict

Equivalent Index

Equation

Resolving

2014

15

y = -0.0005x + 0.5974, R² = 0.0069

59%

 


Lets exercise our democratic rights and await for good governance. Also lets check the voter turnout for the year 2014. 🙂

Step by Step connecting to MySQL from R

Many of you would like to explore the data from the MySql database with R. This would help to analyze data with relationship which exists in R. This post will talk about the steps to connect to MySQL from R. This post is a step by step approach similar to the to the steps outlined in this PDF Document, though there are some additional information and easy to follow sequence.

Assumptions:

  1. MySQL is already installed on your System
  2. Operating System is Windows 7 (64-Bit)

Dataset:

We will use the Employee dataset which is available in the http://www.eclipse.org/birt/phoenix/db/ which can be used for testing applications and database with the fictious names “classicmodels”. (approximately 3.1 MB). Make sure you import the dataset to MySQL as outlined here. It has the following tables Customers, Employees, Orders, OrderDetails, Payments, Products and Product Lines. You can look at what is available in the database using the SQLYOG Community Edition.

Various ways to connect to MySQL from R:

There following are the ways in which we can connect to MySQL from R:

  • Using RODBC Library
  • Using RMySQL Library

In this post we will see the steps for connecting to MySQL from R using ODBC. Also refer to this PDF Attachment on the ODBC connectivity with R.

Step 1:

Download the ODBC Driver from the site https://dev.mysql.com/downloads/connector/odbc/ to make sure if you have the driven if you don’t have one. I have downloaded mysql-connector-odbc-5.3.2-winx64.msi as My operating and System is 64-bit. Please proceed with the installation of the same.

ODBC Setup:

Going forward we will do the ODBC setup related steps.

Step 2:

Goto Control Panel->Administrative Tools->ODBC


Step 3: Click Add to add a new ODBC Setup for MySQL


 

Step 4: Once you click finish you will get the below given screen where in you need to feed the IP Address/Hostname of the MySQL Server, Username and Password credentials and database and click on the Test button to make sure you are able to connect to the database without any problems. Once the Test is successful click OK to added to the List of ODBC connections.


After the MySQL ODBC connection is added:


 

Now we will move to R to invoke this Datasource and try to access any one of the table in R

Step 5: Now load the library RODBC using the following command, if its not getting loaded you can look at the instructions in the link to install the RODBC Library.

library(RODBC)


Step 6: Now having the RODBC installed now we can connect to the ClassicModels database in the MySQL and test the RODBC Library

Hope now we have got an idea of how to bring on the MySQL database to R for further processing, in the next post we will try to attempt of using RMySQL Library. Thanks, please share your feedback.

Social Network Analysis: Calculating Degree with Gephi

In continuation to my earlier post on creating a Simple social network analysis using Gephi on the simple friends network. We will focus on the terminology “Degree” related to the Social Network Analysis in this post. In the previous example I used to Edge List in a CSV format to import the data to Gephi to obtain this social network which is very simple.

Degree:

Degree is the edges incident on a node. According to Wikipedia, The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes.

In this outcome are social network diagram you can see Siva has more number of Degrees which can be looked from the gephi as given in below given screenshot.

This shows the following attributes of Siva:

In-Degree: 4 (Vijay, Gopikrishna, Aditya, Kumar) – Head End Points at the Node

Out-Degree:7 (Ilango, Ramesh, Kannan, Aditya, Kumar, Vijay and GopiKrishna) – Tail End points at the Node

Degree: 11

According to Wikipedia, For a node, the number of head endpoints adjacent to a node is called the indegree of the node and the number of tail endpoints adjacent to a node is its outdegree.

Degree Distribution:

The degree distribution displays the count of nodes with the appropriate distribution. This example there are 3 nodes with the degree of 4 and one node with degree 1.

Step by Step Correlation Matrix using Rapid miner on the Fuel Consumption Data of cars in Canada

Correlation Matrix will help you understand the co-relation between various variables. It is a Symmetrical Matrix where ij element in the matrix is equal to the correlation co-efficient between the variable i and j. The diagonal element are always equivalent to 1. (Thanks http://www.statistics.com/glossary&term_id=310).

Purpose of using Correlation Matrix:

  • To identify the outliers
  • To identify the co-linearity exists between the variables.
  • Used for regression analysis

Simple understanding:

Correlation is a number between +1 and -1 that helps you to measure the relationship between two variables which are being linear(e.g., Higher the income, Higher the Tax) where correlation is +1 or positive, on the other hand (e.g., every item sold will reduce your inventory) where the correlation is -1 or Negative. If it’s near to zero it means that co-relation doesn’t exists (e.g., Average temperature in summer, Average sales of news magazines) which would reflect linear independence between variables. Also its very important to understand the correlation would not affect by the scale of the variables and how its measured.

About the Dataset:

Source: Thanks to Fuel Consumption Ratings from data.gc.ca , Link: http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

The dataset is which I have used is a refined one of the data from the above link. The dataset I use in this post has the following attributes:

  • Make – Car Make
  • Class – Referred as given below

COMPACT

C

SPECIAL PURPOSE

SP

MID-SIZE

M

SUBCOMPACT

S

TWO-SEATER

T

STATION WAGON

W

FULL-SIZE

L

PICKUP TRUCK

PU

LARGE VAN

F

MINIVAN

V

  • Engine
  • Transmission
  • Fuel Type
  • City (Fuel Consumption during City drive in mi/gallons)
  • Hwy (Fuel Consumption during Highway drive in mi/gallons)

Tool Usage:

In this post we will use Rapid Miner tool to understand the Fuel Consumption of cars in Canada for the Year 2013 data related variables.

Steps to evaluate correlation Matrix:

Step 1: Open Rapid Miner which you can download from rapidminer.com

 

Step 2: Import the data from the local drive. In my case I have kept it in excel format, for that you have to click “Import Excel Sheet…” under the Repository Tab. Also you can look at a repository named “SivaRepository” which I have created previously.

Step 3: After you import and click Finish you will something like this as given below, also you can see the log to identify if there are any errors, I have given the name for the dataset as CanadaCarsFuelConsumption2013.


Step 4: Now we will do the correlation matrix from this data. Select the correlation matrix operator from the Operators under Modeling/Correlation and Dependency Computation Section.

Step 5: Now also drag the CanadaCarsFuelConsumption2013 dataset to the process area and connect the out to the exa of the CorrelationMatrix Operator. Then connect the mat output to the process res for the output.

Step 6: Now let’s run the process to the see the results.


Result:

Based on this outcome we can realize that City, Hwy and Fuel related variables have a close correlation that other parameters as the relationship is very positive when compared to other variables. We can also look at the pair wise tables to have better understanding.