Step by Step Correlation Matrix using Rapid miner on the Fuel Consumption Data of cars in Canada

Correlation Matrix will help you understand the co-relation between various variables. It is a Symmetrical Matrix where ij element in the matrix is equal to the correlation co-efficient between the variable i and j. The diagonal element are always equivalent to 1. (Thanks http://www.statistics.com/glossary&term_id=310).

Purpose of using Correlation Matrix:

  • To identify the outliers
  • To identify the co-linearity exists between the variables.
  • Used for regression analysis

Simple understanding:

Correlation is a number between +1 and -1 that helps you to measure the relationship between two variables which are being linear(e.g., Higher the income, Higher the Tax) where correlation is +1 or positive, on the other hand (e.g., every item sold will reduce your inventory) where the correlation is -1 or Negative. If it’s near to zero it means that co-relation doesn’t exists (e.g., Average temperature in summer, Average sales of news magazines) which would reflect linear independence between variables. Also its very important to understand the correlation would not affect by the scale of the variables and how its measured.

About the Dataset:

Source: Thanks to Fuel Consumption Ratings from data.gc.ca , Link: http://data.gc.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

The dataset is which I have used is a refined one of the data from the above link. The dataset I use in this post has the following attributes:

  • Make – Car Make
  • Class – Referred as given below

COMPACT

C

SPECIAL PURPOSE

SP

MID-SIZE

M

SUBCOMPACT

S

TWO-SEATER

T

STATION WAGON

W

FULL-SIZE

L

PICKUP TRUCK

PU

LARGE VAN

F

MINIVAN

V

  • Engine
  • Transmission
  • Fuel Type
  • City (Fuel Consumption during City drive in mi/gallons)
  • Hwy (Fuel Consumption during Highway drive in mi/gallons)

Tool Usage:

In this post we will use Rapid Miner tool to understand the Fuel Consumption of cars in Canada for the Year 2013 data related variables.

Steps to evaluate correlation Matrix:

Step 1: Open Rapid Miner which you can download from rapidminer.com

 

Step 2: Import the data from the local drive. In my case I have kept it in excel format, for that you have to click “Import Excel Sheet…” under the Repository Tab. Also you can look at a repository named “SivaRepository” which I have created previously.

Step 3: After you import and click Finish you will something like this as given below, also you can see the log to identify if there are any errors, I have given the name for the dataset as CanadaCarsFuelConsumption2013.


Step 4: Now we will do the correlation matrix from this data. Select the correlation matrix operator from the Operators under Modeling/Correlation and Dependency Computation Section.

Step 5: Now also drag the CanadaCarsFuelConsumption2013 dataset to the process area and connect the out to the exa of the CorrelationMatrix Operator. Then connect the mat output to the process res for the output.

Step 6: Now let’s run the process to the see the results.


Result:

Based on this outcome we can realize that City, Hwy and Fuel related variables have a close correlation that other parameters as the relationship is very positive when compared to other variables. We can also look at the pair wise tables to have better understanding.

Correlation made simple using R

It’s always useful to understand how the data is correlated with each other, when we have a dataset. In this blog we will take the example of the studentsdata set which we have discussed in my previous blog to understand how well scores in individual subjects is correlating with the total scores.

What is correlation?

In simple terms, how two dataset establishes relationship with the each other can be termed as correlation. Example you would like to correlate whether you salary has increase over your age, how much of impact your expenditure has increased over your increase in salary, How much of sales has increased in the sale of umbrella based on the level or periodicity of the rainfall, etc,

Before we begin, If you want to know how to import data from excel to R environment please read my blog.

We will have a look at the data which we have now. So the studentsdata is having 8 columns which are imported from an excelsheet.

We can use the cor(var1,var2) method to determine the correlation, which will default return the pearsons correlation co-efficient. Now will initially find the correlation between the scores of tamil subject and TotalScores. If you see the below picture we have used the function cor(studentsdata$Tamil, studentsdata$TotalScores) which is returning the value of 0.4370992 which is 43.70% which seems to low positive correlation. We have also tried to plot the data between both the variables using plot. If you wanna learn how to do calculation for correlation please refer to this link for a simple example.

Interestingly in this dataset when we refer to the English subject and TotalScores the correlation coefficient value is 0.7475341 which is 74.75%. This seems to establish a strong relationship between the totalscores they have secured in relation with the subject English. If one variable increases when the second one increases, then there is a positive correlation. In this case the correlation coefficient will be closer to 1. In this instance if the score of English marks increase and TotalScores will increase significantly as they very positively correlated.


The same scatter plot has been plotted with ggplot2 library using qplot method. Please find the picture below:


How the correlation fairs with other subjects have a look, R Made it so simple isn’t?