# Beginners guide to ARIMA: Problem Identification & Data Gathering – Part 2

In continuation to my earlier post, I’m trying to explore ARIMA using an Example. In this post we will go into each step in detail how we can accomplish ARIMA based forecasting for a problem.

Step 1: Problem Identification or Our Scenario

We are going to consider the past history of time series data on the Household Power consumption and use that data to forecast using ARIMA. There is also a research paper published in Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, on the same dataset analysing the performance between ARMA and ARIMA. Our post will focus on step by step accomplishing forecast using R on the same dataset for ease of use for Beginners. Sooner or later we will evaluate tools such as AutoBox, R which can be used for solving this problems.

Step 2: Data Gathering or Identification of dataset

The dataset we are going to use would be a dataset on Individual household electric power consumption available in UCI Repository under the URL: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. This dataset is a multivariate dataset. Please check this link to understand the difference between univariate, bivariate and multivariate.

Quick Summary of the Dataset:

• Dataset contains data between December 2006 and November 2010
• It has around 19.7 MB of Data
• File is in .TXT Format
• Columns in the Dataset are:
• Date (DD/MM/YYYY)
• Time (HH:MM:SS)
• Global active power
• Global Reactive power
• Voltage
• Global Intensity
• Sub Metering 1
• Sub Metering 2
• Sub Metering 3

You can open this semicolon delimited text file in Excel and make the necessary steps on the wizard you will be having an excel sheet with the data as given below. I was able to load rows only up to 1048576. The actual total number of rows in the text file is 2075260. Whoa..

Next to do the Step 3: preliminary analysis we can use R as a tool. For using R as a tool we need to load this data into R and for analysing it. For this I can save this excel sheet in CSV format or in XLS format and the import into R as outlined in my other post or using this link. I’m using RStudio for the purpose and demonstrating the data loading process in the screenshots in the subsequent sections.

First the installation had shown some error, after that in the subsequent attempt the installation of gdata was successful. Now we can load the library using the command library(gdata). After we which we have loaded powerData variable with the data available in the CSV file for further analysis and we can view the data using View. Please check the console window for the code.

In the next post we will do some preliminary analysis on this data which we have loaded.

# Beginners guide to ARIMA: ARIMA Forecasting technique learn by example

Word “ARIMA” in Tamil language the means Lion.

Everybody is curious and anxious enough to know what the future holds? It’s always exciting to know about it. Though there are various forecasting models available in this post we will look at ARIMA. Welcome to the world of Forecasting with ARIMA.

## What is ARIMA?

ARIMA is a forecasting technique. ARIMA– Auto Regressive Integrated Moving Average the key tool in Time Series Analysis. This link from Penn State University gives good introduction on the time series fundamentals.

## What is the purpose?

To Forecast. The book Forecasting: principles and practice gives a very good understanding to the whole subject. You can read it online.

## What kind of business problems it can solve?

To give examples the following are some of the use cases of ARIMA.

• Forecast revenue
• Forecast whether to buy a new asset or not
• Forecast of currency exchange rates
• Forecast consumption of energy or utilities

## What is mandate to get started?

1. It is very important to have clarity on what to forecast. Example if you want to forecast revenue whether it is for a product line, demography, etc., has to be analysed before venturing on to the actual task.
2. Period or the horizon in which the forecast is to be done is also crucial. Example: Monthly, Quarterly, Half-yearly etc.,

## What are the preferred pre-requisites on data for Time series forecasting?

### Updated after comment from tomdireill:

1. Data should be part of time series. That is data which is observed sequentially over time.
2. It can be seasonal. Means it should have highs and lows. As per the notes from Duke University it can be also applied on flat pattern less data too.
3. It should have trend of increasing or decreasing
4. outliers
can be handled as outlined here http://www.unc.edu/~jbhill/tsay.pdf

Ok, Now we got to understand what is essential to get started on forecasting, before we devolve lets work on the steps.

5 Steps towards forecasting:

In the next post we will take up an example and work on the above steps one by one. Keep waiting.

# Database synchronization needs of multi-location enterprises

Recently during my interaction with one of our colleagues there came a discussion about using the same database replicated or making available across multiple locations. In the advent of various connectivity options exists these days and when people are talking about cloud based apps and implementation why is this need. This post is a search of an answer for that.

Business needs for multi-location enterprise solutions:

1. Requires using one application across the enterprise to ensure data integrity and single version of truth.
2. Get to know the data of what’s happens in other locations or other manufacturing or outlets.
3. Helps to plan and react better based on the data insights available from other locations.
4. Process control and improvement across the enterprise with a single solution
5. Low training cost

Challenges in accomplishing these business needs:

1. Lack of connectivity or poor connectivity between the locations
2. Higher bandwidth costs or complex internet solutions required to support the enterprise needs
3. No control or process enablement in the locations or facilities
4. Enterprise applications does not support the scenarios of multi-location with better control on data and process
5. Processes and applications established at locations without understanding the impact of connectivity and process issues
6. Limited accountability and responsibility at the locations in comparison with corporate or head quarters

Solutions or options are available for us:

1. If we are very sure about the connectivity and availability we can adopt cloud based solution which resolves problems for once for all
2. When there is connectivity issues, we might need to resort to Database synchronization options which would be more feasible to manage enterprise applications
3. The key things to these kind of scenarios is to identify the following with respect to data:
1. Who is the data owner?
2. Who has to create it?
3. Where it has to be created?
4. Who is going to consume it?
5. Is it required real-time?
6. What controls to be established upon the data?

http://blogs.msdn.com/b/sync/archive/2009/12/14/how-to-synchronize-multiple-geographically-distributed-sql-server-databases-using-sql-azure-data-sync.aspx

http://www.comp.dit.ie/btierney/Oracle11gDoc/server.111/b28324/tdpii_repcont.htm

# Power BI Limitations and Issues

I’ve been experimenting Power BI recently for a proof of concept development on building a simple dashboard to display the consumption of electricity across different equipments. This dashboard was being built on a datawarehouse DB. Following are some of the observations I found in my initial experimentation with it. Neverthless there could be lot of promise in the latest releases. I leave it to the viewers to check Power BI in their own discretion.

1. Unable to provide custom label the axis
2. Drill down capabilities are limited
3. Close integration with Power View only
4. Limited documentation on the deployment scenarios
5. Inability to export to multiple formats
6. Facility to develop color or formatting based on expressions (As of 30th April 2015 they have started working on it)
7. Limitations on data sources exists

These are all some of the limitations based on my exploration.

# Google Fusion Tables step by step: Table in cloud

• Viewers can see the table or charts from any where once shared
• API capability to program the table
• Ability to integerate tables from the internet
• Keep the data online and make it single version of truth across the board

In this blog post we will create a table for sharing the home work given in a school using Google Fusion Tables. This tables also helps to keep track of the various home work given in the school with the due date. This table can be shared across of parents or students of a specific class so that everybody can be the same page. This would also help the students who were absent on that particular day.

Step 1: Getting Started

You can navigate to the create a fusion tables using the following URL:

Step 2: Creating Fusion Table

Step 3: Change Columns: To add or modify the columns

You can add new columns by clicking the new button. It has the following data types : Text, Number, Date and Location

Also one of the key feature is that we can add a list of Drop-Down items using a list of comma separated values or using Key-Value pair based on JSON.

Step 4: Adding multiple values for dropdown

In this example I have added Activity, Subject, Research as 3 category items for the column Category.

Click the Add row button from the Edit Menu to add a new row to the table.

Now you will be presented with a form to capture the inputs which will collect the details for the table.

This makes the data entry process more easy to the table also it helps you to select value from the drop down control and also date from a Date time picker. You can either click save or Save and add another to add another row.

# Open source tools for Data Profiling

Data Profiling is nothing but analyzing the existing data available in a data source and identifying the meta data on the same. This post is an high level introduction to data profiling and just provide pointers to data profiling.

What is the use of doing data profiling?

1. To understand the metadata characteristics of the data under purview.
2. To have an enterprise view of the data for the purpose of Master Data Management and Data Governance
3. Helps in identifying the right candidates for Source-Target mapping.
4. Ensure data fits for the intended purpose
5. It helps to identify the Data issues and quantify them.

Typical project types its being put to use:

• Research Engagements
• Data research projects
• Data Conversion/Migration Projects
• Source System Quality initiatives.

Some of the open source tools which can be used for Data Profiling:

Some links which points to understand various commercial players exists and there comparison and evaluation:

In the next post we will evaluate certain aspects of data profiling with any of the tools mentioned in this blog post.

# India voter turnout for 2014: Using simple Excel Linear regression model

Wondering what would be the voter turnout for the year 2014 based the past history of elections in India. Here is an attempt in my post to evaluate the Linear regression model and fitting a line to data to determine what could be the possible voter turnout percentage for the year 2014.

Source of Data:

International Institute for Democracy and Electoral Assistance has the data for the voter turnout in the elections in the past in India. http://www.idea.int/vt/countryview.cfm?CountryCode=IN. We will use that as the source for this analysis. Though the elections are not held linearly this is only an attempt to predict what will the line of fit within the given data.

Step 1: Let’s put the data in a spreadsheet. For the purpose of simplicity and to bring some linearity we have added Index column along with the data.

 Year Index Voter turnout Percentage 1952 0 61.17% 1957 1 62.23% 1962 2 55.42% 1967 3 61.04% 1971 4 55.25% 1977 5 60.49% 1980 6 56.92% 1984 7 63.56% 1989 8 61.98% 1991 9 56.73% 1996 10 57.94% 1998 11 61.97% 1999 12 59.99% 2004 13 58.07% 2009 14 58.17%

Step 2: Now let’s plot a scatter plot against this data as shown below. Make sure you select both the columns Index and Voter turnout percentage and select “Scatter with Only Markers” type (I’m using Office 2007). This highest voter turnout seems to be in the year 1984 with 63.56%

Step 3: Now let’s bring the linear regression equation to this chart, which will probably help us to evaluate what possibly could be the turnout in 2014? After selecting the chart select the “Design” Tab and look for “layout 9” which has fx along with a trend line and select that. It’s given in the picture below:

Step 4: After selecting the Layout 9, you will have the line of fit and its relevant equation. To have better clarity this equation has been moved on to the right.

Step 5: So putting things into the equation we are expecting a 59% overall turnout for 2014. Not sure, I’m also awaiting for the results. See the screenshots with the updated data in the spreadsheet.

 Year to Predict Equivalent Index Equation Resolving 2014 15 y = -0.0005x + 0.5974, R² = 0.0069 59%

Lets exercise our democratic rights and await for good governance. Also lets check the voter turnout for the year 2014. 🙂