Beginners guide to ARIMA: Problem Identification & Data Gathering – Part 2

In continuation to my earlier post, I’m trying to explore ARIMA using an Example. In this post we will go into each step in detail how we can accomplish ARIMA based forecasting for a problem.

Step 1: Problem Identification or Our Scenario

We are going to consider the past history of time series data on the Household Power consumption and use that data to forecast using ARIMA. There is also a research paper published in Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, on the same dataset analysing the performance between ARMA and ARIMA. Our post will focus on step by step accomplishing forecast using R on the same dataset for ease of use for Beginners. Sooner or later we will evaluate tools such as AutoBox, R which can be used for solving this problems.

Step 2: Data Gathering or Identification of dataset

The dataset we are going to use would be a dataset on Individual household electric power consumption available in UCI Repository under the URL: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. This dataset is a multivariate dataset. Please check this link to understand the difference between univariate, bivariate and multivariate.

Quick Summary of the Dataset:

  • Dataset contains data between December 2006 and November 2010
  • It has around 19.7 MB of Data
  • File is in .TXT Format
  • Columns in the Dataset are:
    • Date (DD/MM/YYYY)
    • Time (HH:MM:SS)
    • Global active power
    • Global Reactive power
    • Voltage
    • Global Intensity
    • Sub Metering 1
    • Sub Metering 2
    • Sub Metering 3

You can open this semicolon delimited text file in Excel and make the necessary steps on the wizard you will be having an excel sheet with the data as given below. I was able to load rows only up to 1048576. The actual total number of rows in the text file is 2075260. Whoa..

 

Next to do the Step 3: preliminary analysis we can use R as a tool. For using R as a tool we need to load this data into R and for analysing it. For this I can save this excel sheet in CSV format or in XLS format and the import into R as outlined in my other post or using this link. I’m using RStudio for the purpose and demonstrating the data loading process in the screenshots in the subsequent sections.

First the installation had shown some error, after that in the subsequent attempt the installation of gdata was successful. Now we can load the library using the command library(gdata). After we which we have loaded powerData variable with the data available in the CSV file for further analysis and we can view the data using View. Please check the console window for the code.

In the next post we will do some preliminary analysis on this data which we have loaded.

Importing Excel data using R: Step by Step

Update: Please refer to this blog for various other ways to import excel http://www.milanor.net/blog/?p=779 as the steps outlined by me has dependency on Perl.

In this blog I’m going to share with you the steps involved in importing the data and understanding its aspects from R.

Prerequisites:

  1. Windows XP
  2. Completed the installation of R (In this example I’m using R version 2.15.2)

Step 1: Keep the Excel ready with you, in this example I’ve prepared my own sample data which is the table which captures of the various employees who have shown interest for stream change. The column labeled “Agreed” captures whether they are agreed or not. Snapshot of the worksheet:

Step 2: I have named the file as datasource.xls

Step 3: For using the Excel we will need the “gdata” package. If we have not installed it we can do it by using the command install.packages(“gdata”). Make sure you have internet connection to download the package.


Step 4: Then issue the command library(gdata), which will enable support for using .xls files in “R”


Step 5: We are going to use the command read.xls. If you need more or additional help regarding the same you can issue help (read.xls) which will start the server and load its relevant content.

Step 6: I have saved the datasource.xls in mydocuments folder

Step 7: Issue the command mydata=read.xls(“datasource.xls”) and again type mydata to see the excel file loaded to the R environment. If the file is missing or there is problem in the path you will get an error.

Step 8: In this final step we can see the summary of data using the command Summary(mydata)