Fundamentals of Pig: Workings with Tuples

In the previous blog we uploaded the Windows Event log to the Hadoop environment and started analyzing it using Pig. We will see in this blog how we can work with the tuples.

Filtering Data:

In the script below there is no filter applied, so it fetches all the tuples.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = FOREACH Events GENERATE Level,EventID, TaskDescription;

Dump Result;

You can see one such example is highlighted in the picture given below.

Tuples of data can be filtered using the FILTER option in Pig.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = Filter Events by EventID is not null

Dump Result;

In this above code snipped the events are filtered when the EventID is not null, you can see the results.

 

More to come..

Advertisements

Time Series Analysis with R: Part I

Time Series is a collection of well defined data observations over a period of time.

Examples:

  • Measuring retails sales every month for a year
  • Number of births per month in a City for a specific period
  • Age of death of successive kings of England
  • Production of gold over the years based on US Geological Survey (in Metric Tons)

Analysis of Production of Gold with R:

Please refer to the DataMarket.com for getting the gold related data. It’s free. You can refer to my previous blog to import the data from Excel. In this example I have imported the data using read.csv method using the gdata Library.


 

The above run indicates a start = 1 and End=111 with a frequency of 1.

 

Plotting of Time series data:

>plot.ts(GoldDataTS)

Pig: Exploring more on Schema and data models

Schema in Pig:

Schemas are for both simple and complex types of data and can be used appropriately wherever required. It can be used with LOAD, STREAM and FOREACH operations using the AS Clause. We will see a case and example further.

When we specify a schema we can mention about the field name and also its data type. If there is no mention about the data type while we are providing the schema it’s automatically considered as bytearrray if required can be casted to a different datatype later. The fieldname specified in the schema can be accessed by its name or positional notation. We will see that in the example going forward.

Case:

I would like to do some analysis on the EventViewer in my PC with the Pig Environment along with exploring more on Tuple. So I have exported my events from the Event viewer and uploaded to my Hortonworks environment as a filename ‘MyAppEvents.csv’.

In this sample pig Script given below, it’s unable to determine the schema as you can see in the output window below with a message “Schema for events unknown”.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',');
Describe Events;
Dump Events;

Now we will try to provide schema to this same pig script and see what happens with the new code with schema definition.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;

Now assume we would like to only access the Level, EventId and TaskDescription we would need to use FOREACH.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;
Result = FOREACH Events GENERATE Level,EventID, TaskDescription;
Dump Result;

This will provide results like this and now we will move on to understanding tuple.

Tuple:

Now, we will understand more about tuple.

A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.

Each field can be of different data type in a tuple. Constants are referred in single quotes and they are delimited by commas.

Example:

(Siva,33,’M’,Chennai)

Other Note:

Tuple can have its own schema if required to describe the fields in it. So it might help the end user in determining the data types expected in a tuple.

Fundamentals of Pig

Introduction:

Pig is a High-level scripting platform for analyzing large data sets. It’s an abstraction built on top of hadoop. It contains domain-specific dataflow language Pig Latin and a translation engine which converts the Pig Latin to MapReduce jobs. It uses familiar keywords such as Join, Group and filter. This has been Hadoop Subproject since 2007.

What do I need to work with Pig:

You might need Windows or Linux environment with Hadoop with Java 1.6 above. It would be easy if you can get started with Cloudera or Hortonworks distribution of Hadoop.

Running Pig:

You can run pig as commands or statements in Local mode or MapReduce mode. In Local mode all the files are installed in local host and filesystem. In the Mapreduce mode we need to access to the Hadoop Cluster and HDFS installation. Mapreduce is the default mode of execution.

Big Picture in a simple way:


Structure of the Pig Latin Script:


Linear Regression using R: Part II

In this previous post we saw a quick introduction to what is linear regression. In this post we will see how we can implement linear regression using R. We are planning to use the data used in our firstpost that is studentsdata. Would like to know more about how to load the data, please refer to my other blog post.

Step 1: Let’s have a look at data for various studentsdata using the command studentsdata which is loaded as per my another blog post.

Step 2: We have stored the data in the table studentsdata now we will plot the marks of tamil and the total scores and see how they come along using the command plot(studentsdata$Tamil,studentsdata$TotalScores)

Step 3: It’s practically difficult to fit a perfect straight line in this case. So we will calculate and plot the line of Best fit or the Lease squares regression line. We will be using the lm Command to compute the linear model.

> res=lm(studentsdata$TotalScores~studentsdata$Tamil)

> abline(res)

> res

Call:

lm(formula = studentsdata$TotalScores ~ studentsdata$Tamil)

Coefficients:

(Intercept) studentsdata$Tamil

233.477 1.311

Step 4: You can see the plot done by abline here for the line of best fit.

Step 5: Prediction of total score using linear regression Now we have the line of best fit

TotalScore=studentdata$Tamil . 1.311 + 233.477

If you wish to predict the totalscore of a student who would be scoring 75 in tamil it would be as followes:

> 1.311*75+233.477

[1] 331.802

He/she would 331.80 total marks

Linear Regression using R: Part I

In this blog we are going to see what linear regression is and in my next blog we will how to implement the same in R.

What is linear regression?

As per very simple definition from internet it goes like this:

A technique in which a straight line is fitted to a set of data points to measure the effect of a single independent variable. The slope of the line is the measured impact of that variable. It is one of the most widely used statistical techniques. It’s the study of linear relationship (straight-line) between variables under an assumption of normally distributed errors.

Why?

To determine the effect of one variable on the other. Technically, linear regression estimates how much Y changes when X changes one unit.

Examples?

  1. Change in the fuel prices increases/decreases the inflation
  2. Change in the raw material cost increases / decreases the product price
  3. Change in class size increases/decreases the participation of the students in events

How do I do it manually?

Want to checkout manually how it’s being done please refer to this link, which provides a clear explanation of how linear regression works.

In the next blog we will see how we can use R for linear regression using the Population in India from 1901 and 2011.

Power Pivot with Test playing nations

I have tried to attempt to create a power pivot using the data from ESPNCricket Info. Power pivot seems to be very powerful and provides capabilities of Slicing and dicing at ease. Try your hands at it. It seems based on the historic records the number of matches which has been won and lost being equal. Australia and India has drawn more than 200 matches, where in Australia has drawn 200 matches out of 752 matches.

Interesting info. J