Fundamentals of Pig: Workings with Tuples

In the previous blog we uploaded the Windows Event log to the Hadoop environment and started analyzing it using Pig. We will see in this blog how we can work with the tuples.

Filtering Data:

In the script below there is no filter applied, so it fetches all the tuples.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = FOREACH Events GENERATE Level,EventID, TaskDescription;

Dump Result;

You can see one such example is highlighted in the picture given below.

Tuples of data can be filtered using the FILTER option in Pig.

Events = LOAD ‘MyAppEvents.csv’ USING PigStorage(‘,’) as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);

Describe Events;

Result = Filter Events by EventID is not null

Dump Result;

In this above code snipped the events are filtered when the EventID is not null, you can see the results.

 

More to come..

Advertisements

Pig: Exploring more on Schema and data models

Schema in Pig:

Schemas are for both simple and complex types of data and can be used appropriately wherever required. It can be used with LOAD, STREAM and FOREACH operations using the AS Clause. We will see a case and example further.

When we specify a schema we can mention about the field name and also its data type. If there is no mention about the data type while we are providing the schema it’s automatically considered as bytearrray if required can be casted to a different datatype later. The fieldname specified in the schema can be accessed by its name or positional notation. We will see that in the example going forward.

Case:

I would like to do some analysis on the EventViewer in my PC with the Pig Environment along with exploring more on Tuple. So I have exported my events from the Event viewer and uploaded to my Hortonworks environment as a filename ‘MyAppEvents.csv’.

In this sample pig Script given below, it’s unable to determine the schema as you can see in the output window below with a message “Schema for events unknown”.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',');
Describe Events;
Dump Events;

Now we will try to provide schema to this same pig script and see what happens with the new code with schema definition.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;

Now assume we would like to only access the Level, EventId and TaskDescription we would need to use FOREACH.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;
Result = FOREACH Events GENERATE Level,EventID, TaskDescription;
Dump Result;

This will provide results like this and now we will move on to understanding tuple.

Tuple:

Now, we will understand more about tuple.

A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.

Each field can be of different data type in a tuple. Constants are referred in single quotes and they are delimited by commas.

Example:

(Siva,33,’M’,Chennai)

Other Note:

Tuple can have its own schema if required to describe the fields in it. So it might help the end user in determining the data types expected in a tuple.