Pig: Exploring more on Schema and data models

Schema in Pig:

Schemas are for both simple and complex types of data and can be used appropriately wherever required. It can be used with LOAD, STREAM and FOREACH operations using the AS Clause. We will see a case and example further.

When we specify a schema we can mention about the field name and also its data type. If there is no mention about the data type while we are providing the schema it’s automatically considered as bytearrray if required can be casted to a different datatype later. The fieldname specified in the schema can be accessed by its name or positional notation. We will see that in the example going forward.

Case:

I would like to do some analysis on the EventViewer in my PC with the Pig Environment along with exploring more on Tuple. So I have exported my events from the Event viewer and uploaded to my Hortonworks environment as a filename ‘MyAppEvents.csv’.

In this sample pig Script given below, it’s unable to determine the schema as you can see in the output window below with a message “Schema for events unknown”.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',');
Describe Events;
Dump Events;

Now we will try to provide schema to this same pig script and see what happens with the new code with schema definition.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;

Now assume we would like to only access the Level, EventId and TaskDescription we would need to use FOREACH.

Events = LOAD 'MyAppEvents.csv' USING PigStorage(',') as (Level,DateTime,Source,EventID,TaskCategory, TaskDescription);
Describe Events;
Result = FOREACH Events GENERATE Level,EventID, TaskDescription;
Dump Result;

This will provide results like this and now we will move on to understanding tuple.

Tuple:

Now, we will understand more about tuple.

A tuple is an ordered set of fields. It’s most often used as a row in a relation. It’s represented by fields separated by commas, all enclosed by parentheses.

Each field can be of different data type in a tuple. Constants are referred in single quotes and they are delimited by commas.

Example:

(Siva,33,’M’,Chennai)

Other Note:

Tuple can have its own schema if required to describe the fields in it. So it might help the end user in determining the data types expected in a tuple.

Advertisements