PIG INTERVIEW QUESTIONS
Frequently asked Pig Interview Questions:
(click to view answers)
1) What is pig?
Pig is a data flow language that process parallel on hadoop. It is a Apache open source project. It includes language called Pig Latin, which is for expressing these data flow.
It includes different operations like joins, sort, filter, UDF for processing and reading and writing. Pig uses both HDFS and MapReduce i,e storing and processing.
2) What is Dataflow language?
To access the external data, every language must follow many rules and regulations. The instructions are flowing through data by executing different control statements, but data doesn’t get moved.
Data flow language can get a stream of data which passes from one instruction to another instruction to be processed. Pig can easily process those conditions, jumps, loops and process the data in efficient manner.
3) Can you define Pig in 2 lines?
Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, un-structured data in parallel.
4) Why Pig ?
• Ease of programming
• Optimization opportunities.
• Extensibility
Ease of programming :-
It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities :-
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility :-
Users can create their own functions to do special-purpose processing.
5) Pig Features ?
User Specifies a Sequence of Steps where each step specifies only a single high-level data transformation.
• Data Flow Language
• User Defined Functions (UDF)
• Debugging Environment
• Nested data Model.
6) Advantages of Using Pig ?
i) Pig can be treated as a higher level language
• Increases Programming Productivity
• Decreases duplication of Effort
• Opens the M/R Programming system to more uses
ii) Pig Insulates against hadoop complexity
• Hadoop version Upgrades
• Job configuration Tunning
7) How is Pig Useful For?
In three categories,we can use pig.
They are,
• ETL data pipline
• Research on raw data
• Iterative processing
Most common use case for pig is data pipeline.
Let us take one example,
web based companies gets the web logs, so before storing data into warehouse,they do some operations on data like cleaning and aggregation operations..etc.i,e transformations on data.
8) What is Pig Engine?
Pig Engine is an execution environment to run Pig Latin programs. It converts these Pig Latin operators or transformations into a series of MapReduce jobs in parallel manner.
9) Why Pig instead of Mapreduce?
Compare with MapReduce many features available in Apache Pig. In Mapreduce it’s too difficult to join multiple data sets. Development cycle is very long. Depends on the task, Pig automatically converts code into Map or Reduces. Easy to join multiple tables and run many sql queries like Join, filter, group by, order by , union and many more.
10) Why do we need MapReduce during a PIG programming?
Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.
11) How Pig integrate with Mapreduce to process data?
Pig can easier to execute. When programmer wrote a script to analyze the data sets, Here Pig compiler will convert the programs into MapReduce understandable format. Pig engine execute the query on the MR Jobs. The MapReduce process the data and generate output report. Here MapReduce doesn’t return output to Pig, directly stored in the HDFS.
12) What Is Difference Between MapReduce and Pig ?
MapReduce | Pig |
1. In MR Need to write entire logic for operations like join,group,filter,sum etc . | 1. In Pig Bulit in functions are available. |
2. In MR Number of lines of code required is too much even for a simple functionality. | 2. In Pig 10 lines of pig latin equal to 200 lines of java. |
3. In MR Time of effort in coding is high. | 3. In Pig What took 4hrs to write in java took 15 mins in pig latin (approx). |
4. In MR Less productivity. | 4. In PIG High Productivity. |
13) What is the difference between logical and physical plan?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.
14) How many ways we can run Pig programs?
Pig programs or commands can be executed in three ways.
• Script – Batch Method
• Grunt Shell – Interactive Method
• Embedded mode
All these ways can be applied to both Local and Mapreduce modes of execution.
15) What is Grunt in Pig?
Grunt is an Interactive Shell in Pig, and below are its major features:
• Ctrl-E key combination will move the cursor to the end of the line.
• Grunt remembers command history, and can recall lines in the history buffer using up or down cursor keys.
• Grunt supports Auto completion mechanism, which will try to complete
• Pig Latin keywords and functions when you press the Tab key.
16) What are the modes of Pig Execution?
Local Mode:
Local execution in a single JVM, all files are installed and run using local host and file system.
Mapreduce Mode:
Distributed execution on a Hadoop cluster, it is the default mode.
17) What are the main difference between local mode and MapReduce mode?
Local mode:
No need to start or install Hadoop. The pig scripts run in the local system. By default Pig store data in File system. 100% MapReduce and Local mode commands everything same, no need to change anything.
MapReduce Mode:
It’s mandatory to start Hadoop. Pig scripts run and stored in in HDFS. in Both modes, Java and Pig installation is mandatory.
18) Can we process vast amount of data in local mode? Why?
No, System has limited fixed amount of storage, where as Hadoop can handle vast amount of data. So, Pig -x Mapreduce mode is the best choice to process vast amount of data.
19) What is Pig Latin?
Pig Latin is a data flow Scripting Language for exploring large data sets. A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output.
20) Does Pig support multi-line commands?
Yes
21) Hive doesn’t support multi-line commands, what about Pig?
Pig can support single and multiple line commands.
Single line comments:
Dump B; — It execute the data, but not store in the file system.
Multiple Line comments:
Store B into ‘/output’; /* it can store/persists the data in Hdfs or Local File System. In protection level most often used Store command */
22) What are the Pig Latin Features?
• Pig Latin script is made up of a series of operations, or transformations, that are applied to the input data to produce output
• Pig Latin programs can be executed either in Interactive mode through Grunt shellor in Batch mode via Pig Latin Scripts
• Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.)
• User Defined Functions (UDF)
• Debugging Environment
23) Is Pig Latin Case Sensitive?
The names (aliases) of relations and fields are case sensitive. The names of Pig Latin functions are case sensitive. The names of parameters and all other Pig Latin keywords are case in- sensitive.
Pig latin is some times not a case sensitive.
let us see example,
Load is equivalent to load.
A=load ‘b’ is not equivalent to a=load ‘b’
UDF are also case sensitive,count is not equivalent to COUNT.
24) What is the difference between Pig Latin and HiveQL ?
Pig Latin:
• Pig Latin is a Procedural language
• Nested relational data model
• Schema is optional
HiveQL:
• HiveQL is Declarative
• HiveQL flat relational
• Schema is required
25) Difference Between Pig and SQL ?
Pig is a Procedural SQL is Declarative Nested relational data model SQL flat relational Schema is optional SQL schema is required OLAP works SQL supports OLAP+OLTP works loads Limited Query Optimization and Significent opportunity for query Optimization.
26) How to debugging in Pig?
Describe: Review the schema.
Explain: logical, Physical and MapReduce execution plans.
Illustrate: Step by step execution of the each step execute in this operator.
These commands used to debugging the pig latin script.
27) Can you tell me little bit about Hive and Pig?
-Pig internally use Pig Latin, it’s procedural language. Schema is optional, no meta store concept. where as Hive use a database to store meta store.
-Hive internally use special language called HQL, it’s subset of SQL. Schema is mandatory o process. Hive intentionally done for Queries.
-But both Pig and Hive run on top of MapReduce and convert internal commands into MapReduce jobs. Both used to analyze the data and eventually generate same output.
28) Can you tell me important data types in Pig?
Primitive datatypes: Int, Long, float, double, arrays, chararray, byte array.
Complex datatypes: Tuple, bag, map.
29) What are the simple data types supported by Pig?
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data:10L or 10l
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2
chararray Character array hello world
bytearray Byte array
boolean boolean true/false (case insensitive)
datetime datetime 1970-01-01T00:00:00.000+00:00
biginteger Java BigInteger 200000000000
bigdecimal Java BigDecimal 33.4567833213
30) What are the complex datatypes in pig?
Map:
Map in pig is chararray to data element mapping, where element have pig data type including complex data type.
Example:
map [‘city’#’hyd’,’pin’#500086]
The above example city and pin are data elements(key) maping to values
Tuple:
Tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.
Example:
(hyd,500086) which containing two fields.
Bag:
A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by commas.
Example:
{(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}
31) What is a tuple?
A tuple is an ordered set of fields and A field is a piece of data.
32) What is bag?
A bag is one of the data models present in Pig. It is an un-ordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited.
When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.
33) What are the features of bag?
• A bag can have duplicate tuples.
• A bag can have tuples with differing numbers of fields. However, if Pig tries to access a field that does not exist, a null value is substituted.
• A bag can have tuples with fields that have different data types. However, for Pig to effectively process bags, the schemas of the tuples within those bags should be the same.
34) What is an outer bag?
An outer bag is nothing but a relation.
35) What is an inner bag?
An inner bag is a relation inside any other bag.
Example: (4,{(4,2,1),(4,3,3)})
In the above example, the complete relation is an outer bag and {(4,2,1),(4,3,3)} is an inner bag.
36) What is the purpose of ‘dump’ keyword in pig?
dump display the output on the screen.
dump ‘processed’
37) What is the purpose of ‘Store’ keyword?
After you have finished processing your data, you will want to write it out somewhere.
Pig provides the store statement for this purpose. In many ways it is the mirror image of the load statement. By default, Pig stores your data on HDFS in a tab-delimited file using PigStorage.
38) What is the difference between Store and dump commands?
Dump command after process the data displayed on the terminal, but it’s not stored anywhere. Where as Store stored in local file system or HDFS and output execute in a folder. In the protection environment most often hadoop developer used ‘store’ command to store data in in the HDFS.
39) Tell me few important operators while working with Data in Pig.
Filter: Working with Touples and rows to filter the data.
Foreach: Working with Colums of data to load data into columns.
Group: Group the data in single relation.
Cogroup & Join: To group/Join data in multiple relations.
Union: Merge the data of multiple relations.
Split: partition the content into multiple relations.
40) what are the relational operations in pig latin?
• for each
• order by
• filters
• group
• distinct
• join
• limit
41) How to use ‘foreach’ operation in pig scripts?
foreach takes a set of expressions and applies them to every record in the data pipeline.
A = load ‘input’ as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach A generate user, id;
positional references are preceded by a $ (dollar sign) and start from 0:
c= load d generate $2-$1
42) How to write ‘foreach’ statement for map datatype in pig scripts?
for map we can use hash(‘#’)
bball = load ‘baseball’ as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#’batting_average’;
43) How to write ‘foreach’ statement for tuple datatype in pig scripts?
for tuple we can use dot(‘.’)
A = load ‘input’ as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;
44) How to write ‘foreach’ statement for bag datatype in pig scripts?
when you project fields in a bag, you are creating a new bag with only those fields:
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;
we can also project multiple field in bag
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.(x, y);
45) why should we use ‘filters’ in pig scripts?
-Selects tuples from a relation based on some condition.
Filters are similar to where clause in SQL. filter which contain predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not. predicate contain different operators like ==,>=,<=,!=,so,== and != can be applied to maps and tuples.
A= load ‘inputs’ as (name,address)
B = filter A by symbol matches ‘CM.*’;
46) What does GROUP operator will do in Pig?
-Groups the data in one or more relations.
The group statement collects together records with the same key. In SQL the group by clause
creates a group that must feed directly into one or more aggregate functions. In Pig Latin there is no direct connection between group and aggregate functions.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;
47) What is co-group does in Pig?
Cogroup can groups rows based on columns, unlike Group it can join the multiple tables on the grouped column.
48) why should we use ‘orderby’ keyword in pig scripts?
The order statement sorts your data for you, producing a total order of your output data.The
syntax of order is similar to group. You indicate a key or set of keys by which you wish to order
your data.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;
49) why should we use ‘distinct’ keyword in pig scripts?
Removes duplicate tuples in a relation. It works only on entire records, not on individual fields.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;
50) Is it posible to join multiple fields in pig scripts?
Yes, Join select records from one input and join with another input.This is done by indicating keys for each input. When those keys are equal, the two rows are joined.
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;
we can also join multiple keys
Example:
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);
51) Is it possible to display the limited no of results?
Yes, Sometimes you want to see only a limited number of results. ‘limit’ allows you do this:
input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;
52) What is a relation in Pig?
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
53) What is Flatten does in Pig?
Syntactically flatten similar to UDF, but it’s powerful than UDFs. The main aim of Flatten is change the structure of tuple and bags, UDFs can’t do it. Flatten can un-nest the Tuple and bags, it’s opposite to “Tobag” and “ToTuple”.