MapReduce Interview Questions
(click to view answers)
1.What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
2.What are ‘maps’ and ‘reduces’?
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
3.What are the four basic parameters of a mapper?
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.
4.What are the four basic parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate output parameters and the second two represent final output parameters.
5.What do the master class and the output class do?
Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location.
6.What is the input type/format in MapReduce by default?
By default the type input type in MapReduce is ‘text’.
7.Is it mandatory to set input and output type/format in MapReduce?
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.
8.What does a MapReduce partitioner do?
A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.
9.How is Hadoop different from other data processing tools?
In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data processing tools available.
10.Can we rename the output file?
Yes we can rename the output file by implementing multiple format output class.
11.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.
12.What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.
13.What is a Combiner?
A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
14.What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
15.What happens in a textinputformat?
In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line. For instance, Key: longWritable, value: text.
16.What do you know about keyvaluetextinputformat?
In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value. For instance, Key: text, value: text.
17.What do you know about Sequencefileinputformat?
Sequencefileinputformat is an input format for reading in sequence files. Key and valueare user defined. It is a specific compressed binary file format which is optimized for passing the data between the output of one MapReduce job to the input of some other MapReduce job.
18.What do you know about Nlineoutputformat?
Nlineoutputformat splits ‘n’ lines of input as one split.
19.How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
20.Explain what is shuffling in MapReduce ?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
21.Explain what is distributed Cache in MapReduce Framework ?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.
22.Where do you specify the Mapper Implementation?
Generally mapper implementation is specified in the Job itself.
23.How Mapper is instantiated in a running job?
The Mapper itself is instantiated in the running job, and will be passed a MapContext object which it can use to configure itself.
24.Which are the methods in the Mapper interface?
The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. All above methods you can override in your code.
25.What are the primary phases of the Reducer?
Shuffle, Sort and Reduce
26.It can be possible that a Job has 0 reducers?
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
27.What happens if number of reducers are 0?
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
28.How many instances of JobTracker can run on a Hadoop Cluser?
29.What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer
30.How to set which framework would be used to run mapreduce program?
mapreduce.framework.name. it can be
31.What are the key differences between Pig vs MapReduce?
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting uniques in a dataset.
32. Can MapReduce program be written in any language other than Java?
Yes, Mapreduce can be written in many programming languages Java, R, C++, scripting Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and newline characters should work . Hadoop streaming (A Hadoop Utility) allows you to create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reducer.
33.What is OutputCommitter?
OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the following operations:
• Create temporary output directory for the job during initialization
• Then, it cleans the job as in removes temporary output directory post job completion
• Sets up the task temporary output
• Identifies whether a task needs commit. The commit is applied if required.
JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.
34. What Mapper does?
Mapper is the first phase of Map phase which process map task.Mapper reads key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
35.What if job tracker machine is down?
In Hadoop 1.0, Job Tracker is single Point of availability means if JobTracker fails, all jobs must restart.Overall Execution flow will be interupted. Due to this limitation, In hadoop 2.0 Job Tracker concept is replaced by YARN. In YARN, the term JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).
Node Manager(node specific)
36.What happens when a datanode fails ?
When a datanode fails:
Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node
<a href=”http://geoinsyssoft.com/big-data-course-content”>click here </a>to know about Big Data Training and to see about course details.