Apache Spark Interview Questions
Apache Spark Interview Questions
(click to view answers)
1. What is Apache Spark?
Apache Spark “an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.”
Spark is essentially a fast and flexible data processing framework. It has an advanced execution engine supporting cyclic data flow with in-memory computing functionalities. Apache Spark can run on Hadoop, as a standalone system or on the cloud. Spark is capable of accessing diverse data sources including HDFS, HBase, Cassandra among others
2. Explain the key features of Spark ?
Spark allows Integration with Hadoop and files included in HDFS.
It has an independent language (Scala) interpreter and hence comes with an interactive language shell.
It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
It supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing. Additionally, some of the salient features of Spark include:
Lighting fast processing: When it comes to Big Data processing, speed always matters, and Spark runs Hadoop clusters way faster than others. Spark makes this possible by reducing the number of read/write operations to the disc. It stores this intermediate processing data in memory.
Support for sophisticated analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. This allows users to combine all these capabilities in a single workflow.
Real-time stream processing: Spark can handle real-time streaming. MapReduce primarily handles and processes previously stored data even though there are other frameworks to obtain real-time streaming. Spark does this in the best way possible.
3. What is “RDD”?
RDD stands for Resilient Distribution Datasets: a collection of fault-tolerant operational elements that run in parallel. The partitioned data in RDD is immutable and is distributed in nature.
4. How does one create RDDs in Spark?
In Spark, parallelized collections are created by calling the SparkContext “parallelize” method on an existing collection in your driver program.
val data = Array(4,6,7,8)
val distData = sc.parallelize(data)
Text file RDDs can be created using SparkContext’s “textFile” method. Spark has the ability to create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, among others. Spark supports text files, “SequenceFiles”, and any other Hadoop “InputFormat” components.
val inputfile = sc.textFile(“input.txt”)
5. What does the Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.
6. Define “Partitions”?
A “Partition” is a smaller and logical division of data, that is similar to the “split” in Map Reduce. Partitioning is the process that helps derive logical units of data in order to speed up data processing.
Here’s an example: val someRDD = sc.parallelize( 1 to 100, 4)
Here an RDD of 100 elements is created in four partitions, which then distributes a dummy map task before collecting the elements back to the driver program.
7. What operations does the “RDD” support
Transformations
Actions
8. Define “Transformations” in Spark ?
“Transformations” are functions applied on RDD, resulting in a new RDD. It does not execute until an action occurs. map() and filer() are examples of “transformations”, where the former applies the function assigned to it on each element of the RDD and results in another RDD. The filter() creates a new RDD by selecting elements from the current RDD.
9. Define “Action” in Spark?
An “action” helps in bringing back the data from the RDD to the local machine. Execution of “action” is the result of all transformations created previously. reduce() is an action that implements the function passed again and again until only one value is left. On the other hand, the take() action takes all the values from the RDD to the local node.
10. What are the functions of “Spark Core”?
The “SparkCore” performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.
11. What is an “RDD Lineage”?
Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt using the “RDD Lineage”. It is a process that reconstructs lost data partitions.
12. What is a “Spark Driver”?
“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. The driver also delivers RDD graphs to the “Master”, where the standalone cluster manager runs.
13. What is SparkContext?
“SparkContext” is the main entry point for Spark functionality. A “SparkContext” represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
14. What is Hive on Spark?
Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.
15. Name a few commonly used Spark Ecosystems?
Spark SQL (Shark)
Spark Streaming
GraphX
MLlib
SparkR
16. What is “Spark Streaming”?
Spark supports stream processing, essentially an extension to the Spark API. This allows stream processing of live data streams. The data from different sources like Flume and HDFS is streamed and processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever you want to analyze data with the latency of less than 15 minutes and greater than 2 minutes i.e. near real time is when you use Spark streaming
17. What is “GraphX” in Spark?
“GraphX” is a component in Spark which is used for graph processing. It helps to build and transform interactive graphs.
18. What is the function of “MLlib”?
“MLlib” is Spark’s machine learning library. It aims at making machine learning easy and scalable with common learning algorithms and real-life use cases including clustering, regression filtering, and dimensional reduction among others.
19. What is “Spark SQL”?
Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the capability to load data from multiple structured sources like “textfiles”, JSON files, Parquet files, among others. Spark SQL provides a special type of RDD called SchemaRDD. These are row objects, where each object represents a record.
Here’s how you can create an SQL context in Spark SQL:
SQL context: scala> var sqlContext=new SqlContext
HiveContext: scala> var hc = new HIVEContext(sc)
20. What is a “Parquet” in Spark?
“Parquet” is a columnar format file supported by many data processing systems. Spark SQL performs both read and write operations with the “Parquet” file.
21. What is an “Accumulator”?
“Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters”, “Accumulators” provide the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.
22. Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3
23. What is “YARN”?
“YARN” is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.
24. List the benefits of Spark over MapReduce?
Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce
MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses replication to achieve fault tolerance while Spark uses a different data storage model, resilient distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that minimizes network input and output.
25. What is a “Spark Executor”?
When “SparkContext” connects to a cluster manager, it acquires an “Executor” on the cluster nodes. “Executors” are Spark processes that run computations and store the data on the worker node. The final tasks by “SparkContext” are transferred to executors.
26. List the various types of “Cluster Managers” in Spark?
The Spark framework supports three kinds of Cluster Managers:
Standalone
Apache Mesos
YARN
27. What is a “worker node”?
“Worker node” refers to any node that can run the application code in a cluster.
28. Define “PageRank”?
“PageRank” is the measure of each vertex in a graph.
29. Can we do real-time processing using Spark SQL?
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.
30. What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and MapReduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly work on top of micro batching
31.What are the advantages of using Apache Spark over Hadoop MapReduce for big data processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
Spark is easier to program as it comes with an interactive mode.
It provides complete recovery using lineage graph whenever something goes wrong.
32) What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
33) List some use cases where Spark outperforms Hadoop in processing?
Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
Spark is preferred over Hadoop for real time querying of dataStream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
34) What are the languages supported by Apache Spark for developing big data applications?
Scala, Java, Python, R and Clojure
35) Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.
36) Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
37) How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
38) How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
39) Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
40) Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
41) What is lineage graph?
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
42) How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
43) Explain about the major libraries that constitute the Spark Ecosystem?
Spark MLib– Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
Spark Streaming – This library is used to process real time streaming data.Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools
44) What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
45) What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
46) What is a DStream?
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operatio
Transformations that produce a new DStream.
Output operations that write data to an external system
47) When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
48) What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
49) Name a few companies that use Apache Spark in production?
Pinterest, Conviva, Shopify, Open Table
50) Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon
51) Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.
52) How can you compare Hadoop and Spark in terms of ease of use?
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.
53) What are the common mistakes developers make when running Spark applications?
Developers often make the mistake of-Hitting the web service several times by using multiple clusters.
Run everything on the local node instead of distributing it.
Developers need to be careful with this, as Spark makes use of memory for processing.
54) What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
Limit I/O operations
Consumes less space
Fetches only required columns.
55) What are the various data sources available in SparkSQL?
Parquet file
JSON Datasets
Hive tables
56) How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
57) What do you understand by Pair RDD?
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
58) Which one will you choose for a project –Hadoop MapReduce or Apache Spark?
The answer to this question depends on the given project scenario – as it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.
59) Explain about the different types of transformations on DStreams?
Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.
60) Explain about the popular use cases of Apache Spark?
Apache Spark is mainly used for
Iterative machine learning.
Interactive data analytics and processing.
Sensor data processing
61) Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
62) What is Spark Core?
It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.
63) How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function
64) What is the difference between persist() and cache()?
persist () allows the user to specify the storage level whereas cache () uses the default storage level.
65) What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP
66) How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.
69) Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
70) How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.
71) How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
72) How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.
73) Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.
74) Explain about the core components of a distributed Spark application.
Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
75) What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
76) Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
77) What do you understand by SchemaRDD?
An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.
We invite the big data community to share the most frequently asked Apache Spark Interview questions and answers, in the comments below – to ease big data job interviews for all prospective analytics professionals
78) What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far
79) List the functions of Spark SQL.
Spark SQL is capable of:
Loading data from a variety of structured sources
Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
80) llustrate some demerits of using Spark?
Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems. Developers need to be careful while running their applications in Spark. Instead of running everything on a single node, the work must be distributed over multiple clusters.
81) Features of spark streaming?
Second scale latencies
Scalability and efficient fault tolerance
“Once and only once” processing
Integrates batch and realtime processing
Easy to develop
– Uses Spark’s high level API