Apache Spark Interview Questions and Answers

Apache Spark Interview Questions

Q. CAN YOU EXPLAIN THE MAIN FEATURES OF SPARK APACHE?
Supports several programming languages – Spark can be coded in four programming languages, i.e. Java, Python, R, and Scala. It also offers high-level APIs for them. Additionally, Apache Spark supplies Python and Scala shells.
Lazy Evaluation – Apache Spark uses the principle of lazy evaluation to postpone the evaluation before it becomes completely mandatory.
Machine Learning – The MLib machine learning component of Apache Spark is useful for extensive data processing. It removes the need for different engines for processing and machine learning.
Modern Format Assistance – Apache Spark supports multiple data sources, like Cassandra, Hive, JSON, and Parquet. The Data Sources API provides a pluggable framework for accessing structured data through Spark SQL.
Real-Time Computation – Spark is specifically developed to satisfy massive scalability criteria. Thanks to in-memory computing, Spark’s computing is real-time and has less delay.
Speed – Spark is up to 100x faster than Hadoop MapReduce for large-scale data processing. Apache Spark is capable of achieving this incredible speed by optimized portioning. The general-purpose cluster-computer architecture handles data across partitions that parallel distributed data processing with limited network traffic.
Hadoop Integration – Spark provides seamless access to Hadoop and is a possible substitute for the Hadoop MapReduce functions. Spark is capable of operating on top of the existing Hadoop cluster using YARN for scheduling resources.

Q. WHAT IS APACHE SPARK?
Apache Spark is a data processing framework that can perform processing tasks on extensive data sets quickly. This is one of the most frequently asked Apache Spark interview questions.

Q. EXPLAIN THE CONCEPT OF SPARSE VECTOR.
A vector is a one-dimensional array of elements. However, in many applications, the vector elements have mostly zero values that are said to be sparse.

Q. WHAT IS THE METHOD FOR CREATING A DATA FRAME?
A data frame can be generated using the Hive and Structured Data Tables.

Q. EXPLAIN WHAT SCHEMARDD IS.
A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

Q. EXPLAIN WHAT ACCUMULATORS ARE.
Accumulators are variables used to aggregate information across the executors.

Q. EXPLAIN WHAT THE CORE OF SPARK IS.
Spark Core is a basic execution engine on the Spark platform.

Q. EXPLAIN HOW DATA IS INTERPRETED IN SPARK?
Data can be interpreted in Apache Spark in three ways: RDD, DataFrame, and DataSet.

Q. HOW MANY FORMS OF TRANSFORMATIONS ARE THERE?
There are two forms of transformation: narrow transformations and broad transformations.

Q. WHAT’S PAIRED RDD?
Paired RDD is a key-value pair of RDDs.

Q. WHAT IS IMPLIED BY THE TREATMENT OF MEMORY IN SPARK?
In memory computing, we retain data in sloppy access memory instead of specific slow disc drives.

Q. EXPLAIN THE DIRECTED ACYCLIC GRAPH.
Directed Acyclic Graph is a finite collateral graphic with no alternating disc.

Q. EXPLAIN THE LINEAGE CHART.
Lineage map reports to the graph for the RDD parent as a whole.

Q. EXPLAIN THE IDLE ASSESSMENT IN SPARK.
The idle assessment, known as call by use, is a strategy that defers compliance until one needs a benefit.

Q. EXPLAIN THE ADVANTAGE OF A LAZY EVALUATION.
To expand the program’s manageability and features.

Q. EXPLAIN THE CONCEPT OF “PERSISTENCE”.
RDD persistence is an ideal technique that saves the results of the RDD assessment.

Q. WHAT IS THE MAP-REDUCE LEARNING FUNCTION?
Map Reduce is a model used for a vast amount of data design.

Q. WHEN PROCESSING INFORMATION FROM HDFS, IS THE CODE PERFORMED NEAR THE DATA?
Yes, in most situations, it is. It creates executors that are close to paths that contain data.

Q. DOES SPARK ALSO CONTAIN THE STORAGE LAYER?
No, it doesn’t have a disc layer, but it lets you use many data sources.

Q. WHERE DOES THE SPARK DRIVER OPERATE ON YARN?
The Spark driver operates on the client computer.

Q. HOW IS MACHINE LEARNING CARRIED OUT IN SPARK?
Machine learning is carried out in Spark with the help of MLlib. It’s a scalable machine learning library provided by Spark.

Q. EXPLAIN WHAT A PARQUET FILE IS.
Parquet is a column structure file that is supported by many other data processing classes.

Q. EXPLAIN THE LINEAGE OF THE RDD.
The lineage of RDD is that it does not allow memory duplication of records.

Q. EXPLAIN THE SPARK EXECUTOR.
Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job.

Q. EXPLAIN THE MEANING OF A WORKER’S NODE OR ROUTE.
A worker node or path corresponds to any node that can stick the application symbol in many nodes.

Q. EXPLAIN THE SPARSE VECTOR.
A sparse vector has two parallel formats, one for indices and the other for values.

Q. IS IT POSSIBLE TO STICK WITH THE APACHE SPARK ON APACHE MESOS?
Yes, you should adhere to the clusters of resources that have Mesos.

Q. EXPLAIN THE APACHE SPARK ACCUMULATORS.
Accumulators are predictions that are taken away only by a non-linear method of thinking and alternate processes.

Q. WHY IS THERE A NEED FOR TRANSMITTING VARIABLES WHILE USING APACHE SPARK?
Because it reads, except for variables, the relevant in-memory array on each machine tool.

Q. EXPLAIN THE IMPORT OF SLIDING WINDOW PERFORMANCE.
Sliding Window withholds transmission of numerical information packets between different data networks on machines.

Q. EXPLAIN THE DISCRETIZED STREAM OF APACHE SPARK.
Discretized Stream is a fundamental abstraction acceptable to Spark Streaming.

Make sure you revise these Spark streaming interview questions before moving onto the next set of questions.

Q. STATE THE DISTINCTION BETWEEN SQL AND HQL.
SparkSQL is a critical component of the Spark Core engine, whereas HQL is a combination of OOPS with the Relational database concept.

Q. EXPLAIN THE USE OF BLINK DB.
Blink DB is a query machine tool that helps you to run SQL queries.

Q. EXPLAIN THE NODE OF THE APACHE SPARK WORKER.
The node of a worker is any path that can run the application code in a cluster.

Q. EXPLAIN THE FRAMEWORK OF THE CATALYST.
The Catalyst Concept is a modern optimization framework in Spark SQL.

Q. DOES SPARK USE HADOOP?
Spark has its own cluster administration list and only uses Hadoop for collection.

Q. WHY DOES SPARK USE AKKA?
Spark simply uses Akka for scheduling.

Q. EXPLAIN THE WORKER NODE OR PATHWAY.
A node or route that can run the Spark program code in a cluster can be called a worker or porter node.

Q. EXPLAIN WHAT YOU UNDERSTAND ABOUT THE RDD SCHEMA?
Schema RDD consists of a row factor with schema data in both directions with details in each column.

Q. What can you say about Spark Datasets?
Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.

Spark datasets are strongly typed structures that represent the structured queries along with their encoders.
They provide type safety to the data and also give an object-oriented programming interface.
The datasets are more structured and have the lazy query expression which helps in triggering the action. Datasets have the combined powers of both RDD and Dataframes. Internally, each dataset symbolizes a logical plan which informs the computational query about the need for data production. Once the logical plan is analyzed and resolved, then the physical query plan is formed that does the actual query execution.
Datasets have the following features:

Optimized Query feature: Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.
Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.
Interconvertible: The type-safe feature of datasets can be converted to “untyped” Dataframes by making use of the following methods provided by the Datasetholder:
toDS():Dataset[T] toDF():DataFrame
toDF(columName:String*):DataFrame
Faster Computation: Datasets implementation are much faster than those of the RDDs which helps in increasing the system performance.
Persistent storage qualified: Since the datasets are both queryable and serializable, they can be easily stored in any persistent storages.
Less Memory Consumed: Spark uses the feature of caching to create a more optimal data layout. Hence, less memory is consumed.
Single Interface Multiple Languages: Single API is provided for both Java and Scala languages. These are widely used languages for using Apache Spark. This results in a lesser burden of using libraries for different types of inputs.

Q. Define Spark DataFrames.
Spark Dataframes are the distributed collection of datasets organized into columns similar to SQL. It is equivalent to a table in the relational database and is mainly optimized for big data operations.
Dataframes can be created from an array of data from different data sources such as external databases, existing RDDs, Hive Tables, etc. Following are the features of Spark Dataframes:

Spark Dataframes have the ability of processing data in sizes ranging from Kilobytes to Petabytes on a single node to large clusters.
They support different data formats like CSV, Avro, elastic search, etc, and various storage systems like HDFS, Cassandra, MySQL, etc.
By making use of SparkSQL catalyst optimizer, state of art optimization is achieved.
It is possible to easily integrate Spark Dataframes with major Big Data tools using SparkCore.

Q. Define Executor Memory in Spark
The applications developed in Spark have the same fixed cores count and fixed heap size defined for spark executors. The heap size refers to the memory of the Spark executor that is controlled by making use of the property spark.executor.memory that belongs to the -executor-memory flag. Every Spark applications have one allocated executor on each worker node it runs. The executor memory is a measure of the memory consumed by the worker node that the application utilizes.

Q. What are the functions of SparkCore?
SparkCore is the main engine that is meant for large-scale distributed and parallel data processing. The Spark core consists of the distributed execution engine that offers various APIs in Java, Python, and Scala for developing distributed ETL applications.
Spark Core does important functions such as memory management, job monitoring, fault-tolerance, storage system interactions, job scheduling, and providing support for all the basic I/O functionalities. There are various additional libraries built on top of Spark Core which allows diverse workloads for SQL, streaming, and machine learning. They are responsible for:

Fault recovery
Memory management and Storage system interactions
Job monitoring, scheduling, and distribution
Basic I/O functions

Q. Define Piping in Spark.
Apache Spark provides the pipe() method on RDDs which gives the opportunity to compose different parts of occupations that can utilize any language as needed as per the UNIX Standard Streams. Using the pipe() method, the RDD transformation can be written which can be used for reading each element of the RDD as String. These can be manipulated as required and the results can be displayed as String.

Q. What API is used for Graph Implementation in Spark?
Spark provides a powerful API called GraphX that extends Spark RDD for supporting graphs and graph-based computations. The extended property of Spark RDD is called as Resilient Distributed Property Graph which is a directed multi-graph that has multiple parallel edges. Each edge and the vertex has associated user-defined properties. The presence of parallel edges indicates multiple relationships between the same set of vertices. GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that can support graph computation. It also includes a large collection of graph builders and algorithms for simplifying tasks related to graph analytics.

Q. What is a DStream?

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –

Transformations that produce a new DStream.
Output operations that write data to an external system.

Q. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Q. What is Catalyst framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Q. Name a few companies that use Apache Spark in production.

Pinterest, Conviva, Shopify, Open Table

Q. Which spark library allows reliable file sharing at memory speed across different cluster frameworks?

Tachyon

Q. Why is BlinkDB used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:

Sample building engine: determines the stratified samples to be built based on workload history and data distribution.

Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.

Q. How can you compare Hadoop and Spark in terms of ease of use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.

Q. What are the common mistakes developers make when running Spark applications?

Developers often make the mistake of-

Hitting the web service several times by using multiple clusters.
Run everything on the local node instead of distributing it.
Developers need to be careful with this, as Spark makes use of memory for processing.

Q. What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

Limit I/O operations
Consumes less space
Fetches only required columns.

Q. What are the various data sources available in SparkSQL?

Parquet file
JSON Datasets
Hive tables

Q. How Spark uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

Q. What are the key features of Apache Spark that you like?

Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
It has built-in APIs in multiple languages like Java, Scala, Python and R
It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.

Q. What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

For more Click Here

Apache Spark Interview Questions

Latest News