Hadoop Administration Interview Questions

Q. How Will You Decide Whether You Need To Use The Capacity Scheduler Or The Fair Scheduler?

Fair Scheduling is the system wherein assets are assigned to jobs such that each one jobs get to proportion equal quantity of assets over the years.Fair Scheduler may be used underneath the subsequent occasions:i) If you wishes the jobs to make same progress rather than following the FIFO order then you definately ought to use Fair Scheduling.Ii) If you have got slow connectivity and facts locality plays a crucial position and makes a sizable difference to the task runtime then you definitely have to use Fair Scheduling.Iii) Use honest scheduling if there’s lot of variability in the usage between swimming pools.Capacity Scheduler lets in runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximise the utilization of the hadoop cluster and throughput.Capacity Scheduler may be used beneath the following circumstances:i) If the roles require scheduler detrminism then Capacity Scheduler may be useful.Ii) CS’s reminiscence based scheduling approach is beneficial if the roles have various reminiscence necessities.Iii) If you need to put into effect useful resource allocation because you realize very well about the cluster utilization and workload then use Capacity Scheduler.

Q. What Are The Daemons Required To Run A Hadoop Cluster?

NameNode, DataNode, TaskTracker and JobTracker

Q. How Will You Restart A Namenode?

The simplest manner of doing this is to run the command to prevent walking shell script i.E. Click on prevent-all.Sh. Once this is finished, restarts the NameNode through clicking on start-all.Sh.

Q. Explain About The Different Schedulers Available In Hadoop.?

FIFO Scheduler – This scheduler does now not bear in mind the heterogeneity within the gadget but orders the roles based totally on their arrival times in a queue.COSHH- This scheduler considers the workload, cluster and the person heterogeneity for scheduling decisions.Fair Sharing-This Hadoop scheduler defines a pool for every person. The pool incorporates some of map and decrease slots on a aid. Each person can use their personal pool to execute the roles.

Q. List Few Hadoop Shell Commands That Are Used To Perform A Copy Operation.?

  • fs –placed
  • fs –copyToLocal
  • fs –copyFromLocal

Q. What Is Jps Command Used For?

jps command is used to verify whether or not the daemons that run the Hadoop cluster are running or no longer. The output of jps command suggests the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

Q. What Are The Important Hardware Considerations When Deploying Hadoop In Production Environment?

Memory-System’s memory requirements will range between the worker offerings and control services primarily based at the utility.Operating System – a sixty four-bit working system avoids any restrictions to be imposed on the amount of reminiscence that can be used on worker nodes.Storage- It is premier to layout a Hadoop platform through moving the compute pastime to records to attain scalability and high overall performance.Capacity- Large Form Factor (three.5”) disks price much less and allow to store more, when as compared to Small Form Factor disks.Network – Two TOR switches in keeping with rack offer better redundancy.Computational Capacity- This can be determined by using the whole number of MapReduce slots to be had throughout all of the nodes inside a Hadoop cluster.

Q. How Many Namenodes Can You Run On A Single Hadoop Cluster?

Only one.

Q. What Happens When The Namenode On The Hadoop Cluster Goes Down?

The file system goes offline on every occasion the NameNode is down.

Q. What Is The Conf/hadoop-env.Sh File And Which Variable In The File Should Be Set For Hadoop To Work?

This document offers an surroundings for Hadoop to run and includes the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

Q. Apart From Using The Jps Command Is There Any Other Way That You Can Check Whether The Namenode Is Working Or Not.?

Use the command -/and so on/init.D/hadoop-zero.20-namenode status.

Q. In A Mapreduce System, If The Hdfs Block Size Is sixty four Mb And There Are three Files Of Size 127mb, 64k And 65mb With Fileinputformat. Under This Scenario, How Many Input Splits Are Likely To Be Made By The Hadoop Framework.?

2 splits each for 127 MB and sixty five MB documents and 1 break up for the 64KB document.

Q. Which Command Is Used To Verify If The Hdfs Is Corrupt Or Not?

Hadoop FSCK (File System Check) command is used to check lacking blocks.

Q. List Some Use Cases Of The Hadoop Ecosystem?

Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

Q. How Can You Kill A Hadoop Job?

Hadoop job –kill jobID

Q. Hadoop In what mode can the code run ?

Hadoop Stand alone mode can be adopted 、 Pseudo distributed mode or fully distributed mode .Hadoop Designed for deployment on multi node clusters . however , It can also be deployed on a single machine and tested as a single process

Q. Hadoop How the administrator will deploy… In production Hadoop The various components of ?

stay master Deploy on node namenode and jobtracker, In more than one slave Deploy on node datanodes and tasktracker. Only one… Is needed on the system namenode and jobtracker. The number of data nodes depends on the available hardware

Q. What are the best practices for deploying secondary name nodes?

Deploy the secondary name node on a separate stand-alone machine . The secondary name node needs to be deployed on a separate machine . It will not interfere with the main in this way namenode operation . The secondary name node must have the same memory requirements as the primary name node .

Q. Is there a deployment Hadoop Standard procedures for ?

No , There are some differences between the various distributions . however , They all require installation on the machine Hadoop jar. all Hadoop Distributions have some common requirements , However, the specific procedures of different suppliers will be different , Because they all have some degree of proprietary software

Q. What is the role of the auxiliary name node ?

The secondary name node performs merge editing of logs and snapshots of the current file system CPU Intensive operation . Because of having CPU Additional requirements for intensive operations and metadata backup , The secondary name node is separated into a process

Q. What are the side effects of not running a secondary name node?

Cluster performance will decline over time , Because the editing log will be bigger and bigger . If the secondary name node is not running at all , Editing logs will grow significantly , And will slow down the system . Besides , The system will enter safe mode for a long time , because namenode You need to combine the edit log with the current file system checkpoint image .

Q. What happens if the data node loses its network connection for a few minutes ?

namenode Will detect datanode No response , And will start replicating data from the remaining replicas . When datanode When it comes back online , Additional copies will be made by namenode Actively maintain replication factors .namenode Monitor the status of all data nodes and track which blocks are located on that node . When the data node is unavailable , It will trigger the replication of data from an existing replica . however , If the data node returns to normal , The over copied data will be deleted . Be careful : Data may be deleted from the original data node .

Q. If one of the data nodes CPU What if it’s much slower

The task will be performed as fast as the slowest worker . however , If speculative execution is enabled , The slowest worker thread will not have such a big impact , because Hadoop It is specially designed with commodity hardware . Presumably execution helps offset slow staff . Multiple instances of the same task will be created , The job tracker will consider the first result , And terminate the second instance of the task .

Q. What is speculative execution ?

If speculative execution is enabled , Then the job tracker will issue multiple instances of the same task on multiple nodes , And will get the results of the first completed task . Other instances of the task will be killed .

Speculative execution is used to offset the impact of slow workers in the cluster .jobtracker Create multiple instances of the same task and get the result of the first successful task . The remaining tasks will be discarded .

Q. establish Hadoop How many racks does the cluster need to run reliably ?

To ensure reliable operation , The suggestion is at least 2 Racks are configured with rack placement Hadoop With built-in rack sensing mechanism , Allows data to be distributed between different racks based on configuration .

Q. namenode Do you have any special requirements ?

Yes ,namenode Save information about all files in the system , Need to be extra reliable .namenode It’s a single point of failure . It needs to be extra reliable , And metadata needs to be copied in multiple places . Please note that , The community is trying to solve namenode A single point of failure .

Q. If you have one 128M The size of the file and the copy factor is set to 3, How many blocks can you find on the cluster corresponding to this file ( Suppose that by default apache and cloudera To configure )?


According to the configuration settings , The file will follow the default block size 64M Divide into multiple blocks .128M / 64M = 2. Each block will be set according to the replication factor ( The default is 3) replicate .2 * 3 = 6 .

Q. What is distributed replication (distcp)?

Distcp It’s a Hadoop Utilities , Used to start MapReduce Job to copy data . The main purpose is to replicate large amounts of data .Hadoop One of the main challenges in the environment is to replicate data across multiple clusters ,distcp Parallel replication of data using multiple data nodes will be allowed .

Q. What is replication factor ?

The copy factor controls how many times each individual block can be copied ——

According to the replication factor in Hadoop Replicate data in the cluster . High replication factor ensures data availability in case of failure .

Q. Which daemons are running on the master node ?

NameNode、 auxiliary NameNode and JobTracker

Hadoop It consists of five independent daemons , Each daemon is in its own JVM Run in .NameNode、Secondary NameNode and JobTracker Running on the Master Node .DataNode and TaskTracker Run on every Slave Node .

Q. What is rack awareness ?

Rack awareness is how the name node determines how to place blocks according to the rack definition .Hadoop The network traffic between data nodes in the same rack will be minimized , And will only contact the remote rack if necessary . Due to rack sensing ,namenode Can control it

For more  Click Here



Hadoop Interview Questions And Answers

Our experts providing  Hadoop  interview questions & Answers/Faqs can develop your carrier & knowledge to find the right job in a good MNC’s, doesn’t matter what kind of company you’re hired.

1)Explain “Big Data” and what are five V’s of Big Data?
“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.

♣ Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!

Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.
Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor in the velocity of growing data.
Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

2)What is Hadoop and its components.
When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

♣ Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:

Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)

3)Name some companies that use Hadoop.?

Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)

4)What are active and passive “NameNodes”?
In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.

Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

5)What is a checkpoint?
In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.

6)What is the port number for NameNode, Task Tracker and Job Tracker?

NameNode 50070

Job Tracker 50030

Task Tracker 50060

7)What does ‘jps’ command do?
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

8) Explain about the indexing process in HDFS?
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

9)Whenever a client submits a hadoop job, who receives it?

NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.

10)What are the main configuration parameters in a “MapReduce” program?
The main configuration parameters which users need to specify in “MapReduce” framework are:

Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format of data
Output format of data
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes

11)What is the purpose of “RecordReader” in Hadoop?
The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.

12)How do “reducers” communicate with each other?
This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.

13)What is a “Combiner”?
A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

14) What are the different relational operations in “Pig Latin” you worked with?
Different relational operators are:

for each
order by

15) What is a UDF?
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.

16)What are the components of Apache HBase?
HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.

Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.

17) Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

18)Differentiate between Sqoop and distCP.

DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.

19)How would you check whether your NameNode is working or not?

There are several ways to check the status of the NameNode. Mostly, one uses the jps command to check the status of all daemons running in the HDFS

20)What is checkpointing in Hadoop?
Checkpointing is the process of combining the Edit Logs with the FsImage (File system Image). It is performed by the Secondary NameNode………..For more  Click Here

For Course Content  Click Here