hadoop interview questions

Q. How Will You Decide Whether You Need To Use The Capacity Scheduler Or The Fair Scheduler?

Fair Scheduling is the system wherein assets are assigned to jobs such that each one jobs get to proportion equal quantity of assets over the years.Fair Scheduler may be used underneath the subsequent occasions:i) If you wishes the jobs to make same progress rather than following the FIFO order then you definately ought to use Fair Scheduling.Ii) If you have got slow connectivity and facts locality plays a crucial position and makes a sizable difference to the task runtime then you definitely have to use Fair Scheduling.Iii) Use honest scheduling if there’s lot of variability in the usage between swimming pools.Capacity Scheduler lets in runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximise the utilization of the hadoop cluster and throughput.Capacity Scheduler may be used beneath the following circumstances:i) If the roles require scheduler detrminism then Capacity Scheduler may be useful.Ii) CS’s reminiscence based scheduling approach is beneficial if the roles have various reminiscence necessities.Iii) If you need to put into effect useful resource allocation because you realize very well about the cluster utilization and workload then use Capacity Scheduler.

Q. What Are The Daemons Required To Run A Hadoop Cluster?

NameNode, DataNode, TaskTracker and JobTracker

Q. How Will You Restart A Namenode?

The simplest manner of doing this is to run the command to prevent walking shell script i.E. Click on prevent-all.Sh. Once this is finished, restarts the NameNode through clicking on start-all.Sh.

Q. Explain About The Different Schedulers Available In Hadoop.?

FIFO Scheduler – This scheduler does now not bear in mind the heterogeneity within the gadget but orders the roles based totally on their arrival times in a queue.COSHH- This scheduler considers the workload, cluster and the person heterogeneity for scheduling decisions.Fair Sharing-This Hadoop scheduler defines a pool for every person. The pool incorporates some of map and decrease slots on a aid. Each person can use their personal pool to execute the roles.

Q. List Few Hadoop Shell Commands That Are Used To Perform A Copy Operation.?

fs –placed
fs –copyToLocal
fs –copyFromLocal

Q. What Is Jps Command Used For?

jps command is used to verify whether or not the daemons that run the Hadoop cluster are running or no longer. The output of jps command suggests the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

Q. What Are The Important Hardware Considerations When Deploying Hadoop In Production Environment?

Memory-System’s memory requirements will range between the worker offerings and control services primarily based at the utility.Operating System – a sixty four-bit working system avoids any restrictions to be imposed on the amount of reminiscence that can be used on worker nodes.Storage- It is premier to layout a Hadoop platform through moving the compute pastime to records to attain scalability and high overall performance.Capacity- Large Form Factor (three.5”) disks price much less and allow to store more, when as compared to Small Form Factor disks.Network – Two TOR switches in keeping with rack offer better redundancy.Computational Capacity- This can be determined by using the whole number of MapReduce slots to be had throughout all of the nodes inside a Hadoop cluster.

Q. How Many Namenodes Can You Run On A Single Hadoop Cluster?

Only one.

Q. What Happens When The Namenode On The Hadoop Cluster Goes Down?

The file system goes offline on every occasion the NameNode is down.

Q. What Is The Conf/hadoop-env.Sh File And Which Variable In The File Should Be Set For Hadoop To Work?

This document offers an surroundings for Hadoop to run and includes the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

Q. Apart From Using The Jps Command Is There Any Other Way That You Can Check Whether The Namenode Is Working Or Not.?

Use the command -/and so on/init.D/hadoop-zero.20-namenode status.

Q. In A Mapreduce System, If The Hdfs Block Size Is sixty four Mb And There Are three Files Of Size 127mb, 64k And 65mb With Fileinputformat. Under This Scenario, How Many Input Splits Are Likely To Be Made By The Hadoop Framework.?

2 splits each for 127 MB and sixty five MB documents and 1 break up for the 64KB document.

Q. Which Command Is Used To Verify If The Hdfs Is Corrupt Or Not?

Hadoop FSCK (File System Check) command is used to check lacking blocks.

Q. List Some Use Cases Of The Hadoop Ecosystem?

Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

Q. How Can You Kill A Hadoop Job?

Hadoop job –kill jobID

Q. Hadoop In what mode can the code run ？

Hadoop Stand alone mode can be adopted 、 Pseudo distributed mode or fully distributed mode .Hadoop Designed for deployment on multi node clusters . however , It can also be deployed on a single machine and tested as a single process

Q. Hadoop How the administrator will deploy… In production Hadoop The various components of ？

stay master Deploy on node namenode and jobtracker, In more than one slave Deploy on node datanodes and tasktracker. Only one… Is needed on the system namenode and jobtracker. The number of data nodes depends on the available hardware

Q. What are the best practices for deploying secondary name nodes?

Deploy the secondary name node on a separate stand-alone machine . The secondary name node needs to be deployed on a separate machine . It will not interfere with the main in this way namenode operation . The secondary name node must have the same memory requirements as the primary name node .

Q. Is there a deployment Hadoop Standard procedures for ？

No , There are some differences between the various distributions . however , They all require installation on the machine Hadoop jar. all Hadoop Distributions have some common requirements , However, the specific procedures of different suppliers will be different , Because they all have some degree of proprietary software

Q. What is the role of the auxiliary name node ？

The secondary name node performs merge editing of logs and snapshots of the current file system CPU Intensive operation . Because of having CPU Additional requirements for intensive operations and metadata backup , The secondary name node is separated into a process

Q. What are the side effects of not running a secondary name node？

Cluster performance will decline over time , Because the editing log will be bigger and bigger . If the secondary name node is not running at all , Editing logs will grow significantly , And will slow down the system . Besides , The system will enter safe mode for a long time , because namenode You need to combine the edit log with the current file system checkpoint image .

Q. What happens if the data node loses its network connection for a few minutes ？

namenode Will detect datanode No response , And will start replicating data from the remaining replicas . When datanode When it comes back online , Additional copies will be made by namenode Actively maintain replication factors .namenode Monitor the status of all data nodes and track which blocks are located on that node . When the data node is unavailable , It will trigger the replication of data from an existing replica . however , If the data node returns to normal , The over copied data will be deleted . Be careful ： Data may be deleted from the original data node .

Q. If one of the data nodes CPU What if it’s much slower ？

The task will be performed as fast as the slowest worker . however , If speculative execution is enabled , The slowest worker thread will not have such a big impact , because Hadoop It is specially designed with commodity hardware . Presumably execution helps offset slow staff . Multiple instances of the same task will be created , The job tracker will consider the first result , And terminate the second instance of the task .

Q. What is speculative execution ？

If speculative execution is enabled , Then the job tracker will issue multiple instances of the same task on multiple nodes , And will get the results of the first completed task . Other instances of the task will be killed .

Speculative execution is used to offset the impact of slow workers in the cluster .jobtracker Create multiple instances of the same task and get the result of the first successful task . The remaining tasks will be discarded .

Q. establish Hadoop How many racks does the cluster need to run reliably ？

To ensure reliable operation , The suggestion is at least 2 Racks are configured with rack placement Hadoop With built-in rack sensing mechanism , Allows data to be distributed between different racks based on configuration .

Q. namenode Do you have any special requirements ？

Yes ,namenode Save information about all files in the system , Need to be extra reliable .namenode It’s a single point of failure . It needs to be extra reliable , And metadata needs to be copied in multiple places . Please note that , The community is trying to solve namenode A single point of failure .

Q. If you have one 128M The size of the file and the copy factor is set to 3, How many blocks can you find on the cluster corresponding to this file （ Suppose that by default apache and cloudera To configure ）？

According to the configuration settings , The file will follow the default block size 64M Divide into multiple blocks .128M / 64M = 2. Each block will be set according to the replication factor （ The default is 3） replicate .2 * 3 = 6 .

Q. What is distributed replication （distcp）？

Distcp It’s a Hadoop Utilities , Used to start MapReduce Job to copy data . The main purpose is to replicate large amounts of data .Hadoop One of the main challenges in the environment is to replicate data across multiple clusters ,distcp Parallel replication of data using multiple data nodes will be allowed .

Q. What is replication factor ？

The copy factor controls how many times each individual block can be copied ——

According to the replication factor in Hadoop Replicate data in the cluster . High replication factor ensures data availability in case of failure .

Q. Which daemons are running on the master node ？

NameNode、 auxiliary NameNode and JobTracker

Hadoop It consists of five independent daemons , Each daemon is in its own JVM Run in .NameNode、Secondary NameNode and JobTracker Running on the Master Node .DataNode and TaskTracker Run on every Slave Node .

Q. What is rack awareness ？

Rack awareness is how the name node determines how to place blocks according to the rack definition .Hadoop The network traffic between data nodes in the same rack will be minimized , And will only contact the remote rack if necessary . Due to rack sensing ,namenode Can control it

For more Click Here

Tag: hadoop interview questions

Hadoop Administration Interview Questions