Hadoop installation steps for a pseudo-distributed mode
Steps for setting up a pseudo-distributed Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and Map Reduce computing paradigm.
Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Name node :
The Name node is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Secondary Name node :
Secondary Name node whole purpose is to have a checkpoint in HDFS. It’s just a helper node for name node. That’s why it also known as checkpoint node inside the community.
So we now understood all Secondary Name node does puts a checkpoint in file system which will help Name node to function better. It’s not the replacement or backup for the Name node. So from now on make a habit of calling it as a checkpoint node.
Data node :
A Data node stores data in the hadoop file system A Functional file system has more than one Data node, with data replicated across them.
Resource Manager :
Resource Manager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.
Node Manager :
The Node Manager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the Resource Manager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
Node Managers take instructions from the Resource Manager and manage resources available on a single node.
Starting your single-node cluster
Before starting the cluster, we need to give the required permissions to the directory with the following command
$ sudo chmod -R 777 /home/geouser
Open the terminal (CTRL + ALT + T)
Next run the commands one by one and update the source list.
$ sudo apt-get update
Install Java 8 latest version.
sudo apt-get install oracle-java8-installer
Check the Java Version whether java JDK is correctly installed or not, with the following command.
$ java –version
The Hadoop control scripts reply on SSH to perform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs to be setup to allow password-less login for the Hadoop user from machines in the cluster. The simplest way to archive this is to generate a public/private key pair, and it will be shared across the cluster.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to local host for the geouser we created in the earlier. We have to generate an SSH key for the geouser.
$ sudo apt-get install openssh-server
# ssh key genrate
$ ssh-keygen -t rsa (press 4 times ‘enter’) to create passwordless ssh
$ cat .ssh/id_rsa.pub>.ssh/authorized_keys
$ ssh localhost
Step 5 :
copy the hadoop-2.7.1.tar.gz file into your required place and extract it home folder or /usr/local and go to directory [hadoop-2.7.1/etc/hadoop].
use following command in terminal to extract the compressed file in home folder or /usr/local/
tar -xvf hadoop-2.7.1.tar.gz
Step 6 :
open the core-site.xml, by right click and open with gedit
Step 7 :
Within the core-site.xml, in between the <configuration > tag paste the following code
Step 8 :
open hdfs-site.xml, in between the <configuration> tag paste the following code.
Step 9 :
Rename mapred-site.xml.template to mapred-site.xml and paste the following code between configuration tag.
Step 10 :
In yarn-site.xml and paste in between the configuration tag
Step 11 : Set the java path
open yarn-env.sh and set the java path
Step 12 :
open hadoop-env.sh and set the java path
Step 13 : format namenode
Go to hadoop/bin driectory and type the following command and click enter
./hadoop namenode -format
Note : i) only first time only format the namenode after the setup
ii) ./ is to run the command from current directory ,if you dont set path in .bashrc or .profile file
Then go to hadoop/sbin directory and start the hadoop daemons(services)
Step 14 :
Set the path in bashrc and environment variable to run and access the comments anywhere in terminal
$ sudo gedit .bashrc
Then paste below commands in bottom of the bashrc gedit file and save and close the file.
Run the command
run the command in terminal to refresh the .bashrc setup
After setup path variable in .bashrc , run the command to start the daemons from anywhere .
To start all daemons in single command but this command is deprecated
This will startup a Name node,Secondary Name node,Data node,Resource Manager and Node Manager on the machine.
To start dfs daemons and YARN daemons separately
To start individual daemon
hadoop-daemon.sh start <daemon name>
hadoop-daemon.sh start namenode
To check all the daemons are started or not by using “jps” command.(Java Process status – java command not hadoop command ) .
Web console of hadoop :
check the dfshealth in your browser when the daemons are started.
Namenode port no : 50070
Resource manager port no : 8088
Hadoop namespace port no : 9000
Stopping your single-node cluster
Run the command to stop all the daemons running on your machine.
# hadoop installation completed