Hadoop installation steps for a pseudo-distributed mode

Pseudo-Distributed Installation

Steps for setting up a pseudo-distributed Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and Map Reduce computing paradigm.

Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Name node :

The Name node is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Secondary Name node :

Secondary Name node whole purpose is to have a checkpoint in HDFS. It’s just a helper node for name node. That’s why it also known as checkpoint node inside the community.

So we now understood all Secondary Name node does puts a checkpoint in file system which will help Name node to function better. It’s not the replacement or backup for the Name node. So from now on make a habit of calling it as a checkpoint node.

Data node :

A Data node stores data in the hadoop file system A Functional file system has more than one Data node, with data replicated across them.

Resource Manager :

Resource Manager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.

Node Manager :

The Node Manager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the Resource Manager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Node Managers take instructions from the Resource Manager and manage resources available on a single node.

Starting your single-node cluster

Before starting the cluster, we need to give the required permissions to the directory with the following command

$ sudo chmod -R 777 /home/geouser

Installation Steps:

Step 1:

Open the terminal (CTRL + ALT + T)

Step 2:

Next run the commands one by one and update the source list.

$ sudo apt-get update

hadoop installation, hadoop tutorial

hadoop installation,hadoop tutorial

Step 3:

Install Java 8 latest version.

sudo apt-get install oracle-java8-installer

hadoop installation,hadoop tutorial

hadoop installation, hadoop tutorials

Step 4:

Check the Java Version whether java JDK is correctly installed or not, with the following command.

$ java –version

hadoop installation, hadoop tutorial

Configuring SSH:

The Hadoop control scripts reply on SSH to perform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs to be setup to allow password-less login for the Hadoop user from machines in the cluster. The simplest way to archive this is to generate a public/private key pair, and it will be shared across the cluster.

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to local host for the geouser we created in the earlier. We have to generate an SSH key for the geouser.

$ sudo apt-get install openssh-server

hadoop installation,hadoop tutorial

# ssh key genrate

$ ssh-keygen -t rsa (press 4 times ‘enter’) to create passwordless ssh

hadoop installation,hadoop tutorial

hadoop installation,hadoop tutorial

$ cat .ssh/id_rsa.pub>.ssh/authorized_keys

hadoop installation, hadoop tutorial

$ ssh localhost

hadoop tutorial,hadoop installation

hadoop installation,hadoop tutorial

Step 5 :

copy the hadoop-2.7.1.tar.gz file into your required place and extract it home folder or /usr/local and go to directory [hadoop-2.7.1/etc/hadoop].

or

use following command in terminal to extract the compressed file in home folder or /usr/local/

tar -xvf hadoop-2.7.1.tar.gz

Step 6 :

open the core-site.xml, by right click and open with gedit

Step 7 :

Within the core-site.xml, in between the <configuration > tag paste the following code

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

Step 8 :

open hdfs-site.xml, in between the <configuration> tag paste the following code.

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/home/geouser/yarn/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/home/geouser/yarn/datanode</value>

</property>

Step 9 :

Rename mapred-site.xml.template to mapred-site.xml and paste the following code between configuration tag.

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

Step 10 :

In yarn-site.xml and paste in between the configuration tag

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

Step 11 : Set the java path

open  yarn-env.sh and set the java path

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Step 12 :

open hadoop-env.sh and set the java path

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Step 13 : format namenode

Go to hadoop/bin driectory and type the following command and click enter

./hadoop namenode -format

Note :   i) only first time only format the namenode after the setup

ii) ./ is to run the command from current directory ,if you dont set path in .bashrc or .profile file

Then go to hadoop/sbin directory and start the hadoop daemons(services)

cd ../sbin

./start-all.sh

or

./start-dfs.sh

./start-yarn.sh

Step 14 :

Set the path in bashrc and environment variable to run and access the comments anywhere in terminal

$ sudo gedit .bashrc

hadoop installation,hadoop tutorial

hadoop installation,hadoop tutorial

Then paste below commands in bottom of the bashrc gedit file and save and close the file.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export HADOOP_HOME=$HOME/hadoop-2.7.1

export HADOOP_INSTALL=$HOME/hadoop-2.7.1

export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.1

export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.1

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.1

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Run the command

run the command in terminal to refresh the .bashrc setup

$ bash

hadoop installation,hadoop tutorial

After setup path variable in .bashrc , run the command to start the daemons from anywhere .

To start all daemons in single command but this command is deprecated

start-all.sh

This will startup a Name node,Secondary Name node,Data node,Resource Manager and Node Manager on the machine.

To start dfs daemons and YARN daemons separately

start-dfs.sh

start-yarn.sh

To start individual daemon

hadoop-daemon.sh start <daemon name>

e.g

hadoop-daemon.sh start namenode

To check all the daemons are started or not by using “jps” command.(Java Process status – java command not hadoop command ) .

hadoop installation, hadoop tutorial

Web console of hadoop :

check the dfshealth in your browser when the daemons are started.

Link:http://localhost:50070

Namenode port no : 50070

Resource manager port no : 8088

Hadoop namespace port no : 9000

hadoop installation, hadoop tutorial

hadoop tutorial, hadoop installation

Stopping your single-node cluster

Run the command to stop all the daemons running on your machine.

stop-all.sh

or

stop-dfs.sh

stop-yarn.sh

# hadoop installation completed