Quickstart

Preparation

Create a Hadoop User

sudo addgroup hadoop

# example password is hadoop
sudo adduser -ingroup hadoop hadoopuser

su - hadoopuser

Java Environment

Install java on each resources.

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo apt-get install openjdk-8-jre
sudo apt-get install openjdk-8-source #this is optional, the jdk source code
java -version

Notice Java 9 or newer would cause unexcepted error while using hadoop. If you want to downgrade to older java version, use the following command and select the desired one.

sudo update-alternatives --config java
sudo update-alternatives --config javac

# change all
sudo update-alternatives --all

If the error message shown as below:

dpkg: error processing archive /var/cache/apt/archives/openjdk-8-jdk_9~b115-1ubuntu1_amd64.deb (--install):
 trying to overwrite '/usr/lib/jvm/java-8-openjdk-amd64/include/linux/jawt_md.h', which is also in package openjdk-8-jdk-headless:amd64 9~b115-1ubuntu1

then using the following command.

sudo apt-get -o Dpkg::Options::="--force-overwrite" install openjdk-8-jdk

Switch to the hadoop user, su - hadoopuser. Set the environment variables to the end of the file ~/.bashrc.

# add JAVA_HOME
# assume jdk is installed at /usr/lib/jvm/java-1.9.0-openjdk-amd64
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Run source ~/.bashrc, then echo $JAVA_HOME to check the environment variable is set.

Install Hadoop

Install hadoop on each resources.

wget http://www-eu.apache.org/dist/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz
tar -zxvf hadoop-3.1.0.tar.gz
sudo mv hadoop-3.1.0 /opt/hadoop
sudo chown -R hadoopuser:hadoop /opt/hadoop

# make sure you have already add a new user
# run `su - hadoopuser` first
su - hadoopuser
cd ~
mkdir hdfs
mkdir /home/hadoopuser/hdfs/name
mkdir /home/hadoopuser/hdfs/data
mkdir /home/hadoopuser/tmp

The following is the configuration for 3 VMs setting, 1 master and 2 slaves. Confirm the configuration of hadoop on /etc/hostname, for example, master, slave01, slave02, etc, and on /etc/hosts as the following.

Make sure each VM corresponds to one hostname.

192.168.56.101  master
192.168.56.103  slave01
192.168.56.102  slave02

Set the envrionment variables on the end of ~/.bashrc.

# set HADOOP
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export HADOOP_INSTALL=$HADOOP_HOME

Run source ~/.bashrc, then echo $HADOOP_HOME to check the environment variable is set.

Auto Login SSH

Hadoop uses ssh to distribute the jobs so that auto login to slave resources is necessary.

ssh-keygen -t rsa -P ""
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
scp -r ~/.ssh slave01:~/
scp -r ~/.ssh slave02:~/

# test login
ssh slave01

Additional Configuration

There are 2 bash scripts necessary to add the environment setting. Notice you have to set the following configuration in all resources.

vim /opt/hadoop/etc/hadoop/hadoop-env.sh
vim /opt/hadoop/etc/hadoop/yarn-env.sh

Add the JAVA_HOME to the bash script.

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

Configure Hadoop

The following 4 documents are key to configure hadoop, core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml. Notice you have to set the following configuration in all resources.

  • Edit /opt/hadoop/etc/hadoop/core-site.xml.
<configuration>
<property>
   <name>fs.defaultFS</name>
   <value>hdfs://master:9000</value>
</property>
<property>
   <name>io.file.buffer.size</name>
   <value>131072</value>
</property>
<property>
   <name>hadoop.tmp.dir</name>
   <value>/home/hadoopuser/tmp</value>
</property>
<property>
   <name>hadoop.proxyuser.root.hosts</name>
   <value>master</value>
</property>
<property>
   <name>hadoop.proxyuser.root.group</name>
   <value>*</value>
</property>
</configuration>
  • Edit /opt/hadoop/etc/hadoop/hdfs-site.xml.
<configuration>  
<property>
   <name>dfs.namenode.secondary.http-address</name>
   <value>master:9001</value>
</property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>/home/hadoopuser/hdfs/name</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>/home/hadoopuser/hdfs/data</value>
</property>
<property>
   <name>dfs.replication</name>
   <value>1</value>
</property>
<property>
   <name>dfs.webhdfs.enable</name>
   <value>true</value>
</property>
<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>    
</configuration>
  • Edit /opt/hadoop/etc/hadoop/mapred-site.xml.
<configuration>
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
<property>
   <name>mapreduce.jobhistory.address</name>
   <value>master:10020</value>
</property>
<property>
   <name>mapreduce.jobhistory.webapp.address</name>
   <value>master:19888</value>
</property>

<!-- fix configure error -->
<property>
   <name>mapreduce.application.classpath</name>
   <value>
    /opt/hadoop/etc/hadoop,
    /opt/hadoop/share/hadoop/common/*,
    /opt/hadoop/share/hadoop/common/lib/*,
    /opt/hadoop/share/hadoop/hdfs/*,
    /opt/hadoop/share/hadoop/hdfs/lib/*,
    /opt/hadoop/share/hadoop/mapreduce/*,
    /opt/hadoop/share/hadoop/mapreduce/lib/*,
    /opt/hadoop/share/hadoop/yarn/*,
    /opt/hadoop/share/hadoop/yarn/lib/*
   </value>
</property>
</configuration>
  • Edit /opt/hadoop/etc/hadoop/yarn-site.xml.
<configuration>

<!-- Site specific YARN configuration properties -->
<property>
   <name>yarn.resourcemanager.hostname</name>
   <value>master</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandle</value>
</property>
<property>
   <name>yarn.resourcemanager.address</name>
   <value>master:8032</value>
</property>
<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>master:8030</value>
</property>
<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>master:8031</value>
</property>
<property>
   <name>yarn.resourcemanager.admin.address</name>
   <value>master:8033</value>
</property>
<property>
   <name>yarn.resourcemanager.webapp.address</name>
   <value>master:8088</value>
</property>

<!-- fix configure error -->    
<property>
  <name>yarn.application.classpath</name>
  <value>
    /opt/hadoop/etc/hadoop,
    /opt/hadoop/share/hadoop/common/*,
    /opt/hadoop/share/hadoop/common/lib/*,
    /opt/hadoop/share/hadoop/hdfs/*,
    /opt/hadoop/share/hadoop/hdfs/lib/*,
    /opt/hadoop/share/hadoop/mapreduce/*,
    /opt/hadoop/share/hadoop/mapreduce/lib/*,
    /opt/hadoop/share/hadoop/yarn/*,
    /opt/hadoop/share/hadoop/yarn/lib/*
   </value>
</property>

<!-- Container [containerID is running 256072192B beyond the 'VIRTUAL' memory limit.  -->
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>
</configuration>

After configure all resources, add slaves info to the file vim /opt/hadoop/etc/hadoop/workers (while hadoop version > 3)

localhost
slave01
slave02

or edit the file /opt/hadoop/etc/hadoop/slaves.

192.168.56.103  slave01
192.168.56.102  slave02

Activate Hadoop

Check hadoop version.

hadoop version

Format Namenode.

hdfs namenode -format

In master node, activate the hadoop.

You can check the link http://192.168.56.101:9870 (hadoop version >= 3.x) to see hadoop status , or check the link http://192.168.56.101:50070 while hadoop version < 3.

You can browse http://192.168.56.101:8088 to see all hadoop applications.

While you start all applications, the whole services including slave ones would be activate on the same time.

$ /opt/hadoop/sbin/start-all.sh

Check the status.

$ jps

Stop the all service.

$ /opt/hadoop/sbin/stop-all.sh

results matching ""

    No results matching ""