HBase Quickstart Guide

Overview

Your company, or personal interest, have lead you to take HBase to a test drive. By test-drive, I don’t mean installing HBase on one computer, and using your local filesystem. I refer to basic cluster composed of both HBase and an underlying HDFS. Something that looks like this:

  • Computer 1: HBase Master, HDFS NameNode
  • Computer 2-4: HBase Region Server, HDFS DataNode, ZooKeeper

If you are like me, first thing you do is Google “HBase installation guide”. You immediately find you self immersed with results that does not get you to the point you want and that is: installing HBase from scratch, on a Linux machine, in a fully distributed mode, including HDFS. So here it goes…

The Guide

Installing Java JDK

  1. Download the Java JDK RPM from Oracle
  2. Install it locally on your machine as root
  3. Register it as the default JVM, instead of using stock Open JDK by using alternatives. In my example I’ve downloaded JDK 1.6.30. You should change this number your number for this to work.
    alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_30/bin/java 2000 --slave /usr/bin/keytool keytool /usr/java/jdk1.6.0_30/bin/keytool --slave /usr/bin/rmiregistry rmiregistry /usr/java/jdk1.6.0_30/bin/rmiregistry --slave /usr/bin/wsimport wsimport /usr/java/jdk1.6.0_30/bin/wsimport
     
    alternatives --install /usr/bin/javac javac /usr/java/jdk1.6.0_30/bin/javac 120 --slave /usr/bin/jar jar /usr/java/jdk1.6.0_30/bin/jar --slave /usr/bin/rmic rmic /usr/java/jdk1.6.0_30/bin/rmic --slave /usr/bin/javadoc javadoc /usr/java/jdk1.6.0_30/bin/javadoc
     
    alternatives --set java /usr/java/jdk1.6.0_30/bin/java
     
    alternatives --set javac /usr/java/jdk1.6.0_30/bin/javac

Installing HDFS

Installing NameNode

On your designated NameNode machine (in our example, computer1):

  1. Download latest Hadoop RPM from Hadoop site. In this article I’ve downloaded hadoop-1.0.3-1.x86_64
  2. Install RPM
    (sudo yum localinstall RPM-file-name --nogpgcheck)
  3. Run the following to setup hadoop. Replace “your-name-node-host” with your host domain name (namenode.example.com):
    sudo hadoop-setup-conf.sh --auto --conf-dir=/etc/hadoop --datanode-dir=/var/lib/hadoop/hdfs/datanode --group=hadoop --hdfs-user=hdfs --jobtracker-host=<your-name-node-host> --namenode-host=<your-name-node-host> --log-dir=/var/log/hadoop --pid-dir=/var/run/hadoop --hdfs-dir=/var/lib/hadoop/hdfs --mapred-dir=/var/lib/hadoop/mapred --mapreduce-user=mapred --namenode-dir=/var/lib/hadoop/hdfs/namenode --dfs-support-append=true
    
  4. Logout and login again to get hadoop-env.sh sourced
  5. Run the following command to setup the HDFS file system:
    sudo hadoop-setup-hdfs.sh --format --hdfs-user=hdfs --mapreduce-user=mapred
  6. Add a user for the GUI:
    sudo groupadd webgroup
    sudo adduser webuser -G webgroup
Installing DataNodes

On each computer designated as Data Node, do the following:

  1. Install Java JDK as instructed above in the NameNode section.
  2. Download and install Hadoop RPM as instructed in the NameNode section
  3. Run the following command to setup the DataNode. Replace “name-node-host-name” with the domain name of the namenode you’ve installed earlier:
    sudo hadoop-setup-conf.sh --auto --conf-dir=/etc/hadoop --datanode-dir=/var/lib/hadoop/hdfs/datanode --group=hadoop --hdfs-user=hdfs --jobtracker-host=<name-node-host-name> --namenode-host=<name-node-host-name> --log-dir=/var/log/hadoop --pid-dir=/var/run/hadoop --hdfs-dir=/var/lib/hadoop/hdfs --mapred-dir=/var/lib/hadoop/mapred --mapreduce-user=mapred --namenode-dir=/var/lib/hadoop/hdfs/namenode --dfs-support-append=true

Configuring HDFS

  1. On the NameNode machine, edit /etc/hadoop/hdfs-site.xml, and add the following property:
    <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
    </property>
    

    This changes the default amount of clients who can communicate with HDFS, which is quite low.

  2. On both the NameNode and on each DataNode machine, we’ll need to run the following:
    1. Create a user we’ll use to administrate both HDFS and HBASE. We’ll call it hadoop.
      sudo useradd -g hadoop -m hadoop
      
    2. Set a password for this user.
       sudo passwd hadoop
      
    3. Each machine on the cluster needs to be able to access another machine via ssh, to make it easy to administer the cluster. Therefor we’ll create a private/public key for this user on each machine:
       su - hadoop
       ssh-keygen -t rsa -f ~/.ssh/id_rsa
      
  3. Copy paste each key created to a single file. Save it in the NameNode, under the user hadoop, in “~/.ssh/authorized_keys”.
  4. Manually/via-script copy the authorized_keys to all machines on the cluster.
  5. Change the permissions of ~/.ssh/authorized_keys on each machine:
    chmod 600 ~/.ssh/authorized_keys
    
  6. Set a password for “hdfs” user, which is the root user of the HDFS file system. You’ll need it later and resolving issues.
    sudo passwd hdfs
    
  7. Shut-down the firewall. I presume this is done in a lab and not in a production mode, thus I’mm allowing my self security breach these computers. If you can’t do this, you’ll have to track down all used ports by HDFS and HBase and allow only them through the firewall
    == CentOS ==
    sudo /etc/init.d/iptables stop   (depends on OS)
    
    == Fedora Core ==
    sudo service iptables stop
    sudo chkconfig iptables off
    
  8. Make sure DNS resolving works properly by issuing the following command:
    ping -c 1 $(hostname)

    Make sure it reports your internal ip (such as 172.22.22.45) and not 127.0.0.1

  9. Increase the maximum file handlers for your operating systems:
    sudo vi /etc/sysctl.conf

    Append:

    fs.file-max = 100000

    Run, (for settings to take immediate effect)

    sudo sysctl -p 

Running HDFS

  • On the NameNode run, ssh as hadoop user and run
    sudo service hadoop-namenode start
    
  • Check NameNode is running successfully by visiting http://[name-node-host]:50070
  • On each Data Node, ssh as hadoop user and run
    sudo service hadoop-datanode start
  • Check with the NameNode GUI (address above) you have 3 live nodes.
  • Run a quick sanity check, by copying a file into /tmp of hdfs. From the NameNode machine, run:
    hadoop fs -copyFromLocal [path-to-a-local-file] /tmp/justmyfile

    This will check HDFS is able to copy a file from you local file system into HDFS (into file path /tmp/justmyfile)

Installing HBase

  1. Download tar file from HBase site. During the write of his blog post, I’ve downloaded version 0.94.0.
  2. Copy the downloaded tar file to the NameNode machine. This is where we will configure HBase and use it as the HBase Master
  3. ssh to NameNode machine as hadoop user.
  4. Untar the tar file into your chosen location. I chose to install it under /usr/share, thus I’ve untar it there. After untar you will have /usr/share/hbase-0.94.0 directory with hbase files
  5. Set JAVA_HOME in /usr/share/hbase-0.94.0/conf/hbase-env.sh. It the beginning of the article we’ve installed the JDK to /usr/java/jdk1.6.0_30. So I’ve made a symbolic link:
    ln -s /usr/java/default /usr/java/default/jdk1.6.0_30
    

    So in my hbase-env.sh, it has the following value:

    export JAVA_HOME=/usr/java/default/
    
  6. Add the following two properties at /usr/share/hbase-0.94.0/conf/hbase-site.xml:
    <property>
    	<!-- This is the location in HDFS that HBase will use to store its files -->
        <name>hbase.rootdir</name>
        <value>hdfs://[name-node-host]:8020/hbase</value>
      </property>
      <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
      </property>
    
  7. Edit /usr/share/hbase-0.94.0/conf/regionservers, and place region server host names in it, each hostname in its own line. It’s common to install a region server on each server hosting the DataNode (this gives us data-locality), so simply list down the host names of your data nodes servers (computer2 – computer4). For example:
    comp2.example.com
    comp3.example.com
    comp4.example.com
    
  8. Configure the built-in ZooKeeper by adding the following lines in /usr/share/hbase-0.94.0/conf/hbase-site.xml:
     <property>
        <name>hbase.zookeeper.quorum</name>
        <!-- In our example I've chose to have 3 ZooKeeper, one on each Region Server. So this value should be:
        	 comp2.example.com,comp3.example.com,comp4.example.com -->
        <value>[comma-seperated list of all region servers]</value>
      </property>
      <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/var/zookeeper</value>
      </property>
    
  9. On each Region Server, do the following with hadoop user:
    sudo mkdir /var/zookeeper
    sudo chown -R hadoop:hadoop /var/zookeeper
    	

    This correlates with the ZooKeeper configuration we’ve placed in hbase-site.xml

  10. Now we need to create the hbase directory in HDFS as configured in hbase-site.xml.
    1. First login as the root of HDFS which is the user hdfs:
      su - hdfs
      

      This is done since the permissions to hdfs are determined by the user which is logged in to the operating system.

    2. Create the directory
      hadoop fs -mkdir /hbase
      
    3. Change the ownership of this directory from the create which is hdfs to hadoop and to the group hadoop.
      hadoop fs -chown hadoop:hadoop /hbase
      
  11. All machines clocks must by in sync with one another, thus we run a standard service called ntp which synchronizes the clock to a world clock:
    sudo yum install ntp
    sudo systemctl restart ntpd.service
    sudo systemctl enable ntpd.service
    
  12. Copy /usr/share/hbase-0.94.0 to all region server machines (comp2 – comp4). I usually copy it to home directory of hadoop user using scp and then I move the directory to /usr/share using sudo. Just to make sure we get permissions right after this step, run:
    sudo chown -R hadoop:hadoop /usr/share/hbase-0.94.0/ 
    

Running HBase

  1. In the HBase master server (comp1) – log in as hadoop – goto /usr/share/hbase-0.94.0/bin
  2. Run:
    ./start-hbase.sh
    

    This effectively runs the HBase master process on comp1 and then using ssh it runs the Region Server process and ZooKeeper process on each Region Server machine as defined in regionservers.

  3. Test HBase is up by going to http://[hbase-master-host]:60010
  4. You can also check it by using HBase shell:
    /usr/share/hbase-0.94.0/bin/hbase shell