BIG DATA : HADOOP : I was able to setup hadoop – here are the steps

A few months before i was following http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm and it took me around 2 weeks spending nights after office before it successfully worked. And i was lost what i did in what sequence. So this weekend i tried it once again. Following are the steps i followed this time-

1. After installation of ubuntu 14.0, get update from terminal
sudo apt-get update

2. Install java
sudo apt-get install openjdk-7-jre
Following command will return the path where java and python installed in ubuntu.
which java
which python

Check version of java is installed
java -version

3. Add dedicated user for haddop
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

This will create new user hduser with creating home directory “/home/hduser”.
Use below command to make this user as Sudo user
sudo usermod -a -G sudo hduser

4. Installing SSH
SSH (“Secure Shell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons.
sudo apt-get install openssh-server

5. Generate password less ssh connection
a. Run below from hduser. rsa is the algorithm for generating id
ssh-keygen -t rsa
b. After generation of finger print, pass this to localhost
cat ~/.ssh/id_rsa.pub | ssh hduser@localhost 'cat>>.ssh/authorized_keys'
c. Check for password less call to localhost
ssh localhost
This will not ask for password.

6. Go back to root user with ‘exit’ command
a. install gksudo in ubuntu. It is an editor.
sudo apt-get install gksu
b. install vim in ubuntu. It is also an editor.
sudo apt-get install vim

7. Disabling IPv6
Since Hadoop doesn’t work on IPv6, we should disable it. One of another reason is also that it has been developed and tested on IPv4 stacks. Hadoop nodes will be able to communicate if we are having IPv4 cluster. (Once you have disabled IPV6 on your machine, you need to reboot your machine in order to check its effect. In case if you don’t know how to reboot with command use sudo reboot )
For getting your IPv6 disable in your Linux machine, you need to update /etc/sysctl.conf by adding following line of codes at end of the file,
a. Open terminal and go to root user then enter gksudo gedit /etc/sysctl.conf and open the configuration file and add the following lines at the end
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

b.After that run $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
If it reports ‘1′ means you have disabled IPV6. If it reports ‘0‘ then please follow Step c and Step d.
c. Type command sudo sysctl -p you will see this in terminal.
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

d. Repeat above “Step b” and it will now report 1.

8.Copy hadoop package folder from root user home to current hduser home
scp -r hadoop hduser@localhost:/home/hduser

9.change ownership and mode on hadoop folder:
sudo chown hduser:hadoop -R /home/hduser/hadoop
sudo chmod -R 777 /home/hduser/hadoop

10.Edit bashrc file for hadoop environment variables:
vi ~/.bashrc
open in editer and shift G to go to file end.
Then paste below lines:
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/
export PATH=$PATH:$JAVA_HOME
export HADOOP_HOME=/home/hduser/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin/:$HADOOP_CONF_DIR
export HIVE_HOME=/home/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin
# -- HADOOP ENVIRONMENT VARIABLES END -- #

11. Check all config files in hadoop/etc/hadoop such as: core-site.xml, hadoop-env.sh, hdfs-site.xml etc.

12. Create new directory hdfs in hadoop folder and two folders named data and name
mkdir hdfs
mkdir name
mkdir data

13. Reload bashrc file for updated path
source ~/.bashrc
echo $PATH

This will list updated path as per current user settings

14. Format Hadoop cluster / datanode
hadoop namenode -format

15. Start all service
start-all.sh

16. Run jps to see all service list. 6 service should be running:
10894 NameNode
11045 DataNode
11228 SecondaryNameNode
12055 Jps
11503 NodeManager
11377 ResourceManager

17. To see namenode, resourcemanager, datanode services, use:
http://localhost:50070 -namenode
http://localhost:8088/cluster -resourcemanager
http://localhost:8042/node - datanode

18. Go to root user:
sudo -i

19. Rename folder from datanode to data
gvfs-move /home/hduser/hadoop/hdfs/datanode /home/hduser/hadoop/hdfs/data

20. For newly setup hadoop folder, update core-site.xml and hdfs-site.xml
as below:
hdfs-site.xml
-------------------------

dfs.replication
3
dfs.namenode.name.dir
file:/home/hduser/hdoop/hdfs/name
dfs.datanode.data.dir
file:/home/hduser/hadoop/hdfs/data

core-site.xml
-----------------------------------

fs.defaultFS
hdfs://localhost

21. Using below HIVE COMMANDS, let’s import some data in hdfs:
create table CountryTable(id int,name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';

load data local inpath '/home/hduser/country.txt' overwrite into table CountryTable;

22. Some more HIVE scripts. Very similar to SQL server scripts:
Select * from CountryTable
Select Count(*) from CountryTable
Select Count(ID) from CountryTable
Select SUM(ID) from CountryTable

This entry was posted in BIG DATA, HADOOP, Hive. Bookmark the permalink.

Leave a Reply