BigData Analytics : 2013

Tuesday, October 29, 2013

Hadoop 2 Installation

Follow Hadoop2.x.x Installation for detailed steps.

Friday, September 6, 2013

A. Pre-Requisite

1. Hadoop

B. HBase Installation
1. Download stable version of HBase from : http://apache.mirrors.hoobly.com/hbase/stable/ (downlaod the latest stable version !)

2. Goto Downloads folder: launch terminal -> cd <download path> (ex: cd /home/hduser/Downloads)

3. Extract: tar -xzf hbase-<version>.tar.gz (ex: tar -zxf hbase-0.94.11.tar.gz)

4. Change directory name: mv hbase-<version>/ hbase/

5. Move hbase to a directory where all hadoop tools installed: sudo mv hbase/ /usr/hadoop/.

6. Change Ownership: sudo chown -R hduser:hadoop /usr/hadoop/hbase/

7. For convenience, export HBASE_HOME & export $PATH:$HBASE_HOME/bin (so that you will be able to call hbase commands from any where )

8. Edit $HBASE_HOME/conf/hbase-env.sh to specify JAVA_HOME. Save & exit from the editor

9. Start HBase: start-hbase.sh

10. Launch hbase shell to work on it (create, alter, drop tables & insert data, etc ...): hbase shell

That's it your HBase installation is done.

To test your installation, use below statement at hbase shell prompt to create table:

1. hbase(main):001:0> create 'emptable', 'empno', 'empname', 'salary', 'deptno'

2. hbase(main):002:0> describe 'emptable'

-

Monday, August 19, 2013

Public Datasets

http://www.scaleunlimited.com/datasets/public-datasets/ - Sample data for various sectors. This sample data can be used for POC, Prototype, Conduct pre-production tests, etc ...

http://www.grouplens.org/node/12 - Sample data for various sectors. This sample data can be used for POC, Prototype, Conduct pre-production tests, etc ...

http://stat-computing.org/dataexpo/2009/the-data.html - Airlines Data to Analyze & predict delays:

       - to calculate average departure delay by month for each airline

Saturday, August 17, 2013

Zookeeper Installation

A. Pre-Requisite

1. Hadoop

B. Zookeeper Installation

1. Download ZooKeeper from : http://www.apache.org/dyn/closer.cgi/zookeeper/ (downlaod the latest stable version !)

2. Goto Downloads folder: launch terminal -> cd <download path> (ex: cd /home/hduser/Downloads)

3. Extract: tar -xzf zookeeper-<version>.tar.gz (ex: tar -xzf zookeeper-3.4.5.tar.gz)

4. Change directory name: mv zookeeper-<version>/ zookeeper/

5. Move zookeeper to a common directory where all hadoop tools installed: sudo mv zookeeper/ /usr/hadoop/.

6. Change Ownership: sudo chown -R hduser:hadoop /usr/hadoop/zookeeper/

7. Create a directory under /tmp and change ownership to to the user under which zookeeper installed : cd /tmp -> sudo mkdir zookeeper -> sudo chown -R hduser:hadoop zookeeper (This directory to store the in-memory database snapshots & the transaction logs)

8. Edit zoo config: In the distribution zoo_sample.cfg exists. Create a copy of it as zoo.cfg & edit: cd <Zookeeper installation path>/conf; cp zoo_sample.cfg zoo.cfg; vi zoo.cfg to configure the datadir & other configurations as per below:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181

Upon any change save the file & exit (esc :wq)

9. Now login / change as hduser (zookeeper installation user)

10 cd (to goto home directory)

11. vi .bashrc
add below environment variables:
ZOOKEEPER_HOME=/usr/hadoop/zookeeper
export ZOOKEEPER_HOME
PATH=$PATH:$ZOOKEEPER_HOME/bin
export PATH
Upon adding above lines, save & exit from .bashrc (esc :wq)

12. source .bashrc

13. Start ZooKeeper server: zkServer.sh start

14. Zookeeper's command line tools:

zkCli.sh localhost ls /zoo

Thursday, August 15, 2013

Sqoop Installation

A. Pre-Requisite:

I. Hadoop

II. RDBMS (MySQL, Oracle, DB2, etc ...)

III. RDBMS Connector

I have installed MySQL for testing purpose.

Here is the simple way to install MySQL on ubuntu:

1. Launch Terminal

2. sudo apt-get install mysql-server (prompt for password)

3. While installing it will prompt to key in root password for mysql (Not for your system). Key in the password for mysql root user (new password & re-type password)

4. Upon successful installation, check the status using below command:

5. sudo netstat -tap | grep mysql

result should be: tcp 0 0 localhost:mysql *:* LISTEN 10444/mysqld

If it shows above output then your mysql database is ready.

6. Upon installation download the respective connector. In my case I have downloaded mysql-connector-java-5.1.25.jar & added this to CLASSPATH.

B.Sqoop Installation

1. Downlaod Sqoop from http://mirror.sdunix.com/apache/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz (check your hadoop version & download respective version of sqoop, make sure the file name has "bin")

2. Extract Sqoop : tar -xzf sqoop-1.4.4-bin_hadoop-1.0.0.tar.gz (it will extract to a folder sqoop-1.4.4-bin_hadoop-1.0.0)

3. sudo mv sqoop-1.4.4-bin_hadoop-1.0.0/ sqoop

4. sudo mv sqoop/ /usr/hadoop/.

5. cd /usr/hadoop

6. sudo chown -R hduser:hadoop sqoop/

7. cd sqoop

8. cp *.jar $HADOOP_HOME/lib/. (copy sqoop jar files to hadoop lib directory)

9. Set below Env variables (under hduser).

export SQOOP_HOME=/usr/hadoop/SQOOP
export PATH=$SQOOP_HOME/bin:$PATH

(Make sure hadoop is started. To start Hadoop, ssh localhost; start-all.sh)

10. type sqoop at command prompt (it will display type "sqoop help" to get help).

11. Below are sample sqoop statements for importing & exporting data from mysql db to hdfs

a. sqoop import --connect jdbc:mysql://localhost/hadoop_test --username xxxxx --password ******** --table Employee --target-dir /data/emp1 -m 1

b. sqoop import --connect jdbc:mysql://localhost/hadoop_test --username xxxxx --password ******** --table Employee --target-dir /data/emp2/ --split-by deptno;

Import as Avro:
c. sqoop import --connect jdbc:mysql://localhost/hadoop_test --username xxxxx --password ****** --table Employee --target-dir /data/emp3/ --as-avrodatafile -m 1;

d. sqoop export --connect jdbc:mysql://localhost/hadoop_test --table Employee --username xxxxx --password ******** --export-dir /data/emp --input-fields-terminated-by '\t';

** while exporting one should specify the absolute path of the file. In case of Parts, give full path ex: part-00000

(Hadoop Definitive guide has very good & simple example to work on !).

"Sqoop is mainly used to transport data from RDBMS to HDFS & vis-a-vis"

**************** End of Sqoop Installation ******************************

PIG Installation

A. Pre-Requisite:

Hadoop

B. Pig Installation steps:

1. Download Hive from http://mirror.symnds.com/software/Apache/pig/stable/pig-0.11.1.tar.gz

2. Extract Pig : tar -xzf pig-0.11.1.tar.gz (it will extract to a folder hive-0.10.0-bin)

3. sudo mv pig-0.11.1/ pig

4. sudo mv pig/ /usr/hadoop/.

5. cd /usr/hadoop

6. sudo chown -R hduser:hadoop pig/

7. Set below Env variables (under hduser).

export PIG_HOME=/usr/hadoop/pig
export PATH=$PIG_HOME/bin:$PATH

(Make sure hadoop is started. To start Hadoop, ssh localhost; start-all.sh)

At command prompt in terminal: pig (type pig & enter). It will lead you to grunt> prompt, where you can run pig statements / scripts

Note: hduser is the user under which I have installed hadoop & hive. In your case it may be different

********** End of Pig Installation - Enjoy statement based HDFS tool ***************************

Tuesday, August 13, 2013

Hive Installation

A. Pre-Requisite:

Hadoop

B. Hive Installation steps:

1. Download Hive from http://mirror.olnevhost.net/pub/apache/hive/stable/ (File: hive-0.10.0-bin.tar.gz)

2. Extract Hive : tar -xzf hive-0.10.0-bin.tar.gz (it will extract to a folder hive-0.10.0-bin)

3. sudo mv hive-0.10.0-bin/ hive

4. sudo mv hive/ /usr/hadoop/.

5. cd /usr/hadoop

6. sudo chown -R hduser:hadoop hive/

7. Set below Env variables (under hduser).

export HIVE_HOME=/usr/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

(Make sure hadoop is started. To start Hadoop, ssh localhost; start-all.sh)
hive

8. hive

it will give you hive prompt. So that you can start accessing default DB or create new DB :)

Note: hduser is the user under which I have installed hadoop & hive. In your case it may be different

********** End of Hive Installation - Enjoy SQL based HDFS tool ***************************

OOZIE Installation is Simplified

Oozie Installation

Hi All,

It took few hours for me to install Oozie :(. So I thought of making it simple for others. Below are steps & links for installing oozie.

A. Pre-Requisite:
1. Unix (Ubuntu, Redhat, Hp unix, etc ...)
2. Java 1.6+
3. Apache Hadoop (0.20 onwards)
4. Maven (mvn command should work from your terminal). To install Maven refer to : http://www.mkyong.com/maven/how-to-install-maven-in-ubuntu/
5. ExtJS Library (optional, it is to enable webconsole) - http://extjs.com/deploy/ext-2.2.zip (Download to /tmp)

B. Installation Steps:

1. Download latest version of Oozie from http://mirror.symnds.com/software/Apache/oozie/ (I have downloaded version 3.3.2)

2. Extract the downloaded tar.gz (tar -zxf oozie-3.3.2.tar.gz)

3. cd oozie-3.3.2/bin (Your version may be different!! )

4. ./mkdistro.sh -DskipTests (Run this command to build Oozie - It means you have downloaded oozie source & now need to build as a binary). Upon successful build it will create Oozie Binary. Note: It needs internet connection & will take some time.
Success build will show below message:
Oozie distro created, DATE[2013.08.13-12:54:51GMT] VC-REV[unavailable], available at [<Oozie downloaded path>/oozie-3.3.2/distro/target]

5. Change owner: sudo chown -R hduser:hadoop <oozie folder path>

6. Change directory name to oozie: sudo mv oozie-3.2.2 oozie

7. Move oozie folder from Downloads to other folder. In my case, moved to /home/hduser : Change directory name to oozie: sudo mv oozie /home/hduser/.

8. su hduser (login as hduser to continue setup) & cd

9. export OOZIE_HOME=<oozie path> (in my case: export OOZIE_HOME=/home/hduser/oozie)

10. Create libext folder under $OOZIE_HOME (mkdir libext)

11. Copy all jars from $HADOOP_HOME & $$HADOOP_HOME/lib to newly create libext folder (cp $HADOOP_HOME/lib/*jar $OOZIE_HOME/libext/. & cp $HADOOP_HOME/*jar $OOZIE_HOME/libext/.)

12. copy extJS-2.2.zip to $OOZIE_HOME/libext (before copying make sure the owner is same as oozie)

13. copy oozie war file:
- cp $OOZIE_HOME/distro/target/oozie-3.3.2-distro/oozie-3.3.2/oozie.war $OOZIE_HOME/webapp/srs/main/webapp/.

14. Change configuration as per below :
vi ./distro/target/oozie-3.3.2-distro/oozie-3.3.2/conf/oozie-site.xml (by default the valu is false. Change it to true)

<property>
<name>oozie.service.JPAService.create.db.schema</name>
<value>true</value>
<description>
Creates Oozie DB.
If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.
If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.
</description>
</property>

15. Run the following command to create OOZIE DB:
- $OOZIE_HOME/distro/target/oozie-3.3.2-distro/oozie-3.3.2/bin/ooziedb.sh create -sqlfile oozie.sql -run

If it is successfull, the output should be:
setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"

Validate DB Connection
DONE
Check DB schema does not exist
DONE
Check OOZIE_SYS table does not exist
DONE
Create SQL schema
DONE
Create OOZIE_SYS table
DONE

Oozie DB has been created for Oozie version '3.3.2'

16. To enable webconsole, we need to install the ext JS library. Also, oozie war file requires few other jar files like hadoop-core-<version>.jar & commons-configuration<version>.jar

- ./bin/addtowar.sh -inputwar oozie.war -outputwar oozie1.war -jars /home/hduser/oozie/hadooplibs/target/oozie-3.3.2-hadooplibs/oozie-3.3.2/hadooplibs/hadooplib-1.1.1.oozie-3.3.2/*.jar -extjs $OOZIE_HOME/libext/ext-2.2.zip (make sure hadoop-core-<YOUR DISTRIBUTION VERSION>.jar & commons-configuration<VERSION>.jar are added)

- Run below commands to change the name of jar & deploy:
i) rm -f oozie.war
ii) mv oozie1.war oozie.war
iii) cp oozie.war $OOZIE_HOME/distro/target/oozie-3.3.2-distro/oozie-3.3.2/oozie-server/webapps/.

17. Start the Oozie Server (oozied.sh start)

18. Run below command to check the status of oozie server:
bin/oozie admin -oozie http://localhost:11000/oozie -status
(It should return "System Mode: NORMAL")
19. To view the log file tail -100f logs/catalina.out

20. Finally configure hadoop core-site.xml. Open core-site.xml (located under $HADOOP_HOME/conf) and add below configuration:

<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>localhost</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>
19. To access webconsole of oozie: http://localhost:11000/oozie .. - Able to view workflow status

20. To view the log file tail -100f logs/catalina.out

21. Finally configure hadoop core-site.xml. Open core-site.xml (located under $HADOOP_HOME/conf) and add below configuration:

<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>localhost</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>

*************End of Installation - Enjoy Workflow Scheduler*****************************