BigData PlayBook: 2014

Saturday, December 27, 2014

Tables in Hive

A Hive table is logically made up of the data being stored and the associated metadata describing the layout of the data in the table. The data typically associated resides in HDFS, although it may reside in any Hadoop file system, including the local file system or s3. Hive stores the metadata in relational databases and not in HDFS.

Each table has a corresponding directory in HDFS and the data is Serialized and stores in files within the directory.

Different types of Tables

Managed Tables

Managed Tables are nothing but when you create the table in hive, by default hive will manage the data, which means that hive controls the lifecycle of the data into its warehouse directory.

Hive stores the data for these tables in a subdirectory under the directory defined by

hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

hive> CREATE TABLE workshop.sample(key int, value string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

If you’re not currently working in the target database, then you can prefix the database name. In the above statement we mentioned workshop as database.

If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table already exists. This is useful in scripts that should create a table the first time they run.

Note:

When a managed table gets dropped, both the metadata and data get dropped. However, managed tables are less convenient for sharing with other tools.

External Tables

Suppose we have data that is created and used primarily by Pig or other tools, but we want to run some queries against it, but not give Hive ownership of the data. We can define an external table that points to that data, but doesn't take ownership of it.

External table is nothing but when you create the table in hive, the data stored at an existing location outside the warehouse directory.

Create an External table and points the location of the data like

hive>CREATE EXTERNAL TABLE sample (key int, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data/sample';

An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

Note:

External table are virtual tables, not physically shown in HDFS.
External table won’t load data; just it points the location of the data.
When an external table is dropped, the data associated with it doesn’t get deleted, only the metadata (number of columns, type of columns, terminators, etc.) gets dropped from the Hive Metastore.
When you want to do process on other module data, external table are useful. Since External table create a virtual table points to data and process on the data without modifying the original data.

How do you check existing table is managed or external table?

To check that we use describe command like below

describe formatted tablename;

It displays complete meta data of a table.

You can find row key as Table Type which will display either MANAGED_TABLE OR EXTERNAL_TABLE

Eg:

If it is managed table, you will see

Table Type: MANAGED_TABLE

If it is external table ,you will see

Table Type: EXTERNAL_TABLE

Very Useful Hive CLI settings

hive.cli.print.current.db [Show the current database as part of the prompt]

The USE command sets a database as your working database, analogous to changing working directories in a filesystem:

hive> USE workshop;

There is no command to show you which database is your current working database.

To show the current database as part of the prompt, set the property 'hive.cli.print.current.db' to show the current database as part of the prompt:

hive> set hive.cli.print.current.db=true;
hive (workshop)> USE default;
hive (default)> set hive.cli.print.current.db=false;
hive>

hive.cli.print.header [Print table columns header]

CLI to print column headers, which is disabled by default. We can enable this feature by setting

hive> set hive.cli.print.header=true;
hive> SELECT * FROM employee LIMIT 2;
employee.id employee.name employee.age
101 Satish 30
102 kumar 29

Friday, December 26, 2014

Databases in Hive

In Hive, database is just a catalog or namespace of tables. However, they are very useful for larger clusters with multiple teams and users, as a way of avoiding table name collisions. It’s also common to use databases to organize production tables into logical groups.

If you don’t specify a database, the default database is used.

Create a DataBase

hive> CREATE DATABASE workshop;

Throw an error if workshop database already exists.

You can suppress these warnings with below variation:

hive> CREATE DATABASE IF NOT EXISTS workshop;

IF NOT EXISTS clause is useful for scripts that should create a database on-the-fly, if necessary, before proceeding.

Note:

Hive will create a directory for each database under a top-level directory specified by the property hive.metastore.warehouse.dir. (default value is /user/hive/warehouse).

Eg: when the 'workshop' database is created, Hive will create the directory /user/hive/warehouse/workshop.db. (Note the .db extension).

All the tables of the database will be stored in sub-directories of the database directory. The exception is tables in the default database, which doesn't have its own directory.

You can override the default location for the new directory as shown in this example:

hive> CREATE DATABASE workshop LOCATION '/hive_workshop’;

You can add a descriptive comment to the database, which will be shown by the

hive> CREATE DATABASE workshop COMMENT 'Holds all my exercises in Hive';

Database Schema

hive> DESCRIBE DATABASE workshop;

workshop Holds all my exercises in Hive hdfs://satishkumar/user/hive/warehouse/workshop.db satishkumar

Time taken: 0.026 seconds, Fetched: 1 row(s)

DESCRIBE DATABASE also shows the directory location for the database. If you are running in pseudo-distributed mode, then the master server will be localhost. For local mode, the path will be a local path, file:///user/hive/warehouse/workshop.db.

You can associate key-value properties with the database, although their only function currently is to provide a way of adding information to the output of DESCRIBE DATABASE EXTENDED <database>:

hive> CREATE DATABASE workshop WITH DBPROPERTIES ('creator' = 'Satish Kumar', 'date' = '2014-12-28');

hive> DESCRIBE DATABASE workshop;

workshop hdfs://satishkumar/user/hive/warehouse/workshop.db satishkumar

Time taken: 0.026 seconds, Fetched: 1 row(s)

List of DataBases

At any time, you can see the databases that already exist as follows:

hive> SHOW DATABASES;

default

workshop

If you have a lot of databases, you can restrict the ones listed using a regular expression like

hive> SHOW DATABASES LIKE 'w.*';

workshop

It lists only those databases that start with the letter ‘w’ and end with any other characters.

Set Working Database

The USE command sets a database as your working database, analogous to changing working directories in a filesystem:

hive> USE employee_db;

Note: There is no command to show you which database is your current working database.

Drop a DataBase

hive> DROP DATABASE IF EXISTS workshop;

The IF EXISTS is optional and suppresses warnings if workshop doesn't exist.

By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:

hive> DROP DATABASE IF EXISTS financials CASCADE;

Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database. When a database is dropped, its directory is also deleted.

Alter a DataBase

You can set key-value pairs in the DBPROPERTIES associated with a database using the

ALTER DATABASE command. No other metadata about the database can be changed,

Including its name and directory location:

hive> ALTER DATABASE workshop SET DBPROPERTIES ('edited-by' = 'Teja');

There is no way to delete or “unset” a DBPROPERTY.

Setup and Configure Hive

Step 1:

Download latest Hive from Apache Download Mirrors and derby db.

https://archive.apache.org/dist/hive/hive-0.13.1/

http://db.apache.org/derby/releases/release-10.10.2.0.cgi

Step 2:

Un-tar the files to your working directory.

Step 3:

Set environment variable for the Hive home directory in .bashrc file.

export HIVE_HOME="/home/satishkumar/WORK/apache-hive-0.13.1-bin"

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME

Step 4:

Close the terminal and open to verify the environment variable for the Hive is configured or not.

$echo $HIVE_HOME

Step 5: Start derby

Here we are using default db as derby for storing Hive Metadata. Start derby by using following commands

./startNetworkServer -h 0.0.0.0 &

or in case of any exception

./startNetworkServer -noSecurityManager

Step 6:

Add/Update Configurations in hive-site.xml

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby://localhost:1527/myderby1;create=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.ClientDriver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

</configuration>

Step 7:

Copy derby librarires to hive/lib. (Here we are using derby as MetaData, so copying all the derby libraries to Hive)

1) Delete if any derby lib is already there in hive/lib folder.

2) copy derby.jar, derbyclient.jar, derbytools.jar from derby/lib into hive/lib.

Step 8:

Start Hive with following command

$> bin/hive

Why go for Hive When Pig is There?

Pig
- Procedural data-flow language
- Pig is used by Programmers and Researchers.
- Pig is on the client side.
- For managing and querying unstructured data.
Hive
- Declarative SQLish Language
- Hive is used by analysts generating data reports.
- Hive is on cluster side.
- For managing and querying structured data.

Features	Hive	Pig
Language	SQL-like	PigLatin
Schemas/Type	Yes (explicit) We have to create "tables" beforehand and stores the schema in a either shared or local database for metadata.	Yes (implicit) No need to create table.
Partitions	Yes	No
Server	Optional (Thrift)	No
UDF	Yes (Java)	Yes (Java)
Custom Serialize/Deserializer	Yes	Yes
DFS Direct Access	Yes (implicit) We never point to the actual HDFS folder.	Yes (explict) We explicitly point to HDFS folder.
Join/Order/Sort	Yes	Yes
Shell	Yes	Yes
Streaming	Yes	Yes
WebInterface	Yes	No
JDBC/ODBC	Yes (limited)	No

Configure HADOOP in Pseudo Distributed Mode

In this post I will illustrate the steps to configure and install Hadoop in Pseudo distributed mode.

Step 01: Verify Java is installed or not, If not install and configure.

You can verify java installation using the following command

$ java –version

On executing this command, you should see output similar to the following:

java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.6) (7u51-2.4.6-1ubuntu4)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

If java is not installed, Use the below command to begin the installation of Java

sudo apt-get install openjdk-7-jdk

This will install the full JDK under /usr/lib/jvm/java-7- sundirectory.

Configure JAVA_HOME

Hadoop requires Java installation path to work on, for this we will be setting JAVA_HOME environment variable and this will point to our Java installation dir.

JAVA_HOME can be configured in ~/.bashrc file

Use the below command to set JAVA_HOME on Ubuntu

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Add below statement in .bashrc file.

# set to the root of your Java installation

JAVA_HOME=/usr/lib/jvm/jdk1.7.0_60

Step 02: SSH Configuration (To login into remote PC to execute commands)

Install SSH using the command

sudo apt–get install openssh-server

Check is installed or not

$ssh localhost

If not installed – Error msg: ssh: connect to host localhost port 22: Connection refused.

Step 03: Download & Setup Hadoop

Download the latest stable release of Apache Hadoop from Apache Download Mirrors.

http://apache.mirrors.lucidnetworks.net/hadoop/common/.

Un-tar the file to an appropriate location.
$ tar xzvf <tar-filename>.tar.gz

Use the following command to create an environment variable that points to the Hadoop installation directory (HADOOP_HOME)

export HADOOP_HOME=/home/user/Hadoop

Setup 04 : Verify the class paths

Close terminal and open new Terminal to check whether the JAVA_HOME and HADOOP_HOME path are set or not.

JAVA_HOME can be verified by command

echo $JAVA_HOME

HADOOP _HOME can be verified by command

echo $ HADOOP _HOME

Use this command to verify your Hadoop is installed or not hadoop version

The o/p should be similar to below one

Hadoop 1.2.1

Setup 05 : Configurations

1. Edit the file /conf/hadoop-env.sh to set the java home path.

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_60

2. Edit the file /conf/core-site.xml and add the following parameters

1) <fs.default.name> - Points to the default URI for all FileSystem requests in Hadoops.

So if the user makes a request for a file by specifying its path only, Hadoop tries to find that path on the filesystem defined by fs.default.name. If fs.default.name is set to an an HDFS URI like

hdfs://<hostname>:<port>, then Hadoop tries to find the path on HDFS whose namenode is running at <hostname>:<port>.

Note: To get hostname using command 'hostname'.

2) <hadoop.temp.dir> -used as a temporary directory for both local file system and for HDFS.

Specify the absolute path for temporary directory

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://<host-name>:9000</value>
</property>
<property>
<name>hadoop.temp.dir</name>
<value>/home/satishkumar/HADOOP/SETUPS/temp</value>
</property>
</configuration>

3. Edit the file /conf/hdfs-site.xml and add the following parameters

<dfs.replication> - Replication Factor

<dfs.name.dir> - Set the path to persist the data of NameNode

<dfs.data.dir> - Specify the path that access by the cluster to persist the data of DataNode.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/satishkumar/HADOOP/SETUPS/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/satishkumar/HADOOP/SETUPS/dfs/data</value>
</property>
</configuration>

4. Edit the file /conf/master and the /conf/slaves with hostname

5. Format Hadoop Name Node
After these changes, you will have to format the filesystem
$ bin/hadoop namenode -format

6. Start Hadoop daemons
Start NameNode daemon and DataNode daemon
$ bin/start-dfs.sh

7. Check whether all the daemons are running or not
$ jps
4720 NameNode
5160 SecondaryNameNode
4936 DataNode

Troubleshoot
Note: If your master server fails to start due to the dfs safe mode issue, execute this on the Hadoop command line:
hadoop dfsadmin -safemode leave

Also make sure to format the namenode again if you make changes to your configuration.

8. Browse the web interface for the NameNode, by default it is available at:
http://localhost:50070/

Now you have successfully installed and configured Hadoop on a single node.