Saturday, December 27, 2014

Tables in Hive

A Hive table is logically made up of the data being stored and the associated metadata describing the layout of the data in the table. The data typically associated resides in HDFS, although it may reside in any Hadoop file system, including the local file system or s3. Hive stores the metadata in relational databases and not in HDFS.

Each table has a corresponding directory in HDFS and the data is Serialized and stores in files within the directory.

Different types of Tables

Managed Tables
Managed Tables are nothing but when you create the table in hive, by default hive will manage the data, which means that hive controls the lifecycle of the data into its warehouse directory.
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

hive> CREATE TABLE workshop.sample(key int, value string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

If you’re not currently working in the target database, then you can prefix the database name. In the above statement we mentioned workshop as database.

If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table already exists. This is useful in scripts that should create a table the first time they run.

Note: 
When a managed table gets dropped, both the metadata and data get dropped. However, managed tables are less convenient for sharing with other tools.


External Tables
Suppose we have data that is created and used primarily by Pig or other tools, but we want to run some queries against it, but not give Hive ownership of the data. We can define an external table that points to that data, but doesn't take ownership of it.

External table is nothing but when you create the table in hive, the data stored at an existing location outside the warehouse directory.

Create an External table and points the location of the data like

hive>CREATE EXTERNAL TABLE sample (key int, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data/sample';

An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.


Note:
  • External table are virtual tables, not physically shown in HDFS.
  • External table won’t load data; just it points the location of the data.
  • When an external table is dropped, the data associated with it doesn’t get deleted, only the metadata (number of columns, type of columns, terminators, etc.) gets dropped from the Hive Metastore.
  • When you want to do process on other module data, external table are useful. Since External table create a virtual table points to data and process on the data without modifying the original data.


How do you check existing table is managed or external table?

To check that we use describe command like below

describe formatted tablename;

It displays complete meta data of a table. 
You can find row key as Table Type which will display either MANAGED_TABLE OR EXTERNAL_TABLE

Eg:

If it is managed table, you will see
Table Type:             MANAGED_TABLE

If it is external table ,you will see
Table Type:             EXTERNAL_TABLE




1 comment: