BigData PlayBook: 2016

Friday, November 4, 2016

UBER mode in Hadoop2 and its configuration

ResourceManager will create separate container for mapper and reducer by default. In Uber mode will allows to run mapper and reducer in the same process as the ApplicationMaster.

Jobs running in uber mode are Uber Jobs. Uber jobs are executed within the ApplicationMaster. Rather then communicate with ResourceManager to create the mapper and reducer containers. The ApplicationMaster runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.

Why we go for UBER Mode?

If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.

Uber mode supports only for map-only jobs and jobs with one reducer.

Configurations to enable jobs to run in UBER Mode
There are four core settings around the configuration of UBER Jobs in the mapred-site.xml.

Configuration options for Uber Jobs:

mapreduce.job.ubertask.enable (Default = false)
Whether to enable the small-jobs "ubertask" optimization, which runs "sufficiently small" jobs sequentially within a single JVM.

mapreduce.job.ubertask.maxmaps (Default = 9)
Threshold value for the number of maps beyond which a job is considered too large for the ubertasking optimization. Users can override this value, but only downward.

mapreduce.job.ubertask.maxreduces (Default = 1)
Threshold value for the number of reduces beyond which a job is considered too large for the ubertasking optimization.
Note: Currently the code can't support more than one Reducer and will ignore larger values.

mapreduce.job.ubertask.maxbytes (Default = HDFS Block Size)
Threshold value for the number of input bytes beyond which a job is considered too large for the ubertasking optimization.
If no value is specified, dfs.block.size is used as the default. Be sure to specify a default value in mapred-site.xml if the underlying file system is not HDFS.

Friday, October 14, 2016

Saving Hive Query Result to HIVE TABLE or FILESYSTEM

In this post, we will see how to save hive query result to a HIVE TABLE or FILE SYSTEM.

We can achieve this by using the INSERT clause.

Let say we have customer table
hive> select * from customer;

Saving Query Results to a Table
If we want to load customers who are from city_1 to a table:
hive> INSERT OVERWRITE TABLE customers_city1 SELECT * FROM customers WHERE city='city_1';

hive> select * from customers_city1;

If we don't want to delete existing data, then use below query:
hive> INSERT INTO TABLE customers_city1 SELECT * FROM customers WHERE city='city_1';

Note: Destination table must already exists

Saving Query Results to a FileSystem

If we want to save query result to a Hadoop FileSystem (HDFS), use below query:
hive> INSERT OVERWRITE DIRECTORY '/home/cloudera/result/customers' SELECT * FROM customers WHERE city='city_1';

Add LOCAL keyword to save query result to local filesystem:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/customers' SELECT * FROM customers WHERE city='city_1';

Note: Writes query result in text files with default delimiter '\t'.

Thursday, October 13, 2016

Passing values to Hive script at runtime time to make reusable script

When we are play around with data warehousing it's very common to pass values at runtime. Normally write our series of hive queries in a file and run it using hive -f option from UNIX shell or bash or schedule with workflow management systems like Oozie, Airflow, etc.

Let say we want see top 10 records of a table, we will write below query to a file say get_top_records.hql.
$ cat get_top_records.hql
SELECT * FROM CUSTOMERS LIMIT 10;

Running hive script using hive -f option from UNIX shell
$ hive -f get_top_records.hql

In the above script, table name and limit are hard coded, if you want to retrieve top 10 records of different table or to change the limit then we have to write new script or modify the script. To make reusable script table name and limit to be passed when you are running the script.

We can achieve this with the help of -hiveconf. We will see how to use -hiveconf to avoid hard coding and make reusable script.

Change above script with parametarised that to be passed while running the script.
$ cat get_top_records.hql
SELECT * FROM ${hiveconf:tablename} limit ${hiveconf:limit}

Now we need to pass two parameters tablename and limit while running the script like
$ hive -f get_top_records.hql -hiveconf tablename=CUSTOMERS -hiveconf limit 5

Now its reusable script, you can get top records of any table by passing table name and top records count
$ hive -f get_top_records.hql -hiveconf tablename=SALES -hiveconf limit 10

We can also set hive config parameters also by using -hiveconf. Let say we need to enable compression and set compress codec to SNAPPY:
$ hive -f create_table.hql -hiveconf hive.exec.compress.output=true -hiveconf mapred.output.compress=true -hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec