Thursday, October 13, 2016

Passing values to Hive script at runtime time to make reusable script

When we are play around with data warehousing it's very common to pass values at runtime. Normally write our series of hive queries in a file and run it using hive -f option from UNIX shell or bash or schedule with workflow management systems like Oozie, Airflow, etc.

Let say we want see top 10 records of a table, we will write below query to a file say get_top_records.hql.
$ cat get_top_records.hql
SELECT * FROM CUSTOMERS LIMIT 10;

Running hive script using hive -f option from UNIX shell
$ hive -f get_top_records.hql

In the above script, table name and limit are hard coded, if you want to retrieve top 10 records of different table or to change the limit then we have to write new script or modify the script. To make reusable script table name and limit to be passed when you are running the script.

We can achieve this with the help of -hiveconf. We will see how to use -hiveconf to avoid hard coding and make reusable script.

Change above script with parametarised that to be passed while running the script.
$ cat get_top_records.hql
SELECT * FROM ${hiveconf:tablename} limit ${hiveconf:limit}

Now we need to pass two parameters tablename and limit while running the script like
$ hive -f get_top_records.hql -hiveconf tablename=CUSTOMERS -hiveconf limit 5

Now its reusable script, you can get top records of any table by passing table name and top records count
$ hive -f get_top_records.hql -hiveconf tablename=SALES -hiveconf limit 10

We can also set hive config parameters also by using -hiveconf. Let say we need to enable compression and set compress codec to SNAPPY:
$ hive -f create_table.hql  -hiveconf hive.exec.compress.output=true -hiveconf mapred.output.compress=true -hiveconf mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

No comments:

Post a Comment