Tuesday, June 30, 2015

Load XML file into Hive Table using xpath

Here is a sample input XML file: 

$cat employees.xml
<employee>
<id>1</id>
<name>Satish Kumar</name>
<designation>Technical Lead</designation>
</employee>
<employee>
<id>2</id>
<name>Ramya</name>
<designation>Testing</designation>
</employee>

Step:1 Bring each record to one line, by executing below command


$cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml

$cat employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:2 Load the file to HDFS

$hadoop fs -mkdir /user/hive/sample-xml-inputs

$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs

$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name><designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:3 Create a Hive table and point to xml file

hive>create external table xml_table_org( xmldata string) LOCATION '/user/hive/sample-xml-inputs/';

hive> select * from xml_table_org;
OK
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Time taken: 0.179 seconds

Step 4: From the stage table we can query the elements and load it to other table.

hive> CREATE TABLE xml_table AS SELECT xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') FROM xml_table_org;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201506301103_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201506301103_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201506301103_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-06-30 11:23:10,969 Stage-1 map = 0%,  reduce = 0%
2015-06-30 11:23:18,040 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:19,058 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:20,067 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:21,079 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 0.95 sec
MapReduce Total cumulative CPU time: 950 msec
Ended Job = job_201506301103_0001
Ended Job = 1716336105, job is filtered out (removed at runtime).
Ended Job = -578743888, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10001
Moving data to: hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table
chgrp: changing ownership of 'hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table': User does not belong to hive
Table default.xml_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 46, raw_data_size: 0]
2 Rows loaded to hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10000
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.95 sec   HDFS Read: 435 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 950 msec
OK
Time taken: 21.649 seconds

hive> select * from xml_table;                                       OK
1 Satish Kumar Technical Lead
2 Ramya Testing

Time taken: 0.143 seconds