BigData PlayBook: Load XML file into Hive Table using xpath

Here is a sample input XML file:

$cat employees.xml
<employee>
<id>1</id>
<name>Satish Kumar</name>
<designation>Technical Lead</designation>
</employee>
<employee>
<id>2</id>
<name>Ramya</name>
<designation>Testing</designation>
</employee>

Step:1 Bring each record to one line, by executing below command

$cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml

$cat employees_records.xml

<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>

<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:2 Load the file to HDFS

$hadoop fs -mkdir /user/hive/sample-xml-inputs

$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs

$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name><designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:3 Create a Hive table and point to xml file

hive>create external table xml_table_org( xmldata string) LOCATION '/user/hive/sample-xml-inputs/';

hive> select * from xml_table_org;
OK
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Time taken: 0.179 seconds

Step 4: From the stage table we can query the elements and load it to other table.

hive> CREATE TABLE xml_table AS SELECT xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') FROM xml_table_org;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201506301103_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201506301103_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201506301103_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-06-30 11:23:10,969 Stage-1 map = 0%, reduce = 0%
2015-06-30 11:23:18,040 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:19,058 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:20,067 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:21,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.95 sec
MapReduce Total cumulative CPU time: 950 msec
Ended Job = job_201506301103_0001
Ended Job = 1716336105, job is filtered out (removed at runtime).
Ended Job = -578743888, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10001
Moving data to: hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table
chgrp: changing ownership of 'hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table': User does not belong to hive
Table default.xml_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 46, raw_data_size: 0]
2 Rows loaded to hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.95 sec HDFS Read: 435 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 950 msec
OK
Time taken: 21.649 seconds

hive> select * from xml_table; OK
1 Satish Kumar Technical Lead
2 Ramya Testing

Time taken: 0.143 seconds

5 comments:

UnknownJuly 4, 2016 at 11:45 PM
Hi Sathish thanks for the much informative article. However I have a scenario where there are many children for one parent and I need to load this into a Hive table. Below is the XML file:

INDIAKarnatakaJharkand
UnitedKingdomEnglandScotland

As you can see, there is only one Parent tag called Country and several child tags called State and I would like to have the output in Hive as mentioned below:

Country State
India Karnataka
India Jharkand
UnitedKingdom England
UnitedKingdom Scotland

Your help would be much appreciated.

Thanks in advance,
Abhishek
UnknownNovember 23, 2016 at 10:27 PM
Satish is very good example,really helpful for our project.

Thanks,
Bhushan
LhitoJuly 21, 2019 at 5:40 PM
Satish is it possible to do the clensing inside the hdfs and not in linux?
UnknownApril 14, 2022 at 11:49 PM
Hi Sathish thanks for the much informative article.I make the hive table using xml in the above mentiond Xpath method.can you please explain if i need to only 1 st column( i.e. id ) or 3 rd column (i.e.designation)how to make a query.

BigData PlayBook

Tuesday, June 30, 2015

Load XML file into Hive Table using xpath

5 comments:

Labels