Here is a sample input XML file:
$cat employees.xml
<employee>
<id>1</id>
<name>Satish Kumar</name>
<designation>Technical Lead</designation>
</employee>
<employee>
<id>2</id>
<name>Ramya</name>
<designation>Testing</designation>
</employee>
Step:1 Bring each record to one line, by executing below command
$cat employees_records.xml
Step:2 Load the file to HDFS
$hadoop fs -mkdir /user/hive/sample-xml-inputs
$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs
$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name><designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>
Step:3 Create a Hive table and point to xml file
hive>create external table xml_table_org( xmldata string) LOCATION '/user/hive/sample-xml-inputs/';
hive> select * from xml_table_org;
OK
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>
Time taken: 0.179 seconds
Step 4: From the stage table we can query the elements and load it to other table.
hive> CREATE TABLE xml_table AS SELECT xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') FROM xml_table_org;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201506301103_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201506301103_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201506301103_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-06-30 11:23:10,969 Stage-1 map = 0%, reduce = 0%
2015-06-30 11:23:18,040 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:19,058 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:20,067 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:21,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.95 sec
MapReduce Total cumulative CPU time: 950 msec
Ended Job = job_201506301103_0001
Ended Job = 1716336105, job is filtered out (removed at runtime).
Ended Job = -578743888, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10001
Moving data to: hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table
chgrp: changing ownership of 'hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table': User does not belong to hive
Table default.xml_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 46, raw_data_size: 0]
2 Rows loaded to hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.95 sec HDFS Read: 435 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 950 msec
OK
Time taken: 21.649 seconds
hive> select * from xml_table; OK
1 Satish Kumar Technical Lead
2 Ramya Testing
Time taken: 0.143 seconds
$cat employees.xml
<employee>
<id>1</id>
<name>Satish Kumar</name>
<designation>Technical Lead</designation>
</employee>
<employee>
<id>2</id>
<name>Ramya</name>
<designation>Testing</designation>
</employee>
Step:1 Bring each record to one line, by executing below command
$cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml
$cat employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>
$hadoop fs -mkdir /user/hive/sample-xml-inputs
$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs
$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name><designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>
Step:3 Create a Hive table and point to xml file
hive>create external table xml_table_org( xmldata string) LOCATION '/user/hive/sample-xml-inputs/';
hive> select * from xml_table_org;
OK
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>
Time taken: 0.179 seconds
Step 4: From the stage table we can query the elements and load it to other table.
hive> CREATE TABLE xml_table AS SELECT xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') FROM xml_table_org;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201506301103_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201506301103_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201506301103_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-06-30 11:23:10,969 Stage-1 map = 0%, reduce = 0%
2015-06-30 11:23:18,040 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:19,058 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:20,067 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:21,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.95 sec
MapReduce Total cumulative CPU time: 950 msec
Ended Job = job_201506301103_0001
Ended Job = 1716336105, job is filtered out (removed at runtime).
Ended Job = -578743888, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10001
Moving data to: hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table
chgrp: changing ownership of 'hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table': User does not belong to hive
Table default.xml_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 46, raw_data_size: 0]
2 Rows loaded to hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.95 sec HDFS Read: 435 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 950 msec
OK
Time taken: 21.649 seconds
hive> select * from xml_table; OK
1 Satish Kumar Technical Lead
2 Ramya Testing
Time taken: 0.143 seconds
Hi Sathish thanks for the much informative article. However I have a scenario where there are many children for one parent and I need to load this into a Hive table. Below is the XML file:
ReplyDeleteINDIAKarnatakaJharkand
UnitedKingdomEnglandScotland
As you can see, there is only one Parent tag called Country and several child tags called State and I would like to have the output in Hive as mentioned below:
Country State
India Karnataka
India Jharkand
UnitedKingdom England
UnitedKingdom Scotland
Your help would be much appreciated.
Thanks in advance,
Abhishek
Hi sathish kumar,
DeleteReally thank you, It is very good example and helped me alot.
Thanks,
Swathi
Satish is very good example,really helpful for our project.
ReplyDeleteThanks,
Bhushan
Satish is it possible to do the clensing inside the hdfs and not in linux?
ReplyDeleteHi Sathish thanks for the much informative article.I make the hive table using xml in the above mentiond Xpath method.can you please explain if i need to only 1 st column( i.e. id ) or 3 rd column (i.e.designation)how to make a query.
ReplyDelete