Tuesday, June 30, 2015

Load XML file into Hive Table using xpath

Here is a sample input XML file: 

$cat employees.xml
<employee>
<id>1</id>
<name>Satish Kumar</name>
<designation>Technical Lead</designation>
</employee>
<employee>
<id>2</id>
<name>Ramya</name>
<designation>Testing</designation>
</employee>

Step:1 Bring each record to one line, by executing below command


$cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml

$cat employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:2 Load the file to HDFS

$hadoop fs -mkdir /user/hive/sample-xml-inputs

$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs

$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
<employee> <id>1</id> <name>Satish Kumar</name><designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Step:3 Create a Hive table and point to xml file

hive>create external table xml_table_org( xmldata string) LOCATION '/user/hive/sample-xml-inputs/';

hive> select * from xml_table_org;
OK
<employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical Lead</designation> </employee>
<employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> </employee>

Time taken: 0.179 seconds

Step 4: From the stage table we can query the elements and load it to other table.

hive> CREATE TABLE xml_table AS SELECT xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') FROM xml_table_org;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201506301103_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201506301103_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201506301103_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-06-30 11:23:10,969 Stage-1 map = 0%,  reduce = 0%
2015-06-30 11:23:18,040 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:19,058 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:20,067 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.95 sec
2015-06-30 11:23:21,079 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 0.95 sec
MapReduce Total cumulative CPU time: 950 msec
Ended Job = job_201506301103_0001
Ended Job = 1716336105, job is filtered out (removed at runtime).
Ended Job = -578743888, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10001
Moving data to: hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table
chgrp: changing ownership of 'hdfs://localhost.localdomain:8020/user/hive/warehouse/xml_table': User does not belong to hive
Table default.xml_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 46, raw_data_size: 0]
2 Rows loaded to hdfs://localhost.localdomain:8020/tmp/hive-cloudera/hive_2015-06-30_11-22-59_851_2446083450544691385-1/-ext-10000
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.95 sec   HDFS Read: 435 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 950 msec
OK
Time taken: 21.649 seconds

hive> select * from xml_table;                                       OK
1 Satish Kumar Technical Lead
2 Ramya Testing

Time taken: 0.143 seconds


5 comments:

  1. Hi Sathish thanks for the much informative article. However I have a scenario where there are many children for one parent and I need to load this into a Hive table. Below is the XML file:


    INDIAKarnatakaJharkand
    UnitedKingdomEnglandScotland

    As you can see, there is only one Parent tag called Country and several child tags called State and I would like to have the output in Hive as mentioned below:

    Country State
    India Karnataka
    India Jharkand
    UnitedKingdom England
    UnitedKingdom Scotland

    Your help would be much appreciated.

    Thanks in advance,
    Abhishek

    ReplyDelete
    Replies
    1. Hi sathish kumar,

      Really thank you, It is very good example and helped me alot.

      Thanks,
      Swathi

      Delete
  2. Satish is very good example,really helpful for our project.

    Thanks,
    Bhushan

    ReplyDelete
  3. Satish is it possible to do the clensing inside the hdfs and not in linux?

    ReplyDelete
  4. Hi Sathish thanks for the much informative article.I make the hive table using xml in the above mentiond Xpath method.can you please explain if i need to only 1 st column( i.e. id ) or 3 rd column (i.e.designation)how to make a query.

    ReplyDelete