Tuesday, January 3, 2017

Integrate third party package to Spark application

In this post, I’ll show you how to integrate third party packages (like spark-avro, spark-csv,  spark-redshift, spark-cassandra-connector, hbase) to your Spark application.

Lets take an example spark-avro, which allows you to read/write data in the Avro format using Spark.

Different ways to integrate third party package with Spark Application

Include package to Spark Shell/Applications using --jars
Download the jar file (spark-avro_2.11-3.1.0.jar) from below URL
https://spark-packages.org/package/databricks/spark-avro

Launch the spark shell with the jar file:
$SPARK_HOME/bin/spark-shell --jars <DOWNLOAD_PATH>/spark-avro_2.11-3.1.0.jar

Run Spark Application with jar file:
spark-submit --jars <DOWNLOAD_PATH>/spark-avro_2.11-3.1.0.jar <SPARK_SCRIPT>.jar

Include package in your Spark Shell/Applications using --package
Add maven coordinate as a argument to --package then it will install and available to use in your Spark Application. If you want pass multiple packages then list the packages with comma as separator.

Launch the spark shell with --package
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0

Run Spark Application with --package
spark-submit --packages com.databricks:spark-avro_2.11:3.1.0 <SPARK_SCRIPT>.jar

Try the following script to test the package:
// import packages 
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Read Avro data 
val df = spark.read.format("com.databricks.spark.avro").load("<INPUT_DIR>")

// Write Avro
df.write.format("com.databricks.spark.avro").save("<OUT_DIR>")

Change the highlighted part to the location of Avro data to read/write.

Enjoy Spark!

No comments:

Post a Comment