In this post, I’ll show you how to integrate third party packages (like spark-avro, spark-csv, spark-redshift, spark-cassandra-connector, hbase) to your Spark application.
Lets take an example spark-avro, which allows you to read/write data in the Avro format using Spark.
Different ways to integrate third party package with Spark Application
Include package to Spark Shell/Applications using --jars
Download the jar file (spark-avro_2.11-3.1.0.jar) from below URL
https://spark-packages.org/package/databricks/spark-avro
Launch the spark shell with the jar file:
$SPARK_HOME/bin/spark-shell --jars <DOWNLOAD_PATH>/spark-avro_2.11-3.1.0.jar
Include package in your Spark Shell/Applications using --package
Add maven coordinate as a argument to --package then it will install and available to use in your Spark Application. If you want pass multiple packages then list the packages with comma as separator.
Launch the spark shell with --package
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0
Try the following script to test the package:
// import packages
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
// Read Avro data
val df = spark.read.format("com.databricks.spark.avro").load("<INPUT_DIR>")
// Write Avro
df.write.format("com.databricks.spark.avro").save("<OUT_DIR>")
Change the highlighted part to the location of Avro data to read/write.
Enjoy Spark!
Lets take an example spark-avro, which allows you to read/write data in the Avro format using Spark.
Different ways to integrate third party package with Spark Application
Include package to Spark Shell/Applications using --jars
Download the jar file (spark-avro_2.11-3.1.0.jar) from below URL
https://spark-packages.org/package/databricks/spark-avro
Launch the spark shell with the jar file:
$SPARK_HOME/bin/spark-shell --jars <DOWNLOAD_PATH>/spark-avro_2.11-3.1.0.jar
Run Spark Application with jar file:
spark-submit --jars <DOWNLOAD_PATH>/spark-avro_2.11-3.1.0.jar <SPARK_SCRIPT>.jarInclude package in your Spark Shell/Applications using --package
Add maven coordinate as a argument to --package then it will install and available to use in your Spark Application. If you want pass multiple packages then list the packages with comma as separator.
Launch the spark shell with --package
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0
Run Spark Application with --package
spark-submit --packages com.databricks:spark-avro_2.11:3.1.0 <SPARK_SCRIPT>.jarTry the following script to test the package:
// import packages
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
// Read Avro data
val df = spark.read.format("com.databricks.spark.avro").load("<INPUT_DIR>")
// Write Avro
df.write.format("com.databricks.spark.avro").save("<OUT_DIR>")
Change the highlighted part to the location of Avro data to read/write.
Enjoy Spark!
No comments:
Post a Comment