Saturday, January 14, 2017

Setup Environment for Spark Development on Windows

In this post, i am going to show you how to setup Spark without Hadoop in standalone mode in windows.

Step 1: Install JDK (Java Development Kit)
Download JDK7 or later from http://www.oracle.com/technetwork/java/javase/downloads/index.html and note the path where you installed.

Step 2: Download Apache Spark
Download a pre-built version of Apache Spark archive from https://spark.apache.org/downloads.html. Extract the downloded Spark archive and note the path where you extracted. (for example C:\dev_tools\spark)

Step 3: Download winutils.exe for Hadoop
Though we are not using Hadoop, spark throws error 'Failed to load the winutils binary in the hadoop binary path'. So download winutils.exe from winutils.exe and place it into a folder (for example C:\dev_tools\winutils\bin\winutils.exe)

Note: winutils.exe utility may varies with OS. If it doesn't support to your OS, find supporting one from winutils and use.

Step 4: Create Environment Variables
Open Control Panel -> System and Security -> Click on 'Advanced System Settings' -> Click on 'Environment Variables' button.
Add the following new USER variables:
JAVA_HOME <JAVA_INSTALLED PATH> (C:\Program Files\Java\jdk1.8.0_101)
SPARK_HOME <SPARK_EXTRACTED_PATH> ( C:\dev_tools\spark)
HADOOP HOME <WINUTILES_PATH> (C:\dev_tools\winutils)

Step 5: Set Classpath
Add following paths to your PATH user variable:
%SPARK_HOME%\bin
%JAVA_HOME%\bin

Step 6: Now Test it out!
1. Open command prompt in administrator mode.
2. Move to path where you setup the spark (i.e, C:\dev_tools\spark)
3. Check for a text file to play with like README.md
4. Type spark-shell to enter spark-shell
5. Execute following statements
val rdd = sc.textFile("README.md")
rdd.count()
You should get count of the number of lines in that file.

Congratulations, you setup done and successfully run first Spark program also :)

Enjoy Spark!

No comments:

Post a Comment