Wednesday, January 25, 2017

Resilient Distributed Dataset (RDD)

In this post, I am going to provide breif introduction about Spark RDD, different ways to create RDDs and different operations on RDD.

What is a RDD?
Resilient
i.e fault-tolerant If the data in memory (or on a node) is lost, it can be recreated with the help of Lineage Graph.
Distributed
data is chunked into partitions and stored in memory across the cluster.
Dataset
initial data can come from a file or be created programmatically.

Resilient Distributed Dataset (RDD), the basic abstraction in Spark. RDD are immutable (does not change once created), partitioned collection of objects. Each RDD is split into multiple partitions (like inputsplits in MapReduce) and which performs in-memory computations on large clusters in a fault-tolerant manner and and all the function are performed only on RDDs.

Additional traits:
  • RDD does not actually contain data but just creates the pipeline for it.
  • Lazy evaluated - Data inside RDD is not available or transformed until an action is executed that triggers the execution.
  • Cacheable - RDDs can be cached across parallel operations
  • In-memory computing: Using RDDs iterative algorithms in machine learning and graph computations and executing ad-hoc queries on the same dataset efficiently by reusing intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.
Creating RDD's 
RDDs can be created in two different ways:
  • Hadoop Datasets :
    • Referencing an external dataset in any external storage system supported by Hadoop (Eg: Local FileSystem, Amazon S3, HBase, Casandra or any data source offering a Hadoop Input Format) in the driver program.
  • Parallelized collections
    • By parallelizing a collection of Scala objects (like List)
Before moving further let’s open the Spark Shell by switching into the home directory of Spark and type the following command. It will prompt Scala shell and also load the SparkContext as sc.

$ ./bin/spark-shell

Now you can start Spark programming in Scala.

Creating a RDD from Collection Object
When you want to create a RDD from an existing Scala collection object like Array, List, Tuples by calling SparkContext’s parallelize method on collection in driver program

scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29

In the above program, first I created variable assign with an array of 5 elements and then I created RDD named 'distData' and uniquely identified (by id) by calling SparkContext’s 'parallelize' method on array.

To view the content of any RDD by using 'collect' method. Let see the content of distData by typing the below command:

scala> distData.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5)

To view RDD's uniquely identified inside a SparkContext
scala> distData.id
res1: Int = 0

An RDD can optionally have a friendly name accessible using name, lets see how to set friendly name:
scala> distData.name = "Sample Data"
distData.name: String = Sample Data

scala> distData.name
res3: String = Sample Data


Creating a RDD from External sources
You can create a RDD from external dataset in any external storage system supported by Hadoop (Eg: Local FileSystem, Amazon S3, HBase, Casandra or any data source offering a Hadoop Input Format) in the driver program.

Lets create a RDD by loading a file:
scala> val lines = sc.textFile("sample.txt")

In the above program, created RDD by using SparkContext's textFile method. This method takes an URI of the file path either local (file://) or a HDFS (hdfs://) or S3 (s3://)

Note: RDD's resides in a SparkContext and SparkContext creates a logical boundary, RDDs can't be shared between SparkContexts.

Operations on RDD
Supports two kinds of operations on RDDs:

  • Transformations - which create a new dataset from an existing one
  • Actions - which return a value to the driver program after performing the computation on the dataset.

In next post, we will play with RDD by applying different operations. If you have any questions or doubts feel free to post them in the comments section.

Happy Learning :)

3 comments:

  1. Hi there! I'm at work browsing your blog from my new iphone 4! Just wanted to say I love reading your blog and look forward to all your posts! Carry on the superb work!

    ReplyDelete
  2. Can I simply say what a comfort to discover someone who actually knows what they are discussing online. You definitely understand how to bring a problem to light and make it important. A lot more people ought to check this out and understand this side of your story. I was surprised that you're not more popular since you definitely have the gift.

    ReplyDelete
  3. An interesting discussion is definitely worth comment. I believe that you should publish more about this subject matter, it may not be a taboo matter but usually people don't discuss these issues. To the next! Many thanks!!

    ReplyDelete