Wednesday, January 4, 2017

Create DataFrame from list of tuples using Pyspark

In this post I am going to explain creating a DataFrame from list of tuples in PySpark. I am using Python2 for scripting and Spark 2.0.1

Create a list of tuples
listOfTuples = [(101, "Satish", 2012, "Bangalore"),
(102, "Ramya", 2013, "Bangalore"),
(103, "Teja", 2014, "Bangalore"),
(104, "Kumar", 2012, "Hyderabad")]

Create Dataframe out of listOfTuples
df = spark.createDataFrame(listOfTuples , ["id", "name", "year", "city"])

Check the schema
df.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- year: long (nullable = true)
 |-- city: string (nullable = true)

Print data
df.show()
+---+------+----+---------+
| id|  name|year|     city|
+---+------+----+---------+
|101|Satish|2012|Bangalore|
|102| Ramya|2013|Bangalore|
|103|  Teja|2014|Bangalore|
|104| Kumar|2012|Hyderabad|
+---+------+----+---------+

Enjoy Spark!

2 comments: