Creating PySpark DataFrames
There are a few ways to manually create PySpark DataFrames:
- createDataFrame
- create_df
- toDF
This post shows the different ways to create DataFrames and explains when the different approaches are advantageous.
createDataFrame
Here’s how to create a DataFrame with createDataFrame
:
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
Check out the DataFrame schema with df.printSchema()
:
root
|-- first_name: string (nullable = true)
|-- age: long (nullable = true)
You can also pass createDataFrame
a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
Run df.printSchema()
to verify the schema is exactly as specified:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
createDataFrame
is nice because it allows for terse syntax (with limited schema control) or verbose syntax (with full schema control).
Let’s look at another option that’s not quite as verbose as createDataFrame
, but with the level of fine-grained control.
create_df
The create_df
method defined in quinn allows for precise schema definition when creating DataFrames.
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
Run df.printSchema()
to confirm the schema is exactly as specified:
root
|-- name: string (nullable = true)
|-- blah: string (nullable = true)
create_df
is generally the best option in your test suite.
See here for more information on testing PySpark code.
toDF
You can also create a RDD and convert it to a DataFrame with toDF
:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
It’s usually easier to use createDataFrame
than toDF
.
Conclusion
There are multiple ways to manually create PySpark DataFrames.
create_df
is the best when you’re working in a test suite and can easily add an external dependency.
For quick experimentation, use createDataFrame
.