PySpark RDD to DataFrame: Convert RDD to DataFrame with toDF()

In PySpark, 'rdd.toDF()' is used to convert an RDD (Resilient Distributed Dataset) to a DataFrame. RDD is the fundamental data structure in Apache Spark that represents an immutable, distributed collection of objects.

Here is an example of how to use 'rdd.toDF()':

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create an RDD
rdd = spark.sparkContext.parallelize([(1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35)])

# Convert RDD to DataFrame
df = rdd.toDF(['ID', 'Name', 'Age'])

# Show the DataFrame
df.show()

Output:

+---+-------+---+
| ID|   Name|Age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
+---+-------+---+

In this example, we create an RDD containing tuples representing people's ID, name, and age. We then use 'rdd.toDF(['ID', 'Name', 'Age'])' to convert the RDD to a DataFrame and specify column names. Finally, we show the resulting DataFrame using 'df.show()'.