PySpark RDD to DataFrame: Convert RDD to DataFrame with toDF()
In PySpark, 'rdd.toDF()' is used to convert an RDD (Resilient Distributed Dataset) to a DataFrame. RDD is the fundamental data structure in Apache Spark that represents an immutable, distributed collection of objects.
Here is an example of how to use 'rdd.toDF()':
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create an RDD
rdd = spark.sparkContext.parallelize([(1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35)])
# Convert RDD to DataFrame
df = rdd.toDF(['ID', 'Name', 'Age'])
# Show the DataFrame
df.show()
Output:
+---+-------+---+
| ID| Name|Age|
+---+-------+---+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
+---+-------+---+
In this example, we create an RDD containing tuples representing people's ID, name, and age. We then use 'rdd.toDF(['ID', 'Name', 'Age'])' to convert the RDD to a DataFrame and specify column names. Finally, we show the resulting DataFrame using 'df.show()'.
原文地址: https://www.cveoy.top/t/topic/o4l3 著作权归作者所有。请勿转载和采集!