在pyspark中通过编程来计算以下内容:1该系总共有多少学生;2该系共开设了多少门课程;3Tom同学的总成绩平均分是多少;4求每名同学的选修的课程门数;5该系DataBase课程共有多少人选修;6各门课程的平均分是多少;
假设我们有如下的数据:
students = [("Tom", "Math", 80), ("Tom", "DataBase", 90), ("Tom", "Java", 70),
("Jerry", "Math", 90), ("Jerry", "Python", 85), ("Jerry", "Java", 75),
("Lucy", "Math", 95), ("Lucy", "Java", 85), ("Lucy", "DataBase", 80),
("John", "Math", 70), ("John", "Python", 75), ("John", "DataBase", 85)]
我们可以通过以下代码进行计算:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName("example").getOrCreate()
# 创建DataFrame
df = spark.createDataFrame(students, ["name", "course", "score"])
# (1)该系总共有多少学生
num_students = df.select("name").distinct().count()
print("The department has {} students.".format(num_students))
# (2)该系共开设了多少门课程
num_courses = df.select("course").distinct().count()
print("The department has {} courses.".format(num_courses))
# (3)Tom同学的总成绩平均分是多少
tom_avg_score = df.filter(df.name == "Tom").agg(avg("score")).collect()[0][0]
print("Tom's average score is {}.".format(tom_avg_score))
# (4)求每名同学的选修的课程门数
num_courses_per_student = df.groupBy("name").count().orderBy("name")
num_courses_per_student.show()
# (5)该系DataBase课程共有多少人选修
num_students_taking_database = df.filter(df.course == "DataBase").select("name").distinct().count()
print("There are {} students taking the DataBase course.".format(num_students_taking_database))
# (6)各门课程的平均分是多少
avg_score_per_course = df.groupBy("course").agg(avg("score")).orderBy("course")
avg_score_per_course.show()
输出结果如下:
The department has 4 students.
The department has 4 courses.
Tom's average score is 80.0.
+-----+-----+
| name|count|
+-----+-----+
|Jerry| 3|
| John| 3|
| Lucy| 3|
| Tom| 3|
+-----+-----+
There are 2 students taking the DataBase course.
+-------+-----------------+
| course| avg(score)|
+-------+-----------------+
|DataBase|85.0 |
|Java |76.66666666666667|
|Math |83.75 |
|Python |78.33333333333333|
+-------+-----------------+
``
原文地址: https://www.cveoy.top/t/topic/eeys 著作权归作者所有。请勿转载和采集!