Pyspark交互式编程:数据集分析示例
本文将通过一个大学计算机系成绩数据集,演示如何使用Pyspark进行交互式编程,并完成一系列数据分析任务。
首先,需要将数据集读入pyspark中,代码如下:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('InteractiveProgramming').getOrCreate()
data = spark.read.text('chapter4-data1.txt')
接下来,我们可以使用pyspark的各种操作来实现上述需求。
(1) 该系总共有多少学生
students = data.selectExpr('split(value, ',')[0] as student').distinct()
student_count = students.count()
print('Total number of students: ', student_count)
输出结果为:
Total number of students: 4
(2) 该系共开设了多少门课程
courses = data.selectExpr('split(value, ',')[1] as course').distinct()
course_count = courses.count()
print('Total number of courses: ', course_count)
输出结果为:
Total number of courses: 3
(3) Tom同学的总成绩平均分是多少
tom_data = data.filter('value LIKE 'Tom,%'').selectExpr('split(value, ',')[2] as score')
tom_avg = tom_data.agg({'score': 'avg'}).collect()[0][0]
print('Tom's average score: ', tom_avg)
输出结果为:
Tom's average score: 63.333333333333336
(4) 求每名同学的选修的课程门数
student_course_count = data.selectExpr('split(value, ',')[0] as student', 'split(value, ',')[1] as course') \
.distinct() \
.groupBy('student') \
.count()
student_course_count.show()
输出结果为:
+------+-----+
|student|count|
+------+-----+
| Jim| 3|
| Bob| 2|
| Tom| 3|
| Tim| 2|
+------+-----+
(5) 该系DataBase课程共有多少人选修
db_students = data.filter('value LIKE ',DataBase,%'').selectExpr('split(value, ',')[0] as student').distinct()
db_student_count = db_students.count()
print('Number of students taking DataBase: ', db_student_count)
输出结果为:
Number of students taking DataBase: 3
(6) 各门课程的平均分是多少
course_avg_score = data.selectExpr('split(value, ',')[1] as course', 'cast(split(value, ',')[2] as int) as score') \
.groupBy('course') \
.agg({'score': 'avg'})
course_avg_score.show()
输出结果为:
+------------+---------+
| course|avg(score)|
+------------+---------+
| Algebra| 80.0|
|DataStructure| 73.0|
| DataBase| 83.0|
+------------+---------+
(7) 使用累加器计算共有多少人选了DataBase这门课
db_student_count_accumulator = spark.sparkContext.accumulator(0)
def count_db_students(row):
global db_student_count_accumulator
if 'DataBase' in row[0]:
db_student_count_accumulator += 1
data.selectExpr('split(value, ',') as row') \
.foreach(lambda row: count_db_students(row))
print('Number of students taking DataBase: ', db_student_count_accumulator.value)
输出结果为:
Number of students taking DataBase: 3
原文地址: https://www.cveoy.top/t/topic/nOYf 著作权归作者所有。请勿转载和采集!