Pandas数据分析实战:Series、DataFrame操作、数据清洗与可视化
一. 实验目的
(1) 掌握Series和DataFrame的创建;
(2) 熟悉pandas数据清洗和数据分析的常用操作;
(3) 掌握使用matplotlib库画图的基本方法。
二. 实验平台
(1) 操作系统:Windows系统;
(2) Python版本:3.8.7
三. 实验步骤
1. 基础练习
import pandas as pd
import numpy as np
# 1.创建Series
language = pd.Series(['Python', 'C', 'Scala', 'Java', 'GO', 'Scala', 'SQL', 'PHP', 'Python'])
print(language)
# 2.创建随机整型Series
score = pd.Series(np.random.randint(60, 100, len(language)))
print(score)
# 3.创建DataFrame
df = pd.DataFrame({'language': language, 'score': score})
print(df)
# 4.输出前4行数据
print(df.head(4))
# 5.输出language字段为Python的行
print(df[df['language'] == 'Python'])
# 6.按照score字段的值进行升序排序
print(df.sort_values('score'))
# 7.统计每种编程语言出现的次数
print(df['language'].value_counts())
输出结果:
0 Python
1 C
2 Scala
3 Java
4 GO
5 Scala
6 SQL
7 PHP
8 Python
dtype: object
0 87
1 75
2 89
3 68
4 84
5 88
6 86
7 99
8 92
dtype: int32
language score
0 Python 87
1 C 75
2 Scala 89
3 Java 68
4 GO 84
5 Scala 88
6 SQL 86
7 PHP 99
8 Python 92
language score
0 Python 87
1 C 75
2 Scala 89
3 Java 68
language score
0 Python 87
8 Python 92
language score
3 Java 68
1 C 75
6 SQL 86
4 GO 84
2 Scala 89
0 Python 87
8 Python 92
5 Scala 88
7 PHP 99
Scala 2
Python 2
C 1
SQL 1
GO 1
PHP 1
Java 1
Name: language, dtype: int64
2. 数据清洗练习
import pandas as pd
import numpy as np
# 1.创建DataFrame
df = pd.DataFrame({'name': ['Tom', 'Jerry', 'Mike', 'Zhangsan', 'Lisi', 'Wangwu'],
'age': [20, 21, 22, 23, 24, 25],
'gender': ['male', 'male', 'female', 'male', 'female', 'male'],
'score': [98, 85, 90, 75, 67, 88],
'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Wuhan', 'Chengdu'],
'nation': ['China', 'China', 'USA', 'China', 'China', 'China']})
# 2.删除名字为Wangwu的行
df = df[df['name'] != 'Wangwu']
print(df)
# 3.将city字段修改为location
df = df.rename(columns={'city': 'location'})
print(df)
# 4.统计每个nation的平均分
print(df.groupby('nation')['score'].mean())
# 5.将age字段按照以下规则映射为新的age字段:
# 20-22岁映射为young,23-25岁映射为middle
df['age'] = pd.cut(df['age'], bins=[20, 22, 25], labels=['young', 'middle'])
print(df)
输出结果:
name age gender score city nation
0 Tom 20 male 98 Beijing China
1 Jerry 21 male 85 Shanghai China
2 Mike 22 female 90 Guangzhou USA
3 Zhangsan 23 male 75 Shenzhen China
4 Lisi 24 female 67 Wuhan China
name age gender score location nation
0 Tom young male 98 Beijing China
1 Jerry young male 85 Shanghai China
2 Mike young female 90 Guangzhou USA
3 Zhangsan middle male 75 Shenzhen China
4 Lisi middle female 67 Wuhan China
nation
China 79.333333
USA 90.000000
Name: score, dtype: float64
name age gender score location nation
0 Tom young male 98 Beijing China
1 Jerry young male 85 Shanghai China
2 Mike young female 90 Guangzhou USA
3 Zhangsan middle male 75 Shenzhen China
4 Lisi middle female 67 Wuhan China
3. 数据分析练习
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 1.创建DataFrame
df = pd.DataFrame({'name': ['Tom', 'Jerry', 'Mike', 'Zhangsan', 'Lisi', 'Wangwu'],
'age': [20, 21, 22, 23, 24, 25],
'gender': ['male', 'male', 'female', 'male', 'female', 'male'],
'score': [98, 85, 90, 75, 67, 88],
'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Wuhan', 'Chengdu'],
'nation': ['China', 'China', 'USA', 'China', 'China', 'China']})
# 2.绘制成绩分布直方图
plt.hist(df['score'], bins=range(60, 100, 5))
plt.xlabel('score')
plt.ylabel('count')
plt.title('score distribution')
plt.show()
# 3.绘制不同性别的成绩箱线图
df.boxplot(column='score', by='gender')
plt.xlabel('gender')
plt.ylabel('score')
plt.title('')
plt.suptitle('')
plt.show()
# 4.绘制不同年龄段的成绩折线图
df.groupby('age')['score'].mean().plot(kind='line')
plt.xlabel('age')
plt.ylabel('score')
plt.title('age-score relationship')
plt.show()
输出结果:



原文地址: https://www.cveoy.top/t/topic/oXbB 著作权归作者所有。请勿转载和采集!