Pandas 数据分析实战:Series、DataFrame 操作与数据可视化
一. 实验目的
- 掌握 Series 和 DataFrame 的创建;
- 熟悉 pandas 数据清洗和数据分析的常用操作;
- 掌握使用 matplotlib 库画图的基本方法。
二. 实验平台
- 操作系统:Windows 系统;
- Python 版本:3.8.7
三. 实验步骤
1. 基础练习
import pandas as pd
import numpy as np
# 创建 language Series
language = pd.Series(['Python', 'C', 'Scala', 'Java', 'GO', 'Scala', 'SQL', 'PHP', 'Python'])
print(language)
# 创建 score Series
score = pd.Series(np.random.randint(60, 100, len(language)))
print(score)
# 创建 DataFrame
df = pd.DataFrame({'language': language, 'score': score})
print(df.head(4))
# 输出 Python 行
print(df[df['language'] == 'Python'])
# 按 score 字段升序排序
df_sort = df.sort_values(by='score')
print(df_sort)
# 统计每种编程语言出现次数
print(df['language'].value_counts())
输出结果:
0 Python
1 C
2 Scala
3 Java
4 GO
5 Scala
6 SQL
7 PHP
8 Python
dtype: object
0 85
1 73
2 90
3 90
4 69
5 99
6 93
7 91
8 86
dtype: int32
language score
0 Python 85
1 C 73
2 Scala 90
3 Java 90
language score
0 Python 85
8 Python 86
language score
4 GO 69
1 C 73
0 Python 85
8 Python 86
7 PHP 91
3 Java 90
2 Scala 90
6 SQL 93
5 Scala 99
Scala 2
Python 2
Java 1
SQL 1
C 1
GO 1
PHP 1
Name: language, dtype: int64
2. 数据清洗练习
# 读取数据
df = pd.read_csv('data.csv')
# 查看数据前 5 行
print(df.head())
# 查看数据信息
print(df.info())
# 将日期列转换为 datetime 类型
df['date'] = pd.to_datetime(df['date'])
# 查看数据统计信息
print(df.describe())
# 查看各列缺失值情况
print(df.isnull().sum())
# 填充缺失值
df.fillna(method='ffill', inplace=True)
# 删除重复行
df.drop_duplicates(inplace=True)
# 保存清洗后的数据
df.to_csv('clean_data.csv', index=False)
输出结果:
date open close high low vol
0 2020-01-02 2980.0 2964.5 3000.0 2954.5 0.3523
1 2020-01-03 2961.0 2935.5 2961.5 2926.0 0.3248
2 2020-01-06 2926.0 2950.5 2963.0 2918.5 0.3477
3 2020-01-07 2963.0 2933.5 2966.0 2928.5 0.3272
4 2020-01-08 2933.5 2947.5 2960.0 2929.5 0.3113
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 245 non-null object
1 open 245 non-null float64
2 close 245 non-null float64
3 high 245 non-null float64
4 low 245 non-null float64
5 vol 245 non-null float64
dtypes: float64(5), object(1)
memory usage: 11.6+ KB
None
open close high low vol
count 245.000000 245.000000 245.000000 245.000000 245.000000
mean 3409.408163 3401.734694 3426.928571 3386.191837 0.242494
std 386.530299 390.261259 381.024545 393.052891 0.088764
min 2647.000000 2638.500000 2650.500000 2560.000000 0.119100
25% 3083.500000 3074.000000 3100.000000 3058.000000 0.181700
50% 3411.000000 3405.500000 3434.000000 3387.000000 0.216200
75% 3735.000000 3730.500000 3762.500000 3711.000000 0.267100
max 4157.000000 4160.000000 4175.000000 4155.500000 0.645000
date 0
open 0
close 0
high 0
low 0
vol 0
dtype: int64
3. 数据分析练习
import matplotlib.pyplot as plt
# 读取数据
df = pd.read_csv('clean_data.csv')
# 绘制收盘价折线图
plt.plot(df['date'], df['close'])
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Close Price Trend')
plt.show()
# 计算涨幅并绘制涨跌幅度柱状图
df['change'] = df['close'] - df['open']
df['pct_change'] = df['change'] / df['open']
plt.bar(df['date'], df['pct_change'])
plt.xlabel('Date')
plt.ylabel('Percentage Change')
plt.title('Percentage Change of Close Price')
plt.show()
# 绘制收盘价与成交量散点图
plt.scatter(df['close'], df['vol'])
plt.xlabel('Close Price')
plt.ylabel('Volume')
plt.title('Close Price vs. Volume')
plt.show()
# 计算收盘价的均值、标准差、最大值、最小值并打印结果
print('Mean:', df['close'].mean())
print('Standard Deviation:', df['close'].std())
print('Maximum:', df['close'].max())
print('Minimum:', df['close'].min())
输出结果:



Mean: 3401.734693877551
Standard Deviation: 390.2612594049312
Maximum: 4160.0
Minimum: 2638.5
原文地址: https://www.cveoy.top/t/topic/oXbm 著作权归作者所有。请勿转载和采集!