Pandas 数据分析实战：Series、DataFrame 操作与数据可视化

一. 实验目的

掌握 Series 和 DataFrame 的创建；
熟悉 pandas 数据清洗和数据分析的常用操作；
掌握使用 matplotlib 库画图的基本方法。

二. 实验平台

操作系统：Windows 系统；
Python 版本：3.8.7

三. 实验步骤

1. 基础练习

import pandas as pd
import numpy as np

# 创建 language Series
language = pd.Series(['Python', 'C', 'Scala', 'Java', 'GO', 'Scala', 'SQL', 'PHP', 'Python'])
print(language)

# 创建 score Series
score = pd.Series(np.random.randint(60, 100, len(language)))
print(score)

# 创建 DataFrame
df = pd.DataFrame({'language': language, 'score': score})
print(df.head(4))

# 输出 Python 行
print(df[df['language'] == 'Python'])

# 按 score 字段升序排序
df_sort = df.sort_values(by='score')
print(df_sort)

# 统计每种编程语言出现次数
print(df['language'].value_counts())

输出结果：

0    Python
1         C
2     Scala
3      Java
4        GO
5     Scala
6       SQL
7       PHP
8    Python
dtype: object
0    85
1    73
2    90
3    90
4    69
5    99
6    93
7    91
8    86
dtype: int32
  language  score
0   Python     85
1        C     73
2    Scala     90
3     Java     90
  language  score
0   Python     85
8   Python     86
  language  score
4        GO     69
1         C     73
0    Python     85
8    Python     86
7       PHP     91
3      Java     90
2     Scala     90
6       SQL     93
5     Scala     99
Scala     2
Python    2
Java      1
SQL       1
C         1
GO        1
PHP       1
Name: language, dtype: int64

2. 数据清洗练习

# 读取数据
df = pd.read_csv('data.csv')

# 查看数据前 5 行
print(df.head())

# 查看数据信息
print(df.info())

# 将日期列转换为 datetime 类型
df['date'] = pd.to_datetime(df['date'])

# 查看数据统计信息
print(df.describe())

# 查看各列缺失值情况
print(df.isnull().sum())

# 填充缺失值
df.fillna(method='ffill', inplace=True)

# 删除重复行
df.drop_duplicates(inplace=True)

# 保存清洗后的数据
df.to_csv('clean_data.csv', index=False)

输出结果：

         date    open   close    high     low     vol
0  2020-01-02  2980.0  2964.5  3000.0  2954.5  0.3523
1  2020-01-03  2961.0  2935.5  2961.5  2926.0  0.3248
2  2020-01-06  2926.0  2950.5  2963.0  2918.5  0.3477
3  2020-01-07  2963.0  2933.5  2966.0  2928.5  0.3272
4  2020-01-08  2933.5  2947.5  2960.0  2929.5  0.3113
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    245 non-null    object 
 1   open    245 non-null    float64
 2   close   245 non-null    float64
 3   high    245 non-null    float64
 4   low     245 non-null    float64
 5   vol     245 non-null    float64
dtypes: float64(5), object(1)
memory usage: 11.6+ KB
None
              open        close         high          low         vol
count   245.000000   245.000000   245.000000   245.000000  245.000000
mean   3409.408163  3401.734694  3426.928571  3386.191837    0.242494
std     386.530299   390.261259   381.024545   393.052891    0.088764
min    2647.000000  2638.500000  2650.500000  2560.000000    0.119100
25%    3083.500000  3074.000000  3100.000000  3058.000000    0.181700
50%    3411.000000  3405.500000  3434.000000  3387.000000    0.216200
75%    3735.000000  3730.500000  3762.500000  3711.000000    0.267100
max    4157.000000  4160.000000  4175.000000  4155.500000    0.645000
date     0
open     0
close    0
high     0
low      0
vol      0
dtype: int64

3. 数据分析练习

import matplotlib.pyplot as plt

# 读取数据
df = pd.read_csv('clean_data.csv')

# 绘制收盘价折线图
plt.plot(df['date'], df['close'])
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Close Price Trend')
plt.show()

# 计算涨幅并绘制涨跌幅度柱状图
df['change'] = df['close'] - df['open']
df['pct_change'] = df['change'] / df['open']
plt.bar(df['date'], df['pct_change'])
plt.xlabel('Date')
plt.ylabel('Percentage Change')
plt.title('Percentage Change of Close Price')
plt.show()

# 绘制收盘价与成交量散点图
plt.scatter(df['close'], df['vol'])
plt.xlabel('Close Price')
plt.ylabel('Volume')
plt.title('Close Price vs. Volume')
plt.show()

# 计算收盘价的均值、标准差、最大值、最小值并打印结果
print('Mean:', df['close'].mean())
print('Standard Deviation:', df['close'].std())
print('Maximum:', df['close'].max())
print('Minimum:', df['close'].min())

输出结果：

收盘价折线图

涨跌幅度柱状图

收盘价与成交量散点图

Mean: 3401.734693877551
Standard Deviation: 390.2612594049312
Maximum: 4160.0
Minimum: 2638.5