数据分类模型比较：距离判别、Bayes判别和Fisher判别

该数据包含 10992 个观测值和 17 个变量，其中变量'V17' 为因变量，有 10 个水平，对应于 0-9 这 10 个阿拉伯数字。数据中缺失值已经进行了插补。

本文的目标是根据变量'V1'-'V16' 及因变量'V17' 的观测值，分别建立距离判别、Bayes判别和Fisher判别分析模型，以用于未知目标变量的分类。

数据预处理

由于数据缺失值已经进行了插补，我们可以直接进行建模和分类。首先，我们将数据集划分为训练集和测试集，以便评估模型的准确性。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# 读入数据
df = pd.read_csv('data.csv')

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1], test_size=0.2, random_state=42)

距离判别

距离判别的思想是将每个类别的观测值看作一个点，通过计算未知目标变量与各个类别的距离来确定其分类。具体来说，我们需要计算未知目标变量到每个类别的中心点的距离，然后将其归类到距离最近的类别。

from sklearn.neighbors import DistanceMetric

# 计算每个类别的中心点
centers = []
for c in np.unique(y_train):
    centers.append(X_train[y_train == c].mean(axis=0))

# 建立距离判别模型
dist = DistanceMetric.get_metric('euclidean')
y_pred = []
for x in X_test.values:
    distances = []
    for center in centers:
        distances.append(dist.pairwise([x, center])[0][1])
    y_pred.append(np.argmin(distances))

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('距离判别误判率：', error_rate)

Bayes判别

Bayes判别的思想是根据贝叶斯定理计算未知目标变量属于每个类别的概率，然后将其归类到概率最高的类别。

from sklearn.naive_bayes import MultinomialNB

# 建立Bayes判别模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('Bayes判别误判率：', error_rate)

Fisher判别

Fisher判别的思想是通过计算类别内离散度和类别间离散度的比值来确定分类界面。具体来说，我们需要计算每个类别的均值和协方差矩阵，然后根据这些信息计算Fisher判别函数。对于一个未知目标变量，我们将其代入Fisher判别函数，根据函数值的正负来确定其分类。

from scipy.stats import multivariate_normal

# 计算每个类别的均值和协方差矩阵
means = []
covs = []
for c in np.unique(y_train):
    means.append(X_train[y_train == c].mean(axis=0))
    covs.append(np.cov(X_train[y_train == c].T))

# 计算类别内离散度和类别间离散度的比值
Sw = np.sum(covs, axis=0)
Sb = np.cov(X_train.values.T) - Sw
w = np.dot(np.linalg.inv(Sw), (means[0] - means[1]))
b = -0.5 * np.dot(np.dot(means[0], np.linalg.inv(Sw)), means[0]) + 0.5 * np.dot(np.dot(means[1], np.linalg.inv(Sw)), means[1]) + np.log(X_train[y_train == 0].shape[0] / X_train[y_train == 1].shape[0])

# 建立Fisher判别模型
y_pred = []
for x in X_test.values:
    if np.dot(w, x) + b > 0:
        y_pred.append(0)
    else:
        y_pred.append(1)

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('Fisher判别误判率：', error_rate)

其中，Fisher判别函数为：

$$f(x) = \boldsymbol{w}^T\boldsymbol{x} + b$$

其中，$\boldsymbol{w}$为分类界面的法向量，$b$为截距。根据Fisher判别函数的正负，我们可以将未知目标变量分为两类。

完整代码如下：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import DistanceMetric
from sklearn.naive_bayes import MultinomialNB
from scipy.stats import multivariate_normal

# 读入数据
df = pd.read_csv('data.csv')

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1], test_size=0.2, random_state=42)

# 距离判别
# 计算每个类别的中心点
centers = []
for c in np.unique(y_train):
    centers.append(X_train[y_train == c].mean(axis=0))

# 建立距离判别模型
dist = DistanceMetric.get_metric('euclidean')
y_pred = []
for x in X_test.values:
    distances = []
    for center in centers:
        distances.append(dist.pairwise([x, center])[0][1])
    y_pred.append(np.argmin(distances))

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('距离判别误判率：', error_rate)

# Bayes判别
# 建立Bayes判别模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('Bayes判别误判率：', error_rate)

# Fisher判别
# 计算每个类别的均值和协方差矩阵
means = []
covs = []
for c in np.unique(y_train):
    means.append(X_train[y_train == c].mean(axis=0))
    covs.append(np.cov(X_train[y_train == c].T))

# 计算类别内离散度和类别间离散度的比值
Sw = np.sum(covs, axis=0)
Sb = np.cov(X_train.values.T) - Sw
w = np.dot(np.linalg.inv(Sw), (means[0] - means[1]))
b = -0.5 * np.dot(np.dot(means[0], np.linalg.inv(Sw)), means[0]) + 0.5 * np.dot(np.dot(means[1], np.linalg.inv(Sw)), means[1]) + np.log(X_train[y_train == 0].shape[0] / X_train[y_train == 1].shape[0])

# 建立Fisher判别模型
y_pred = []
for x in X_test.values:
    if np.dot(w, x) + b > 0:
        y_pred.append(0)
    else:
        y_pred.append(1)

# 计算误判率
error_rate = np.mean(y_pred != y_test)
print('Fisher判别误判率：', error_rate)