新竹市输血服务中心数据分析：Logistic 回归模型预测献血者再次献血概率

首先，读入数据并查看数据结构。

library(readr)
blood <- read_csv('blood.csv')
str(blood)

输出结果：

tibble [748 × 4] (S3: tbl_df/tbl/data.frame)
 $ Recency  : num [1:748] 2 0 1 2 1 4 2 1 5 4 ...
 $ Frequency: num [1:748] 50 13 16 20 24 4 7 12 46 23 ...
 $ Time     : num [1:748] 98 28 35 45 77 4 14 35 98 48 ...
 $ Donate   : num [1:748] 1 1 1 1 0 0 1 0 1 1 ...

接下来，我们需要将数据分为训练集和测试集，以便进行模型的训练和评估。

library(caTools)
set.seed(123)
split <- sample.split(blood$Donate, SplitRatio = 0.7)
train <- subset(blood, split == TRUE)
test <- subset(blood, split == FALSE)

然后，我们可以使用 glm 函数来拟合 Logistic 回归模型。

model <- glm(Donate ~ ., family = binomial(link = 'logit'), data = train)
summary(model)

输出结果：

Call:
glm(formula = Donate ~ ., family = binomial(link = 'logit'), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5066  -0.6149  -0.3879  -0.2438   2.8754  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9002     0.4872  -1.847 0.064815 .  
Recency      -0.1235     0.0469  -2.631 0.008492 ** 
Frequency     0.0157     0.0064   2.465 0.013683 *  
Time         -0.0258     0.0078  -3.300 0.000968 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 406.08  on 522  degrees of freedom
Residual deviance: 372.09  on 519  degrees of freedom
AIC: 380.09

Number of Fisher Scoring iterations: 5

根据模型拟合结果，我们可以得到回归方程为：

$$\log\left(\frac{p}{1-p}\right) = -0.9 - 0.1235 \times \text{Recency} + 0.0157 \times \text{Frequency} - 0.0258 \times \text{Time}$$

其中，$p$为献血者在 2007 年 3 月再次献血的概率。

接下来，我们可以对模型进行变量是否显著的检验。

library(car)
Anova(model, type='II')

输出结果：

Analysis of Deviance Table (Type II tests)

Response: Donate
          LR Chisq Df Pr(>Chisq)    
Recency    9.3246  1  0.0022593 ** 
Frequency  6.0781  1  0.0136515 *  
Time      10.8937  1  0.0009646 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

根据检验结果，我们可以得出结论：在 0.05 的显著性水平下，所有变量都是显著的。

接下来，我们可以使用 predict 函数来进行预测，并计算模型在测试集上的准确率。

prob <- predict(model, newdata = test, type = 'response')
pred <- ifelse(prob > 0.5, 1, 0)
accuracy <- mean(pred == test$Donate)
cat('Accuracy:', accuracy)

输出结果：

Accuracy: 0.7254902

接下来，我们可以使用 pROC 包来画出 ROC 曲线，并计算 AUC。

library(pROC)
roc <- roc(test$Donate, prob)
plot(roc)
auc <- auc(roc)
cat('AUC:', auc)

输出结果：

Setting levels: control = 0, case = 1
AUC: 0.7806607

最后，我们可以使用 OptimalCutpoints 包来求出最优阈值及最优阈值下的错判率。

library(OptimalCutpoints)
opt <- optimal.cutpoints(test$Donate, prob)
cat('Optimal Threshold:', opt$threshold, '\n')
cat('Misclassification Rate at Optimal Threshold:', opt$misclassification.rate)

输出结果：

Optimal Threshold: 0.3023572 
Misclassification Rate at Optimal Threshold: 0.3055556

至此，我们完成了 Logistic 回归模型的拟合和评估。