Python 机器学习代码：随机森林模型最佳预估器选择和模型保存

以下代码示例演示了如何使用 Python 的 sklearn 库构建一个随机森林模型，并使用交叉验证来选择最佳预估器和超参数，最终保存最佳模型。

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_iris
import joblib

# 加载鸢尾花数据集
iris = load_iris()

# 将数据集分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# 创建随机森林模型
rf = RandomForestClassifier(random_state=42)

# 设置超参数搜索范围
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 使用交叉验证选择最佳超参数组合和预估器数量
cv = GridSearchCV(rf, param_grid=param_grid, cv=5, n_jobs=-1)
cv.fit(X_train, y_train)

# 输出最佳超参数组合和预估器数量
print('最佳超参数组合和预估器数量：', cv.best_params_)

# 保存最佳模型
best_model = cv.best_estimator_
joblib.dump(best_model, 'best_random_forest_model.pkl')

# 预测测试集
y_pred = best_model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('随机森林模型的准确率为：', accuracy)

在这个例子中，我们使用 sklearn 的 GridSearchCV 来执行交叉验证，通过尝试不同的超参数组合来找到最佳的预估器数量和超参数设置。然后，我们将最佳模型保存到本地文件 'best_random_forest_model.pkl' 中，以便以后使用。

代码解释:

加载数据集: 使用 load_iris() 加载鸢尾花数据集。
划分数据集: 使用 train_test_split() 将数据集分成训练集和测试集。
创建随机森林模型: 使用 RandomForestClassifier() 创建一个随机森林模型。
设置超参数搜索范围: 定义 param_grid 字典来指定需要搜索的超参数范围。
交叉验证: 使用 GridSearchCV 执行交叉验证，传入随机森林模型、超参数搜索范围和交叉验证次数。
保存最佳模型: 将 GridSearchCV 中找到的最佳模型保存到本地文件 'best_random_forest_model.pkl' 中，以便以后使用。
预测测试集: 使用最佳模型对测试集进行预测。
计算准确率: 计算模型在测试集上的准确率。

注意事项:

n_jobs=-1 表示使用所有可用的 CPU 内核来加速交叉验证过程。
保存模型时使用的 joblib 库可以用来有效地保存 Python 对象，包括机器学习模型。

通过使用交叉验证和模型保存，我们可以确保模型的性能得到优化，并方便以后快速加载和使用模型。