基于AtomPairs2D分子指纹预测LD50值的基线回归模型选择
首先,将数据集分成10个等分,每次使用其中9个作为训练集,1个作为测试集,然后将结果进行平均。这就是交叉十则验证法。
以下是代码实现:
from sklearn.model_selection import cross_val_score
# 将数据集分成10个等分,每次使用其中9个作为训练集,1个作为测试集
scores = cross_val_score(estimator, X, y, cv=10, scoring='neg_mean_squared_error')
mse_scores = -scores
# 计算R2和平均均方误差
r2_scores = cross_val_score(estimator, X, y, cv=10, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)
print('MSE:', mean_mse)
print('R2:', mean_r2)
其中,X是输入的AtomPairs2D分子指纹,y是预测值LD50。estimator是需要选择的回归模型,例如LR。
接下来,使用30种机器学习回归模型的基线模型进行训练,并选出最优的基线回归模型。代码如下:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# 定义需要选择的回归模型
models = [
('LR', linear_model.LinearRegression()),
('Ri', linear_model.Ridge(alpha=.5)),
('La', linear_model.Lasso(alpha=0.1)),
('LaL', linear_model.LassoLars(alpha=.1)),
('Ba', linear_model.BayesianRidge()),
('Tw0', TweedieRegressor(power=0, alpha=0.5, link='log')),
('Tw1', TweedieRegressor(power=1, alpha=0.5, link='log')),
('Tw2', TweedieRegressor(power=2, alpha=0.5, link='log')),
('Tw3', TweedieRegressor(power=3, alpha=0.5, link='log')),
('El', ElasticNet(random_state=100)),
('Lorg', linear_model.LogisticRegression(random_state=100)),
('SGD', linear_model.SGDRegressor(random_state=100)),
('Pe', linear_model.Perceptron(random_state=100)),
('Pa', linear_model.PassiveAggressiveRegressor(random_state=100)),
('huber', linear_model.HuberRegressor()),
('krr', KernelRidge(alpha=1.0)),
('SVR', svm.SVR()),
('neigh', KNeighborsRegressor(n_neighbors=2)),
('gpr', GaussianProcessRegressor(kernel=kernel, random_state=100)),
('DT', tree.DecisionTreeRegressor()),
('MLP', MLPRegressor(random_state=100)),
('Ba', BaggingRegressor(random_state=100)),
('RF', RandomForestRegressor(random_state=100)),
('ET', ExtraTreesRegressor(random_state=100)),
('RTE', RandomTreesEmbedding(random_state=100)),
('AB', AdaBoostRegressor(random_state=100)),
('HGB', HistGradientBoostingRegressor()),
('GB', GradientBoostingRegressor(random_state=100)),
('XGB', xgb.XGBRegressor(random_state=100)),
('LGB', LGBMRegressor(random_state=100))
]
# 对每个回归模型进行标准化和网格搜索
results = []
names = []
for name, model in models:
pipeline = Pipeline([('scaler', StandardScaler()), (name, model)])
parameters = {}
grid_search = GridSearchCV(pipeline, parameters, cv=10, scoring='neg_mean_squared_error')
grid_search.fit(X, y)
results.append(grid_search.best_score_)
names.append(name)
# 找到最优的基线回归模型
best_index = np.argmax(results)
best_name = names[best_index]
best_score = results[best_index]
print("Best baseline model is:", best_name)
print("Best mean squared error of baseline model is:", best_score)
其中,models是需要选择的回归模型列表,包括了30种机器学习回归模型。使用Pipeline进行标准化和网格搜索,然后找到最优的基线回归模型。
原文地址: https://www.cveoy.top/t/topic/ncJ2 著作权归作者所有。请勿转载和采集!