首先,将数据集分成10个等分,每次使用其中9个作为训练集,1个作为测试集,然后将结果进行平均。这就是交叉十则验证法。

以下是代码实现:

from sklearn.model_selection import cross_val_score

# 将数据集分成10个等分,每次使用其中9个作为训练集,1个作为测试集
scores = cross_val_score(estimator, X, y, cv=10, scoring='neg_mean_squared_error')
mse_scores = -scores

# 计算R2和平均均方误差
r2_scores = cross_val_score(estimator, X, y, cv=10, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)

print('MSE:', mean_mse)
print('R2:', mean_r2)

其中,X是输入的AtomPairs2D分子指纹,y是预测值LD50。estimator是需要选择的回归模型,例如LR

接下来,使用30种机器学习回归模型的基线模型进行训练,并选出最优的基线回归模型。代码如下:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# 定义需要选择的回归模型
models = [
    ('LR', linear_model.LinearRegression()),
    ('Ri', linear_model.Ridge(alpha=.5)),
    ('La', linear_model.Lasso(alpha=0.1)),
    ('LaL', linear_model.LassoLars(alpha=.1)),
    ('Ba', linear_model.BayesianRidge()),
    ('Tw0', TweedieRegressor(power=0, alpha=0.5, link='log')),
    ('Tw1', TweedieRegressor(power=1, alpha=0.5, link='log')),
    ('Tw2', TweedieRegressor(power=2, alpha=0.5, link='log')),
    ('Tw3', TweedieRegressor(power=3, alpha=0.5, link='log')),
    ('El', ElasticNet(random_state=100)),
    ('Lorg', linear_model.LogisticRegression(random_state=100)),
    ('SGD', linear_model.SGDRegressor(random_state=100)),
    ('Pe', linear_model.Perceptron(random_state=100)),
    ('Pa', linear_model.PassiveAggressiveRegressor(random_state=100)),
    ('huber', linear_model.HuberRegressor()),
    ('krr', KernelRidge(alpha=1.0)),
    ('SVR', svm.SVR()),
    ('neigh', KNeighborsRegressor(n_neighbors=2)),
    ('gpr', GaussianProcessRegressor(kernel=kernel, random_state=100)),
    ('DT', tree.DecisionTreeRegressor()),
    ('MLP', MLPRegressor(random_state=100)),
    ('Ba', BaggingRegressor(random_state=100)),
    ('RF', RandomForestRegressor(random_state=100)),
    ('ET', ExtraTreesRegressor(random_state=100)),
    ('RTE', RandomTreesEmbedding(random_state=100)),
    ('AB', AdaBoostRegressor(random_state=100)),
    ('HGB', HistGradientBoostingRegressor()),
    ('GB', GradientBoostingRegressor(random_state=100)),
    ('XGB', xgb.XGBRegressor(random_state=100)),
    ('LGB', LGBMRegressor(random_state=100))
]

# 对每个回归模型进行标准化和网格搜索
results = []
names = []
for name, model in models:
    pipeline = Pipeline([('scaler', StandardScaler()), (name, model)])
    parameters = {}
    grid_search = GridSearchCV(pipeline, parameters, cv=10, scoring='neg_mean_squared_error')
    grid_search.fit(X, y)
    results.append(grid_search.best_score_)
    names.append(name)

# 找到最优的基线回归模型
best_index = np.argmax(results)
best_name = names[best_index]
best_score = results[best_index]

print("Best baseline model is:", best_name)
print("Best mean squared error of baseline model is:", best_score)

其中,models是需要选择的回归模型列表,包括了30种机器学习回归模型。使用Pipeline进行标准化和网格搜索,然后找到最优的基线回归模型。

基于AtomPairs2D分子指纹预测LD50值的基线回归模型选择

原文地址: https://www.cveoy.top/t/topic/ncJ2 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录