基于AtomPairs2D分子指纹预测LD50值的基线回归模型选择

首先，将数据集分成10个等分，每次使用其中9个作为训练集，1个作为测试集，然后将结果进行平均。这就是交叉十则验证法。

以下是代码实现：

from sklearn.model_selection import cross_val_score

# 将数据集分成10个等分，每次使用其中9个作为训练集，1个作为测试集
scores = cross_val_score(estimator, X, y, cv=10, scoring='neg_mean_squared_error')
mse_scores = -scores

# 计算R2和平均均方误差
r2_scores = cross_val_score(estimator, X, y, cv=10, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)

print('MSE:', mean_mse)
print('R2:', mean_r2)

其中，X是输入的AtomPairs2D分子指纹，y是预测值LD50。estimator是需要选择的回归模型，例如LR。

接下来，使用30种机器学习回归模型的基线模型进行训练，并选出最优的基线回归模型。代码如下：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# 定义需要选择的回归模型
models = [
    ('LR', linear_model.LinearRegression()),
    ('Ri', linear_model.Ridge(alpha=.5)),
    ('La', linear_model.Lasso(alpha=0.1)),
    ('LaL', linear_model.LassoLars(alpha=.1)),
    ('Ba', linear_model.BayesianRidge()),
    ('Tw0', TweedieRegressor(power=0, alpha=0.5, link='log')),
    ('Tw1', TweedieRegressor(power=1, alpha=0.5, link='log')),
    ('Tw2', TweedieRegressor(power=2, alpha=0.5, link='log')),
    ('Tw3', TweedieRegressor(power=3, alpha=0.5, link='log')),
    ('El', ElasticNet(random_state=100)),
    ('Lorg', linear_model.LogisticRegression(random_state=100)),
    ('SGD', linear_model.SGDRegressor(random_state=100)),
    ('Pe', linear_model.Perceptron(random_state=100)),
    ('Pa', linear_model.PassiveAggressiveRegressor(random_state=100)),
    ('huber', linear_model.HuberRegressor()),
    ('krr', KernelRidge(alpha=1.0)),
    ('SVR', svm.SVR()),
    ('neigh', KNeighborsRegressor(n_neighbors=2)),
    ('gpr', GaussianProcessRegressor(kernel=kernel, random_state=100)),
    ('DT', tree.DecisionTreeRegressor()),
    ('MLP', MLPRegressor(random_state=100)),
    ('Ba', BaggingRegressor(random_state=100)),
    ('RF', RandomForestRegressor(random_state=100)),
    ('ET', ExtraTreesRegressor(random_state=100)),
    ('RTE', RandomTreesEmbedding(random_state=100)),
    ('AB', AdaBoostRegressor(random_state=100)),
    ('HGB', HistGradientBoostingRegressor()),
    ('GB', GradientBoostingRegressor(random_state=100)),
    ('XGB', xgb.XGBRegressor(random_state=100)),
    ('LGB', LGBMRegressor(random_state=100))
]

# 对每个回归模型进行标准化和网格搜索
results = []
names = []
for name, model in models:
    pipeline = Pipeline([('scaler', StandardScaler()), (name, model)])
    parameters = {}
    grid_search = GridSearchCV(pipeline, parameters, cv=10, scoring='neg_mean_squared_error')
    grid_search.fit(X, y)
    results.append(grid_search.best_score_)
    names.append(name)

# 找到最优的基线回归模型
best_index = np.argmax(results)
best_name = names[best_index]
best_score = results[best_index]

print("Best baseline model is:", best_name)
print("Best mean squared error of baseline model is:", best_score)

其中，models是需要选择的回归模型列表，包括了30种机器学习回归模型。使用Pipeline进行标准化和网格搜索，然后找到最优的基线回归模型。