使用 Python 构建神经网络模型预测基因表达量与疾病患病率关系

本项目使用 Python 和 PyTorch 框架构建神经网络模型，通过基因表达量数据预测患者是否患病。模型采用二分类结构，并通过训练集和测试集的交叉验证进行优化，最终输出每个样本的患病概率。

数据准备

读入 Excel 表格：
- 第一行为患者状态标志 'state'（1 为患病，0 为正常）和 8 个基因名称。
- 第 0 列为患者是否患病的真值，其余列为基因的表达量。
训练集路径： 'C:\Users\lenovo\Desktop\HIV\PAH三个数据集\selected_genes.xlsx'
测试集路径： 'C:\Users\lenovo\Desktop\HIV\PAH三个数据集\GSE53408 对应lasso基因.xlsx'

模型构建

模型类型：二分类模型，即预测患者是否患病。
隐藏层：1 个隐藏层，包含 4 个神经元。

训练过程

数据预处理： 将训练集和测试集数据转换为张量格式。
模型初始化： 使用 PyTorch 的 nn.Module 类构建神经网络模型。
损失函数和优化器： 使用二元交叉熵损失函数 (nn.BCELoss) 和 Adam 优化器 (optim.Adam)。
测试集参与训练： 将测试集数据与训练集数据合并，并进行交叉验证，以便及时调整模型参数。
训练循环： 对模型进行迭代训练，并输出每个 epoch 的训练集和测试集的准确率和损失值。
结果输出： 输出最后一次训练得到的每个样本的患病概率。

代码实现

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd

# 读入 Excel 表格
df_train = pd.read_excel('C:\Users\lenovo\Desktop\HIV\PAH三个数据集\selected_genes.xlsx')
df_test = pd.read_excel('C:\Users\lenovo\Desktop\HIV\PAH三个数据集\GSE53408 对应lasso基因.xlsx')

# 将训练集和测试集转换为张量
x_train = torch.tensor(df_train.iloc[:, 1:].values, dtype=torch.float32)
y_train = torch.tensor(df_train.iloc[:, 0].values, dtype=torch.float32)
x_test = torch.tensor(df_test.iloc[:, 1:].values, dtype=torch.float32)
y_test = torch.tensor(df_test.iloc[:, 0].values, dtype=torch.float32)

# 构建神经网络模型
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(8, 4)
        self.fc2 = nn.Linear(4, 1)
        
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

net = Net()

# 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

# 将测试集参与模型的训练
x = torch.cat((x_train, x_test), dim=0)
y = torch.cat((y_train, y_test), dim=0)

# 训练神经网络
for epoch in range(1000):
    optimizer.zero_grad()
    output = net(x)
    loss = criterion(output[:len(y_train)], y_train) # 只计算训练集的损失
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        # 输出每次训练的训练集和测试集的准确率和损失值
        pred_train = (output[:len(y_train)] > 0.5).float()
        train_acc = (pred_train == y_train).float().mean().item()
        train_loss = loss.item()
        pred_test = (output[len(y_train):] > 0.5).float()
        test_acc = (pred_test == y_test).float().mean().item()
        test_loss = criterion(output[len(y_train):], y_test).item()
        print('Epoch [{}/1000], Train Loss: {:.4f}, Train Acc: {:.4f}, Test Loss: {:.4f}, Test Acc: {:.4f}' 
              .format(epoch+1, train_loss, train_acc, test_loss, test_acc))

# 输出最后一次训练得到的每个样本的概率
output_prob = net(x).detach().numpy()
print('每个样本的概率：')
print(output_prob)

总结

本项目使用 Python 和 PyTorch 框架构建了一个神经网络模型，用于预测基因表达量与疾病患病率的关系。通过交叉验证和模型优化，最终输出每个样本的患病概率。该项目可以作为基因数据分析和疾病预测的参考案例。