该程序可能因为数据集过大而导致内存溢出,可以尝试以下几种改进方式:

  1. 增加 chunk_size 参数,将读取 FASTA 文件的函数改为逐块读取,减少内存占用。例如:
def read_fasta_file(file_path, chunk_size=10000):
    sequences = []
    labels = []
    with open(file_path) as f:
        while True:
            records = list(SeqIO.parse(f, 'fasta', max_records=chunk_size))
            if not records:
                break
            for record in records:
                sequences.append(str(record.seq))
                labels.append(record.id)
    return sequences, labels
  1. 优化 kmer_to_feature 函数,避免频繁的扩容操作。例如:
def kmer_to_feature(sequence, k=3):
    feature_size = 4 ** k
    features = [0] * feature_size * (len(sequence) - k + 1)
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i+k]
        if 'N' not in kmer:
            index = 0
            for base in kmer:
                if base == 'A':
                    index = index * 4 + 0
                elif base == 'C':
                    index = index * 4 + 1
                elif base == 'G':
                    index = index * 4 + 2
                elif base == 'T':
                    index = index * 4 + 3
            features[i*feature_size+index] = 1
    return features

这样,函数 kmer_to_feature 的返回值是一个一维数组,而不是一个列表,内存占用会更小。

  1. 使用稀疏矩阵存储特征,进一步减少内存占用。例如:
from scipy.sparse import csr_matrix

def kmer_to_feature(sequence, k=3):
    feature_size = 4 ** k
    row = []
    col = []
    data = []
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i+k]
        if 'N' not in kmer:
            index = 0
            for base in kmer:
                if base == 'A':
                    index = index * 4 + 0
                elif base == 'C':
                    index = index * 4 + 1
                elif base == 'G':
                    index = index * 4 + 2
                elif base == 'T':
                    index = index * 4 + 3
            row.append(i)
            col.append(index)
            data.append(1)
    return csr_matrix((data, (row, col)), shape=(len(sequence)-k+1, feature_size))

这样,函数 kmer_to_feature 的返回值是一个稀疏矩阵,内存占用会更小。但是需要注意的是,在使用稀疏矩阵时,算法的实现要求支持稀疏矩阵。例如,sklearn.neighbors.KNeighborsClassifier 的实现就支持稀疏矩阵。

KNN DNA 序列分类:内存溢出解决方案

原文地址: https://www.cveoy.top/t/topic/lJoo 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录