KNN DNA 序列分类:内存溢出解决方案
该程序可能因为数据集过大而导致内存溢出,可以尝试以下几种改进方式:
- 增加
chunk_size参数,将读取 FASTA 文件的函数改为逐块读取,减少内存占用。例如:
def read_fasta_file(file_path, chunk_size=10000):
sequences = []
labels = []
with open(file_path) as f:
while True:
records = list(SeqIO.parse(f, 'fasta', max_records=chunk_size))
if not records:
break
for record in records:
sequences.append(str(record.seq))
labels.append(record.id)
return sequences, labels
- 优化
kmer_to_feature函数,避免频繁的扩容操作。例如:
def kmer_to_feature(sequence, k=3):
feature_size = 4 ** k
features = [0] * feature_size * (len(sequence) - k + 1)
for i in range(len(sequence) - k + 1):
kmer = sequence[i:i+k]
if 'N' not in kmer:
index = 0
for base in kmer:
if base == 'A':
index = index * 4 + 0
elif base == 'C':
index = index * 4 + 1
elif base == 'G':
index = index * 4 + 2
elif base == 'T':
index = index * 4 + 3
features[i*feature_size+index] = 1
return features
这样,函数 kmer_to_feature 的返回值是一个一维数组,而不是一个列表,内存占用会更小。
- 使用稀疏矩阵存储特征,进一步减少内存占用。例如:
from scipy.sparse import csr_matrix
def kmer_to_feature(sequence, k=3):
feature_size = 4 ** k
row = []
col = []
data = []
for i in range(len(sequence) - k + 1):
kmer = sequence[i:i+k]
if 'N' not in kmer:
index = 0
for base in kmer:
if base == 'A':
index = index * 4 + 0
elif base == 'C':
index = index * 4 + 1
elif base == 'G':
index = index * 4 + 2
elif base == 'T':
index = index * 4 + 3
row.append(i)
col.append(index)
data.append(1)
return csr_matrix((data, (row, col)), shape=(len(sequence)-k+1, feature_size))
这样,函数 kmer_to_feature 的返回值是一个稀疏矩阵,内存占用会更小。但是需要注意的是,在使用稀疏矩阵时,算法的实现要求支持稀疏矩阵。例如,sklearn.neighbors.KNeighborsClassifier 的实现就支持稀疏矩阵。
原文地址: https://www.cveoy.top/t/topic/lJoo 著作权归作者所有。请勿转载和采集!