Python Fasta 序列分割工具:创建训练集和测试集
以下是一个示例代码,用于将 Fasta 文件中的序列随机分为训练集和测试集,并输出到不同的文件中。
import random
# 读取 Fasta 文件中的序列和标签
def read_fasta_file(file_path):
seqs = []
labels = []
with open(file_path, 'r') as f:
seq = ''
for line in f:
line = line.strip()
if line.startswith('>'):
label = line[1:]
if seq != '':
seqs.append(seq)
labels.append(label)
seq = ''
else:
seq += line
if seq != '':
seqs.append(seq)
labels.append(label)
return seqs, labels
# 将序列随机分为训练集和测试集
def split_dataset(seqs, labels, test_ratio=0.2):
num_seqs = len(seqs)
test_size = int(num_seqs * test_ratio)
test_indices = set(random.sample(range(num_seqs), test_size))
train_seqs = []
train_labels = []
test_seqs = []
test_labels = []
for i in range(num_seqs):
if i in test_indices:
test_seqs.append(seqs[i])
test_labels.append(labels[i])
else:
train_seqs.append(seqs[i])
train_labels.append(labels[i])
return train_seqs, train_labels, test_seqs, test_labels
# 输出序列到文件中
def write_seqs_to_file(seqs, labels, file_path):
with open(file_path, 'w') as f:
for i in range(len(seqs)):
f.write('>' + labels[i] + '
')
f.write(seqs[i] + '
')
# 测试代码
if __name__ == '__main__':
file_path = 'sequences.fasta'
seqs, labels = read_fasta_file(file_path)
train_seqs, train_labels, test_seqs, test_labels = split_dataset(seqs, labels)
write_seqs_to_file(train_seqs, train_labels, 'train.fasta')
write_seqs_to_file(test_seqs, test_labels, 'test.fasta')
该代码将 Fasta 文件中的序列读取到 seqs 和 labels 数组中,然后将 seqs 随机分为训练集和测试集,输出到不同的文件中。您可以根据需要修改文件路径和测试比例,以及在函数中添加其他处理逻辑。
原文地址: https://www.cveoy.top/t/topic/lE2i 著作权归作者所有。请勿转载和采集!