Python 批量提取 FASTA 文件序列 - 基于 ID 筛选

本指南将指导您如何使用 Python 代码从 FASTA 文件中根据 ID 批量提取序列。

步骤:

  1. 读取 FASTA 文件:
def read_fasta_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content
  1. 分割 FASTA 文件内容:
def split_fasta_content(content):
    sequences = content.split('>')
    sequences = [seq.strip() for seq in sequences if seq.strip() != '']
    return sequences
  1. 提取每个序列的 ID 和内容:
def extract_sequences(sequences):
    extracted_sequences = {}
    for seq in sequences:
        seq_lines = seq.split('
')
        seq_id = seq_lines[0]
        seq_content = ''.join(seq_lines[1:])
        extracted_sequences[seq_id] = seq_content
    return extracted_sequences
  1. 根据 ID 列表提取序列:
def get_sequences_by_ids(extracted_sequences, id_list):
    sequences = {}
    for seq_id in id_list:
        if seq_id in extracted_sequences:
            sequences[seq_id] = extracted_sequences[seq_id]
    return sequences

完整代码示例:

def read_fasta_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

def split_fasta_content(content):
    sequences = content.split('>')
    sequences = [seq.strip() for seq in sequences if seq.strip() != '']
    return sequences

def extract_sequences(sequences):
    extracted_sequences = {}
    for seq in sequences:
        seq_lines = seq.split('
')
        seq_id = seq_lines[0]
        seq_content = ''.join(seq_lines[1:])
        extracted_sequences[seq_id] = seq_content
    return extracted_sequences

def get_sequences_by_ids(extracted_sequences, id_list):
    sequences = {}
    for seq_id in id_list:
        if seq_id in extracted_sequences:
            sequences[seq_id] = extracted_sequences[seq_id]
    return sequences

file_path = 'example.fasta'
# 替换为您的 FASTA 文件路径
id_list = ['seq1', 'seq3']
# 替换为要提取的序列 ID 列表

content = read_fasta_file(file_path)
sequences = split_fasta_content(content)
extracted_sequences = extract_sequences(sequences)
selected_sequences = get_sequences_by_ids(extracted_sequences, id_list)

for seq_id, seq_content in selected_sequences.items():
    print('ID:', seq_id)
    print('Sequence:', seq_content)
    print('---')

注意:

  • 将代码中的 example.fasta 替换为您的 FASTA 文件路径。
  • id_list 替换为您要提取的序列 ID 列表。

运行代码后,将打印出所提取序列的 ID 和序列内容。


原文地址: https://www.cveoy.top/t/topic/pcWA 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录