Python Fasta 文件分析: 计算状态转移矩阵和相邻碱基转移计数

该脚本用于读取 Fasta 文件并计算 DNA 序列中状态转移矩阵和相邻碱基转移计数。

import numpy as np
from Bio import SeqIO

def read_fasta_file(file_path):
    sequences = []
    with open(file_path) as f:
        sequence = ''
        for line in f:
            if line.startswith('>'):
                if sequence:
                    sequences.append(sequence)
                sequence = ''
            else:
                sequence += line.strip()
        if sequence:
            sequences.append(sequence)
    return sequences

def calc_transition_matrix(sequences):
    alphabet = list(set(''.join(sequences)))
    states = len(alphabet)
    matrix = np.zeros((states, states))
    for seq in sequences:
        for i in range(len(seq) - 1):
            from_state = alphabet.index(seq[i])
            to_state = alphabet.index(seq[i + 1])
            matrix[from_state, to_state] += 1
    matrix = matrix / np.sum(matrix, axis=1, keepdims=True)
    return matrix

def count_adjacent_transitions(transition_matrix):
    transition_count = {
        'AA': 0,
        'AC': 0,
        'AG': 0,
        'AT': 0,
        'CA': 0,
        'CC': 0,
        'CG': 0,
        'CT': 0,
        'GA': 0,
        'GC': 0,
        'GG': 0,
        'GT': 0,
        'TA': 0,
        'TC': 0,
        'TG': 0,
        'TT': 0,
    }
    alphabet = list('ACGT')
    for i in range(len(alphabet)):
        for j in range(len(alphabet)):
            transition_count[alphabet[i] + alphabet[j]] = transition_matrix[i, j]
    return transition_count

# 读取fasta文件，得到序列列表
sequences = read_fasta_file('sequences.fasta')
# 计算状态转移概率矩阵
transition_matrix = calc_transition_matrix(sequences)
# 计算相邻状态转移计数器
transition_count = count_adjacent_transitions(transition_matrix)
print(transition_count)

代码解释

read_fasta_file(file_path) 函数读取 Fasta 文件并返回一个包含所有序列的列表。
calc_transition_matrix(sequences) 函数根据序列列表计算状态转移矩阵。
count_adjacent_transitions(transition_matrix) 函数根据状态转移矩阵计算相邻碱基转移计数。

使用方法

将 Fasta 文件命名为 sequences.fasta 并放置在脚本所在目录。
运行脚本。
脚本将输出一个字典，包含每个相邻碱基转移的计数。

例子

假设 sequences.fasta 文件包含以下序列:

>seq1
ACGT
>seq2
CGTA

运行脚本后，将输出以下结果：

{'AA': 0.0, 'AC': 0.5, 'AG': 0.0, 'AT': 0.0, 'CA': 0.0, 'CC': 0.0, 'CG': 0.5, 'CT': 0.0, 'GA': 0.0, 'GC': 0.0, 'GG': 0.0, 'GT': 0.5, 'TA': 0.5, 'TC': 0.0, 'TG': 0.0, 'TT': 0.0}

应用

该脚本可以用于分析 DNA 序列的结构和演化关系。例如，可以利用状态转移矩阵和相邻碱基转移计数来识别序列中的重复序列、预测序列的二级结构、或分析序列的演化路径。