東京大学新領域創成科学研究科メディカル情報生命専攻 2023年8月実施問題12

Author

Description

(1) The iterative equations below are for calculation of the score of global alignment of two sequences , , where is the match score of character and , and . The initial values are not shown here.

(1-1) Show the formula of the penalty for a gap of length .

(1-2) Suppose that some of the initial values are

, , ,

for : ,

for : .

Show the initial values and .

(1-3) Show the iterative equations to calculate the maximum score of local alignment using the same type of gap penalty.

(1-4) Explain a method to get a local alignment with the maximum score using the calculation of (1-3).

(2) There is a sequence consisting of . Define the complementary character of as

(2-1) Explain what is reported by the following algorithm.

(2-2) Let us define the 'reverse complementary alignment score' of two subsequences and of length and as the maximum score of global alignment of and . Note that is reverse ordered.

Also define the substitution matrix of the alignment as

and the gap penalty is the number of gaps (a gap of length has penalty ).

Show an algorithm to report a pair of (possibly empty) subsequences of with the maximum reverse complementary alignment score.

(1) 下列迭代方程用于计算两个序列和的全局比对得分，其中是字符和的匹配得分，且。初始值未显示。

对 于 对 于

(1-1) 展示长度为的空隙的惩罚公式。

(1-2) 假设一些初始值为

, , ,

对于 : ,

对于 : 。

展示初始值和。

(1-3) 使用相同类型的空隙惩罚展示计算局部比对最大得分的迭代方程。

(1-4) 解释一种使用 (1-3) 的计算方法获取最大得分的局部比对的方法。

(2) 有一个由组成的序列。定义的互补字符为

(2-1) 解释以下算法报告的内容。

对 于 对 于 对 于 对 于 如 果 那 么 否 则 如 果 那 么 报 告 一 对 范 围 和

(2-2) 定义两个子序列和的“反向互补对齐得分”为和的全局对齐的最大得分。注意是反向排列的。

同样，定义对齐的替换矩阵为

如 果 否 则

并且间隙惩罚是间隙的数量（长度为的间隙有惩罚）。

展示一个算法报告的一对（可能为空）子序列，具有最大反向互补对齐得分。

Kai

Written by zephyr

解题思路

本题涉及两个序列的全局和局部比对问题。题目给出了全局比对的迭代公式，并要求推导出相关的公式和算法。序列比对中，常用的评分包括匹配分、错配分和插入/删除（gap）的罚分。罚分由 gap opening penalty 和 gap extension penalty 组成。

反向互补序列是 DNA 双链结构中的一个重要概念。在 DNA 中,A 与 T 配对,C 与 G 配对,两条链的方向相反。因此,一条链的序列可以决定另一条链的序列。这个概念在本题的后半部分起到了关键作用。

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length

Let's denote the penalty for a gap of length as . From the given equations, we can see that:

Opening a gap costs
Extending a gap costs for each additional position

Therefore, the formula for the penalty of a gap of length is:

Note: This is known as an affine gap penalty model.

1-2: Initial Values for and

Given the initial conditions:

We need to determine and .

For :

represents a gap in sequence at the beginning. According to the recurrence relation:

Therefore, .

For :

represents a gap in sequence at the beginning. It's symmetrical to :

Hence, the initial values of and are as follows:

1-3: Iterative Equations for Local Alignment

To compute the local alignment, the iterative equations are modified to allow for the possibility of starting a new alignment, indicated by a score of 0:

1-4: Obtaining a Local Alignment with Maximum Score

To obtain a local alignment with the maximum score:

Initialize all cells in the first row and column to 0.
Fill the dynamic programming matrix using the equations from (1-3).
Find the cell with the maximum score in the entire matrix.
Perform a traceback from until reaching a cell with score 0 or the matrix boundary.
The path of this traceback gives the optimal local alignment.

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

The algorithm scans a sequence for reverse complementary matches. It uses a matrix to record the length of matching substrings that are reverse complements. If the length of the match exceeds a threshold , the algorithm reports the corresponding subsequences.

Specifically:

stores the length of the reverse complementary match ending at and .
The algorithm compares with the complement of for from down to .
If a match is found, it extends the previous match () by 1.
If the length of the match () is at least , it reports the corresponding ranges.

The reported ranges and represent the start and end positions of reverse complementary subsequences of length at least .

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

The algorithm to find the maximum reverse complementary alignment score is as follows:

Initialize a dynamic programming matrix dp where dp[i][j] represents the maximum reverse complementary alignment score ending at positions and .
Fill the matrix using a modified Smith-Waterman algorithm, considering reverse complementary matches.
Keep track of the maximum score and its position.
Perform a traceback from the position of the maximum score to reconstruct the aligned subsequences.
Return the pair of subsequences with the maximum reverse complementary alignment score.

The expected time complexity of this algorithm is , where is the length of the sequence . The expected space complexity is also due to the dynamic programming matrix.

Code Implementation

def max_reverse_complementary_alignment(x):
    m = len(x)
    # Initialize the dynamic programming matrix
    dp = [[0 for * in range(m+1)] for * in range(m+1)]
    max_score = 0
    max_pos = (0, 0)
    
    # Define complementary base pairs
    def comp(a):
        return {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}[a]
    
    # Define scoring function
    def s(a, b):
        return 1 if comp(a) == b else -1
    
    # Fill the dynamic programming matrix
    for i in range(1, m+1):
        for j in range(m, 0, -1):  # Note: reverse order, as we're looking for reverse complements
            match = dp[i-1][j+1] + s(x[i-1], x[j-1])
            delete = dp[i-1][j] - 1
            insert = dp[i][j+1] - 1
            dp[i][j] = max(0, match, delete, insert)
            if dp[i][j] > max_score:
                max_score = dp[i][j]
                max_pos = (i, j)
    
    # Traceback process, reconstruct optimal alignment
    i, j = max_pos
    seq1, seq2 = [], []
    while dp[i][j] > 0:
        if dp[i][j] == dp[i-1][j+1] + s(x[i-1], x[j-1]):
            seq1.append(x[i-1])
            seq2.append(x[j-1])
            i -= 1
            j += 1
        elif dp[i][j] == dp[i-1][j] - 1:
            seq1.append(x[i-1])
            seq2.append('-')
            i -= 1
        elif dp[i][j] == dp[i][j+1] - 1:
            seq1.append('-')
            seq2.append(x[j-1])
            j += 1
    return ''.join(reversed(seq1)), ''.join(seq2)

Knowledge

难点思路

这道题目的难点主要在于理解和设计反向互补序列的比对算法。我们需要修改传统的局部比对算法 (Smith-Waterman 算法) 来适应这个特殊的需求。关键是要理解如何在动态规划矩阵中正确地比较序列元素,以及如何进行回溯以重构最优的子序列对。

解题技巧和信息

对于序列比对问题,通常可以考虑使用动态规划方法。
在设计动态规划算法时,要注意初始条件的设置,这往往对算法的正确性至关重要。
对于带有间隔惩罚的序列比对,通常使用仿射间隔惩罚模型 (affine gap penalty model)。
在处理 DNA 序列时,要注意互补碱基对的概念 (A-T, C-G)。
局部比对和全局比对的主要区别在于是否允许比对从序列中间开始和结束。
在处理反向互补序列时,可以通过逆序遍历一个序列来模拟反向操作,同时使用互补碱基对的映射来处理互补关系。

重点词汇

global alignment 全局比对
local alignment 局部比对
affine gap penalty 仿射间隔惩罚
reverse complementary 反向互补
dynamic programming 动态规划
traceback 回溯
subsequence 子序列
palindromic sequence 回文序列
nucleotide 核苷酸
base pair 碱基对
DNA strand DNA 链
complementary base pairing 互补碱基配对

参考资料

Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press. Chapter 2-3.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge university press. Chapter 11-12.

Author​

Description​

Kai​

解题思路​

1. Global Alignment with Affine Gap Penalty​

1-1: Formula for the Penalty of a Gap of Length ​

1-2: Initial Values for and ​

1-3: Iterative Equations for Local Alignment​

1-4: Obtaining a Local Alignment with Maximum Score​

2. Reverse Complementary Sequence Analysis​

2-1: Explanation of the Algorithm​

2-2: Algorithm for Maximum Reverse Complementary Alignment Score​

Algorithm​

Code Implementation​

Knowledge​

难点思路​

解题技巧和信息​

重点词汇​

参考资料​