東京大学新領域創成科学研究科メディカル情報生命専攻 2023年8月実施問題12

Author

Description

(1) The iterative equations below are for calculation of the score of global alignment of two sequences \(x = x_1 \cdots x_m\), \(y = y_1 \cdots y_n\), where \(s(a, b)\) is the match score of character \(a\) and \(b\), and \(0 < e < d\). The initial values are not shown here.

\[ \begin{align*} &\text{for } i = 1, \cdots, m \\ &\quad \text{for } j = 1, \cdots, n \\ &\quad \quad M_{i,j} = \max \begin{cases} M_{i-1,j-1} + s(x_i, y_j) \\ X_{i-1,j-1} + s(x_i, y_j) \\ Y_{i-1,j-1} + s(x_i, y_j) \end{cases} \\ &\quad \quad X_{i,j} = \max \begin{cases} M_{i-1,j} - d \\ X_{i-1,j} - e \\ Y_{i-1,j} - d \end{cases} \\ &\quad \quad Y_{i,j} = \max \begin{cases} M_{i,j-1} - d \\ X_{i,j-1} - d \\ Y_{i,j-1} - e \end{cases} \end{align*} \]

(1-1) Show the formula of the penalty for a gap of length \(k\).

(1-2) Suppose that some of the initial values are

\(M_{0,0} = 0\), \(X_{0,0} = -\infty\), \(Y_{0,0} = -\infty\),

for \(j = 1, \cdots, n\): \(M_{0,j} = X_{0,j} = -\infty\),

for \(i = 1, \cdots, m\): \(M_{i,0} = Y_{i,0} = -\infty\).

Show the initial values \(X_{i,0}\) \((i = 1, \cdots, m)\) and \(Y_{0,j}\) \((j = 1, \cdots, n)\).

(1-3) Show the iterative equations to calculate the maximum score of local alignment using the same type of gap penalty.

(1-4) Explain a method to get a local alignment with the maximum score using the calculation of (1-3).

(2) There is a sequence \(x = x_1 \cdots x_m\) consisting of \(a, c, g, t\). Define the complementary character of \(a, c, g, t\) as

\[ \begin{align*} \text{comp}(a) &= t, \\ \text{comp}(c) &= g, \\ \text{comp}(g) &= c, \\ \text{comp}(t) &= a. \end{align*} \]

(2-1) Explain what is reported by the following algorithm.

\[ \begin{align*} &\text{for } i = 1, \cdots, m: H_{i,m+1} = 0 \\ &\text{for } j = 1, \cdots, m + 1: H_{0,j} = 0 \\ &\text{for } i = 1, \cdots, m \\ &\quad \text{for } j = m, \cdots, i \\ &\quad \quad \text{if } x_i = \text{comp}(x_j) \text{ then } \\ &\quad \quad \quad H_{i,j} = H_{i-1,j+1} + 1 \\ &\quad \quad \text{else } H_{i,j} = 0 \\ &\quad \quad \text{if } H_{i-1,j+1} \ge k \text{ then } \\ &\quad \quad \quad \text{report a pair of ranges } [i - H_{i-1,j+1} + 1, i - 1] \text{ and } [j + 1, j + H_{i-1,j+1}]. \end{align*} \]

(2-2) Let us define the 'reverse complementary alignment score' of two subsequences \(x_{i} \cdots x_{i+p}\) and \(x_{j-q} \cdots x_{j}\) of length \(p + 1\) and \(q + 1\) as the maximum score of global alignment of \(x_{i} \cdots x_{i+p}\) and \(x_{j-q} \cdots x_{j}\). Note that \(x_{j-q} \cdots x_{j}\) is reverse ordered.

Also define the substitution matrix of the alignment as

\[ s(a, b) = \begin{cases} 1 & \text{if } \text{comp}(a) = b \\ -1 & \text{otherwise} \end{cases} \]

and the gap penalty is the number of gaps (a gap of length \(k\) has penalty \(k\)).

Show an algorithm to report a pair of (possibly empty) subsequences of \(x = x_1 \cdots x_m\) with the maximum reverse complementary alignment score.

(1) 下列迭代方程用于计算两个序列 \(x = x_1 \cdots x_m\) 和 \(y = y_1 \cdots y_n\) 的全局比对得分，其中 \(s(a, b)\) 是字符 \(a\) 和 \(b\) 的匹配得分，且 \(0 < e < d\)。初始值未显示。

\[ \begin{align*} &\text{对于 } i = 1, \cdots, m \\ &\quad \text{对于 } j = 1, \cdots, n \\ &\quad \quad M_{i,j} = \max \begin{cases} M_{i-1,j-1} + s(x_i, y_j) \\ X_{i-1,j-1} + s(x_i, y_j) \\ Y_{i-1,j-1} + s(x_i, y_j) \end{cases} \\ &\quad \quad X_{i,j} = \max \begin{cases} M_{i-1,j} - d \\ X_{i-1,j} - e \\ Y_{i-1,j} - d \end{cases} \\ &\quad \quad Y_{i,j} = \max \begin{cases} M_{i,j-1} - d \\ X_{i,j-1} - d \\ Y_{i,j-1} - e \end{cases} \end{align*} \]

(1-1) 展示长度为 \(k\) 的空隙的惩罚公式。

(1-2) 假设一些初始值为

\(M_{0,0} = 0\), \(X_{0,0} = -\infty\), \(Y_{0,0} = -\infty\),

对于 \(j = 1, \cdots, n\): \(M_{0,j} = X_{0,j} = -\infty\),

对于 \(i = 1, \cdots, m\): \(M_{i,0} = Y_{i,0} = -\infty\)。

展示初始值 \(X_{i,0}\) \((i = 1, \cdots, m)\) 和 \(Y_{0,j}\) \((j = 1, \cdots, n)\)。

(1-3) 使用相同类型的空隙惩罚展示计算局部比对最大得分的迭代方程。

(1-4) 解释一种使用 (1-3) 的计算方法获取最大得分的局部比对的方法。

(2) 有一个由 \(a, c, g, t\) 组成的序列 \(x = x_1 \cdots x_m\)。定义 \(a, c, g, t\) 的互补字符为

\[ \begin{align*} \text{comp}(a) &= t, \\ \text{comp}(c) &= g, \\ \text{comp}(g) &= c, \\ \text{comp}(t) &= a. \end{align*} \]

(2-1) 解释以下算法报告的内容。

\[ \begin{align*} &\text{对于 } i = 1, \cdots, m: H_{i,m+1} = 0 \\ &\text{对于 } j = 1, \cdots, m + 1: H_{0,j} = 0 \\ &\text{对于 } i = 1, \cdots, m \\ &\quad \text{对于 } j = m, \cdots, i \\ &\quad \quad \text{如果 } x_i = \text{comp}(x_j) \text{ 那么 } \\ &\quad \quad \quad H_{i,j} = H_{i-1,j+1} + 1 \\ &\quad \quad \text{否则 } H_{i,j} = 0 \\ &\quad \quad \text{如果 } H_{i-1,j+1} \ge k \text{ 那么 } \\ &\quad \quad \quad \text{报告一对范围 } [i - H_{i-1,j+1} + 1, i - 1] \text{ 和 } [j + 1, j + H_{i-1,j+1}]. \end{align*} \]

(2-2) 定义两个子序列 \(x_{i} \cdots x_{i+p}\) 和 \(x_{j-q} \cdots x_{j}\) 的“反向互补对齐得分”为 \(p + 1\) 和 \(q + 1\) 的全局对齐的最大得分。注意 \(x_{j-q} \cdots x_{j}\) 是反向排列的。

同样，定义对齐的替换矩阵为

\[ s(a, b) = \begin{cases} 1 & \text{如果 } \text{comp}(a) = b \\ -1 & \text{否则} \end{cases} \]

并且间隙惩罚是间隙的数量（长度为 \(k\) 的间隙有惩罚 \(k\)）。

展示一个算法报告 \(x = x_1 \cdots x_m\) 的一对（可能为空）子序列，具有最大反向互补对齐得分。

Kai

Written by zephyr

解题思路

本题涉及两个序列的全局和局部比对问题。题目给出了全局比对的迭代公式，并要求推导出相关的公式和算法。序列比对中，常用的评分包括匹配分、错配分和插入/删除（gap）的罚分。罚分由 gap opening penalty 和 gap extension penalty 组成。

反向互补序列是 DNA 双链结构中的一个重要概念。在 DNA 中,A 与 T 配对,C 与 G 配对,两条链的方向相反。因此,一条链的序列可以决定另一条链的序列。这个概念在本题的后半部分起到了关键作用。

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length \(k\)

Let's denote the penalty for a gap of length \(k\) as \(P(k)\). From the given equations, we can see that:

Opening a gap costs \(d\)
Extending a gap costs \(e\) for each additional position

Therefore, the formula for the penalty of a gap of length \(k\) is:

\[ P(k) = d + (k-1)e \]

Note: This is known as an affine gap penalty model.

1-2: Initial Values for \(X_{i,0}\) and \(Y_{0,j}\)

Given the initial conditions:

\(M_{0,0} = 0\)
\(X_{0,0} = -\infty\), \(Y_{0,0} = -\infty\)
\(M_{0,j} = X_{0,j} = -\infty\) for \(j = 1, \cdots, n\)
\(M_{i,0} = Y_{i,0} = -\infty\) for \(i = 1, \cdots, m\)

We need to determine \(X_{i,0}\) \((i = 1, \cdots, m)\) and \(Y_{0,j}\) \((j = 1, \cdots, n)\).

For \(X_{i,0}\) \((i = 1, \cdots, m)\):

\(X_{i,0}\) represents a gap in sequence \(y\) at the beginning. According to the recurrence relation:

\[ X_{i,0} = \max \begin{cases} M_{i-1,0} - d = -\infty \\ X_{i-1,0} - e \\ Y_{i-1,0} - d = -\infty \end{cases} \]

Therefore, \(X_{i,0} = X_{i-1,0} - e = -d - (i-1)e\).

For \(Y_{0,j}\) \((j = 1, \cdots, n)\):

\(Y_{0,j}\) represents a gap in sequence \(x\) at the beginning. It's symmetrical to \(X_{i,0}\):

\[ Y_{0,j} = -d - (j-1)e \]

Hence, the initial values of \(X_{i,0}\) and \(Y_{0,j}\) are as follows:

\(X_{i,0} = -d - (i - 1)e\)
\(Y_{0,j} = -d - (j - 1)e\)

1-3: Iterative Equations for Local Alignment

To compute the local alignment, the iterative equations are modified to allow for the possibility of starting a new alignment, indicated by a score of 0:

\[ M_{i,j} = \max \begin{cases} 0 \\ M_{i-1,j-1} + s(x_i, y_j) \\ X_{i-1,j-1} + s(x_i, y_j) \\ Y_{i-1,j-1} + s(x_i, y_j) \end{cases} \]

\[ X_{i,j} = \max \begin{cases} 0 \\ M_{i-1,j} - d \\ X_{i-1,j} - e \\ Y_{i-1,j} - d \end{cases} \]

\[ Y_{i,j} = \max \begin{cases} 0 \\ M_{i,j-1} - d \\ X_{i,j-1} - d \\ Y_{i,j-1} - e \end{cases} \]

1-4: Obtaining a Local Alignment with Maximum Score

To obtain a local alignment with the maximum score:

Initialize all cells in the first row and column to 0.
Fill the dynamic programming matrix using the equations from (1-3).
Find the cell \((i^*, j^*)\) with the maximum score in the entire matrix.
Perform a traceback from \((i^*, j^*)\) until reaching a cell with score 0 or the matrix boundary.
The path of this traceback gives the optimal local alignment.

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

The algorithm scans a sequence \(x\) for reverse complementary matches. It uses a matrix \(H\) to record the length of matching substrings that are reverse complements. If the length of the match exceeds a threshold \(k\), the algorithm reports the corresponding subsequences.

Specifically:

\(H_{i,j}\) stores the length of the reverse complementary match ending at \(x_i\) and \(x_j\).
The algorithm compares \(x_i\) with the complement of \(x_j\) for \(j\) from \(m\) down to \(i\).
If a match is found, it extends the previous match (\(H_{i-1,j+1}\)) by 1.
If the length of the match (\(H_{i-1,j+1}\)) is at least \(k\), it reports the corresponding ranges.

The reported ranges \([i - H_{i-1,j+1} + 1, i - 1]\) and \([j + 1, j + H_{i-1,j+1}]\) represent the start and end positions of reverse complementary subsequences of length at least \(k\).

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

The algorithm to find the maximum reverse complementary alignment score is as follows:

Initialize a dynamic programming matrix dp where dp[i][j] represents the maximum reverse complementary alignment score ending at positions \(i\) and \(j\).
Fill the matrix using a modified Smith-Waterman algorithm, considering reverse complementary matches.
Keep track of the maximum score and its position.
Perform a traceback from the position of the maximum score to reconstruct the aligned subsequences.
Return the pair of subsequences with the maximum reverse complementary alignment score.

The expected time complexity of this algorithm is \(O(m^2)\), where \(m\) is the length of the sequence \(x\). The expected space complexity is also \(O(m^2)\) due to the dynamic programming matrix.

Code Implementation

def max_reverse_complementary_alignment(x):
    m = len(x)
    # Initialize the dynamic programming matrix
    dp = [[0 for * in range(m+1)] for * in range(m+1)]
    max_score = 0
    max_pos = (0, 0)

    # Define complementary base pairs
    def comp(a):
        return {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}[a]

    # Define scoring function
    def s(a, b):
        return 1 if comp(a) == b else -1

    # Fill the dynamic programming matrix
    for i in range(1, m+1):
        for j in range(m, 0, -1):  # Note: reverse order, as we're looking for reverse complements
            match = dp[i-1][j+1] + s(x[i-1], x[j-1])
            delete = dp[i-1][j] - 1
            insert = dp[i][j+1] - 1
            dp[i][j] = max(0, match, delete, insert)
            if dp[i][j] > max_score:
                max_score = dp[i][j]
                max_pos = (i, j)

    # Traceback process, reconstruct optimal alignment
    i, j = max_pos
    seq1, seq2 = [], []
    while dp[i][j] > 0:
        if dp[i][j] == dp[i-1][j+1] + s(x[i-1], x[j-1]):
            seq1.append(x[i-1])
            seq2.append(x[j-1])
            i -= 1
            j += 1
        elif dp[i][j] == dp[i-1][j] - 1:
            seq1.append(x[i-1])
            seq2.append('-')
            i -= 1
        elif dp[i][j] == dp[i][j+1] - 1:
            seq1.append('-')
            seq2.append(x[j-1])
            j += 1
    return ''.join(reversed(seq1)), ''.join(seq2)

Knowledge

难点思路

这道题目的难点主要在于理解和设计反向互补序列的比对算法。我们需要修改传统的局部比对算法 (Smith-Waterman 算法) 来适应这个特殊的需求。关键是要理解如何在动态规划矩阵中正确地比较序列元素,以及如何进行回溯以重构最优的子序列对。

解题技巧和信息

对于序列比对问题,通常可以考虑使用动态规划方法。
在设计动态规划算法时,要注意初始条件的设置,这往往对算法的正确性至关重要。
对于带有间隔惩罚的序列比对,通常使用仿射间隔惩罚模型 (affine gap penalty model)。
在处理 DNA 序列时,要注意互补碱基对的概念 (A-T, C-G)。
局部比对和全局比对的主要区别在于是否允许比对从序列中间开始和结束。
在处理反向互补序列时,可以通过逆序遍历一个序列来模拟反向操作,同时使用互补碱基对的映射来处理互补关系。

重点词汇

global alignment 全局比对
local alignment 局部比对
affine gap penalty 仿射间隔惩罚
reverse complementary 反向互补
dynamic programming 动态规划
traceback 回溯
subsequence 子序列
palindromic sequence 回文序列
nucleotide 核苷酸
base pair 碱基对
DNA strand DNA 链
complementary base pairing 互补碱基配对

参考资料

Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press. Chapter 2-3.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge university press. Chapter 11-12.

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2023年8月実施 問題12