東京大学新領域創成科学研究科メディカル情報生命専攻 2015年8月実施問題12

Author

Description

Assume that a global alignment of two sequences, \(x_1, \dots, x_m\) and \(y_1, \dots, y_n\), is calculated by a dynamic programming using the iterative formula (A).

\[ F(i, j) = \max \begin{cases} F(i - 1, j - 1) + s(x_i, y_j) \\ F(i - 1, j) - d \\ F(i, j - 1) - d \end{cases} \quad \text{(A)} \]

where \(s(x_i, y_j)\) is the score of aligning \(x_i\) and \(y_j\), \(F(i, j)\) is the maximum score of the alignments of \(x_1, \dots, x_i\) and \(y_1, \dots, y_j\). Assume that \(d > 0\) and that score \(g(k)\) is given to a gap of length \(k\). Answer the following questions (1) – (5).

(1) Show the general form of \(g(k)\).

(2) Show the initial values \(F(i,0)\) for \(i = 1, \dots, m\) and \(F(0, j)\) for \(j = 1, \dots, n\), so that the maximum score of the alignments of the two sequences \(x_1, \dots, x_m\) and \(y_1, \dots, y_n\) is obtained as \(F(m,n)\) after updating the iterative formula (A) for \(i = 1, \dots, m\) and \(j = 1, \dots, n\). Notice that \(F(0,0) = 0\).

(3) Evaluate the computational time of calculating the maximum score of the alignments of the two sequences \(x_1, \dots, x_m\) and \(y_1, \dots, y_n\), and describe it by using \(m\) and \(n\).

(4) When updating formula (A) for \(i = 1, \dots, m\) and \(j = 1, \dots, n\), \(\pi(i,j)\) is defined as follows:

Among the values of \(F(i - 1, j - 1) + s(x_i, y_j)\), \(F(i - 1, j) - d\) and \(F(i, j - 1) - d\), when \(F(i - 1, j - 1) + s(x_i, y_j)\) is the maximum, \(\pi(i, j) = (i - 1, j - 1)\), otherwise, when \(F(i - 1, j) - d\) is the maximum, \(\pi(i, j) = (i - 1, j)\), otherwise, \(\pi(i, j) = (i, j - 1)\). Briefly explain, within five lines, about the role of \(\pi(i, j)\) in the alignment algorithm.

假设通过动态规划计算两个序列 \(x_1, \dots, x_m\) 和 \(y_1, \dots, y_n\) 的全局比对，使用迭代公式 (A)。

\[ F(i, j) = \max \begin{cases} F(i - 1, j - 1) + s(x_i, y_j) \\ F(i - 1, j) - d \\ F(i, j - 1) - d \end{cases} \quad \text{(A)} \]

其中 \(s(x_i, y_j)\) 是比对 \(x_i\) 和 \(y_j\) 的得分，\(F(i, j)\) 是 \(x_1, \dots, x_i\) 和 \(y_1, \dots, y_j\) 的比对的最大得分。假设 \(d > 0\) 并且对于长度为 \(k\) 的空隙给定得分 \(g(k)\)。回答以下问题 (1) – (5)。

(1) 展示 \(g(k)\) 的一般形式。

(2) 展示初始值 \(F(i,0)\) 对于 \(i = 1, \dots, m\) 和 \(F(0, j)\) 对于 \(j = 1, \dots, n\)，使得在更新迭代公式 (A) 之后，对于 \(i = 1, \dots, m\) 和 \(j = 1, \dots, n\)，两个序列 \(x_1, \dots, x_m\) 和 \(y_1, \dots, y_n\) 的比对最大得分为 \(F(m,n)\)。注意 \(F(0,0) = 0\)。

(3) 评估计算 \(x_1, \dots, x_m\) 和 \(y_1, \dots, y_n\) 两个序列的比对最大得分的计算时间，并用 \(m\) 和 \(n\) 描述。

(4) 在更新公式 (A) 时，对于 \(i = 1, \dots, m\) 和 \(j = 1, \dots, n\)，\(\pi(i,j)\) 定义如下：

在 \(F(i - 1, j - 1) + s(x_i, y_j)\), \(F(i - 1, j) - d\) 和 \(F(i, j - 1) - d\) 的值中，当 \(F(i - 1, j - 1) + s(x_i, y_j)\) 是最大值时，\(\pi(i, j) = (i - 1, j - 1)\), 否则，当 \(F(i - 1, j) - d\) 是最大值时，\(\pi(i, j) = (i - 1, j)\), 否则，\(\pi(i, j) = (i, j - 1)\)。简要解释 \(\pi(i, j)\) 在比对算法中的作用，限制在五行以内。

Kai

(1)

The general form of \(g(k)\), which is the score given to a gap of length \(k\), is typically represented as \(g(k) = -dk\) where \(d\) is a positive penalty for each gap. This linear form assumes a constant penalty for each gap, reflecting the simple gap penalty model.

(2)

The initial values are determined based on the penalty for gaps. Specifically:

\[ F(i, 0) = -di \quad \text{for} \quad i = 1, \dots, m \]

\[ F(0, j) = -dj \quad \text{for} \quad j = 1, \dots, n \]

This initialization reflects the cumulative penalty for introducing gaps in either sequence up to length \(i\) or \(j\).

(3)

The computational time of calculating the maximum score of the alignments involves filling in an \(m \times n\) matrix \(F\). For each cell \((i, j)\) in this matrix, we compute the value using the iterative formula, which takes constant time \(O(1)\). Since there are \(m \times n\) cells, the overall time complexity is:

\[ O(mn) \]

(4)

The role of \(\pi(i, j)\) is to record the source of the maximum value for \(F(i, j)\), indicating the optimal move that leads to the current cell. This allows us to trace back from \(F(m, n)\) to \(F(0, 0)\) to reconstruct the optimal alignment path between the two sequences.

Knowledge

动态规划序列比对全局比对路径回溯复杂度分析

解题技巧和信息

在序列比对中，理解递推公式的三个选择对应的不同操作（匹配/错配、插入间隙）是非常重要的。
初始化边界条件可以帮助你理解和构建完整的动态规划表。
通过路径回溯可以重构出最优的对齐方式，而不仅仅是求得最大得分。

重点词汇

alignment 对齐
sequence 序列
dynamic programming 动态规划
gap penalty 间隙惩罚
traceback 回溯

参考资料

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. Chap. 2

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2015年8月実施 問題12