東京大学新領域創成科学研究科メディカル情報生命専攻 2014年8月実施問題12

Author

Description

Suppose that the maximum score of the global alignment of the two sequences, \(x = x_1, \ldots, x_m\) and \(y = y_1, \ldots, y_n\), is calculated by a dynamic programming using the following iterations (A).

\[ \begin{aligned} M(i, j) &= \max[M(i-1, j-1), X(i-1, j-1), Y(i-1, j-1)] + s(x_i, y_j), \\ X(i, j) &= \max[M(i-1, j) - d, X(i-1, j) - e], \\ Y(i, j) &= \max[M(i, j-1) - d, Y(i, j-1) - e]. \end{aligned} \tag{A} \]

Here we assume that \(d > 0\), \(e > 0\) and that \(s(x_i, y_j)\) is the score of alignment of \(x_i\) and \(y_j\).

We also assume that the same score of \(g(k)\) is given to any gap of length \(k\).

Answer the following problems.

(1) Show the general form of \(g(k)\).

(2) In order to get the maximum score of the alignment as \(\max[M(m, n), X(m, n), Y(m, n)]\) after iterations using (A) for \(i = 1, \ldots, m\) and \(j = 1, \ldots, n\), answer what initial values should be assigned to the following variables.

\(X(i, 0)\) for \(i = 1, \ldots, m\)
\(Y(0, j)\) for \(j = 1, \ldots, n\)

You may suppose that

\[ \begin{aligned} M(0, 0) &= 0, \\ X(0, 0) &= Y(0, 0) = -\infty, \\ M(i, 0) &= Y(i, 0) = -\infty \quad (i = 1, \ldots, m), \\ M(0, j) &= X(0, j) = -\infty \quad (j = 1, \ldots, n). \end{aligned} \]

(3) For the alignment of biological sequences, usually we set \(d > e\). Describe the reason briefly.

(4) In order to treat the score of long gaps more accurately, let us consider alignments using the following iterations. Show the general form of the gap score of length \(k\), when \(d_2 > d_1\), \(e_2 > e_1\) holds. Note that classification on the value of \(k\) is necessary.

\[ \begin{aligned} M(i, j) &= \max[M(i-1, j-1), X_1(i-1, j-1), Y_1(i-1, j-1),X_2(i-1, j-1), Y_2(i-1, j-1)] + s(x_i, y_j), \\ X_1(i, j) &= \max[M(i-1, j) - d_1, X_1(i-1, j) - e_1], \\ Y_1(i, j) &= \max[M(i, j-1) - d_1, Y_1(i, j-1) - e_1], \\ X_2(i, j) &= \max[M(i-1, j) - d_2, X_2(i-1, j) - e_2], \\ Y_2(i, j) &= \max[M(i, j-1) - d_2, Y_2(i, j-1) - e_2]. \end{aligned} \]

假设通过使用以下迭代 (A) 计算两个序列 \(x = x_1, \ldots, x_m\) 和 \(y = y_1, \ldots, y_n\) 的全局比对的最大得分。

\[ \begin{aligned} M(i, j) &= \max[M(i-1, j-1), X(i-1, j-1), Y(i-1, j-1)] + s(x_i, y_j), \\ X(i, j) &= \max[M(i-1, j) - d, X(i-1, j) - e], \\ Y(i, j) &= \max[M(i, j-1) - d, Y(i, j-1) - e]. \end{aligned} \tag{A} \]

这里我们假设 \(d > 0\)，\(e > 0\) 并且 \(s(x_i, y_j)\) 是 \(x_i\) 和 \(y_j\) 的比对得分。

我们还假设对于任意长度为 \(k\) 的间隙给出了相同的得分 \(g(k)\)。

回答以下问题。

(1) 展示 \(g(k)\) 的一般形式。

(2) 为了在使用 (A) 迭代后获得最大比对得分 \(\max[M(m, n), X(m, n), Y(m, n)]\)，回答应为以下变量分配什么初始值。

\(X(i, 0)\) 对于 \(i = 1, \ldots, m\)
\(Y(0, j)\) 对于 \(j = 1, \ldots, n\)

你可以假设

\[ \begin{aligned} M(0, 0) &= 0, \\ X(0, 0) &= Y(0, 0) = -\infty, \\ M(i, 0) &= Y(i, 0) = -\infty \quad (i = 1, \ldots, m), \\ M(0, j) &= X(0, j) = -\infty \quad (j = 1, \ldots, n). \end{aligned} \]

(3) 对于生物序列的比对，通常我们设置 \(d > e\)。简要描述原因。

(4) 为了更准确地处理长间隙的得分，让我们考虑使用以下迭代进行比对。展示当 \(d_2 > d_1\)，\(e_2 > e_1\) 时长度为 \(k\) 的间隙得分的一般形式。注意对 \(k\) 值的分类是必要的。

\[ \begin{aligned} M(i, j) &= \max[M(i-1, j-1), X_1(i-1, j-1), Y_1(i-1, j-1), X_2(i-1, j-1), Y_2(i-1, j-1)] + s(x_i, y_j), \\ X_1(i, j) &= \max[M(i-1, j) - d_1, X_1(i-1, j) - e_1], \\ Y_1(i, j) &= \max[M(i, j-1) - d_1, Y_1(i, j-1) - e_1], \\ X_2(i, j) &= \max[M(i-1, j) - d_2, X_2(i-1, j) - e_2], \\ Y_2(i, j) &= \max[M(i, j-1) - d_2, Y_2(i, j-1) - e_2]. \end{aligned} \]

Kai

(1)

To find the general form of \(g(k)\), we need to analyze the recursive formulas for \(X(i,j)\) and \(Y(i,j)\).

For a gap of length \(k\), the score will be:

\[ g(k) = -d - (k-1)e \]

Explanation:

The first gap opening costs \(d\)
Each subsequent gap extension costs \(e\)
For a gap of length \(k\), we have 1 opening and \(k-1\) extensions

Therefore, the general form of \(g(k)\) is:

\[ g(k) = -d - (k-1)e \]

(2)

Given the initial conditions:

\[ \begin{aligned} M(0, 0) &= 0, \\ X(0, 0) &= Y(0, 0) = -\infty, \\ M(i, 0) &= Y(i, 0) = -\infty \quad (i = 1, \ldots, m), \\ M(0, j) &= X(0, j) = -\infty \quad (j = 1, \ldots, n). \end{aligned} \]

We need to determine \(X(i,0)\) for \(i = 1, \ldots, m\) and \(Y(0,j)\) for \(j = 1, \ldots, n\).

For \(X(i,0)\):

\[ \begin{aligned} X(1,0) &= \max[M(0,0) - d, X(0,0) - e] = \max[-d, -\infty] = -d \\ X(2,0) &= \max[M(1,0) - d, X(1,0) - e] = \max[-\infty, -d-e] = -d-e \\ X(3,0) &= \max[M(2,0) - d, X(2,0) - e] = \max[-\infty, -d-2e] = -d-2e \\ &\vdots \end{aligned} \]

Therefore, \(X(i,0) = -d-(i-1)e\) for \(i = 1, \ldots, m\).

For \(Y(0,j)\):

\[ \begin{aligned} Y(0,1) &= \max[M(0,0) - d, Y(0,0) - e] = \max[-d, -\infty] = -d \\ Y(0,2) &= \max[M(0,1) - d, Y(0,1) - e] = \max[-\infty, -d-e] = -d-e \\ Y(0,3) &= \max[M(0,2) - d, Y(0,2) - e] = \max[-\infty, -d-2e] = -d-2e \\ &\vdots \end{aligned} \]

Therefore, \(Y(0,j) = -d-(j-1)e\) for \(j = 1, \ldots, n\).

(3)

In biological sequence alignment, we usually set \(d > e\) (i.e., the gap opening penalty is larger than the gap extension penalty) for the following reasons:

Biological relevance: In evolutionary processes, it's more likely for a single mutation event to cause a contiguous gap (insertion or deletion) of multiple residues than for multiple independent single-residue indel events to occur adjacently. Setting \(d > e\) reflects this biological reality.
Alignment quality: This setting encourages fewer, longer gaps rather than many short gaps in the alignment. This often leads to more biologically meaningful alignments, as it better represents the evolutionary events that may have occurred.
Computational efficiency: By discouraging the introduction of many small gaps, this setting can reduce the search space for optimal alignments, potentially improving the efficiency of alignment algorithms.

(4)

For the more complex model with two sets of gap penalties \((d_1, e_1)\) and \((d_2, e_2)\), where \(d_2 > d_1\) and \(e_2 > e_1\), the general form of the gap score \(g(k)\) will be:

\[ g(k) = \begin{cases} -d_1 - (k-1)e_1 & \text{if } 1 \leq k \leq k^* \\ -d_2 - (k-1)e_2 & \text{if } k > k^* \end{cases} \]

where \(k^*\) is the break-even point where the two scoring schemes give the same result.

To find \(k^*\), we solve:

\[ -d_1 - (k^*-1)e_1 = -d_2 - (k^*-1)e_2 \]

Solving for \(k^*\):

\[ k^* = \frac{d_2 - d_1}{e_2 - e_1} + 1 \]

Note that \(k^*\) may not be an integer. In practice, we would use the floor or ceiling of this value, depending on the specific requirements of the alignment algorithm.

This more complex model allows for a more nuanced treatment of gap penalties, potentially leading to more accurate alignments, especially for sequences with varying gap length distributions.

Knowledge

动态规划序列比对全局比对生物信息学

难点思路

理解动态规划在序列比对中的应用
分析递推公式以推导空位罚分函数
理解不同空位罚分模型的生物学意义
处理更复杂的空位罚分模型，并推导其一般形式

解题技巧和信息

分析递推公式时，关注 gap opening 和 gap extension 的区别
初始值的设置对于动态规划算法的正确性至关重要
在生物序列比对中，参数的设置常常基于生物学知识和经验
处理分段函数时，寻找分界点（break-even point）是关键

重点词汇

Global alignment: 全局比对
Dynamic programming: 动态规划
Gap penalty: 空位罚分
Gap opening penalty: 空位开放罚分
Gap extension penalty: 空位延伸罚分
Sequence alignment: 序列比对
Affine gap penalty: 仿射空位罚分

参考资料

Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Chap. 3.

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2014年8月実施 問題12

Author

Description

Kai

(1)

(2)

(3)

(4)

Knowledge

难点思路

解题技巧和信息

重点词汇

参考资料

東京大学新領域創成科学研究科メディカル情報生命専攻 2014年8月実施問題12