東京大学新領域創成科学研究科メディカル情報生命専攻 2017年8月実施問題12

Author

Description

For two strings \(\mathbf{x} = x[1] \cdots x[n]\) and \(\mathbf{y} = y[1] \cdots y[m]\) of lengths \(n\) and \(m\) (\(n \geq 0, m \geq 0\)), we define the set of common subsequences as

\[ S = \left\{ \begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix} \mid r: \text{positive integer}, \begin{matrix} 1 \leq i_1 < \cdots < i_r \leq n \\ 1 \leq j_1 < \cdots < j_r \leq m \end{matrix}, x[i_k] = y[j_k], k = 1, \cdots, r \right\} \]

Here, \(r\) represents the length of common subsequence \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\). Let \(l(x, y)\) denote the length of the longest common subsequences. If \(S = \emptyset\), we define \(l(x, y) = 0\).

(1) Let \(z_k^p\) and \(z_k^s\) denote the length-\(k\) prefix and suffix of string \(z\), respectively. We define \(\alpha[i, j] = l(x_i^p, y_j^p)\) and \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) for \(1 \leq i \leq n, 1 \leq j \leq m\). Describe a recurrence for computing \(\alpha[i, j]\) from \(\alpha[i-1, j-1]\), \(\alpha[i-1, j]\), and \(\alpha[i, j-1]\) when \(1 < i, 1 < j\). Also, describe a recurrence for computing \(\beta[i, j]\) from \(\beta[i+1, j]\), \(\beta[i, j+1]\), and \(\beta[i+1, j+1]\) when \(i < n, j < m\).

(2) Compute matrices \(\alpha\) and \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\).

(3) Suppose matrix \(\alpha\) is given. Write a pseudocode for obtaining one of the longest common subsequences.

(4) Suppose matrices \(\alpha\) and \(\beta\) are given. Write a pseudocode for computing the maximal length of common subsequences that contain the \(i\)-th position of \(\mathbf{x}\) (\(1 \leq i \leq n\)).

对于长度为 \(n\) 和 \(m\) 的两个字符串 \(\mathbf{x} = x[1] \cdots x[n]\) 和 \(\mathbf{y} = y[1] \cdots y[m]\) (\(n \geq 0, m \geq 0\))，我们定义公共子序列集为

\[ S = \left\{ \begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix} \mid r: \text{正整数}, \begin{matrix} 1 \leq i_1 < \cdots < i_r \leq n \\ 1 \leq j_1 < \cdots < j_r \leq m \end{matrix}, x[i_k] = y[j_k], k = 1, \cdots, r \right\} \]

这里，\(r\) 表示公共子序列 \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\) 的长度。令 \(l(x, y)\) 表示最长公共子序列的长度。如果 \(S = \emptyset\)，我们定义 \(l(x, y) = 0\)。

(1) 令 \(z_k^p\) 和 \(z_k^s\) 分别表示字符串 \(z\) 的长度为 \(k\) 的前缀和后缀。我们定义 \(\alpha[i, j] = l(x_i^p, y_j^p)\) 和 \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) 对于 \(1 \leq i \leq n, 1 \leq j \leq m\)。描述从 \(\alpha[i-1, j-1]\)，\(\alpha[i-1, j]\)，和 \(\alpha[i, j-1]\) 计算 \(\alpha[i, j]\) 的递推关系，当 \(1 < i, 1 < j\)。同时，描述从 \(\beta[i+1, j]\)，\(\beta[i, j+1]\)，和 \(\beta[i+1, j+1]\) 计算 \(\beta[i, j]\) 的递推关系，当 \(i < n, j < m\)。

(2) 计算 \(\mathbf{x} = \text{ACTGG}\) 和 \(\mathbf{y} = \text{ACACG}\) 的矩阵 \(\alpha\) 和 \(\beta\)。

(3) 假设给定矩阵 \(\alpha\)。编写伪代码以获取一个最长的公共子序列。

(4) 假设给定矩阵 \(\alpha\) 和 \(\beta\)。编写伪代码以计算包含 \(\mathbf{x}\) 的第 \(i\) 个位置的最大长度的公共子序列 (\(1 \leq i \leq n\))。

Kai

(1)

Recurrence for \(\alpha[i, j]\)

The value of \(\alpha[i, j] = l(x_i^p, y_j^p)\) can be computed as follows:

If \(x[i] = y[j]\), then \(\alpha[i, j] = \alpha[i-1, j-1] + 1\).
If \(x[i] \neq y[j]\), then \(\alpha[i, j] = \max(\alpha[i-1, j], \alpha[i, j-1])\).

This can be summarized as:

\[ \alpha[i, j] = \begin{cases} \alpha[i-1, j-1] + 1 & \text{if } x[i] = y[j] \\ \max(\alpha[i-1, j], \alpha[i, j-1]) & \text{if } x[i] \neq y[j] \end{cases} \]

Recurrence for \(\beta[i, j]\)

The value of \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) can be computed as follows:

If \(x[n-i+1] = y[m-j+1]\), then \(\beta[i, j] = \beta[i+1, j+1] + 1\).
If \(x[n-i+1] \neq y[m-j+1]\), then \(\beta[i, j] = \max(\beta[i+1, j], \beta[i, j+1])\).

This can be summarized as:

\[ \beta[i, j] = \begin{cases} \beta[i+1, j+1] + 1 & \text{if } x[n-i+1] = y[m-j+1] \\ \max(\beta[i+1, j], \beta[i, j+1]) & \text{if } x[n-i+1] \neq y[m-j+1] \end{cases} \]

(2)

Matrix \(\alpha\)

Let's fill the matrix \(\alpha\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):

x\y	A	C	A	C	G
A	1	1	1	1	1
C	1	2	2	2	2
T	1	2	2	2	2
G	1	2	2	2	3
G	1	2	2	2	3

Matrix \(\beta\)

Let's fill the matrix \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):

x\y	G	C	A	C	A
G	1	1	1	1	1
G	1	1	1	1	1
T	1	1	1	1	1
C	1	2	2	2	2
A	1	2	3	3	3

(3)

To extract one of the longest common subsequences from the matrix \(\alpha\), we use the following algorithm. This algorithm traces back from the bottom-right corner of the matrix to the top-left corner, reconstructing the longest common subsequence by following the path of optimal choices recorded in \(\alpha\).

Explanation

Start from the bottom-right corner of the matrix \(\alpha\), i.e., \(\alpha[n, m]\).
Compare characters of \(\mathbf{x}\) and \(\mathbf{y}\):
If \(x[i] = y[j]\), include \(x[i]\) in the result and move diagonally to \(\alpha[i-1, j-1]\).
If \(x[i] \neq y[j]\), move in the direction that gives the larger value (either up or left).
Continue this process until reaching the top-left corner of the matrix.
The result will be one of the longest common subsequences.

Pseudocode

function getLCS(x, y, alpha)
    i = length(x)
    j = length(y)
    lcs = ""
    while i > 0 and j > 0
        if x[i] == y[j]
            lcs = x[i] + lcs
            i = i - 1
            j = j - 1
        else if alpha[i-1, j] >= alpha[i, j-1]
            i = i - 1
        else
            j = j - 1
    return lcs

(4)

Given matrices \(\alpha\) and \(\beta\), we can compute the maximal length of common subsequences that include the \(i\)-th position of \(\mathbf{x}\). This is done by evaluating the length of subsequences that start from the beginning and end at the \(i\)-th position, combined with subsequences that start at the \(i\)-th position and extend to the end.

Explanation

For each position \(j\) in \(\mathbf{y}\):
Combine the length of the prefix up to \(i\) (\(\alpha[i, j]\)) with the length of the suffix from \(i\) onwards (\(\beta[n-i+1, m-j+1]\)) as the length of the common subsequence.
If \(x[i] = y[j]\), since \(x[i]\) is included in both the prefix and suffix, subtract 1 from the total length.
The maximum value obtained through this process gives the desired length.

Pseudocode

function maxLengthWithPosition(x, y, alpha, beta, i)
    maxLength = 0
    for j = 1 to length(y)
        currentLength = alpha[i, j] + beta[n-i+1, m-j+1]
        if x[i] == y[j]
            currentLength = currentLength - 1
        maxLength = max(maxLength, currentLength)
    return maxLength

Knowledge

动态规划最长公共子序列递归

难点思路

对于递归关系的理解和矩阵填充的具体实现可能会比较复杂，需要仔细考虑每一步的递推关系。

解题技巧和信息

动态规划表格填充方法：先初始化，然后按照递推关系逐步填充。
递归关系需要对字符串字符的匹配情况进行详细考虑，以确保递推关系的正确性。

重点词汇

common subsequence 公共子序列
recurrence relation 递推关系
prefix 前缀
suffix 后缀
dynamic programming 动态规划

参考资料

Introduction to Algorithms, 3^rd Edition, Cormen et al., Chapter 15.

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2017年8月実施 問題12

Author

Description

Kai

(1)

Recurrence for \(\alpha[i, j]\)

Recurrence for \(\beta[i, j]\)

(2)

Matrix \(\alpha\)

Matrix \(\beta\)

(3)

Explanation

Pseudocode

(4)

Explanation

Pseudocode

Knowledge

难点思路

解题技巧和信息

重点词汇

参考资料

東京大学新領域創成科学研究科メディカル情報生命専攻 2017年8月実施問題12