東京大学 新領域創成科学研究科 メディカル情報生命専攻 2017年8月実施 問題12
Author
Description
For two strings \(\mathbf{x} = x[1] \cdots x[n]\) and \(\mathbf{y} = y[1] \cdots y[m]\) of lengths \(n\) and \(m\) (\(n \geq 0, m \geq 0\)), we define the set of common subsequences as
Here, \(r\) represents the length of common subsequence \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\). Let \(l(x, y)\) denote the length of the longest common subsequences. If \(S = \emptyset\), we define \(l(x, y) = 0\).
(1) Let \(z_k^p\) and \(z_k^s\) denote the length-\(k\) prefix and suffix of string \(z\), respectively. We define \(\alpha[i, j] = l(x_i^p, y_j^p)\) and \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) for \(1 \leq i \leq n, 1 \leq j \leq m\). Describe a recurrence for computing \(\alpha[i, j]\) from \(\alpha[i-1, j-1]\), \(\alpha[i-1, j]\), and \(\alpha[i, j-1]\) when \(1 < i, 1 < j\). Also, describe a recurrence for computing \(\beta[i, j]\) from \(\beta[i+1, j]\), \(\beta[i, j+1]\), and \(\beta[i+1, j+1]\) when \(i < n, j < m\).
(2) Compute matrices \(\alpha\) and \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\).
(3) Suppose matrix \(\alpha\) is given. Write a pseudocode for obtaining one of the longest common subsequences.
(4) Suppose matrices \(\alpha\) and \(\beta\) are given. Write a pseudocode for computing the maximal length of common subsequences that contain the \(i\)-th position of \(\mathbf{x}\) (\(1 \leq i \leq n\)).
对于长度为 \(n\) 和 \(m\) 的两个字符串 \(\mathbf{x} = x[1] \cdots x[n]\) 和 \(\mathbf{y} = y[1] \cdots y[m]\) (\(n \geq 0, m \geq 0\)),我们定义公共子序列集为
这里,\(r\) 表示公共子序列 \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\) 的长度。令 \(l(x, y)\) 表示最长公共子序列的长度。如果 \(S = \emptyset\),我们定义 \(l(x, y) = 0\)。
(1) 令 \(z_k^p\) 和 \(z_k^s\) 分别表示字符串 \(z\) 的长度为 \(k\) 的前缀和后缀。我们定义 \(\alpha[i, j] = l(x_i^p, y_j^p)\) 和 \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) 对于 \(1 \leq i \leq n, 1 \leq j \leq m\)。描述从 \(\alpha[i-1, j-1]\),\(\alpha[i-1, j]\),和 \(\alpha[i, j-1]\) 计算 \(\alpha[i, j]\) 的递推关系,当 \(1 < i, 1 < j\)。同时,描述从 \(\beta[i+1, j]\),\(\beta[i, j+1]\),和 \(\beta[i+1, j+1]\) 计算 \(\beta[i, j]\) 的递推关系,当 \(i < n, j < m\)。
(2) 计算 \(\mathbf{x} = \text{ACTGG}\) 和 \(\mathbf{y} = \text{ACACG}\) 的矩阵 \(\alpha\) 和 \(\beta\)。
(3) 假设给定矩阵 \(\alpha\)。编写伪代码以获取一个最长的公共子序列。
(4) 假设给定矩阵 \(\alpha\) 和 \(\beta\)。编写伪代码以计算包含 \(\mathbf{x}\) 的第 \(i\) 个位置的最大长度的公共子序列 (\(1 \leq i \leq n\))。
Kai
(1)
Recurrence for \(\alpha[i, j]\)
The value of \(\alpha[i, j] = l(x_i^p, y_j^p)\) can be computed as follows:
- If \(x[i] = y[j]\), then \(\alpha[i, j] = \alpha[i-1, j-1] + 1\).
- If \(x[i] \neq y[j]\), then \(\alpha[i, j] = \max(\alpha[i-1, j], \alpha[i, j-1])\).
This can be summarized as:
Recurrence for \(\beta[i, j]\)
The value of \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) can be computed as follows:
- If \(x[n-i+1] = y[m-j+1]\), then \(\beta[i, j] = \beta[i+1, j+1] + 1\).
- If \(x[n-i+1] \neq y[m-j+1]\), then \(\beta[i, j] = \max(\beta[i+1, j], \beta[i, j+1])\).
This can be summarized as:
(2)
Matrix \(\alpha\)
Let's fill the matrix \(\alpha\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):
x\y | A | C | A | C | G |
---|---|---|---|---|---|
A | 1 | 1 | 1 | 1 | 1 |
C | 1 | 2 | 2 | 2 | 2 |
T | 1 | 2 | 2 | 2 | 2 |
G | 1 | 2 | 2 | 2 | 3 |
G | 1 | 2 | 2 | 2 | 3 |
Matrix \(\beta\)
Let's fill the matrix \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):
x\y | G | C | A | C | A |
---|---|---|---|---|---|
G | 1 | 1 | 1 | 1 | 1 |
G | 1 | 1 | 1 | 1 | 1 |
T | 1 | 1 | 1 | 1 | 1 |
C | 1 | 2 | 2 | 2 | 2 |
A | 1 | 2 | 3 | 3 | 3 |
(3)
To extract one of the longest common subsequences from the matrix \(\alpha\), we use the following algorithm. This algorithm traces back from the bottom-right corner of the matrix to the top-left corner, reconstructing the longest common subsequence by following the path of optimal choices recorded in \(\alpha\).
Explanation
- Start from the bottom-right corner of the matrix \(\alpha\), i.e., \(\alpha[n, m]\).
- Compare characters of \(\mathbf{x}\) and \(\mathbf{y}\):
- If \(x[i] = y[j]\), include \(x[i]\) in the result and move diagonally to \(\alpha[i-1, j-1]\).
- If \(x[i] \neq y[j]\), move in the direction that gives the larger value (either up or left).
- Continue this process until reaching the top-left corner of the matrix.
- The result will be one of the longest common subsequences.
Pseudocode
(4)
Given matrices \(\alpha\) and \(\beta\), we can compute the maximal length of common subsequences that include the \(i\)-th position of \(\mathbf{x}\). This is done by evaluating the length of subsequences that start from the beginning and end at the \(i\)-th position, combined with subsequences that start at the \(i\)-th position and extend to the end.
Explanation
- For each position \(j\) in \(\mathbf{y}\):
- Combine the length of the prefix up to \(i\) (\(\alpha[i, j]\)) with the length of the suffix from \(i\) onwards (\(\beta[n-i+1, m-j+1]\)) as the length of the common subsequence.
- If \(x[i] = y[j]\), since \(x[i]\) is included in both the prefix and suffix, subtract 1 from the total length.
- The maximum value obtained through this process gives the desired length.
Pseudocode
Knowledge
动态规划 最长公共子序列 递归
难点思路
对于递归关系的理解和矩阵填充的具体实现可能会比较复杂,需要仔细考虑每一步的递推关系。
解题技巧和信息
- 动态规划表格填充方法:先初始化,然后按照递推关系逐步填充。
- 递归关系需要对字符串字符的匹配情况进行详细考虑,以确保递推关系的正确性。
重点词汇
- common subsequence 公共子序列
- recurrence relation 递推关系
- prefix 前缀
- suffix 后缀
- dynamic programming 动态规划
参考资料
- Introduction to Algorithms, 3rd Edition, Cormen et al., Chapter 15.