Skip to content

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2017年8月実施 問題12

Author

zephyr

Description

For two strings \(\mathbf{x} = x[1] \cdots x[n]\) and \(\mathbf{y} = y[1] \cdots y[m]\) of lengths \(n\) and \(m\) (\(n \geq 0, m \geq 0\)), we define the set of common subsequences as

\[ S = \left\{ \begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix} \mid r: \text{positive integer}, \begin{matrix} 1 \leq i_1 < \cdots < i_r \leq n \\ 1 \leq j_1 < \cdots < j_r \leq m \end{matrix}, x[i_k] = y[j_k], k = 1, \cdots, r \right\} \]

Here, \(r\) represents the length of common subsequence \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\). Let \(l(x, y)\) denote the length of the longest common subsequences. If \(S = \emptyset\), we define \(l(x, y) = 0\).

(1) Let \(z_k^p\) and \(z_k^s\) denote the length-\(k\) prefix and suffix of string \(z\), respectively. We define \(\alpha[i, j] = l(x_i^p, y_j^p)\) and \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) for \(1 \leq i \leq n, 1 \leq j \leq m\). Describe a recurrence for computing \(\alpha[i, j]\) from \(\alpha[i-1, j-1]\), \(\alpha[i-1, j]\), and \(\alpha[i, j-1]\) when \(1 < i, 1 < j\). Also, describe a recurrence for computing \(\beta[i, j]\) from \(\beta[i+1, j]\), \(\beta[i, j+1]\), and \(\beta[i+1, j+1]\) when \(i < n, j < m\).

(2) Compute matrices \(\alpha\) and \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\).

(3) Suppose matrix \(\alpha\) is given. Write a pseudocode for obtaining one of the longest common subsequences.

(4) Suppose matrices \(\alpha\) and \(\beta\) are given. Write a pseudocode for computing the maximal length of common subsequences that contain the \(i\)-th position of \(\mathbf{x}\) (\(1 \leq i \leq n\)).


对于长度为 \(n\)\(m\) 的两个字符串 \(\mathbf{x} = x[1] \cdots x[n]\)\(\mathbf{y} = y[1] \cdots y[m]\) (\(n \geq 0, m \geq 0\)),我们定义公共子序列集为

\[ S = \left\{ \begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix} \mid r: \text{正整数}, \begin{matrix} 1 \leq i_1 < \cdots < i_r \leq n \\ 1 \leq j_1 < \cdots < j_r \leq m \end{matrix}, x[i_k] = y[j_k], k = 1, \cdots, r \right\} \]

这里,\(r\) 表示公共子序列 \(\begin{pmatrix} i_1, \cdots, i_r \\ j_1, \cdots, j_r \end{pmatrix}\) 的长度。令 \(l(x, y)\) 表示最长公共子序列的长度。如果 \(S = \emptyset\),我们定义 \(l(x, y) = 0\)

(1) 令 \(z_k^p\)\(z_k^s\) 分别表示字符串 \(z\) 的长度为 \(k\) 的前缀和后缀。我们定义 \(\alpha[i, j] = l(x_i^p, y_j^p)\)\(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) 对于 \(1 \leq i \leq n, 1 \leq j \leq m\)。描述从 \(\alpha[i-1, j-1]\)\(\alpha[i-1, j]\),和 \(\alpha[i, j-1]\) 计算 \(\alpha[i, j]\) 的递推关系,当 \(1 < i, 1 < j\)。同时,描述从 \(\beta[i+1, j]\)\(\beta[i, j+1]\),和 \(\beta[i+1, j+1]\) 计算 \(\beta[i, j]\) 的递推关系,当 \(i < n, j < m\)

(2) 计算 \(\mathbf{x} = \text{ACTGG}\)\(\mathbf{y} = \text{ACACG}\) 的矩阵 \(\alpha\)\(\beta\)

(3) 假设给定矩阵 \(\alpha\)。编写伪代码以获取一个最长的公共子序列。

(4) 假设给定矩阵 \(\alpha\)\(\beta\)。编写伪代码以计算包含 \(\mathbf{x}\) 的第 \(i\) 个位置的最大长度的公共子序列 (\(1 \leq i \leq n\))。

Kai

(1)

Recurrence for \(\alpha[i, j]\)

The value of \(\alpha[i, j] = l(x_i^p, y_j^p)\) can be computed as follows:

  • If \(x[i] = y[j]\), then \(\alpha[i, j] = \alpha[i-1, j-1] + 1\).
  • If \(x[i] \neq y[j]\), then \(\alpha[i, j] = \max(\alpha[i-1, j], \alpha[i, j-1])\).

This can be summarized as:

\[ \alpha[i, j] = \begin{cases} \alpha[i-1, j-1] + 1 & \text{if } x[i] = y[j] \\ \max(\alpha[i-1, j], \alpha[i, j-1]) & \text{if } x[i] \neq y[j] \end{cases} \]

Recurrence for \(\beta[i, j]\)

The value of \(\beta[i, j] = l(x_{n-i+1}^s, y_{m-j+1}^s)\) can be computed as follows:

  • If \(x[n-i+1] = y[m-j+1]\), then \(\beta[i, j] = \beta[i+1, j+1] + 1\).
  • If \(x[n-i+1] \neq y[m-j+1]\), then \(\beta[i, j] = \max(\beta[i+1, j], \beta[i, j+1])\).

This can be summarized as:

\[ \beta[i, j] = \begin{cases} \beta[i+1, j+1] + 1 & \text{if } x[n-i+1] = y[m-j+1] \\ \max(\beta[i+1, j], \beta[i, j+1]) & \text{if } x[n-i+1] \neq y[m-j+1] \end{cases} \]

(2)

Matrix \(\alpha\)

Let's fill the matrix \(\alpha\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):

x\y A C A C G
A 1 1 1 1 1
C 1 2 2 2 2
T 1 2 2 2 2
G 1 2 2 2 3
G 1 2 2 2 3

Matrix \(\beta\)

Let's fill the matrix \(\beta\) for \(\mathbf{x} = \text{ACTGG}\) and \(\mathbf{y} = \text{ACACG}\):

x\y G C A C A
G 1 1 1 1 1
G 1 1 1 1 1
T 1 1 1 1 1
C 1 2 2 2 2
A 1 2 3 3 3

(3)

To extract one of the longest common subsequences from the matrix \(\alpha\), we use the following algorithm. This algorithm traces back from the bottom-right corner of the matrix to the top-left corner, reconstructing the longest common subsequence by following the path of optimal choices recorded in \(\alpha\).

Explanation

  • Start from the bottom-right corner of the matrix \(\alpha\), i.e., \(\alpha[n, m]\).
  • Compare characters of \(\mathbf{x}\) and \(\mathbf{y}\):
  • If \(x[i] = y[j]\), include \(x[i]\) in the result and move diagonally to \(\alpha[i-1, j-1]\).
  • If \(x[i] \neq y[j]\), move in the direction that gives the larger value (either up or left).
  • Continue this process until reaching the top-left corner of the matrix.
  • The result will be one of the longest common subsequences.

Pseudocode

function getLCS(x, y, alpha)
    i = length(x)
    j = length(y)
    lcs = ""
    while i > 0 and j > 0
        if x[i] == y[j]
            lcs = x[i] + lcs
            i = i - 1
            j = j - 1
        else if alpha[i-1, j] >= alpha[i, j-1]
            i = i - 1
        else
            j = j - 1
    return lcs

(4)

Given matrices \(\alpha\) and \(\beta\), we can compute the maximal length of common subsequences that include the \(i\)-th position of \(\mathbf{x}\). This is done by evaluating the length of subsequences that start from the beginning and end at the \(i\)-th position, combined with subsequences that start at the \(i\)-th position and extend to the end.

Explanation

  • For each position \(j\) in \(\mathbf{y}\):
  • Combine the length of the prefix up to \(i\) (\(\alpha[i, j]\)) with the length of the suffix from \(i\) onwards (\(\beta[n-i+1, m-j+1]\)) as the length of the common subsequence.
  • If \(x[i] = y[j]\), since \(x[i]\) is included in both the prefix and suffix, subtract 1 from the total length.
  • The maximum value obtained through this process gives the desired length.

Pseudocode

1
2
3
4
5
6
7
8
function maxLengthWithPosition(x, y, alpha, beta, i)
    maxLength = 0
    for j = 1 to length(y)
        currentLength = alpha[i, j] + beta[n-i+1, m-j+1]
        if x[i] == y[j]
            currentLength = currentLength - 1
        maxLength = max(maxLength, currentLength)
    return maxLength

Knowledge

动态规划 最长公共子序列 递归

难点思路

对于递归关系的理解和矩阵填充的具体实现可能会比较复杂,需要仔细考虑每一步的递推关系。

解题技巧和信息

  • 动态规划表格填充方法:先初始化,然后按照递推关系逐步填充。
  • 递归关系需要对字符串字符的匹配情况进行详细考虑,以确保递推关系的正确性。

重点词汇

  • common subsequence 公共子序列
  • recurrence relation 递推关系
  • prefix 前缀
  • suffix 后缀
  • dynamic programming 动态规划

参考资料

  1. Introduction to Algorithms, 3rd Edition, Cormen et al., Chapter 15.