跳到主要内容

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2017年8月実施 問題12

Author

zephyr

Description

For two strings and of lengths and (), we define the set of common subsequences as

Here, represents the length of common subsequence . Let denote the length of the longest common subsequences. If , we define .

(1) Let and denote the length- prefix and suffix of string , respectively. We define and for . Describe a recurrence for computing from , , and when . Also, describe a recurrence for computing from , , and when .

(2) Compute matrices and for and .

(3) Suppose matrix is given. Write a pseudocode for obtaining one of the longest common subsequences.

(4) Suppose matrices and are given. Write a pseudocode for computing the maximal length of common subsequences that contain the -th position of ().


对于长度为 的两个字符串 (),我们定义公共子序列集为

这里, 表示公共子序列 的长度。令 表示最长公共子序列的长度。如果 ,我们定义

(1) 令 分别表示字符串 的长度为 的前缀和后缀。我们定义 对于 。描述从 ,和 计算 的递推关系,当 。同时,描述从 ,和 计算 的递推关系,当

(2) 计算 的矩阵

(3) 假设给定矩阵 。编写伪代码以获取一个最长的公共子序列。

(4) 假设给定矩阵 。编写伪代码以计算包含 的第 个位置的最大长度的公共子序列 ()。

Kai

(1)

Recurrence for

The value of can be computed as follows:

  • If , then .
  • If , then .

This can be summarized as:

Recurrence for

The value of can be computed as follows:

  • If , then .
  • If , then .

This can be summarized as:

(2)

Matrix

Let's fill the matrix for and :

x\yACACG
A11111
C12222
T12222
G12223
G12223

Matrix

Let's fill the matrix for and :

x\yGCACA
G11111
G11111
T11111
C12222
A12333

(3)

To extract one of the longest common subsequences from the matrix , we use the following algorithm. This algorithm traces back from the bottom-right corner of the matrix to the top-left corner, reconstructing the longest common subsequence by following the path of optimal choices recorded in .

Explanation

  • Start from the bottom-right corner of the matrix , i.e., .
  • Compare characters of and :
    • If , include in the result and move diagonally to .
    • If , move in the direction that gives the larger value (either up or left).
  • Continue this process until reaching the top-left corner of the matrix.
  • The result will be one of the longest common subsequences.

Pseudocode

function getLCS(x, y, alpha)
i = length(x)
j = length(y)
lcs = ""
while i > 0 and j > 0
if x[i] == y[j]
lcs = x[i] + lcs
i = i - 1
j = j - 1
else if alpha[i-1, j] >= alpha[i, j-1]
i = i - 1
else
j = j - 1
return lcs

(4)

Given matrices and , we can compute the maximal length of common subsequences that include the -th position of . This is done by evaluating the length of subsequences that start from the beginning and end at the -th position, combined with subsequences that start at the -th position and extend to the end.

Explanation

  • For each position in :
    • Combine the length of the prefix up to () with the length of the suffix from onwards () as the length of the common subsequence.
    • If , since is included in both the prefix and suffix, subtract 1 from the total length.
  • The maximum value obtained through this process gives the desired length.

Pseudocode

function maxLengthWithPosition(x, y, alpha, beta, i)
maxLength = 0
for j = 1 to length(y)
currentLength = alpha[i, j] + beta[n-i+1, m-j+1]
if x[i] == y[j]
currentLength = currentLength - 1
maxLength = max(maxLength, currentLength)
return maxLength

Knowledge

动态规划 最长公共子序列 递归

难点思路

对于递归关系的理解和矩阵填充的具体实现可能会比较复杂,需要仔细考虑每一步的递推关系。

解题技巧和信息

  • 动态规划表格填充方法:先初始化,然后按照递推关系逐步填充。
  • 递归关系需要对字符串字符的匹配情况进行详细考虑,以确保递推关系的正确性。

重点词汇

  • common subsequence 公共子序列
  • recurrence relation 递推关系
  • prefix 前缀
  • suffix 后缀
  • dynamic programming 动态规划

参考资料

  1. Introduction to Algorithms, 3rd Edition, Cormen et al., Chapter 15.