Skip to content

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2022年8月実施 問題12

Author

zephyr

Description

We would like to know the probability that the string \(\mathbf{FRED}\) appears in a random protein sequence. Assume the sequence has random independent letters from the protein alphabet \(A\), and letter \(a \in A\) has probability \(p(a)\). Define \(Q(i, X)\) to be the probability that \(\mathbf{FRED}\) appears in a random sequence of length \(i\), given that the sequence ends with \(X\), where \(X\) is a length-4 string.

(1) What is \(Q(4, X)\) when \(X \neq \mathbf{FRED}\)?

(2) What is \(Q(4, X)\) when \(X = \mathbf{FRED}\)?

Define \(X'\) to be \(X\) without its final letter. Define \(a \cdot X\) to be \(X\) with letter \(a\) prepended to it. These definitions may be useful to answer the following questions.

(3) What is \(Q(i + 1, X)\), in terms of \(Q(i, Y)\), where \(Y\) is any length-4 string, when \(X \neq \mathbf{FRED}\)?

(4) What is \(Q(i + 1, X)\) when \(X = \mathbf{FRED}\)?

Define \(P(X)\) to be the product of the probabilities of the letters in \(X\). Define \(A^4\) to be the set of all possible length-4 strings. These definitions may be useful to answer the following question.

(5) What is the probability that \(\mathbf{FRED}\) appears in a random sequence of length \(n\), in terms of \(Q(n, X)\)?

Kai

(1)

\(Q(4, X)\) when \(X \neq \mathbf{FRED}\):

If \(X \neq \mathbf{FRED}\), then \(\mathbf{FRED}\) has not appeared by the time the sequence reaches length 4 with ending \(X\). Therefore, the probability \(Q(4, X) = 0\).

\[ Q(4, X) = 0 \quad \text{for} \quad X \neq \mathbf{FRED} \]

(2)

\(Q(4, X)\) when \(X = \mathbf{FRED}\):

If \(X = \mathbf{FRED}\), then the sequence exactly matches \(\mathbf{FRED}\), meaning \(\mathbf{FRED}\) has appeared. Therefore, the probability \(Q(4, \mathbf{FRED}) = 1\).

\[ Q(4, \mathbf{FRED}) = 1 \]

(3)

\(Q(i + 1, X)\) in terms of \(Q(i, Y)\) when \(X \neq \mathbf{FRED}\):

To find \(Q(i + 1, X)\), we need to consider all possible sequences of length \(i\) that could lead to a sequence ending with \(X\) when an additional letter is appended. Specifically, \(Q(i + 1, X)\) depends on \(Q(i, a \cdot X')\) for all possible letters \(a \in A\).

Since \(X \neq \mathbf{FRED}\), we have:

\[ Q(i + 1, X) = \sum_{a \in A} p(a) Q(i, a \cdot X') \]

(4)

\(Q(i + 1, X)\) when \(X = \mathbf{FRED}\):

If the sequence ends in \(\mathbf{FRED}\) at length \(i+1\), then \(\mathbf{FRED}\) has appeared, so \(Q(i + 1, \mathbf{FRED}) = 1\).

\[ Q(i + 1, \mathbf{FRED}) = 1 \]

(5)

The probability that \(\mathbf{FRED}\) appears in a random sequence of length \(n\):

To find the total probability that \(\mathbf{FRED}\) appears in a sequence of length \(n\), we will consider 3 cases:

  1. \(n < 4\): In this case, \(\mathbf{FRED}\) cannot appear, so the probability is 0.

  2. \(n = 4\): The probability that \(\mathbf{FRED}\) appears in a sequence of length 4 is given by \(Q(4, \mathbf{FRED}) = P(\mathbf{FRED}) = p(F) \cdot p(R) \cdot p(E) \cdot p(D)\).

  3. \(n > 4\): The probability that \(\mathbf{FRED}\) appears in a sequence of length \(n\) is given by:

\[ P(\mathbf{FRED}) = \sum_{X \in A^4} Q(n, X) \]

where \(P(X)\) is the product of the probabilities of the letters in \(X\).

Therefore, the probability that \(\mathbf{FRED}\) appears in a random sequence of length \(n\) can be concluded as follows:

\[ P(\mathbf{FRED}) = \begin{cases} 0 & \text{if} \quad n < 4 \\ p(F) \cdot p(R) \cdot p(E) \cdot p(D) & \text{if} \quad n = 4 \\ \sum_{X \in A^4} Q(n, X) & \text{if} \quad n > 4 \end{cases} \]

Knowledge

概率论 随机序列 字符串出现概率

难点解题思路

对于随机序列中特定字符串的出现概率问题,关键在于递推关系的构建以及边界条件的处理。通过定义恰当的递推公式,可以逐步推导出所需概率。

解题技巧和信息

  1. 构建递推关系,考虑当前状态如何从前一个状态转移。
  2. 确定边界条件,并基于这些条件初始化递推公式。
  3. 注意概率的加总,确保所有可能的转移情况都被考虑在内。

重点词汇

  • Probability: 概率
  • Sequence: 序列
  • Random: 随机的
  • Independent: 独立的
  • Recurrence relation: 递推关系

参考资料

  1. Introduction to Probability Models, Chapter 3
  2. Probability and Statistics for Engineers and Scientists, Chapter 5