東京大学新領域創成科学研究科メディカル情報生命専攻 2019年8月実施問題11

Author

Description

Machine A produces a sequence $x = x_1 x_2 \cdots x_n$ with the following output probability $\{\beta_k\}$.

Output probabilities: $P(x_s = k) = \beta_k$ for $k \in \{a, c, g, t\}$, $s = 1, 2, \ldots, n$.

Machine B produces a sequence $x = x_1 x_2 \cdots x_n$ by a stationary first-order Markov model with the following initial probabilities $\{\pi_k\}$ and transition probabilities $\{\lambda_{ij}\}$.

Initial Probabilities: $P(x_1 = k) = \pi_k$ for $k \in \{a, c, g, t\}$.

Transition Probabilities: $P(x_s = j \mid x_{s-1} = i) = \lambda_{ij}$ for $i, j \in \{a, c, g, t\}$, $s = 2, \ldots, n$.

Machine C converts each occurrence of ac in an input sequence independently to gt with probability $\epsilon$.

Assume that $P(X)$ is the probability that $X$ is true, and that $P(X \mid Y)$ is the conditional probability that $X$ is true when $Y$ is true. Assume also that $P_A(x)$ is the probability that Machine A outputs a sequence $x$, and $P_B(x)$ is the probability that Machine B outputs a sequence $x$.

Solve the following problems. For (2) to (4), you can use $P_A(x)$ and $P_B(x)$ by substituting $x$ with any concrete sequence ($P_A(\text{actgt})$, for example) as they are.

(1) When $n = 5$, show the four probabilities, $P_A(\text{actgt})$, $P_A(\text{actac})$, $P_B(\text{actgt})$, $P_B(\text{actac})$.

(2) When $n = 5$, three sequences from Machine A and two sequences from Machine B were produced. When a sequence is selected randomly from those five sequences, it was actgt. Show the probability that this sequence was one of the three sequences produced from Machine A.

(3) When $n = 5$, show the probability that the output sequence of Machine C is actgt when the input of Machine C was an output of Machine A.

(4) Five sequences were produced in the same manner as (2), and two sequences were randomly selected from the five sequences. When one of the selected two sequences was used as the input of Machine C, both the output sequence of Machine C and the remaining sequence were actgt. Show the probability that those two sequences were originally produced from the same machine (Machine A or Machine B).

机器 A 产生一个序列 $x = x_1 x_2 \cdots x_n$，其输出概率为 $\{\beta_k\}$。

输出概率：$P(x_s = k) = \beta_k$，其中 $k \in \{a, c, g, t\}$，$s = 1, 2, \ldots, n$。

机器 B 通过一个固定的一阶马尔可夫模型产生一个序列 $x = x_1 x_2 \cdots x_n$，其初始概率 $\{\pi_k\}$ 和转移概率 $\{\lambda_{ij}\}$ 如下。

初始概率：$P(x_1 = k) = \pi_k$，其中 $k \in \{a, c, g, t\}$。

转移概率：$P(x_s = j \mid x_{s-1} = i) = \lambda_{ij}$，其中 $i, j \in \{a, c, g, t\}$，$s = 2, \ldots, n$。

机器 C 将输入序列中的每次出现的 ac 独立地以概率 $\epsilon$ 转换为 gt。

假设 $P(X)$ 是 $X$ 为真的概率，$P(X \mid Y)$ 是 $Y$ 为真时 $X$ 为真的条件概率。同样假设 $P_A(x)$ 是机器 A 输出序列 $x$ 的概率，$P_B(x)$ 是机器 B 输出序列 $x$ 的概率。

解决以下问题。从 (2) 到 (4)，你可以直接使用 $P_A(x)$ 和 $P_B(x)$，用任何具体序列（例如 $P_A(\text{actgt})$）代替 $x$。

(1) 当 $n = 5$ 时，展示四个概率 $P_A(\text{actgt})$，$P_A(\text{actac})$，$P_B(\text{actgt})$，$P_B(\text{actac})$。

(2) 当 $n = 5$ 时，机器 A 产生三个序列，机器 B 产生两个序列。当从这五个序列中随机选择一个序列时，它是 actgt。展示该序列是从机器 A 产生的三个序列之一的概率。

(3) 当 $n = 5$ 时，展示当机器 C 的输入是机器 A 的输出时，机器 C 的输出序列是 actgt 的概率。

(4) 按照 (2) 的方式产生五个序列，随机从这五个序列中选择两个序列。当选定的两个序列之一被用作机器 C 的输入时，机器 C 的输出序列和剩余的序列都是 actgt。展示这两个序列最初是由同一台机器（机器 A 或机器 B）产生的概率。

Kai

(1)

For Machine A:

\[ P_A(x) = \prod_{s=1}^n \beta_{x_s} \]

For sequence actgt:

\[ P_A(\text{actgt}) = \beta_a \beta_c \beta_t \beta_g \beta_t \]

For sequence actac:

\[ P_A(\text{actac}) = \beta_a \beta_c \beta_t \beta_a \beta_c \]

For Machine B:

\[ P_B(x) = \pi_{x_1} \prod_{s=2}^n \lambda_{x_{s-1}, x_s} \]

For sequence actgt:

\[ P_B(\text{actgt}) = \pi_a \lambda_{a, c} \lambda_{c, t} \lambda_{t, g} \lambda_{g, t} \]

For sequence actac:

\[ P_B(\text{actac}) = \pi_a \lambda_{a, c} \lambda_{c, t} \lambda_{t, a} \lambda_{a, c} \]

(2)

Given three sequences from Machine A and two sequences from Machine B, the total number of sequences is five. Let the probability that a sequence is selected randomly from Machine A be denoted by $P(M_A) = \frac{3}{5}$, and the probability that a sequence is selected from Machine B be $P(M_B) = \frac{2}{5}$.

Using Bayes' Theorem, the probability that the sequence was produced by Machine A given that the sequence was actgt is:

\[ P(M_A \mid \text{actgt}) = \frac{P(\text{actgt} \mid M_A) \cdot P(M_A)}{P(\text{actgt})} \]

Where:

\[ P(\text{actgt}) = P(\text{actgt} \mid M_A) \cdot P(M_A) + P(\text{actgt} \mid M_B) \cdot P(M_B) \]

\[ P(\text{actgt}) = P_A(\text{actgt}) \cdot \frac{3}{5} + P_B(\text{actgt}) \cdot \frac{2}{5} \]

So:

\[ P(M_A \mid \text{actgt}) = \frac{P_A(\text{actgt}) \cdot \frac{3}{5}}{P_A(\text{actgt}) \cdot \frac{3}{5} + P_B(\text{actgt}) \cdot \frac{2}{5}} \]

(3)

Let's correctly calculate the probability by considering both scenarios where the input sequence is actgt or actac.

Case 1: Input Sequence is `actgt`

If the input sequence is actgt, there is no ac to convert. Hence, the probability that the sequence remains unchanged (no conversion needed) is $1 - \epsilon$.

\[ P_C(\text{actgt} \mid \text{actgt from A}) = 1 - \epsilon \]

Case 2: Input Sequence is `actac`

If the input sequence is actac, Machine C converts each occurrence of ac independently to gt with probability $\epsilon$.

Probability of converting actac to actgt: $\epsilon$

Now, we need to calculate the total probability that the output sequence of Machine C is actgt when the input is an output of Machine A. This involves considering both cases and their probabilities:

Probability of Machine A producing actgt:

\[ P_A(\text{actgt}) = \beta_a \beta_c \beta_t \beta_g \beta_t \]

Then the first ac should not be converted.

Probability of Machine A producing actac:

\[ P_A(\text{actac}) = \beta_a \beta_c \beta_t \beta_a \beta_c \]

Then the first ac should not be converted, but the next ac should be converted.

The total probability is given by:

\[ P_C(\text{actgt} \mid \text{input from A}) = P_A(\text{actgt}) \cdot (1 - \epsilon) + P_A(\text{actac}) \cdot (1 - \epsilon) \epsilon \]

Therefore, the probability that the output sequence of Machine C is actgt when the input of Machine C was an output of Machine A is:

\[ P_C(\text{actgt} \mid \text{input from A}) = \beta_a \beta_c \beta_t \beta_g \beta_t (1 - \epsilon) + (1 - \epsilon) \epsilon \beta_a \beta_c \beta_t \beta_a \beta_c \]

(4)

Given:

Five sequences were produced: three from Machine A and two from Machine B.
Two sequences were randomly selected from these five.
One selected sequence was input to Machine C, resulting in actgt.
The other selected sequence is also actgt.

We need to find the probability that both sequences were originally produced by the same machine (either both from A or both from B).

Let's approach this step-by-step:

First, let's define our events:
Let $E$ be the event that both sequences are actgt after one goes through Machine C.
Let $S$ be the event that both sequences were from the same machine.
We need to calculate $P(S \mid E)$ using Bayes' theorem:

\[ P(S \mid E) = \frac{P(E \mid S) P(S)}{P(E)} \]

Let's calculate each component: a) $P(S)$: The probability that both selected sequences are from the same machine.
- Probability of both from A: $\frac{\binom{3}{2}}{\binom{5}{2}} = \frac{3}{10}$
- Probability of both from B: $\frac{\binom{2}{2}}{\binom{5}{2}} = \frac{1}{10}$
- $P(S) = \frac{3}{10} + \frac{1}{10} = \frac{4}{10} = \frac{2}{5}$ b) $P(E \mid S)$: The probability of getting two actgt sequences given they're from the same machine. If both from A:

\[ P(E \mid S_A) = P_A(\text{actgt}) \cdot [P_A(\text{actgt})(1-\epsilon) + P_A(\text{actac})(1-\epsilon)\epsilon] \]

If both from B:

\[ P(E \mid S_B) = P_B(\text{actgt}) \cdot [P_B(\text{actgt})(1-\epsilon) + P_B(\text{actac})(1-\epsilon)\epsilon] \]

So,

\[ P(E \mid S) = \frac{3}{5} \cdot P(E \mid S_A) + \frac{1}{5} \cdot P(E \mid S_B) \]

c) $P(E)$: The overall probability of the observed event.

$$ P(E) = P(E \mid S) \cdot P(S) + P(E \mid \neg S) \cdot P(\neg S)

$$

where $P(E \mid \neg S)$ is the probability of getting two actgt sequences given they're from different machines:

\[ P(E \mid \neg S) = [P_A(\text{actgt}) \cdot P_B(\text{actgt})(1-\epsilon) + P_A(\text{actgt}) \cdot P_B(\text{actac}) (1-\epsilon)\epsilon + P_B(\text{actgt}) \cdot P_A(\text{actgt})(1-\epsilon) + P_B(\text{actgt}) \cdot P_A(\text{actac}) \cdot (1-\epsilon)\epsilon] \]

And,

\[ P(E \mid \neg S) = P(E \mid S) \cdot (1 - P(S)) \]

Putting it all together:

\[ P(S \mid E) = \frac{P(E \mid S) \cdot P(S)}{P(E)} \]

Where:

$$ P(E \mid S_A) = P_A(\text{actgt}) \cdot [P_A(\text{actgt})(1-\epsilon) +

P_A(\text{actac})(1-\epsilon)\epsilon] $$

\[ P(E \mid S_B) = P_B(\text{actgt}) \cdot [P_B(\text{actgt})(1-\epsilon) + P_B(\text{actac})(1-\epsilon)\epsilon] \]

\[ P(E) = [P(E \mid S_A) \cdot \frac{3}{5} + P(E \mid S_B) \cdot \frac{1}{5}] + P(E \mid \neg S) \cdot \frac{3}{5} \]

This completes the solution for each problem.

Knowledge

概率论马尔可夫链贝叶斯定理转移概率

概率论 (Probability Theory): 研究随机现象规律的数学学科。
马尔可夫模型 (Markov Model): 具有马尔可夫性质（即未来状态仅与当前状态有关，与过去状态无关）的随机过程模型。
贝叶斯定理 (Bayes' Theorem): 用于计算后验概率的公式，通过已知的先验概率和似然函数来更新概率。
转移概率 (Transition Probability): 从一个状态转移到另一个状态的概率。

难点思路

在解答这类题目时，主要难点在于理解和应用马尔可夫模型的概率计算，以及如何运用贝叶斯定理进行概率的更新。对于每一个具体问题，我们可以按照以下步骤进行解答：

计算基础概率：根据给定模型参数计算序列在不同机器下的生成概率。
应用贝叶斯定理：通过已知信息更新概率，计算后验概率。
考虑转换：对于涉及序列转换的情况，计算转换前后的概率，并结合条件概率公式进行综合计算。

解题技巧和信息

贝叶斯定理应用技巧:
条件概率的分解：将复杂的概率计算分解为条件概率的乘积。
归一化：确保计算的概率值归一化，使得总和为 1。
马尔可夫模型技巧:
初始概率与转移概率：明确初始状态的概率分布和状态之间的转移概率矩阵。
序列概率计算：利用转移概率逐步计算整个序列的生成概率。
综合概率计算:
将多种概率结果进行加权平均，以得到综合概率。

重点词汇

Probability Theory 概率论
Markov Model 马尔可夫模型
Bayes' Theorem 贝叶斯定理
Transition Probability 转移概率
Posterior Probability 后验概率

参考资料

Probability and Statistics for Engineering and the Sciences by Jay L. Devore, Chap. 5 (Discrete Random Variables and Probability Distributions)
Introduction to Probability Models by Sheldon M. Ross, Chap. 4 (Markov Chains)
Pattern Recognition and Machine Learning by Christopher M. Bishop, Chap. 13 (Hidden Markov Models)

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2019年8月実施 問題11