We would like to know the probability that the string appears in a random protein sequence. Assume the sequence has random independent letters from the protein alphabet , and letter has probability . Define to be the probability that appears in a random sequence of length , given that the sequence ends with , where is a length-4 string.
(1) What is when ?
(2) What is when ?
Define to be without its final letter. Define to be with letter prepended to it. These definitions may be useful to answer the following questions.
(3) What is , in terms of , where is any length-4 string, when ?
(4) What is when ?
Define to be the product of the probabilities of the letters in . Define to be the set of all possible length-4 strings. These definitions may be useful to answer the following question.
(5) What is the probability that appears in a random sequence of length , in terms of ?
To find , we need to consider all possible sequences of length that could lead to a sequence ending with when an additional letter is appended. Specifically, depends on for all possible letters .