東京大学新領域創成科学研究科メディカル情報生命専攻 2024年1月実施問題10

Author

Description

Suppose we have many DNA sequences. Each sequence has length \(L\), and the number of sequences is \(N\). The sequences are stored in an array \(\mathbf{A}\), where \(\mathbf{A}[i]\) holds the \(i\)-th sequence \((1 \leq i \leq N)\). Assume that copying one sequence from one memory location to another takes constant time.

We wish to sort the sequences in alphabetical order of their left-most base. Write pseudocode of an algorithm that puts the sorted sequences into another array \(\mathbf{B}\). The running time of the algorithm should be linearly proportional to \(N\). The sequences may contain any of the bases a, c, g, or t.

Below this point, assume that the sequences only contain bases a and c, never g or t.

We wish to sort the sequences in alphabetical order of their \(j\)-th base \((1 \leq j \leq L)\), using less memory. The result should go in the array \(\mathbf{A}\), and we are not allowed to use any other array. Write pseudocode of an algorithm to perform this sort. The running time should be linearly proportional to \(N\).
We wish to sort the sequences in alphabetical order of the whole sequences. Write pseudocode of an algorithm to do this sort, using the answer to question (2) as a subroutine. The worst-case running time should be linearly proportional to \(L \times N\).

假设我们有许多DNA序列。每个序列的长度为 \(L\)，序列的数量为 \(N\)。序列存储在数组 \(\mathbf{A}\) 中，其中 \(\mathbf{A}[i]\) 保存第 \(i\) 个序列 \((1 \leq i \leq N)\)。假设将一个序列从一个内存位置复制到另一个位置的时间是恒定的。

我们希望按序列最左边的碱基的字母顺序排序。编写一个算法的伪代码，将排序后的序列放入另一个数组 \(\mathbf{B}\)。算法的运行时间应与 \(N\) 成线性比例。序列可能包含 a、c、g 或 t 中的任意碱基。

以下假设序列仅包含碱基 a 和 c，不包含 g 或 t。

我们希望按序列的第 \(j\) 个碱基的字母顺序排序 \((1 \leq j \leq L)\)，使用较少的内存。结果应存放在数组 \(\mathbf{A}\) 中，不允许使用任何其他数组。编写一个算法的伪代码来执行此排序。运行时间应与 \(N\) 成线性比例。
我们希望按整个序列的字母顺序对序列进行排序。编写一个算法的伪代码来执行此排序，使用问题（2）的答案作为子例程。最坏情况下的运行时间应与 \(L \times N\) 成线性比例。

Kai

解题思路

第一问：针对第一个问题，我们的目标是按序列最左侧碱基的字母顺序对 DNA 序列进行排序。由于碱基的种类有限（只有a, c, g, t四种），我们可以使用计数排序（Counting Sort），这种排序算法的时间复杂度为 \(O(N)\) 并且适用于这种小范围有限值排序的问题。具体做法是先统计每种碱基的数量，然后根据这些计数值确定每个序列在排序后的数组中的位置。最终将结果存储在新的数组 \(\mathbf{B}\) 中。
第二问：在第二个问题中，我们限定了 DNA 序列仅包含碱基 a 和 c，要求我们按某个特定位置的碱基对序列进行排序，并且不得使用额外的数组。在这种情况下，可以使用荷兰国旗问题中的双指针方法。这种方法通过维护两个指针，一个从左至右移动，寻找需要交换的元素；另一个从右至左移动，同样寻找需要交换的元素。这种方法能够在 \(O(N)\) 的时间复杂度内完成排序，并且不需要额外的存储空间。
第三问：最后一个问题要求我们按整个序列的字母顺序进行排序。此时，我们可以使用基数排序（Radix Sort），一种非比较型排序算法。该算法的关键在于从最低位（最不重要位）开始，对每一位进行排序。对于每一位的排序，我们可以使用第二问中实现的按某个位置排序的方法作为子程序。通过这种方式，我们可以保证整体排序的时间复杂度是 O(L \times N)，其中 L 是序列长度，N 是序列数量。

Question 1: Sorting by the Left-Most Base

Pseudocode:

// Pseudocode for Counting Sort based on the first base

function countingSortFirstBase(A, N):
    // Step 1: Initialize count array
    count = [0, 0, 0, 0] // for 'a', 'c', 'g', 't'
    B = array of size N

    // Step 2: Count occurrences of each base
    for i = 1 to N:
        if A[i][0] == 'a':
            count[0] += 1
        else if A[i][0] == 'c':
            count[1] += 1
        else if A[i][0] == 'g':
            count[2] += 1
        else if A[i][0] == 't':
            count[3] += 1

    // Step 3: Calculate starting index for each base in sorted order
    for j = 1 to 3:
        count[j] += count[j - 1]

    // Step 4: Place elements in sorted order
    for i = N downto 1:
        if A[i][0] == 'a':
            B[count[0] - 1] = A[i]
            count[0] -= 1
        else if A[i][0] == 'c':
            B[count[1] - 1] = A[i]
            count[1] -= 1
        else if A[i][0] == 'g':
            B[count[2] - 1] = A[i]
            count[2] -= 1
        else if A[i][0] == 't':
            B[count[3] - 1] = A[i]
            count[3] -= 1

    return B

Explanation:

Initialize Count Array: We first initialize a count array of size 4 to count the occurrences of the bases 'a', 'c', 'g', and 't'. The array indices represent the bases: count[0] for 'a', count[1] for 'c', count[2] for 'g', and count[3] for 't'.
Count Occurrences: We iterate over the array \(\mathbf{A}\) and count the occurrences of each base in the left-most position of the sequences.
Calculate Starting Indices: We compute the starting index for each base in the sorted array by accumulating the counts. This step helps determine where to place the sequences in array \(\mathbf{B}\).
Place Elements in Sorted Order: We iterate over the array \(\mathbf{A}\) from right to left and place each sequence in its correct position in array \(\mathbf{B}\) based on the left-most base. The count array helps track the next available position for each base.

Question 2: Sorting by the \(j\)-th Base (Only 'a' and 'c')

Pseudocode:

## Pseudocode for In-Place Sorting by the j-th base using Two-Pointer Method

function sortByJthBase(A, N, j):
    left = 1
    right = N
    while left <= right:
        if A[left][j] == 'a':
            left += 1
        else if A[right][j] == 'c':
            right -= 1
        else:
            swap A[left] with A[right]
            left += 1
            right -= 1

Explanation:

Initialization: We initialize two pointers, left starting from the beginning and right starting from the end of the array \(\mathbf{A}\).
Partitioning: The goal is to move all sequences starting with 'a' to the left and all sequences starting with 'c' to the right.
Sorting Process:
If A[left][j] is 'a', we move the left pointer to the right.
If A[right][j] is 'c', we move the right pointer to the left.
If A[left][j] is 'c' and A[right][j] is 'a', we swap these two sequences, move the left pointer to the right, and the right pointer to the left.

This in-place sorting ensures that all sequences are sorted based on the \(j\)-th base without using additional memory.

Question 3: Sorting by Whole Sequence

Pseudocode:

// Pseudocode for Radix Sort using Question 2's Sort as Subroutine

function radixSortSequences(A, N, L):
    // Sort by each base position starting from the least significant
    for j = L downto 1:
        sortByJthBase(A, N, j)

Explanation:

Radix Sort: We use the Radix Sort algorithm to sort the sequences by considering each base position, starting from the least significant (rightmost) to the most significant (leftmost).
Subroutine Usage: For each base position \(j\), we call sortByJthBase(A, N, j) to sort the sequences based on the base at the \(j\)-th position.
Efficiency: Since sortByJthBase runs in O(N) time, and we do this for all \(L\) base positions, the overall time complexity of Radix Sort is \(O(L \times N)\), which is efficient for this problem. This ensures that the entire sorting process is stable and the final order respects the order dictated by all bases in the sequence.

Knowledge

难点思路

基数排序的稳定性：需要确保每次排序时相同的元素顺序不变，这样最终排序结果是正确的。
双指针法的应用：在有限的内存情况下进行排序，尤其是排序特定的基因序列位置。

解题技巧和信息

在有限内存情况下排序时，双指针法非常有用。
基数排序对于多位数的排序非常高效，尤其在数据量较大时。

重点词汇

Counting Sort 计数排序
Radix Sort 基数排序
In-Place Sort 原地排序

参考资料

Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3^rd ed.). MIT Press. Chap. 8, 9.
Sedgewick, R., & Wayne, K. (2011). Algorithms (4^th ed.). Addison-Wesley. Chap. 2.

東京大学 新領域創成科学研究科 メディカル情報生命専攻 2024年1月実施 問題10

Author

Description

Kai

解题思路

Question 1: Sorting by the Left-Most Base

Question 2: Sorting by the \(j\)-th Base (Only 'a' and 'c')

Question 3: Sorting by Whole Sequence

Knowledge

难点思路

解题技巧和信息

重点词汇

参考资料

東京大学新領域創成科学研究科メディカル情報生命専攻 2024年1月実施問題10