東京大学情報理工学系研究科コンピュータ科学専攻 2017年2月実施問題3

Author

kainoj, 祭音Myyura

Description

Consider the problem of sorting an array of integers using the heapsort algorithm. Assume that heapsort brings the minimum element to the front.

Answer the following questions.

(1) Heapsort consists of two phases. Explain what is to be performed in each phase.

(2) Consider sorting of the following array using heapsort.

\[ (3, 8, 1, 5, 4, 9, 7) \]

Draw the tree structure of the heap just after the first phase.

(3) Answer the time complexity of each phase when sorting an array of length $n$. Explain the reason.

(4) Answer the time complexity of obtaining a sorted list of the smallest $k$ elements from an array of length $n$ using heapsort. Explain the reason.

(5) The execution time of heapsort on modern computer systems is often longer than that of some other sort algorithms such as quicksort and mergesort when sorting a large array. Explain a possible reason.

Kai

(1)

Building the heap – level-by-level, starting from the deepest level.
Extracting minimum $n$ times. Extracting means printing the root value, replacing it with the last heap element, and then fixing heap order. If we represent the heap in an array of length $n$, we can swap first element (minimum) with the last element, fix heap order and call recursively on first $n-1$ elements of array.Resulting array is sorted in descending order. Reverse it for ascending order.

(2)

Let's put elements of $[3,8,1,5,4,9,7]$ into a binary tree:

        3
      /    \
     8      1
   /  \    /  \
  5    4  9    7         ] proper heaps

Note that the deepest level consists of $4$ proper, one-element heaps. Let's fix the heap order on second to last deepest level:

        3
      /    \
     4      1             ] 
   /  \    /  \           ] proper heaps
  5    8  9    7          ]

Now we have two proper heaps of height $1$. Now, swap $3$ and $1$ to get the final answer:

        1
      /    \
     4      3
   /  \    /  \
  5    8  9    7

(3)

For simplicity, assume that $n=2^k - 1$, i.e. the heap the deepest level is fully occupied.

Phase 1, building the heap is $O(n)$ when building the heap from the bottom. On the deepest level we do nothing: it consists of $\lceil \frac{n}{2} \rceil$ nodes, each of them is a proper one-element heap. On second-to-the-last level, we perform two comparisons, and possibly one swap. In general we'll perform, ceil omitted for simplicity:

\[ \begin{aligned} &0\cdot \frac{n}{2} + 1 \cdot \frac{n}{4} + 2\cdot \frac{n}{8} + \cdots + O(h-1)\cdot 2 + O(h)\cdot 1 \\ &= \sum_{h=1}^{\lceil \log_2 n \rceil} \lceil\frac{n}{2^{h+1}}\rceil\cdot O(h) \\ &= O(n\sum_{h=1}^{\lceil \log_2 n\rceil} \frac{h}{2^{h+1}}) \end{aligned} \]

Now, let's evaluate the right-hand side:

\[ \begin{aligned} n\sum_{h=1}^{\lceil \log_2 n\rceil} \frac{h}{2^{h+1}} &\leq \frac{n}{4} \sum_{h=1}^{\infty} \frac{h}{2^{h-1}} && \text{substitute $x=\frac{1}{2}$} \\ &= \frac{n}{4} \sum_{h=1}^{\infty} hx^{h-1} \\ &= \frac{n}{4} \frac{d}{dx} \sum_{h=1}^{\infty} x^h\\ &= \frac{n}{4} \frac{d}{dx} \frac{x}{1-x} \\ &= \frac{n}{4} \frac{1}{(1-x)^2} \\ &= \frac{n}{4} \cdot 4 = n \end{aligned} \]

Thus, we can bound the running time as:

\[ \begin{aligned} O(n\sum_{h=1}^{\lceil \log_2 n\rceil} \frac{h}{2^{h+1}}) = O(n) \end{aligned} \]

See Cormen's Introduction to Algorithms, 3^rd ed, Chapter 6.3 for more.

Phase 2 is $O(n\log_2n)$: for each element out of $n$ we perform extract-min, which takes $O(\log_2n)$.

Heapsort pseudocode:

heapsort(T[1..n])
    build-heap(T[1..n])
    for i = n..2
        swap(T[1], T[i])
        shift-down(T[1..(i-1)], 1)

(4)

$O(n+k \log_2 n)$. First, we build heap in $O(n)$, then we perform ext tract-min $k$ times.

(5)

Let's look at an example. If input array is already sorted, then mergesort will just scan through the array with no swaps. Same with partition in quicksort. On the other hand heapsort, after building a heap ($O(n)$), will extract minimum and replace it with the last element of array – which is maximum. The maximum will be later "bubbled" down back to the deepest level. This means two comparisons and a swap at every of $O(\log_2 n)$ iterations. This is pricey.

The other reason might be caching. In heapsort, accessing a parent might result in memory accesses which are distant more than $\lceil \frac{n}{2}\rceil$ (in particular leaves' parents). If $n$ is big, we might have a lot of cache misses.

東京大学 情報理工学系研究科 コンピュータ科学専攻 2017年2月実施 問題3