跳到主要内容

東京大学 情報理工学系研究科 電子情報学専攻 2010年8月実施 専門 第3問

Author

adj-matrix

Description

Consider counting the number of occurrences of each word in a large set of documents on a large cluster of computers. The set of documents is partitioned into sets, and they are processed in parallel by machines. Answer the following questions.

(1) Each machine splits a partial set of documents into words. Then the list of words is sorted, and translated into the list of (word, frequency) pairs. The following pseudo code reads a sorted list of words, and outputs (word, frequency) pairs. Find errors in this code, and explain how to correct the code.

count_frequency(File sorted_words_file) {
int num = 0;
String word, previous_word = null;
while (read a word from sorted_words_file) {
if (word is not previous_word AND previous_word is not null) {
output_pair(word, num);
}
num = num + 1;
previous_word = word;
}
}

(2) Show the pseudo code for the function that takes two lists of (word, frequency) pairs, and merge them into a single list.

(3) Explain how to merge lists of (word, frequency) pairs generated by machines.

(4) Consider distributing the list of words into machines before sorting and counting for avoiding the time consuming merge process. It requires a function that maps each word into an integer from to . Show a concrete example of this mapping function.

(5) Describe how to balance the load across the machines when the distribution of word frequency is not uniform, and show a concrete mapping function to be used.

Kai

(1)

  • Error 1: output_pair(word, num) outputs the current word with previous count.

    Correction: output_pair(previous_word, num)

  • Error 2: The counter num is not reset after outputting.

    Correction: Set num = 0 after the output block.

  • Error 3: The last group of words is not outputted after while loop finishes.

    Correction: Add an if (num > 0) output_pair(previous_word, num);

(2)

merge (list A, list B) {
pa = A.head, pb = B.head;
result =;
while (pa is not null AND pb is not null) {
if (pa.word < pb.word) {
result.append(pa);
pa = pa.next;
} else if (pa.word > pb.word) {
result.append(pb);
pb = pb.next;
} else {
result.append(pa.word, pa.frequency + pb.frequency);
pa = pa.next, pb = pb.next;
}
}
while (pa is not null) {
result.append(pa);
pa = pa.next;
}
while (pb is not null) {
result.append(pb);
pb = pb.next;
}
}

(3)

Using a Min-Heap of size to perform a multi-way merge.

  • ① Extract the minimum element from the heap.
  • ② Check the new minimum. If it has the same word, extract it and add its frequency.
  • ③ Repeat ② until output the merged.
  • ④ Insert the next elements from the lists that provided the extracted words into the heap.
  • ⑤ Repeat until empty.

(4)

Use Hash Partitioning to map.

Example: machine_id = hash(word) MOD N.

hash(word) can be or MD5 and so on.

(5)

Use Range Partitioning based on Sampling.

Select pivot words () such that the total frequency in each range is approximately equal.

Then Mapping Function: