東京大学 情報理工学系研究科 電子情報学専攻 2010年8月実施 専門 第3問
Author
Description
Consider counting the number of occurrences of each word in a large set of documents on a large cluster of computers. The set of documents is partitioned into
(1) Each machine splits a partial set of documents into words. Then the list of words is sorted, and translated into the list of (word, frequency) pairs. The following pseudo code reads a sorted list of words, and outputs (word, frequency) pairs. Find errors in this code, and explain how to correct the code.
count_frequency(File sorted_words_file) {
int num = 0;
String word, previous_word = null;
while (read a word from sorted_words_file) {
if (word is not previous_word AND previous_word is not null) {
output_pair(word, num);
}
num = num + 1;
previous_word = word;
}
}
(2) Show the pseudo code for the function that takes two lists of (word, frequency) pairs, and merge them into a single list.
(3) Explain how to merge
(4) Consider distributing the list of words into
(5) Describe how to balance the load across the machines when the distribution of word frequency is not uniform, and show a concrete mapping function to be used.
Kai
(1)
-
Error 1:
output_pair(word, num)outputs the current word with previous count.Correction:
output_pair(previous_word, num) -
Error 2: The counter
numis not reset after outputting.Correction: Set
num = 0after the output block. -
Error 3: The last group of words is not outputted after while loop finishes.
Correction: Add an
if (num > 0) output_pair(previous_word, num);
(2)
merge (list A, list B) {
pa = A.head, pb = B.head;
result = ∅;
while (pa is not null AND pb is not null) {
if (pa.word < pb.word) {
result.append(pa);
pa = pa.next;
} else if (pa.word > pb.word) {
result.append(pb);
pb = pb.next;
} else {
result.append(pa.word, pa.frequency + pb.frequency);
pa = pa.next, pb = pb.next;
}
}
while (pa is not null) {
result.append(pa);
pa = pa.next;
}
while (pb is not null) {
result.append(pb);
pb = pb.next;
}
}
(3)
Using a Min-Heap of size
- ① Extract the minimum element from the heap.
- ② Check the new minimum. If it has the same word, extract it and add its frequency.
- ③ Repeat ② until output the merged.
- ④ Insert the next elements from the lists that provided the extracted words into the heap.
- ⑤ Repeat until empty.
(4)
Use Hash Partitioning to map.
Example: machine_id = hash(word) MOD N.
hash(word) can be
(5)
Use Range Partitioning based on Sampling.
Select
Then Mapping Function: