跳到主要内容

東京大学 情報理工学系研究科 電子情報学専攻 2016年8月実施 専門 第3問

Author

adj-matrix

Description

You want to find users from an access log that extremely frequently access to a Web service. Unique user IDs are assigned to individual users, and the access log records IDs of the users that have accessed to the service in chronological order. Answer the following questions.

(1) You want to verify whether there exists a user that occupies the majority of the accesses in the access log, without creating a kind of a frequency histogram of all the users. You have therefore conceived the following algorithm.

  • i. Prepare a list that is initialized to an empty list .
  • ii. Access each element in array from the beginning, and perform either of the following operations for the th element depending on the value of at that time.
    • ii-(a). (If is empty) add to .
    • ii-(b). (If is not empty) if is included in , add to . Otherwise, remove one arbitrary element from .
  • iii. output .

Use this algorithm to process the following sequence of user IDs from the first element, and show the values of in order after processing in ii.

(2) Prove that the number of varieties of user IDs in is at most one in the algorithm in (1).

(3) If there exists a majority user in the access log, prove that is a unique element included in list output by the algorithm in (1).

(4) In the algorithm in (1), when the size of array is very large, the size of list may cause an issue. Keeping this in mind, show a pseudo code that implements the algorithm in (1) using a function read_log() that reads a user ID one by one from the access log, while improving its space efficiency by not explicitly expressing as a list.

Here, read_log() returns the first user ID in the access log at the first call and returns the following user IDs from the access log one by one for the following calls. read_log() returns when it reaches the end of the access log.

Kai

(1)

011{11}
110{ }
211{11}
311{11, 11}
47{11}
511{11, 11}
611{11, 11, 11}
73{11, 11}
88{11}

(2)

Initially, is empty. The number of varieties is 0. Condition satisfied.

Assume at step , contains only user ID , consider the next input .

  • Case 1: : Add to , variety is still 1. Condition satisfied.
  • Case 2: : Remove an . becomes smaller or empty, variety is still 1 or 0. Condition satisfied.
  • Case 3: is empty : Add to , the variety is 1. Condition satisfied.

Therefore, the number of varieties of user IDs in is at most one.

(3)

This is the Boyer-Moore Voting Algorithm.

Let the majority element be . Its count is . The count of all other elements is .

The algorithm effectively pairs two distinct elements and eliminates both.

In the remove step, effectively one and one are discarded.

In the worst case for , every non- element pairs with an element and eliminates it.

Since , even if every non- element eliminates one , there will be copies of remaining.

Therefore, the list cannot be empty at the end, the element inside must be .

(4)

find_majority():
candidate_id = -1
count = 0
while True:
id = read_log()
if id == -1:
break

if count == 0:
candidate_id = id
count = 1
else if id == candidate_id:
count = count + 1
else:
count = count - 1

return candidate_id