A data set including eight data is given as in Figure 1. Each datum is in the form of .
Below we consider how to construct a rule from for classifying into or . Answer the following questions.
(1) Let us consider the code-length required for encoding a binary string of length ().
In general, for a finite set , the code-length required for encoding one element is given by (bit), where the logarithm is to the base 2.
Show that, for a binary string of length in which the number of occurrences of is , the code-length required for encoding and itself is at most:
where the value of the number is given in advance and the code-length can be non-integer
valued.
Below the value of Eq.(1) for is denoted as .
(2) Let the probability of for a datum with be . Then calculate the least squares estimate of from Figure 1.
The least squares estimate of is the value of that minimizes where denotes the value of for the -th datum and denotes the sum taken over all the data such that .
(3) Let be a binary string obtained by concatenating the values of for all the data in in Figure 1.
Let be a binary string obtained by concatenating the values of for all the data in such that and let be a binary string obtained by concatenating the values of for all the data in such that ().
For example, the indexes of data in such that are , , , , so the binary string obtained by concatenating the values of corresponding to them is .
We define the measure of goodness of classifying data by partitioning based on whether or as follows:
We consider that the smaller the value of Eq.(2) is, the more the value of contributes to the classification of .
Find that minimizes .
Hereinafter, when there are more than one ’s that minimize the value of Eq.(2), one is chosen randomly from among them.
(4) Let the value of obtained in Question (3) be . is partitioned into two sets according to whether or , then is also partitioned into two strings: and .
It can be represented using a tree structure as shown in Figure 2.
We call it a partitioning tree.
We call and partitioned strings.
We further partition each of and , by finding minimizing and minimizing , respectively. Let this partitioning of a leaf be repeated until the following stopping rule is fulfilled: The depth of a leaf (the number of partitionings from the root to the leaf) is two, or the partitioned string arriving at a leaf is all or all .
Find the partitioning tree that is finally obtained.
(5) For the resulting partitioning tree, for a partitioned string arriving at each leaf, we assign to the leaf if the number of occurrences of in this string is larger than that of , and assign to the leaf if the number of occurrences of is smaller than that of .
When the number of occurrences of is the same as that of , we assign randomly or to the leaf.
This tree can be used for predicting the value of for any new datum.
That is, when in the new datum is given
and arrives at a leaf, the tree predicts the value of its corresponding as the value of assigned to the leaf.
Here, even if we change the stopping rule in Question (4) to construct a larger tree from a training data set so that the values of for data reaching at each leaf are all or all , such a tree doesn’t necessarily predict the value of for a new datum with higher accuracy. Explain the reason.
(6) Consider a general case where for a positive integer , a set of multi-dimensional data in the form of and a partitioning tree are given.
Let be a set of all subtrees which share the root of and are obtained by pruning starting
from its leaves.
We define the following penalized criterion for evaluating the goodness of a subtree for the given :
where is the total number of leaves in and is the total number of inner nodes in .
and are given positive constants.
The sum in the third term in Eq.(3) is taken over all the leaves in and is the binary string obtained by concatenating the values of for all the data which reach the leaf .
The smaller the value of Eq.(3) is, the better is.
Give an algorithm that finds minimizing the criterion Eq.(3) from and , and runs as efficiently as possible in computation time.
In machine learning there are the concepts of bias and variance. The bias indicates how much the generated prediction function fits the relationships between the data and the prediction. And variance indicates how much the prediction function fits "new" data (testing data). When the prediction function fits the training data too well it is called "over fitting" and over fitting leads to low bias but high variance. In the mentioned case, extending the tree would result in over fitting the prediction function to the data set. Meaning that on new data the variance will be large and it will not improve the overall prediction.
Assuming set S is independent from the training set that was used to generate the tree T.
The algorithm will work as Minimum Error Pruning technique. It will work from the bottom up. For every bottom level node (only child nodes are leafs), calculate the error after pruning, meaning that the leaves are discarded and the node becomes a leaf with prediction being:
1 if
0 if
1 if
If the error is smaller on the set S compared to the tree prior to pruning, prune the tree and continue to the next candidate.
The algorithm will work on at most nodes and for each node it will calculate the error for that tree which takes , together
The algorithm will stop when pruning all bottom level nodes do not lead to a decrease in the error.