京都大学 情報学研究科 知能情報学専攻 2022年8月実施 専門科目 S-3
Author
祭音Myyura
Description
設問1
Let us consider a fully-connected feed-forward neural network that has an input of \(d\) dimensions, an output of \(c\) classes, and \(m\) intermediate layers, each having \(n\) nodes. A sigmoid function is used in all nodes including output nodes. A weight between node \(i\) and node \(j\) is denoted as \(w_{ij}\), and there are no bias terms.
(1) Draw this network and specify \(d\), \(c\), \(m\), and \(n\). Answer the total number of the network weights.
(2) Show the output \(g_k\) of an output node (indexed with \(k\)) using the output \(g_j\) of the nodes of the preceding layer (indexed with \(j\)).
(3) Consider the problem of detecting the source(s) from music recording composed of the sounds of one or more from violin, flute, piano, and singing voices. Describe how the training label \(t_k\) will be given for the output nodes (indexed with \(k\)). Explain why it is not appropriate to use a softmax function in the output nodes for this problem.
(4) Show the binary cross-entropy of the output \(g_k\) and the training label \(t_k\) of an output node (indexed with \(k\)).
(5) Show the formula to update the weight \(w_{jk}\) of an output node (indexed with \(k\)) and a node of the preceding layer (indexed with \(j\)) based on the gradient descent method with the objective function of the sum of the binary cross-entropy defined above over all classes. Show how you derive the formula.
(6) Show the formula to update the weight \(w_{ij}\) of the nodes (indexed with \(j\) and \(i\)), both of which are not in the output layer, based on the error back-propagation method. You do not have to show how you derive it.
(7) Explain why it is difficult to update the weights effectively as the number of network layers becomes large. Describe the methods to mitigate this problem.
設問2
Let us consider \(n\) training samples of a \(d\)-dimensional vector \(X = (x_1, \dots, x_d)^T\), with their mean vector and covariance matrix denoted as \(M = (m_1, \dots, m_d)^T\) and \(\Sigma\), respectively, where \(T\) denotes the transpose.
(1) Show the formula to compute the component \(\sigma_{ij}\) of the covariance matrix \(\Sigma\).
(2) Show the formula of the Mahalanobis distance between a sample \(X\) and this training sample distribution.
(3) In neural network training, we often conduct normalization of inputs so that the distribution for each dimension has a mean of \(0\) and a variance of \(1\). Let \(X\) and \(Z\) be an original input and its normalized one. Discuss the relationship between the square root of the sum of the squared values in each dimension of \(Z\), which is regarded as the Euclidean norm \(\|Z\|_2\), and the above Mahalanobis distance.
Kai
設問1
(1)
the total number of the network weights is
(2)
the output \(g_k\) is
where \(\sigma\) is the sigmoid function.
(3)
The classification classes consist of four categories: violin, flute, piano, and vocals, thus \(c = 4\). Let \(k = 1, \dots, c\), and construct the training label \(t_k\) such that it equals \(1\) if a particular source \(k\) is present in the music recording, and \(0\) otherwise. Using the softmax function in the output layer is inappropriate in this context because it would force the neural network to solve a four-class classification problem, making it unable to represent cases containing multiple sources.
(4)
(5)
Let \(h_{j} = \sum_{i=1}^{n}w_{ij}g_{i}\). The loss function \(L\) is defined as follows:
the weight \(w_{jk}\) is updated as follows:
where \(\eta\) is learning rate. Since
we have
(6)
(7)
Vanishing and exploding gradients.
Methods to mitigate the above issue: - Weight Initialization - Xavier Initialization - He Initialization - Activation Functions - ReLU (Rectified Linear Unit) - Batch Normalization - Gradient Clipping
設問2
(1)
where \(X_i, X_j\) denote the \(i\)-th and \(j\)-th training sample.
(2)
(3)
When input \(X\) is normalized, if the covariance matrix \(\Sigma\) is a diagonal matrix, the Mahalanobis distance and the Euclidean distance are equivalent.
Suppose that
since \(Z\) is obtained by normalizing \(X\), when \(\Sigma\) is diagonal, the covariance matrix of \(Z\) is identity matrix. Hence
and the Mahalanobis distance is