Let us consider a fully-connected feed-forward neural network that has an input of dimensions, an output of classes, and intermediate layers, each having nodes.
A sigmoid function is used in all nodes including output nodes.
A weight between node and node is denoted as , and there are no bias terms.
(1) Draw this network and specify , , , and . Answer the total number of the network weights.
(2) Show the output of an output node (indexed with ) using the output of the nodes of the preceding layer (indexed with ).
(3) Consider the problem of detecting the source(s) from music recording composed of the sounds of one or more from violin, flute, piano, and singing voices.
Describe how the training label will be given for the output nodes (indexed with ).
Explain why it is not appropriate to use a softmax function in the output nodes for this problem.
(4) Show the binary cross-entropy of the output and the training label of an output node (indexed with ).
(5) Show the formula to update the weight of an output node (indexed with ) and a node of the preceding layer (indexed with ) based on the gradient descent method with the objective function of the sum of the binary cross-entropy defined above over all classes.
Show how you derive the formula.
(6) Show the formula to update the weight of the nodes (indexed with and ), both of which are not in the output layer, based on the error back-propagation method. You do not have to show how you derive it.
(7) Explain why it is difficult to update the weights effectively as the number of network layers becomes large. Describe the methods to mitigate this problem.
Let us consider training samples of a -dimensional vector , with their mean vector and covariance matrix denoted as and , respectively, where denotes the transpose.
(1) Show the formula to compute the component of the covariance matrix .
(2) Show the formula of the Mahalanobis distance between a sample and this training sample distribution.
(3) In neural network training, we often conduct normalization of inputs so that the distribution for each dimension has a mean of and a variance of .
Let and be an original input and its normalized one.
Discuss the relationship between the square root of the sum of the squared values in each dimension of , which is regarded as the Euclidean norm , and the above Mahalanobis distance.
The classification classes consist of four categories: violin, flute, piano, and vocals, thus .
Let , and construct the training label such that it equals if a particular source is present in the music recording, and otherwise.
Using the softmax function in the output layer is inappropriate in this context because it would force the neural network to solve a four-class classification problem, making it unable to represent cases containing multiple sources.