This document is partially blurred.
Unlock all pages and 1 million more documents.
Get Access
1
Solutions To Problems of Chapter 18
18.1. Prove that the perceptron algorithm, in its pattern-by-pattern mode of
operation, converges in a finite number of iteration steps. Assume that
θ(0) =0.
Hint: Note that since classes are assumed to be linearly separable, there
exists a normalized hyperplane, θ∗, and a γ > 0, so that
γ≤ynθT
∗xn, n = 1,2, . . . , N,
where, ynis the respective label, being +1 for ω1and −1 for ω2. By the
term normalized hyperplane, we mean that,
θT
∗= [ ˆ
θ∗, θ0∗]T,with ||ˆ
θ∗|| = 1.
In this case, ynθT
∗xnis the distance of xnfrom the hyperplane θ∗([164]).
Solution: Let us assume that we are currently at iteration i. This means
that so far i−1 update corrections have been performed by the algo-
rithm. The ith update will be performed only if the current sample, x(i),
is misclassified, i.e., only if
y(i)xT
(i)θ(i−1) ≤0,
and the update will be
Also from (1), we have that
||θ(i)||2=||θ(i−1)||2+ 2µy(i)xT
(i)θ(i−1) + (4)
µ2y2
(i)||x(i)||2
18.2. The derivative of the sigmoid functions has been computed in Problem
7.6. Compute the derivative of the hyperbolic tangent activation function
and show that it is equal to,
f0(z) = ac1−f2(z).
Solution: We have that,
f(z) = atanh(cz)⇒f0(z) = (ac) cosh−2cz)
or
18.3. Show that the effect of the momentum term in the gradient descent back-
propagation scheme is to effectively increase the learning convergence rate
of the algorithm.
Hint: Assume that the gradient is approximately constant over Isucces-
sive iterations.
Solution Let idenote the current iteration, i.e.,
∆θr
j(i) = a∆θr
j(i−1) −µg,(8)
18.4. Show that if a) the activation function is the hyperbolic tangent, b) the
input variables are normalized to zero mean and unit variance, then in
order to guarantee that all the outputs of the neurons are zero mean and
unit variance, the weights must be drawn from a distribution of zero mean
and standard deviation equal to
σ=m−1/2,
where mis the number of synaptic weights associated with the correspond-
ing neuron.
Hint: For simplicity, consider the bias to be zero, and also that the inputs
to each neuron are mutually uncorrelated.
Solution: Let us focus on the weights of a specific neuron, and denote
by yi, i = 1,2, . . . , m, the respective inputs. Then, the input to the acti-
vation functions will be given by
m
X
i=1
Assume that the inputs to the neuron are zero mean and unit variance
and are also uncorrelated, then
σ2
i=E[y2
i] (12)
and
E[yiyj] = δij .(13)
18.5. Consider the sum of error squares cost function
J=1
2
N
X
n=1
kL
X
m=1
(ˆynm −ynm)2.(15)
Compute the elements of the Hessian matrix
∂2J
∂θr
kj ∂θr0
k0j0
.(16)
Near the optimum, show that the second order derivatives can be approx-
imated by
∂2J
∂θr
kj ∂θr0
k0j0
=
N
X
n=1
kL
X
m=1
∂ˆynm
∂θr
kj
∂ˆynm
∂θr0
k0j0
.(17)
In other words, the second order derivatives can be approximated as prod-
ucts of the first order derivatives. The derivatives can be computed by fol-
lowing similar arguments as the gradient descent backpropagation scheme,
[71].
Solution:
J=1
2
N
X
n=1
kL
X
m=1
(ˆynm −ynm)2
∂J
∂θr
kj
=
N
X
n=1
kL
X
m=1
∂ˆynm
∂θr
kj
(ˆynm −ynm)
∂2J
∂θr0
=
N
X
kL
X
∂2ˆynm
∂θr0
(ˆynm −ynm)
N
X
kL
X
∂ˆynm
∂ˆynm
18.6. It is common when computing the Hessian matrix, to assume that it is
diagonal. Show, that under this assumption, the quantities
∂2E
∂(θkj )2,
where
kL
X
L)2=f00 (zL
•∂2E
∂(zr−1
k)2= (f0(zr−1
j))2Pk∂2E
∂(zr
k)2(θr
kj )2+f00 (zr−1
j)Pkr
k=1 θr
klδr
k.
∂(θr
kj )2=∂2E
∂(zr
j)2(yr−1
k)2.(20)
For the backpropagation of the involved second derivatives, we have
I:r=L:∂E
=f0(zL
j)ej
We know from the text that,
δr−1
j=
kr
X
δr
k
∂zr
k
∂zr−1
=
kr
X
δr
kθr
kj f0(zr−1
j) (21)
k=1
∂zr−1
j
or
∂2E
∂(zr−1
j)
kl
X
θr
kj δr
k+
kl
X
θr
kj
∂δr
k
∂zr−1
f0(zr−1
j) (22)
∂(zr
k)2θr
Thus, finally we get,
∂2E
∂2E
kr
X
18.7. Show that if the activation function is the logistic sigmoid and the loss
function in (18.44) is used, then δL
nj in (18.21) becomes
δL
nj =a(ˆynj −ynj ).
If the cross-entropy in Eq. (18.41) is used, then it is equal to
δL
nj =aynj (ˆynj −1).
Solution: The cost function is
J=
N
X
n=1
Jn,
18.8. Show that if the activation function is the logistic sigmoid and the relative
entropy loss function is used, then δL
nj in (18.21) becomes,
δL
nj =a(ˆynj −1)ynj .
Solution: From the theory, we have that
18.9. Show that the cross-entropy loss function depends on the relative output
errors.
Solution: We will prove it for one of the versions, and similar is the proof
for the other. Taking into consideration the limiting value 0 ln 0 = 0, we
can write the cross-entropy cost as,
J=−
N
X
n=1
kL
X
k=1 ynk ln ˆynk
ynk
+ (1 −ynk) ln 1−ˆynk
1−ynk .
18.10. Show that if the activation function is the softmax and the loss function
is the cross-entropy (or the loss in (18.44)), then δL
nj in (18.21) does not
depend on the derivatives of the nonlinearity.
Solution: We will show it for Eq. (18.44) and for the cross-entropy it
nj
nj
Now,
∂ˆynj
∂zL
=∂
∂zL
nj )
PkL
m=1 exp(zL
nm)− exp(zL
m=1 exp(zL
nm)!2
= ˆynj (1 −ˆynj ).(23)
Also,
∂Jn
=−ynj
+1−ynj
=ˆynj −ynj
nj
18.11. As in the previous problem, use the relative entropy as the cost function
and the softmax activation function. Then show that,
δL
nj = ˆynj −ynj .
Solution:
Jn=−
kL
X
ynk ln ˆynk
ynk
,
nk)
k=1
18.12. Derive the backpropagation through time algorithm for training recurrent
neural networks.
Solutions: The starting set of equations are those defining an RNN, i.e.,
hn=f(Uxn+Whn−1+b),
∂wij
=
n=1
∂hni
∂wij
.(25)
10
Note from the defining RNN equations, all the wij ,j= 1,2, . . . , kh, ele-
∂J
∂W =
N
X
n=1
diag 1−h2
ni:1,kh∂J
∂hn
hT
n−1,(27)
∂hN(where the re-
cursion starts), ∂ˆ
yn
∂hnand ∂hn+1
∂hn. The gradient of the cost w.r. to the
outputs is straightforward, and depends on the specific loss function, see,
e.g., Problem 18.10.
Problem 18.10.
(b) Also, following similar arguments as before,
∂hn+1
= diag(1 −h2
(n+1)i:1,kh)W.
is straightforward.
The above have completed the calculations of the gradient of Jw.r. to
W.
•For the gradient w.r. to V, we get,
∂J
∂V =
N
X
∂J
∂zn
hT
n
∂c=
n=1 ∂zn
∂cT∂J
∂zn
=
n=1
∂zn
•The gradient w.r. to bis equal to,
∂J
∂b=
N
X
∂bT∂J
∂hn
N
X
18.13. Derive the gradient of the log-likelihood in (18.74).
Solutions: To simplify slightly the notation, we will consider only one
term of the sum. Thus we have,
∂P (v; Θ)
=∂
ln X
exp (−E)−(29)
∂θij
vX
h
18.14. Prove that for the case of RBMs, the conditional probabilities are given
by the following factorized form,
P(h|v) =
I
Y
exp PJ
j=1 θij vj+bihi
18.15. Derive Eq. (18.83).
Solution: The integrand is equal to
Trusted by Thousands of
Students
Here are what students say about us.
Resources
Company
Copyright ©2022 All rights reserved. | CoursePaper is not sponsored or endorsed by any college or university.