This document is partially blurred.
Unlock all pages and 1 million more documents.
Get Access
1
Solutions To Problems of Chapter 7
7.1. Show that the Bayesian classifier is optimal, in the sense that it minimizes
the probability of error.
Hint: Consider a classification task of Mclasses and start with the proba-
bility of correct label prediction, P(C). Then the probability of error will
be P(e)=1−P(C).
Solution: Let P(C) be the probability of correct classification. Then
P(C) =
M
X
i=1
P(x∈Ri, ωi) =
M
X
i=1
P(ωi)P(x∈Ri|ωi),
or
M
X
M
X
7.2. Show that if the data follow the Gaussian distribution in an Mclass task,
with equal covariance matrices in all classes, the regions formed by the
Bayesian classifier are convex.
Solution: Consider two points, x1,x2, lying in Ri. Then any point lying
on the line segment, which connects these two points, can be written as
2
However, since by the definition of the partition,
7.3. Derive the form of the Bayesian classifier for the case of two equiprobable
classes, when the data follow the Gaussian distribution of the same co-
variance matrix. Furthermore, derive the equation that describes the LS
linear classifier. Compare and comment on the results.
Solution: For the scenario of this problem, we already know that the
Bayesian classifier is equivalent to the minimum Mahalanobis distance
classifier, i.e.,
Let us now turn our attention to the LS solution. By extending the in-
volved training feature vectors by one, in order to accommodate the bias
term, the LS classifier is given by the solution of the following set of linear
equations:
X
N
X
3
where the zero comes from Pnyn, in which half of the terms are +1 and
the rest −1. In the previous formula we have that
and Σbis the covariance matrix for the two classes together; that is, around
the overall mean value µ. Substituting the previous definitions in (1), and
solving the system of equations, it is trivial to see that
2Σ−1
Combining the previous finding we obtain the LS optimal linear classifier
7.4. Show that the ML estimate of the covariance matrix of a Gaussian dis-
tribution, based on Ni.i.d, observations, xn, n = 1,2, . . . , N, is given
by,
ˆ
ΣML =1
N
N
X
n=1
(xn−ˆ
µML)(xn−ˆ
µML)T,
where
ˆ
µML =1
N
N
X
n=1
xn.
Solution: We shall focus on the simplest case where Σ=σ2I. The log-
likelihood function is
L(µ, σ2) =
N
X
n=1
ln p(xn;µ, σ2),
2ln σ2−1
2σ2
N
X
n=1
N
n=1
7.5. Prove that the covariance estimate
ˆ
Σ=1
N−1
N
X
k=1
(xk−ˆ
µ)(xk−ˆ
µ)T
corresponds to an unbiased estimator, where,
ˆ
µ=1
N
N
X
k=1
xk.
Solution: We have that
E[ˆ
Σ] = 1
N−1
N
X
k=1
Eh((xk−µ)−(ˆ
µ−µ)) ((xk−µ)−(ˆ
µ−µ))Ti
N
X
N
X
N−1
k=1
However,
E(ˆ
µ−µ)(ˆ
µ−µ)T=E
1
N
N
X
i=1
(xi−µ)1
N
N
X
j=1
(xj−µ)T
N
X
5
where independence among the samples has been assumed, i.e.,
7.6. Show that the derivative of the logistic link function is given by
dσ(t)
dt =σ(t)1−σ(t).
Solution: We have that
σ(t) = 1
1 + exp(−t).
7.7. Derive the gradient of the negative log-likelihood function associated with
the two-class logistic regression.
Solution: The log-likelihood is given by
L(θ) = −
N
X
n=1 ynln sn+ (1 −yn) ln(1 −sn),
sn
1−sn
n=1
7.8. Derive the Hessian matrix of the negative log-likelihood function associ-
ated with the two-class logistic regression.
Solution: From the previous problem, we have already derived the gra-
dient, i.e.,
∇θan= (yn−sn)xn.
7.9. Show that the Hessian matrix of the negative log-likelihood function of
the two-class logistic regression is a positive definite matrix.
Solution. We have seen that
∇2
θL(θ) = XTRnX.
Thus for any vector x∈Rl, the following is true,
7.10. Show that if
φm=exp(tm)
PM
j=1 exp(tj),
the derivative with respect to tj, j = 1,2, . . . , M is given by,
∂φm
∂tj
=φm(δmj −φj).
Solution: a) Let j=m. Then we have
∂φm
∂tm
=exp(tm)PM
j=1 exp(tj)−exp(tm) exp(tm)
PM
j=1 exp(tj)2,
∂tj
7.11. Derive the gradient of the negative log-likelihood for the multiclass logistic
regression case.
Solution: Let
an:=
M
X
m=1
ynm ln φnm.
Then we have that,
M
X
1
∂φnm
∂tj
m=1
8
n=1
7.12. Derive the j, k block element of the Hessian matrix of the negative log-
likelihood function for the multiclass logistic regression.
Solution: From the previous problem, we have already obtained that,
∇θjL(θ1,...,θM) =
N
X
n=1
(φnj −ynj )xn.
7.13. Consider the so-called Rayleigh ratio,
R=θTAθ
||θ||2,
where Ais a symmetric matrix. Show that Ris maximized, with respect
to θ, if θis the eigenvector corresponding the maximum eigenvalue of A.
Solution: The maximization task is equivalent to
maximize w.r. θ:θTAθ,
s.t ||θ||2= 1,
7.14. Consider the generalized Rayleigh quotient,
Rg=θTBθ
θTAθ.
where Aand Bare a symmetric matrices. Show that Rgis maximized
with respect to θ, if θis the eigenvector which corresponds to the maxi-
mum eigenvalue of A−1B, assuming that the inversion is possible.
Solution: For the case of our problem (generalized Rayleigh quotient),
let
y:= A1/2θ⇒θ=A−1/2y.
Then the problem becomes equivalent with maximizing
yTA−1/2BA−1/2y
7.15. Show that the between-classes scatter matrix Σbfor an Mclass problem
is of rank M−1.
Solution: We will show it for M= 2 and M= 3 and the result is easily
generalized. We use P(ωi) = Pi. For M= 2, we have P1+P2= 1. Thus,
µ1−µ0=µ1−P1µ1−P2µ2=P2(µ1−µ2).
7.16. Derive the arithmetic rule for combination, by minimizing the average KL
divergence.
Solution: Let the output from each classifier be Pj(ωi|x) := Pj(ωi) and
let P(ωi|x) := P(ωi) be the output of the combiner to be estimated.
Minimizing the average Kullback-Leibler divergence and employing the
obvious constraint M
11
Taking the derivative with respect to P(ωi) and equating to zero, it turns
out that
P(ωi) = −1
λL
L
X
j=1
Pj(ωi).(9)
Substituting into the constraint equation, we obtain
M
X
L
X
L
j=1
7.17. Derive the product rule via the minimization of the Kullback-Leibler di-
vergence, as pointed out in the text.
Solution: We will estimate P(ωi) by minimizing the Kullback-Leibler av-
erage divergence, i.e,
1
L
L
X
j=1
M
X
i=1
P(ωi) ln P(ωi)
Pj(ωi),(12)
= exp{λ−1}Y
jPj(ωi)
12
i=1 QL
j=1(Pj(ωi)) 1
L
7.18. Show that the error rate on the training set of the final classifier, obtained
by boosting, tends to zero exponentially fast.
Solution The error rate on the training data set is given by
PN
e=1
N
N
X
I(1 −ynf(xn)) ≤1
N
N
X
exp(−ynF(xn)),
=QK
NQK
.(20)
NQK
= 1.(21)
k=1
13
By the respective definition we have,
N
X
Trusted by Thousands of
Students
Here are what students say about us.
Resources
Company
Copyright ©2022 All rights reserved. | CoursePaper is not sponsored or endorsed by any college or university.