Machine Learning Chapter 7 Solutions Problems Show That The Bayesian Classifier Optimal The Sense That

subject Type Homework Help
subject Pages 9
subject Words 1629
subject Authors Sergios Theodoridis

Unlock document.

This document is partially blurred.
Unlock all pages and 1 million more documents.
Get Access
page-pf1
1
Solutions To Problems of Chapter 7
7.1. Show that the Bayesian classifier is optimal, in the sense that it minimizes
the probability of error.
Hint: Consider a classification task of Mclasses and start with the proba-
bility of correct label prediction, P(C). Then the probability of error will
be P(e)=1P(C).
Solution: Let P(C) be the probability of correct classification. Then
P(C) =
M
X
i=1
P(xRi, ωi) =
M
X
i=1
P(ωi)P(xRi|ωi),
or
M
X
M
X
7.2. Show that if the data follow the Gaussian distribution in an Mclass task,
with equal covariance matrices in all classes, the regions formed by the
Bayesian classifier are convex.
Solution: Consider two points, x1,x2, lying in Ri. Then any point lying
on the line segment, which connects these two points, can be written as
page-pf2
2
However, since by the definition of the partition,
7.3. Derive the form of the Bayesian classifier for the case of two equiprobable
classes, when the data follow the Gaussian distribution of the same co-
variance matrix. Furthermore, derive the equation that describes the LS
linear classifier. Compare and comment on the results.
Solution: For the scenario of this problem, we already know that the
Bayesian classifier is equivalent to the minimum Mahalanobis distance
classifier, i.e.,
Let us now turn our attention to the LS solution. By extending the in-
volved training feature vectors by one, in order to accommodate the bias
term, the LS classifier is given by the solution of the following set of linear
equations:
X
N
X
page-pf3
3
where the zero comes from Pnyn, in which half of the terms are +1 and
the rest 1. In the previous formula we have that
and Σbis the covariance matrix for the two classes together; that is, around
the overall mean value µ. Substituting the previous definitions in (1), and
solving the system of equations, it is trivial to see that
2Σ1
Combining the previous finding we obtain the LS optimal linear classifier
7.4. Show that the ML estimate of the covariance matrix of a Gaussian dis-
tribution, based on Ni.i.d, observations, xn, n = 1,2, . . . , N, is given
by,
ˆ
ΣML =1
N
N
X
n=1
(xnˆ
µML)(xnˆ
µML)T,
where
ˆ
µML =1
N
N
X
n=1
xn.
Solution: We shall focus on the simplest case where Σ=σ2I. The log-
likelihood function is
L(µ, σ2) =
N
X
n=1
ln p(xn;µ, σ2),
2ln σ21
2σ2
N
X
n=1
page-pf4
N
n=1
7.5. Prove that the covariance estimate
ˆ
Σ=1
N1
N
X
k=1
(xkˆ
µ)(xkˆ
µ)T
corresponds to an unbiased estimator, where,
ˆ
µ=1
N
N
X
k=1
xk.
Solution: We have that
E[ˆ
Σ] = 1
N1
N
X
k=1
Eh((xkµ)(ˆ
µµ)) ((xkµ)(ˆ
µµ))Ti
N
X
N
X
N1
k=1
However,
E(ˆ
µµ)(ˆ
µµ)T=E
1
N
N
X
i=1
(xiµ)1
N
N
X
j=1
(xjµ)T
N
X
page-pf5
5
where independence among the samples has been assumed, i.e.,
7.6. Show that the derivative of the logistic link function is given by
(t)
dt =σ(t)1σ(t).
Solution: We have that
σ(t) = 1
1 + exp(t).
7.7. Derive the gradient of the negative log-likelihood function associated with
the two-class logistic regression.
Solution: The log-likelihood is given by
L(θ) =
N
X
n=1 ynln sn+ (1 yn) ln(1 sn),
sn
1sn
page-pf6
n=1
7.8. Derive the Hessian matrix of the negative log-likelihood function associ-
ated with the two-class logistic regression.
Solution: From the previous problem, we have already derived the gra-
dient, i.e.,
θan= (ynsn)xn.
7.9. Show that the Hessian matrix of the negative log-likelihood function of
the two-class logistic regression is a positive definite matrix.
Solution. We have seen that
2
θL(θ) = XTRnX.
Thus for any vector xRl, the following is true,
page-pf7
7.10. Show that if
φm=exp(tm)
PM
j=1 exp(tj),
the derivative with respect to tj, j = 1,2, . . . , M is given by,
φm
tj
=φm(δmj φj).
Solution: a) Let j=m. Then we have
φm
tm
=exp(tm)PM
j=1 exp(tj)exp(tm) exp(tm)
PM
j=1 exp(tj)2,
tj
7.11. Derive the gradient of the negative log-likelihood for the multiclass logistic
regression case.
Solution: Let
an:=
M
X
m=1
ynm ln φnm.
Then we have that,
M
X
1
φnm
tj
m=1
page-pf8
8
n=1
7.12. Derive the j, k block element of the Hessian matrix of the negative log-
likelihood function for the multiclass logistic regression.
Solution: From the previous problem, we have already obtained that,
θjL(θ1,...,θM) =
N
X
n=1
(φnj ynj )xn.
7.13. Consider the so-called Rayleigh ratio,
R=θTAθ
||θ||2,
where Ais a symmetric matrix. Show that Ris maximized, with respect
to θ, if θis the eigenvector corresponding the maximum eigenvalue of A.
Solution: The maximization task is equivalent to
maximize w.r. θ:θTAθ,
s.t ||θ||2= 1,
page-pf9
7.14. Consider the generalized Rayleigh quotient,
Rg=θTBθ
θTAθ.
where Aand Bare a symmetric matrices. Show that Rgis maximized
with respect to θ, if θis the eigenvector which corresponds to the maxi-
mum eigenvalue of A1B, assuming that the inversion is possible.
Solution: For the case of our problem (generalized Rayleigh quotient),
let
y:= A1/2θθ=A1/2y.
Then the problem becomes equivalent with maximizing
yTA1/2BA1/2y
7.15. Show that the between-classes scatter matrix Σbfor an Mclass problem
is of rank M1.
Solution: We will show it for M= 2 and M= 3 and the result is easily
generalized. We use P(ωi) = Pi. For M= 2, we have P1+P2= 1. Thus,
µ1µ0=µ1P1µ1P2µ2=P2(µ1µ2).
page-pfa
7.16. Derive the arithmetic rule for combination, by minimizing the average KL
divergence.
Solution: Let the output from each classifier be Pj(ωi|x) := Pj(ωi) and
let P(ωi|x) := P(ωi) be the output of the combiner to be estimated.
Minimizing the average Kullback-Leibler divergence and employing the
obvious constraint M
page-pfb
11
Taking the derivative with respect to P(ωi) and equating to zero, it turns
out that
P(ωi) = 1
λL
L
X
j=1
Pj(ωi).(9)
Substituting into the constraint equation, we obtain
M
X
L
X
L
j=1
7.17. Derive the product rule via the minimization of the Kullback-Leibler di-
vergence, as pointed out in the text.
Solution: We will estimate P(ωi) by minimizing the Kullback-Leibler av-
erage divergence, i.e,
1
L
L
X
j=1
M
X
i=1
P(ωi) ln P(ωi)
Pj(ωi),(12)
= exp{λ1}Y
jPj(ωi)
page-pfc
12
i=1 QL
j=1(Pj(ωi)) 1
L
7.18. Show that the error rate on the training set of the final classifier, obtained
by boosting, tends to zero exponentially fast.
Solution The error rate on the training data set is given by
PN
e=1
N
N
X
I(1 ynf(xn)) 1
N
N
X
exp(ynF(xn)),
=QK
NQK
.(20)
NQK
= 1.(21)
k=1
page-pfd
13
By the respective definition we have,
N
X

Trusted by Thousands of
Students

Here are what students say about us.

Copyright ©2022 All rights reserved. | CoursePaper is not sponsored or endorsed by any college or university.