This document is partially blurred.
Unlock all pages and 1 million more documents.
Get Access
19
or
kx−T2T1(x)k2≤2µ1
2−µ1
(kx−yk2− kT1(x)−yk2)
+2µ2
2−µ2
(kT1(x)−yk2− kT2T1(x)−yk2).
Let
23. Show the fundamental POCS theorem for the case of closed subspaces in
a Hilbert space, H.
Solution: Fact 1: The relaxed projection operator is self adjoint, i.e.,
hx, TCi(y)i=hTCi(x),yi,∀x,y∈H.
20
hx, T (y)i=hx, TCK· · · TC1(y)
=hTCK(x), TCK−1· · · TC1(y)
=. . .
=hTC1· · · TCK(x),yi
=hT∗(x),yi,
Hence,
(I−T∗)(x) = x−T∗(x) = 0,
or
T∗(x) = x,
and since T∗and Thave the same fixed point set (the proof trivial),
S⊥⊆C.
are now ready to establish strong convergence.
The repeated application of Ton any x∈Hleads to Tn(x) = (TTT)n(x).
We know that ∀x∈Hthere is a unique decomposition into two orthogonal
complement (closed) subspaces, i.e.,
x=y+z,y∈Cand z∈C⊥,∀x∈H
24. Derive the subdifferential of the metric distance function dC(x), where C
is a closed convex set C⊆Rland x∈Rl.
Solution: By definition we have
∂dC(x) = {g:gT(y−x) + dC(x)≤dC(y),∀y∈Rl}.
Thus let gbe a subgradient, then
Since this is true ∀y, let
y:y−x=g⇒ kgk2≤ kgk ⇒ kgk ≤ 1,
or
g∈B[0,1].
a) Let x/∈Cand gany subgradient. For any y∈Rl,
gT(y+PC(x)−x)≤dC(y+PC(x)) −dC(x).
−gT(x−PC(x)) ≤ −kx−PC(x)k,
or
gT(x−PC(x)) ≥ kx−PC(x)k.
However,
kgk ≤ 1,
and recalling the Cauchy-Schwartz inequality, we obtain
and for any y∈C
gT(y−x)≤0,kgk ≤ 1.(20)
If in addition xis an interior point, there will be ε > 0 : ∀z∈Rl
gT(x−ε(z−x)−x)≤0,
since x−ε(z−x)∈Cand the condition (20) has been used. Thus,
g=0.
This completes the proof.
25. Derive the bound in (8.55).
Solution: Subtracting θ∗from both sides of the recursion, squaring and
taking into account the definition of the subgradient, it is readily shown
that,
k=1
k=1
24
2Pi
k=1 µk
2Pi
k=1 µk
26. Show that if a function is γ-Lipschitz, then any of its subgradients is
bounded.
Solution: By the definition of the subgradient we have that ∀u, v we have
27. Show the convergence of the generic projected subgradient algorithm in
(8.61).
Solution: Let us break the iteration into two steps,
z(i)=θ(i−1) −µiJ′(θ(i−1)),(22)
θ(i)=PC(z(i)).(23)
28. Derive equation (8.100).
Solution: By the definition
Jn(θ) = 1
n
n
X
k=1
L(yk,xk,θ).
29. Consider the online version of PDMb in (8.64), i.e.,
θn=(PCθn−1−µn
J(θn−1)
||J′(θn−1)||2J′(θn−1),If J′(θn−1)6=0,
PC(θn−1),If J′(θn−1) = 0,(26)
where we have assumed that J∗= 0. If this is not the case, a shift
can accommodate for the difference. Thus we assume that we know the
minimum. For example, this is the case for a number tasks, such as the
hinge loss function, assuming linearly separable classes, or the linear ǫ-
insensitive loss function, for bounded noise. Assume that
Ln(θ) =
n
X
k=n−q+1
ωkdCk(θn−1)
Pn
k=n−q+1 ωkdCk(θn−1)dCk(θ)
Then derive that APSM algorithm of (8.39).
Solution: Let the loss function be
Ln(θ) =
n
X
k=n−q+1
ωkdCk(θn−1)
Pn
k=n−q+1 ωkdCk(θn−1)dCk(θ) =
n
X
k=n−q+1
βkdCk(θ),
´
k=n−q+1
k=n−q+1
θn−1−PCk(θn−1)
or
L
´
n(θn−1) = 1
L
n
X
k=n−q+1
ωk(θn−1−PCk(θn−1)),with
L=
n
X
ωkdCk(θn−1).
1
L
n
X
k=n−q+1
ωk(θn−1−PCk(θn−1))
=θn−1+µ′
nM
n
X
ωk(PCk(θn−1)−θn−1))
n
X
30. Derive the regret bound for the subgradient algorithm in (8.83).
27
Solution: From the text, we have that
Ln(θn−1)− Ln(h)≤gT
n(θn−1−h)
Summing up both sides, results in
N
X
Ln(θn−1)−
N
X
Ln(h)≤
N
X
1
2µn||θn−1−h||2− ||θn−h||2+
N
X
2µ2
2µ2
.........
1
2µN
||θN−1−h||2−1
2µN
||θN−h||2+
N
X
1
2µN
||θN−1−h||2+G2
2
N
X
n=1
µn,
Taking into account the bound ||θn−h||2≤F2, and selecting the step-size
31. Show that a function f(x) is σ-strongly convex if and only if the function
f(x)−σ
2||x||2is convex.
Solution:
a) Assume that
f(x)−σ
2||x||2,
is convex. Then, by the definition of the subgradient at x, we have
2||y−x||2,(32)
from which the strong convexity of f(x) is deduced.
b) Assume that f(x) is strongly convex. Then by its definition we have,
2||x||2is convex.
32. Show that if the loss function is σ-strongly convex, then if µn=1
σn , the
regret bound for the subgradient algorithm becomes
1
N
N
X
n=1
Ln(θn−1)≤1
N
N
X
n=1
Ln(θ∗) + G2(1 + ln N)
2σN .(35)
Solution: Taking into account the strong convexity we have that,
Ln(θn−1)− Ln(θ∗)≤gT
n(θn−1−θ∗)−σ
2||θn−1−θ∗||2,(36)
and following similar arguments as for Problem 30, we get
2σ||θ1−θ∗||2− ||θ2−θ∗||2−σ||θ1−θ∗||2+
..................................................................
Nσ||θN−1−θ∗||2− ||θN−θ∗||2−σ||θN−1−θ∗||2+
G2
N
X
1
G2
n=1
σn
≤G2
N
X
n=1
1
σn .
33. Consider a batch algorithm that computes the minimum of the empirical
loss function, θ∗(N), having a quadratic convergence rate, i.e.,
ln ln 1
||θ(i)−θ∗(N)||2∼i.
Show that an online algorithm, running for ntime instants so that to spend
the same computational processing resources as the batch one, achieves
for large values of Nbetter performance than the batch algorithm, i.e.,
[12],
||θn−θ∗||2∼1
Nln ln N<< 1
N∼ ||θ∗(N)−θ∗||2.
Hint: Use the fact that
||θn−θ∗||2∼1
n,and ||θ∗(N)−θ∗||2∼1
N.
Solution: Let Kbe the number of operations per iteration for the on-
line algorithm. This amounts to a total of Kn operations. The batch
algorithm, in order to make sense, should perform O(ln ln N) operations,
34. Show property (8.111) for the proximal operator.
Solution: Assume first that p= Proxλf (x). By definition,
f(p) + 1
2λkx−pk2≤f(v) + 1
2λkx−vk2,∀v∈Rl.
Since the previous inequality holds true for any v∈Rl, it also holds true
2α2kv−pk2−αhx−p,v−pi.
After re-arranging terms in the previous relation,
λf(p)≤λf(v) + 1
2αkv−pk2− hx−p,v−pi,∀α∈(0,1).
Application of limα→0on both sides of the previous inequality results in
=f(v) + 1
2λkx−vk2+1
2λkv−pk2−1
λkv−pk2
35. Show property (8.112) for the proximal operator.
Solution: For compact notations, define pj:= Proxλf (xj), j= 1,2. Then,
36. Prove that the recursion in (8.118) converges to a minimizer of f.
Solution: Define the mapping R:= 2 Proxλf −I. Then, (8.118) takes
the following form:
xk+1 =xk+µk
2R(xk)−xk
Notice that Ris non-expansive: ∀x1,x2∈Rl,
kR(x1)−R(x2)k2=
2Proxλf (x1)−Proxλf (x2)−(x1−x2)
2
= 4 kProxλf (x1)−Proxλf (x2)k2+kx1−x2k2
In turn, let zbe a fixed point, then
kxk+1 −zk2=
1−µk
2(xk−z) + µk
2R(xk)−z
2
=1−µk
2kxk−zk2+µk
2kR(xk)−zk2
Hence, ∀k,
Given any non-negative integer k0, the previous telescoping inequality is
utilized for all k∈ {0, . . . , k0}to produce
Since the previous relation holds for any k0, applying limk0→∞ on both
sides of the inequality results into
k=0
Moreover, notice that
kProxλf (xk+1)−xk+1k=1
2kR(xk+1)−xk+1k
=1
2
R(xk+1)−R(xk) + 1−µk
2R(xk)−xk
Since (kProxλf (xk)−xkk)k∈Nis monotonically non-increasing, and bounded
from below, it converges. Necessarily, limk→∞ kProxλf (xk)−xkk2= 0.
Otherwise, there exists an ǫ > 0 and a subsequence (km)m∈Nsuch that
This together with the fact that limm→∞ Pkm
i=0 µi(2−µi) = +∞, and (39)
imply that
+∞>
+∞
X
k=0
µk(2 −µk)kProxλf (xk)−xkk2
+∞
X
+∞
X
≤ kx∗−xkmk2+kxkm−Proxλf (xkm)k2+kxkm−x∗k2
+ 2 hxkm−Proxλf (xkm),Proxλf (xkm)−Proxλf (x∗)i
−2kx∗−xkmk2+ 2 hx∗−xkm,x∗−Proxλf (x∗)i
≤ kxkm−Proxλf (xkm)k2
+ 2 kxkm−Proxλf (xkm)k kProxλf (xkm)−Proxλf (x∗)k
point. To this end, assume two cluster points x,yof (xk)k∈N. This means
that there exist subsequences (xkm)m∈Nand (xlm)m∈Nwhich converge to
xand y, respectively. Moreover, notice that
hxk,x−yi=1
2kxk−yk2− kxk−xk2+kxk2− kyk2.
37. Derive (8.122) from (8.121).
Solution: Use the matrix inversion lemma
(A+BD−1C)−1=A−1−A−1B(D+CA−1B)−1CA−1,
Trusted by Thousands of
Students
Here are what students say about us.
Resources
Company
Copyright ©2022 All rights reserved. | CoursePaper is not sponsored or endorsed by any college or university.