View on GitHub

memo

Introduction to Online Convex Optimization Chapter02

2.1 Basic definitons and setups

Definition

f is said to be G-bounded if

xfG

where xf is subgradient of f at x.

Definition positive definite

We write AB if BA is positive semidefinite.

Proposition property of positive definite

(1) If αAB and A is nonnegative definite, then for all β<α,

βAB.

proof

(1)

Since A is positive semidefinite, xTAx0. Then

xRn, xT(BβA)x=xTBxβxTAxxTBxαxTAx>0

Definition alpha strongly convex and beta smooth

f is said to be α-strongly convex if

f(x)+f(x)T(yx)+α2yx2f(y)(1)f(x)T(yx)+α2yx2f(y)f(x).

f is said to be β-smooth if

f(x)+f(x)T(yx)+β2yx2f(y)(2)f(x)T(yx)+β2yx2f(y)f(x).

Proposition

Then the following conditions are equivalent;

(1) f is β-smooth

(2)

f(x)f(y)βxy.

(3)

αI2f(x)βI.

proof

(1) (2)

Let xRn and h>0 be fixed. Let y:=x+h(1,,1)T.

Definition gamma-well-conditioned

f is said to be γ-well-condtiond if

where

γ:=αβ1.

γ is called the condition number of f.

2.1.1 Projections onto convex sets

Definition. projection onto convex sets

Projection onto convex sets are defined

ΠK(y):=argminxKxy.

Theorem 2.1

zK, yzxz.

2.1.1 Introduction to optimality conditions

In this section, we assume

xargminxKf(x)

Theorem 2.2 KKT

Then

f(x)T(yx))0.

proof.

2.2 Gradient / subgradient descent

In this section, we prove the convergence rate of Gradient descent and Accelerated GD for each cases, f is gamma-well-conditioned, f is beta-smooth, and f is alpha strongly convex.

Gradient descent

Accelerated GD

2.2.1 Basic Gradient descent

ALgorithm 2 Gradient Decent Method

Step1. For t=1,,T

Step2.

yt+1:=xtηtf(xt),xt+1:=ΠK(yt+1)

Step3. Go to Step2 until t=T

Step4. xT+1,

Lemma

x1ba=argminzf(z)f(z):=i=1dai(zixi)+b2zx2

proof

fzi=0ai+b(zixi)=0(3)zi=aib+xi

Then,

f(x1ba)=i=dai2b+12ba2(4)=12ba2

Theroem 2.3

Then

ht+1h1eγt.

proof

By strong convexity, for all x,yK,

f(y)f(x)+f(x)T(yx)+α2xy2(α-strong convexity)minzK{f(x)+f(x)T(zx)+α2xz2}=f(x)12αf(x)2(z:=x1αf(x))

Hence

f(x)f(xt)12αf(xt)2(5)f(xt)22α(f(xt)f(x))=2αht ht+1ht=f(xt+1xt)f(xt)(f(xt))T(xt+1xt)+β2xt+1xt=ηtf(xt)2+β2ηt2f(xt)2=12βf(xt)2αβht((5))(6)=γht.

Thus,

ht+1ht(1γ)h1(1γ)th1eγt

Theorem 2.4

Then

ht+1h1exp(γt4).

proof

From β smoothness, for every x,xtK,

(7)f(xt)(xxt)f(x)f(xt)α2xxt2.

By algorithm’s definition, we have

(8)xt+1=argminxK{f(xt)T(xxt)+β2xxt2}.

To see this, letting t:=f(xt),

argminxK{x(xtηtt)2}=argminxK{i=1d(xi(xt,iηtt,i))2}=argminxK{i=1d(xi22xi(xt,iηtt,i)+xt,i22xt,iηtt,i+ηt2t,i2)}=argminxK{i=1d(xi22xixt,i+xt,i22xiηtt,i2xt,iηtt,i+ηt2t,i2)}=argminxK{i=1d((xixt,i)2+2xiηtt,i2xt,iηtt,i)}=argminxK{i=1d2ηt(12ηt(xixt,i)2+t,i(xixt,i))}(9)=argminxK{12ηtxxt,i2+tT(xxt)} xt+1=ΠK(xtηtf(xt))=argminxK{x(xtηtf(xt))}(10)=argminxK{f(xt)T(xxt)+12ηtxxt2}

Thus, η[0,1],

ht+1ht=f(xt+1)f(xt)f(xt)T(xt+1xt)+β2xt+1xt2(smoothness)=minxK{f(xt)(xxt)+β2xxt2}((8))minxK{f(x)f(xt)+βα2xxt2}((7))minx[xt,x]{f(x)f(xt)+βα2xxt2}=f((1η)xt+ηx)f(xt)+βα2(1η)xt+ηxxt2(1η)f(xt)+ηf(x)f(xt)+(βα)η22xt+x2(convexity)=ηf(xt)+ηf(x)+(βα)η22xt+x2(11)ηht+(βα)η22xt+x2

where

[xt,x]:={(1η)xt+ηxη[0,1]}.

For the second term,

α2xxt2α2xxt2+f(x)T(xtx)( KKT condition)f(xt)f(x)( α-strongly)=ht

Combining the above result, η[0,1],

ht+1htηht+(βα)η2αhtηht+βη2αht=(βαη2η)ht(12)={βα(ηα2β)2α4β}ht

Moreover, since γ-well-conditioned,

0η=α2β12.

Thus, we can achieve the minimizer

ht+1htminη[0,1]ηht+βη2αht=α4βht(13)=γ4ht.

Thus,

(14)ht+1ht(1γ4)hteγ/4

2.3 Reductions to non-smooth and non-strongly convex functions

2.3.1 Reductions to smooth and non-strongly convex functions

Algorithm 3 gradient descent, reduction to beta-smooth functions

Step1. Let

(15)g(x):=f(x)+α~2xx12.

Step2. Apply Algorithm2 with parameters g,T,ηt:=1/β, x1 and get xT,

Step3 return xT in Step2.

Proposition propety of norm

h(x):=α2xx12.

proof

(1)

Hessian matrix of h is diagonal matrix, that is, positve semidefinite.

(2)

h(x)=αh(y)h(x)=α2(yx12xx12)=α2(yx12+yx2yx2+2i=1n(xix1i)(yixi)2i=1n(xix1i)(yixi)xx12)=α2(yx12+yx2+2i=1n(xix1i)(yixi)(yx)+(xx1)2)=αi=1n(xix1i)(yixi)+α2yx2(16)=h(x)T(yx)+α2yx2

Proposition alpha-strongnize

g(x):=f(x)+α^2xx12.

Then g is α-strongly convex.

proof

g(x)=f(x)+α^2xx12f(x)+α^(Qx1x11xnx1n)

Thus,

g(y)g(x)=f(y)f(x)+α^2yx12α^2xx12f(y)T(yx)+α^2(yx12xx12)(convexity)=f(y)T(yx)+α^i=1n(xix1i)(yixi)α^i=1n(xix1i)(yixi)+α^2(yx12xx12)=g(y)T(yx)+α^2(yx12+yx2yx22i=1n(xix1i)(yixi)xx12)=g(y)T(yx)+α^2(yx12+yx2(yx)+(xx1)2)(17)=g(y)T(yx)+α^2yx2

Proposition additivity of beta smooth

proof

h1(y)+h2(y)h1(x)+h2(x)h1(x)T(yx)+β12yx2+h2(x)T(yx)+β12yx2=(h1(x)+h2(x))T(yx)+(β12+β22)yx2.

Lemma 2.5

x,yK, xyD.

Algorithm 3 with parameter α^:=βlogtD2t converges as

ht+1h1gexp(tlogt4(logt+D2t))+βlogt2t.

proof

g defined in (15) is α^-strongly convex. Indeed,

By the previous proposition and proposition above, g is (β+α^)-smooth.

Thus, g is γ=α^α^+β-well-conditioned.

ht=f(xt)f(x)=g(xt)g(x)+α^2(xx12xtx12)htg+α^2(D2)=htg+α^2D2

where htg:=g(xt)g(x). Since g is α^α^+β-well-conditioned,

ht+1ht+1g+α^2D2h1gexp(α^t4(α^+β))+α^2D2(theorem 2.4)=h1gexp(tβlogt4β(logt+D2t))+βlogt2t=h1gexp(tlogt4(logt+D2t))+βlogt2t

2.3.2 Reduction to strongly convex, non-smooth functions

Algorithm 4 Gradient descent reduction to non-smooth functions

f^δ(x):=Bf(x+δu) du.

Step1 Apply algorithm 2 on f^, x1, T, {ηt=δ}, return xT.

Lemma 2.6

Then

proof

(1) TODO

f^δ(y)f^δ(x)Bf(y+δu)f(x+δu) duBα2yx+f(x+δu)T(yx) du(strongly convex)=α2yx+Bf(x+δu)T du(yx)=α2yx+f^(x)T(yx)

The 3rd inequality follows from the fact that f is integrable over unit cube and differentiable and Lebesgue dominated convergence theorem.

(2) TODO

S:={yRdy=1}.

By Stokes theorem,

(18)Sf(x+δu)u du=δdBf(x+δu) du f^δ(x)f^δ(y)=dδSf(x+δu)u duSf(y+δu)u du=dδSf(x+δu)uf(y+δu)u dudδSf(x+δu)uf(y+δu)u du(Jensen's inequality)=dδS|(f(x+δu)f(y+δu))|u dudδGSxyu du(Lipschitz continity)dδGxyS1 du(uS)=dδGxy.

By proposition, f^δ is dδG-smooth.

(3)

|f^δ(x)f(x)|=|Bf(x+δu)f(x) du|B|f(x+δu)f(x)| du(Jensen's inequality)GBδu du(Lipschitz)(19)=GδB1 du

Lemma 2.7

δ:=dGαlogtt.

Algorithm 4 converges as

ht+1h1exp(logt4)+2dG2logtαt

proof

By lemma 2.6, the function f^δ is γ-well conditioned for

γ:=αδdG.

Hence

ht+1=f(xt+1)f(x)=f^(xt+1)+f(xt+1)f^(xt+1)f^(x)+f^(x)f(x)f^(xt+1)+δGf^(x)+δG(lemma 2.6 (3))h1eγt4+2δG(lemma 2.4)h1exp(αδt4dG)+2δGh1exp(logt4)+2dG2logtαt(20)h1exp(logt4)+2dG2logtαt

There is an algorithm for α-strongly convex function, which does not rely on expectation. Even more the bound of the algorithm does not depend on the dimension d.

Theorem 2.8

ηt:=2α(t+1).

Then

f(1ts=1t2st+1xx)f(x)2Gα(t+1).

proof

2.3.3 Reduction to general convex function

We can apply both the technique

g(x):=f(x)+α2xx12f^δ(x):=Bg(x+δu) du=Bf(x+δu)+α2x+δux12 du.

Then by proposition g is α-strongly convex. Moreover, by lemma 2.6, f^δ is dG/δ smooth. Thus, we have the same inequality in lemme 2.7.

However, there is an algorithm which gives the better inequality than the inequality from lemma 2.7, which is O(1t). We will show this in next chapter.

2.4 Example: Support Vector Machine training

General linear classfication problem is stated as below.

A lot of loss functions can be considered for this classfication problems, but here we consider one of the simplest loss function. The following equation is the problem to solve.

(21)minxRdi=1n1sign(xTaibi).

General classi

Support Vector Machine is pracitically used for classification. The soft margin SV relaxation replaces the 0/1 loss in (21) with a convex loss, called the hinge-loss, given by

a,b(x):=hinge(bxTa):=max{0,1bxTa}.

Here we consider the loss function with regularization.

(22)minxRd(λ1ni=1nai,bi(x)+12x2).

Algorithm 5 SVM training via subgradient descent

Step1.

t:=λ1ni=1nai,bi(xt)+xta,b(x)={0(bxTa>1)baotherwise

Step2. Update

xt+1:=xtηtt

Step3. end for

Step4. Return x~:=1Tt=1T2tT+1xt,

Theorem

In Algorithm 5, if T>1ϵ,

f(x~)f(x)O(ϵ).

proof