memo

Introduction to Online Convex Optimization Chapter05

In online covex problem, one of the simplest algorithm is the Follow The Leader method, which just choose the action that has performed the best until current iteration, that is, we choose $x_{t+1}$ as

\[x_{t+1} := \arg \min_{x \in \mathcal{K}} \sum_{t=1}^{s} f_{t}(x) .\]

We will give an example which FTL method does not perform well in.

Example Follow The Leader

■

5.1 Regularization functions

$R: \mathcal{K} \rightarrow \mathbb{R}$,
- regularization function
- strongly convex and smooth
- twice differentiable in $\mathrm{Int}(\mathcal{K})$,
  - we assume for simplicity
$D_{R} \in \mathbb{R}$,
- diameter of the set $\mathcal{K}$ relative to the function $R$,

\[D_{R} := \sqrt{ \max_{x, y \in \mathcal{K}} \left( R(x) - R(y) \right) } .\]

The dual norm to a norm is

\[\| y \|^{*} := \max_{ \|x\|y } \langle x, y \rangle .\] \[\| x \|_{A} := \sqrt{ x^{\mathrm{T}} A x } .\]

For instance, if the norm is the matrix norm defined by $A$, the dual norm is

\[\norm{x}_{A}^{*} = \norm{x}_{A^{-1}} .\]

Definition 5.1

$B_{R}(x \parallel y) \in \mathbb{R}$,

$B_{R}(x \parallel y) \in \mathbb{R}$ is called Bregman divergence with respect to the function $R$, defined as

\[B_{R}(x \parallel y) := R(x) - R(y) - \nabla R(y)^{\mathrm{T}}(x - y) .\]

■

Remark

Suppose that $R$ is twice differentiable. Then by talor thorem, for some $c_{x, y} \in [0, 1]$,

\[R(x) = R(y) + \nabla R(x)(x - y) + \frac{1}{2} (x - y)^{\mathrm{T}} \nabla^{2} R(y + c_{x, y}(x - y)) (x - y) .\]

Hence

\[B_{R}(x \parallel y) = \frac{1}{2} (x - y)^{\mathrm{T}} \nabla^{2} R(y + c_{x ,y}(x - y)) (x - y) .\]

We use the following norm defined by positive definite matrix $\nabla^{2} R(x)$.

\[\begin{eqnarray} \langle x, y \rangle_{z, R} & := & x^{\mathrm{T}} \nabla^{2} R(z) y \label{definition_inner_product} \\ \norm{x}_{z, R} & := & \left( \langle x, y \rangle_{z, R} \right)^{-1/2} \nonumber \\ & = & x^{\mathrm{T}} \nabla^{2} R(z) x \label{definition_norm_local} \\ \norm{y}_{z, R}^{*} & := & \sup_{\norm{x}_{z, R} \le 1} \langle x, y \rangle_{z, R} \label{definition_dual_norm_local} . \end{eqnarray}\]

If we take $z_{x, y} := y + c_{x, y}(x - y)$,

\[\begin{equation} B_{R}(x \parallel y) = \frac{1}{2} (\norm{x - y}_{z_{x, y}})^{2} \label{equation_bregman_divergence_relation_to_local_norm} \end{equation} .\]

■

5.2 The RFTL algorithm and its analysis

By the convexity of $f_{t}$,

$\begin{equation} \sum_{t=1}^{T} f_{t}(x_{t}) - \sum_{t=1}^{T} f_{t}(x) \le \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x) \label{equation_05_01} \end{equation}$ .

5.1 Meta-algorithm definition

Algorithm 10 Regularized Follow The Leader

$\eta > 0$,
$\mathcal{K}$,
- convex compact
$R: \mathcal{K} \rightarrow \mathbb{R}$,
- regularization function

\[x_{1} := \arg\min_{x \in \mathcal{K}} R(x) .\]

Step1. for $t=1$ to $T$

Step2. Predict $x_{t}$

Step3. Observe $f_{t} \in \mathcal{F}$

Step4. Update

\[\begin{eqnarray} x_{t+1} & := & \arg\min_{x \in \mathcal{K}} \left( \eta \sum_{s=1}^{t} \nabla f(x_{s})^{\mathrm{T}} x + R(x) \right) \end{eqnarray}\]

Step5. end for

Step6. return $x_{T+1}$

■

5.2.2 The regret bound

Theorem 5.1

$R$,
- convex function

\[\begin{equation} \mathrm{Regret}_{T} \le 2\eta \sum_{t=1}^{T} (\norm{\nabla f_{t}(x_{t})}_{t}^{*})^{2} + \frac{ R(u) - R(x_{1}) }{ \eta } \nonumber \end{equation}\]

proof

Let

\[\begin{eqnarray} \Psi_{t}(x) & := & \left( \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} x + R(x) \right) \nonumber \\ \nabla \Psi_{t}(x) & = & \left( \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s}) + \nabla R(x) \right) \nonumber \\ \nabla^{2} \Psi_{t}(x) & = & \nabla^{2} R(x) \nonumber \end{eqnarray} .\]

By the taylor expansion, for some $z_{t} \in \{x_{t} + c(x_{t+1} - x_{t}) \mid c \in [0, 1]\}$.

\[\begin{eqnarray} \Psi_{t}(x_{t}) & = & \left( \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} x_{t} + R(x_{t}) \right) \nonumber \\ & = & \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} x_{t + 1} + \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} (x_{t} - x_{t + 1}) + R(x_{t}) \nonumber \\ & = & \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} x_{t + 1} + \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} (x_{t} - x_{t + 1}) + R(x_{t + 1}) + \nabla R(x_{t + 1})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{1}{2} (x_{t} - x_{t + 1})^{\mathrm{T}} \nabla^{2} R(c) (x_{t} - x_{t + 1}) \nonumber \\ & = & \Psi_{t}(x_{t + 1}) + \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \nabla R(x_{t + 1})^{\mathrm{T}} (x_{t} - x_{t + 1}) + B_{R}(x_{t} \parallel x_{t + 1}) \nonumber \\ & = & \Psi_{t}(x_{t + 1}) + \Psi_{t}(x_{t + 1})^{\mathrm{T}} (x_{t} - x_{t + 1}) + B_{R}(x_{t} \parallel x_{t + 1}) \nonumber \\ & \ge & \Psi_{t}(x_{t + 1}) + B_{R}(x_{t} \parallel x_{t + 1}) \quad (\because \text{theorem 2.2 KKT condition}) \end{eqnarray}\]

The last inequality holds since $x_{t + 1}$ is the minimum point over $\mathcal{K}$ in Algorithm 10. Thus,

\[\begin{eqnarray} B_{R}(x_{t} \parallel) x_{t + 1}) & \le & \Psi_{t}(x_{t}) - \Psi_{t}(x_{t + 1}) \nonumber \\ & = & \eta \sum_{s=1}^{t-1} \nabla f_{s}(x_{t}) + R(x_{t}) + \eta \nabla f_{t}(x_{t})^{\mathrm{T}} x_{t} - \eta \sum_{s=1}^{t-1} \nabla f_{s}(x_{t+1}) - R(x_{x+1}) - \eta \nabla f_{t}(x_{t})^{\mathrm{T}} x_{t+1} \nonumber \\ & = & \Psi_{t - 1}(x_{t}) - \Psi_{t - 1}(x_{t+1}) + \eta \nabla f_{t}(x_{t}) \left( x_{t} - x_{t + 1} \right) \nonumber \\ & \le & \eta \nabla f_{t}(x_{t}) \left( x_{t} - x_{t + 1} \right) \quad (\because x_{t}\text{ is the minimizer of }\Psi_{t -1}) \label{equation_05_02} . \end{eqnarray}\]

We use the norm defined in $\eqref{definition_norm_local}$,

\[\begin{eqnarray} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) & \le & \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \norm{ (x_{t} - x_{t + 1}) }_{z_{t}, R} \nonumber \\ & = & \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \sqrt{ 2 B_{R}(x_{t} \parallel x_{t+1}) } \quad (\because \eqref{equation_bregman_divergence_relation_to_local_norm}) \nonumber \\ & \le & \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \sqrt{ 2 \eta \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) } \quad (\because \eqref{equation_05_02}) \end{eqnarray}\]

Thus,

\[\begin{eqnarray} & & \sqrt{ \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) } \le \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \sqrt{ 2 \eta } \nonumber \\ & \Leftrightarrow & \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) \le \left( \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \right)^{2} 2 \eta \label{equation_05_02_modified} \end{eqnarray}\]

By lemma 5.2,

\[\begin{eqnarray} \mathrm{Regret}_{T} & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{ R(u) - R(x_{1}) }{ \eta } \quad (\because \text{lemma 5.2}) \nonumber \\ & \le & 2 \eta \sum_{t=1}^{T} \left( \norm{ \nabla f_{t}(x_{t}) }_{z_{t}, R}^{*} \right)^{2} + \frac{ R(u) - R(x_{1}) }{ \eta } \quad \eqref{equation_05_02_modified} \nonumber \end{eqnarray}\]

$\Box$

Lemma 5.2

$u \in \mathcal{K}$,

In Algorithm 10,

\[\begin{eqnarray} \sum_{t=1}^{T} f_{t}(x_{t}) - \sum_{t=1}^{T} f_{t}(u) & \le & \mathrm{Regret}_{T} \nonumber \\ & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t+1}) + \frac{ R(u) - R(x_{1}) }{ \eta } \nonumber \\ & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t+1}) + \frac{ D_{R}^{2} }{ \eta } . \nonumber \end{eqnarray}\]

proof

Let

\[\begin{eqnarray} g_{0}(x) & := & \frac{1}{\eta} R(x) \nonumber \\ g_{t}(x) & := & \nabla f_{t}(x_{t})^{\mathrm{T}} x . \nonumber \end{eqnarray}\]

By convexity of $f$,

\[\begin{eqnarray} \sum_{t=1}^{T} f_{t}(x_{t}) - \sum_{t=1}^{T} f_{t}(x^{*}) & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x^{*}) \quad (\because \text{convexity of }f) \nonumber \\ & = & \sum_{t=1}^{T} \left( g_{t}(x_{t}) - g_{t}(x^{*}) \right) \nonumber \end{eqnarray}\]

where $x^{*} := \arg\min_{x \in \mathcal{K}}\sum_{t=1}^{T}f_{t}(x)$. Hence it suffices to show the following equation.

\[\begin{eqnarray} \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(x^{*}) \le \sum_{t=1}^{T} \left( g_{t}(x_{t}) - g_{t}(x_{t + 1}) \right) + \frac{D_{R}^{2}}{\eta} . \nonumber \end{eqnarray}\] \[\begin{eqnarray} \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(x^{*}) & \le & \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(x^{*}) + \sum_{t=0}^{T} \left( g_{t}(x^{*}) - g_{t}(x_{t + 1}) \right) \quad (\because \text{lemma 5.3}) \nonumber \\ & = & \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(x_{t + 1}) + g_{0}(x^{*}) - g_{0}(x_{1}) \nonumber \\ & = & \sum_{t=1}^{T} \left( g_{t}(x_{t}) - g_{t}(x_{t + 1}) \right) + \frac{1}{\eta} \left( R(x^{*}) - R(x_{1}) \right) \nonumber \\ & \le & \sum_{t=1}^{T} \left( g_{t}(x_{t}) - g_{t}(x_{t + 1}) \right) + \frac{1}{\eta} D_{R}^{2} \nonumber \end{eqnarray}\]

$\Box$

lemma 5.3

$u \in \mathcal{K}$,

\[\begin{eqnarray} g_{0}(x) & := & \frac{1}{\eta} R(x) \nonumber \\ g_{t}(x) & := & \nabla f_{t}(x_{t})^{\mathrm{T}} x \nonumber \\ \sum_{t=0}^{T} g_{t}(u) & \ge & \sum_{t=0}^{T} g_{t}(x_{t + 1}) \nonumber \end{eqnarray}\]

proof.

We will show by induction on $T$.

Suppose $T=0$. Since $x_{1} := \arg\min_{x \in \mathcal{K}}R(x)$,

\[\begin{eqnarray} & & g_{0}(u) \ge g_{0}(x_{1}) \nonumber \\ & \Leftrightarrow & R(u) \ge R(x_{1}) . \nonumber \end{eqnarray}\]

Suppose the inequality holds up to $T$. We will show the inequality holds $T + 1. Since $x_{T + 2} := \arg\min_{x \in \mathcal{K}} \sum_{t=0}^{T + 1} g(x)$,

\[\begin{eqnarray} \sum_{t=0}^{T + 1} g_{t}(u) & \ge & \sum_{t=0}^{T + 1} g_{t}(x_{T + 2}) \nonumber \\ & = & \sum_{t=0}^{T} g_{t}(x_{T + 2}) + g_{T+1}(x_{T + 2}) \nonumber \\ & \ge & \sum_{t=0}^{T} g_{t}(x_{t + 1}) + g_{T+1}(x_{T + 2}) \quad (\because \text{by induction}) \nonumber \\ & = & \sum_{t=0}^{T + 1} g_{t}(x_{t + 1}) . \nonumber \end{eqnarray}\]

$\Box$

5.3 Online Mirrored Descent

Algorithm 11 Lazy/Agile Online Mirrored Descent

$\eta > 0$,
$\mathcal{K}$,
- convex compact
$R: \mathcal{K} \rightarrow \mathbb{R}$,
- regularization function
$y \in \mathcal{K}$,
- $y_{1}$ satisfies $\nabla R(y_{1}) = 0$,

\[\begin{eqnarray} x_{1} & := & \arg\min_{x \in \mathcal{K}} B_{R}(x \parallel y_{1}) \nonumber \end{eqnarray} .\]

Step1. for $t=1$ to $T$

Step2. Play $x_{t}$

Step3. Observe $f_{t}(x_{t})$,

Step4. Update $y_{t}$ according to the rule (a) or (b),

Rule for lazy OMD

\[\begin{eqnarray} \nabla R(y_{t + 1}) & = & \nabla R(y_{t}) - \eta \nabla f_{t}(x_{t}) \end{eqnarray}\]

Rule for Agile OMD

\[\begin{eqnarray} \nabla R(y_{t + 1}) & = & \nabla R(x_{t}) - \eta \nabla f_{t}(x_{t}) \end{eqnarray}\]

Step5. Project according to $B_{R}:$

\[\begin{eqnarray} x_{t + 1} := \arg\min_{x \in \mathcal{K}} B_{R}(x \parallel y_{t + 1}) \nonumber \end{eqnarray}\]

Step6. end for

Step7. Return $x_{T + 1}$,

■

5.3.1 Equivalence of lazy OMD and RFTL.

Lemma 5.4

$f_{1}, \ldots, f_{T}$,
- linear cost functions
- TODO:Is this assumption required?
$\{y_{t}\}$,
- produced by Algorithm 11
$\{x_{t}\}$,
- produced by Algorithm 10

\[\arg\min_{x \in \mathcal{K}} B_{R}(x \parallel y_{t}) = \arg\min_{x \in \mathcal{K}} \left( \eta \sum_{s=1}^{t-1} \nabla f_{s}(x_{s})^{\mathrm{T}} x + R(x) \right) .\]

proof

In Lazy OMD,

\[\begin{eqnarray} \nabla R(y_{t}) & = & \nabla R(y_{t - 1}) - \eta \nabla f_{t-1}(x_{t-1}) \nonumber \\ & = & \nabla R(y_{1}) - \sum_{s=1}^{t-1} \eta \nabla f_{s}(x_{s}) \nonumber \\ & = & - \sum_{s=1}^{t-1} \eta \nabla f_{s}(x_{s}) . \nonumber \end{eqnarray}\]

Thus,

\[\begin{eqnarray} B_{R}(x \parallel y_{t}) & = & R(x) - R(y_{t}) - (\nabla R(y_{t}))^{\mathrm{T}} (x - y_{t}) \nonumber \\ & = & R(x) - R(y_{t}) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} (x - y_{t}) \nonumber \\ & = & R(x) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} x - \left( R(y_{t}) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} y_{t} \right) \nonumber \end{eqnarray}\]

Since the third term is the independent of $x$,

\[\begin{eqnarray} \arg\min_{x \in \mathcal{K}} \left( R(x) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} x - \left( R(y_{t}) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} y_{t} \right) \right) & = & \arg\min_{x \in \mathcal{K}} \left( R(x) + \eta \sum_{s=1}^{t-1} \nabla f_{s}(s)^{\mathrm{T}} x \right) . \nonumber \end{eqnarray}\]

$\Box$

5.4 Application and special cases

5.4.1 Deriving online gradient descent

We can derive the online descent algorithm by taking

\[R(x) := \frac{1}{2} \norm{ x - x_{0} }_{2}^{2}\]

where $x_{0} \in \mathcal{K}$ which is a parameter of the online gradient descent algorithm. It’s easy to confirm $\nabla R(x) = x - x_{0}$. Moreover, we have

\[\begin{eqnarray} B_{R}(x \parallel y_{t}) & = & \frac{1}{2} \norm{x - x_{0}}_{2}^{2} - \frac{1}{2} \norm{y_{t} - x_{0}}_{2}^{2} + (y_{t} - x_{0})^{\mathrm{T}} (x - y_{t}) \nonumber \\ & = & \frac{1}{2} \norm{x - y_{t}}_{2}^{2} . \nonumber \end{eqnarray}\]

Thus, the minimizer of $B_{R}(x \parallel y_{t})$ is the projection of $y_{t}$ onto $\mathcal{K}$.

In Lazy OMD,

\[\begin{eqnarray} x_{t} & = & \Pi_{\mathcal{K}}(y_{t}) \nonumber \\ y_{t} & = & y_{t - 1} - \eta \nabla f_{t-1}(x_{t-1}) \nonumber \end{eqnarray}\]

In agile OMD,

\[\begin{eqnarray} x_{t} & = & \Pi_{\mathcal{K}}(y_{t}) \nonumber \\ y_{t} & = & x_{t - 1} - \eta \nabla f_{t-1}(x_{t-1}) . \nonumber \end{eqnarray}\]

Agile OMD with the above regularization function is identical to the online gradient descnet algorithm.

5.4.2 Deriving multiplicative updates

Let

\[\begin{eqnarray} R(x) & = & \sum_{i=1}^{n} x_{i} \log x_{i} \nonumber \\ \nabla R(x) & = & \left( \begin{array}{c} 1 + \log x_{1} \\ \vdots \\ 1 + \log x_{n} \end{array} \right) \nonumber \end{eqnarray}\]

TODO.

5.5 Randomized regularization

5.5.1 Perturbation for convex losses

Definition

$N$,
- $\mathbb{R}^{n}$-valued r.v.
- $F_{N}$ is c.d.f. of $N$,
$a \in \mathbb{R}$,
$\sigma \in \mathbb{R}$,
$L \in \mathbb{R}$,

$F_{N}$ is said to be $(\sigma_{a}, L_{a})$ stable with respect to the norm $\norm{\cdot}_{a}$ if

\[\begin{eqnarray} \mathrm{E} \left[ \norm{ N }_{a}^{*} \right] & = & \sigma_{a} \nonumber \\ \forall u \in \mathbb{R}^{n}, \ \int_{\mathbb{R}^{n}} \abs{ F_{N}(n) - F_{N}(n - u) } \ dn \nonumber \\ & \le & L_{a} \norm{u}_{a}^{*} \nonumber \end{eqnarray}\]

■

If $N$ is uniformly distribuited on $[0, 1]^{n}$ and norm is Euclidean norm,

\[\begin{eqnarray} \mathrm{E} \left[ \norm{N}_{2}^{2} \right] & = & \mathrm{E} \left[ \max_{\norm{x}_{2} \le 1} \langle x, N \rangle \right] \nonumber \\ & \le & \mathrm{E} \left[ \max_{\norm{x}_{2} \le 1} \norm{x}_{2} \norm{N}_{2} \right] \nonumber \\ & \le & \mathrm{E} \left[ \norm{N}_{2} \right] \nonumber \\ & = & \int_{[0, 1]^{n}} \sqrt{ \sum_{i=1}^{n} n_{i}^{2} } \ dn & \le & \sqrt{n} \end{eqnarray}\]

Algorithm 13 Follow-the-perturbed-leader for convex

$\eta > 0$,
$N$,
- r.v.
$\mathcal{K}$,
$x_{1} \in \mathcal{K}$,

Step1. for $t=1$ to $T$

Step2. Predict $x_{t}$

Step3. Observe $f_{t} \in \mathcal{F}$ and suffer loss $f_{t}(x_{t})$.

Step4. Update

\[\begin{eqnarray} x_{t + 1} := \mathrm{E} \left[ \arg\min_{x \in \mathcal{K}} \left( \eta \sum_{s=1}^{t} \nabla f_{s}(x_{s})^{\mathrm{T}} x + N^{\mathrm{T}} x \right) \right] \label{equation_05_03} \end{eqnarray}\]

Step5. end for

Step6. return $x_{T+1}$

■

Lemma

$\mathcal{K} \subseteq \mathbb{R}^{n}$,
$g_{t}: \mathcal{K} \times \Omega \rightarrow \mathbb{R} \ (t=0, \ldots, T)$,
- funciton valued r.v.
- convex
$x_{0} \in \mathcal{K}$,

\[\begin{eqnarray} X_{t + 1}(\omega) & := & \arg\min_{x \in \mathcal{K}} \sum_{s=0}^{t} g_{s}(x, \omega) \nonumber \\ x_{t + 1} & := & \mathrm{E} \left[ X_{t + 1} \right] \nonumber \end{eqnarray}\]

Then for all $u \in \mathcal{K}$,

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(u) \right] & \ge & \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(x_{t + 1}) \right] \nonumber \end{eqnarray}\]

proof

Suppose that $T=1$. By the definition of $x_{1}$,

\[\begin{eqnarray} \mathrm{E} \left[ g_{0}(u) + g_{1}(u) \right] & \ge & \mathrm{E} \left[ g_{0}(X_{1}) + g_{1}(X_{1}) \right] \nonumber \\ & \ge & \mathrm{E} \left[ g_{0}(X_{1}) + g_{1}(X_{1}) \right] \end{eqnarray} .\]

Suppose that the inequality holds up to $T$.

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=0}^{T+1} g_{t}(u) \right] & \ge & \mathrm{E} \left[ \sum_{t=0}^{T+1} g_{t}(x_{T + 2}) \right] \nonumber \\ & = & \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(x_{T + 2}) \right] + \mathrm{E} \left[ g_{T+1}(x_{T + 2}) \right] \nonumber \\ & \ge & \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(x_{t + 1}) \right] + \mathrm{E} \left[ g_{T+1}(x_{T + 2}) \right] \quad (\because \text{by assumption of induction}) \nonumber \\ & = & \mathrm{E} \left[ \sum_{t=0}^{T+1} g_{t}(x_{t + 1}) \right] \nonumber \end{eqnarray}\]

$\Box$

Theorem 5.6

$\sigma \in \mathbb{R}$,
$L \in \mathbb{R}$,
$N$,
- $N$ is (\sigma, L)-stable with respect to norm \norm{\cdot}_{a},
$x_{t}$,
- $\eqref{equation_05_03}$,

In Algorithm 13,

\[\mathrm{Regret}_{T} \le \eta D (G^{*})^{2} LT + \frac{1}{\eta} \sigma D\]

proof

Let

\[\begin{eqnarray} g_{0}(x) & := & \frac{1}{\eta} N_{t}^{\mathrm{T}} x \nonumber \\ g_{t}(x) & := & \nabla f_{t}(x_{t})^{\mathrm{T}} x \nonumber \end{eqnarray} .\]

We will show that for $T > 0$,

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(u) \right] \ge \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}(x_{t + 1}) \right] . \nonumber \end{eqnarray}\] \[\begin{eqnarray} \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - u) & = & \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(u) \nonumber \\ & \le & \sum_{t=1}^{T} g_{t}(x_{t}) - \sum_{t=1}^{T} g_{t}(u) + \sum_{t=0}^{T} g_{t}(u) - \sum_{t=0}^{T} g_{t}(x_{t + 1}) \nonumber \\ & = & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{1}{\eta} N_{0} \left( u - x_{1} \right) \nonumber \\ & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{1}{\eta} \norm{N_{0}}^{*} \norm{ u - x_{1} } \nonumber \\ & \le & \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{1}{\eta} \norm{N_{0}}^{*} D \nonumber \end{eqnarray}\]

Hence

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=1}^{T} f_{t}(x_{t}) - \min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{t}(x) \right] & \le & \mathrm{E} \left[ \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - u) \right] \nonumber \\ & \le & \mathrm{E} \left[ \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{t + 1}) + \frac{1}{\eta} \norm{N_{0}}^{*} D \right] \nonumber \\ & \le & \mathrm{E} \left[ \sum_{t=1}^{T} \norm{ \nabla f_{t}(x_{t})^{\mathrm{T}} }^{*} \norm{ (x_{t} - x_{t + 1}) } \right] + \frac{1}{\eta} \sigma D \nonumber \\ & \le & \mathrm{E} \left[ \norm{ G }^{*} \sum_{t=1}^{T} \norm{ (x_{t} - x_{t + 1}) } \right] + \frac{1}{\eta} \sigma D \end{eqnarray}\]

TODO. $\norm{x_{t} - x_{t+1}} \in O(\eta)$.

\[h_{t}(N_{t}) := \arg\min_{x \in \mathcal{K}} \left( \eta \sum_{s=1}^{t-1} \nabla f_{s}(x_{s})^{\mathrm{T}} x + N_{t}^{\mathrm{T}} x \right) .\] \[\begin{eqnarray} \mathrm{E} \left[ \norm{ x_{t} - x_{t + 1} } \right] & = & \mathrm{E} \left[ \norm{ h_{t}(N_{t}) - h_{t + 1}(N_{t + 1}) } \right] \nonumber \\ & \le & \norm{ \mathrm{E} \left[ h_{t}(N_{t}) - h_{t + 1}(N_{t + 1}) \right] } \quad (\because \text{Jensen's inequality}) \nonumber \\ & \le & \norm{ \mathrm{E} \left[ h_{t}(N_{t}) - h_{t}(N_{t + 1} + \eta \nabla f_{t}(x_{t})) \right] } \nonumber \\ & \le & \norm{ \int_{\mathbb{R}^{n}} h_{t}(s) - h_{t}(s + \eta \nabla f_{t}(x_{t})) \ F_{N}(ds) } \end{eqnarray}\]

$\Box$

5.5.2 Perturbation for linear cost functions

Here we assume cost functions are linear; $f_{t}(x) = g_{t}^{\mathrm{T}}x$.

\[\begin{eqnarray} w_{t}(N) := \arg\min_{x \in \mathcal{K}} \left( \eta \sum_{s=1}^{t} g_{s}^{\mathrm{T}} x + N^{\mathrm{T}}x \right) \nonumber . \end{eqnarray}\]

By linearity of the expectation,

\[f_{t}(x_{t}) = f_{t} \left( \mathrm{E} \left[ w_{t}(N) \right] \right) = \mathrm{E} \left[ f_{t} \left( w_{t}(N) \right) \right] .\]

ALgorithm 14 FPL for linear losses

$\eta > 0$,
$N$,
$N_{0}$,
- sample of $N$ $\hat{x}_{1} := \arg\min_{x \in \mathcal{K}} -N_{0}^{\mathrm{T}} x .$

Step1. for $t=1$ to $T$ do

Step2. Predict $\hat{x}_{t}$

Step3. Observer the loss function $f_{t}$ and suffer loss $f_{t}(x_{t})$

Step4. Update

\[\begin{eqnarray} \hat{x}_{t + 1} := \arg\min_{x \in \mathcal{K}} \eta \sum_{s=1}^{t-1} f_{s}(x_{s})^{\mathrm{T}} x + N_{0}^{\mathrm{T}} x \end{eqnarray}\]

Step5. end for

Step6. Return $\hat{x}_{T + 1}$.

■

Corolary 5.7

proof

$\Box$

5.5.3 Foolow-the-perturbed-leader for expert advice

\[\begin{eqnarray} \mathcal{F} := \left\{ f(x) := g^{\mathrm{T}}x \mid g \in [0, 1]^{n} \right\} \label{definition_cost_function_algorithm_15} \end{eqnarray} .\]

Algorithm 15 FPL for prediction from expert advice

$\eta > 0$,
$N_{0}$,
- real valued r.v. whose p.d.f. is p.d.f. of exponential distribution $e^{-\eta x}$,

\[x_{1} := \arg\min_{x \in \Delta_{n}} -N_{0}^{\mathrm{T}} x .\]

Step1. for $t=1$ to $T$

Step2. Predict using expert $i_{t} := \arg\max_{i} x_{i}$. See Remark below.

Step3. Observe the loss $f_{t} \in \mathcal{F}$ and sufer loss $f_{t}(\hat{x}_{t})$

Draw sample independent sample from $N_{t} \sim e^{-\eta x}$,

Step4. Update

\[\begin{eqnarray} \hat{x}_{t + 1} & := & \arg\min_{x \in \Delta_{n}} \left( \sum_{s=1}^{t} f_{s}(x) - N_{t}^{\mathrm{T}} x \right) \nonumber \\ & = & \arg\min_{x \in \Delta_{n}} \left( \sum_{s=1}^{t} g_{s}^{\mathrm{T}} x - N_{t}^{\mathrm{T}} x \right) \nonumber \end{eqnarray}\]

Step5. end for

Step6. Return $\hat{x}_{T +1}$.

■

Remark

$y \in [0, 1]^{n}$,

\[\begin{eqnarray} & & j = \arg\min_{i = \{1, \ldots, n\}} y_{i} \nonumber \\ & \Leftrightarrow & x = \arg\min_{x \in \Delta_{n}} y^{\mathrm{T}} x, \ x_{j} = 1 \nonumber \end{eqnarray}\]

■

Theorem 5.8

\[(1 - \eta) \mathrm{E} \left[ \sum_{t=1}^{T} g_{t}^{\mathrm{T}} \hat{x}_{t} \right] \le \min_{x^{*} \in \Delta_{n}} \sum_{t=1}^{T} g_{t}^{\mathrm{T}} x^{*} + \frac{ 4 \log n }{ \eta } .\]

proof

Let $g_{0} := -N$. We will show that

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}^{\mathrm{T}} u \right] \ge \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}^{\mathrm{T}} \hat{x}_{t + 1} \right] \label{theorem_05_08_inequality_g} . \end{eqnarray}\]

In case of $T = 0$, by definition of $x_{1}$,

\[\begin{eqnarray} \mathrm{E} \left[ -N_{0}^{\mathrm{T}} u \right] \ge \mathrm{E} \left[ -N_{0}^{\mathrm{T}} x_{1} \right] . \nonumber \end{eqnarray}\]

In case of $T = 1$,

\[\begin{eqnarray} \mathrm{E} \left[ -N_{0}^{\mathrm{T}} u + g_{1}^{\mathrm{T}} u \right] & \ge & \mathrm{E} \left[ -N_{0}^{\mathrm{T}} x_{2} + g_{1}^{\mathrm{T}} x_{2} \right] \quad (\because \text{by definition of }x_{2}) \nonumber \\ & \ge & \mathrm{E} \left[ -N_{0}^{\mathrm{T}} x_{1} + g_{1}^{\mathrm{T}} x_{2} \right] \quad (\because \text{by definition of }x_{1}) \nonumber \end{eqnarray}\]

Now, suppose that the equation holds up to $T - 1$.

\[\begin{eqnarray} \mathrm{E} \left[ -N_{0}^{\mathrm{T}} u + \sum_{s=1}^{T} g_{s}^{\mathrm{T}} u \right] & = & \mathrm{E} \left[ -N_{T}^{\mathrm{T}} u + \sum_{s=1}^{T} g_{s}^{\mathrm{T}} u \right] \quad (\because \text{i.i.d.}) \nonumber \\ & \ge & \mathrm{E} \left[ -N_{T}^{\mathrm{T}} x_{T + 1} + \sum_{s=1}^{T} g_{s}^{\mathrm{T}} x_{T + 1} \right] \quad (\because \text{by definition of }x_{T + 1}) \nonumber \\ & \ge & \mathrm{E} \left[ -N_{T}^{\mathrm{T}} x_{1} + \sum_{s=1}^{T-1} g_{s}^{\mathrm{T}} x_{s + 1} + g_{T}^{\mathrm{T}} x_{T + 1} \right] \quad (\because \text{by assumption of indection}) \nonumber \\ & = & \mathrm{E} \left[ -N_{0}^{\mathrm{T}} x_{1} + \sum_{s=1}^{T} g_{s}^{\mathrm{T}} x_{s + 1} \right] \quad (\because \text{i.i.d.}) . \nonumber \end{eqnarray}\]

This proves $\eqref{theorem_05_08_inequality_g}$.

\[\begin{eqnarray} \mathrm{E} \left[ \sum_{t=1}^{T} g_{t}^{\mathrm{T}} (\hat{x}_{t} - u) \right] & \le & \mathrm{E} \left[ \sum_{t=1}^{T} g_{t}^{\mathrm{T}} (\hat{x}_{t} - u) \right] + \mathrm{E} \left[ \sum_{t=0}^{T} g_{t}^{\mathrm{T}} (u - \hat{x}_{t+1}) \right] \nonumber \\ & = & \mathrm{E} \left[ \sum_{t=1}^{T} g_{t}^{\mathrm{T}} (\hat{x}_{t} - \hat{x}_{t + 1}) \right] + \mathrm{E} \left[ g_{0}^{\mathrm{T}} (u - x_{1}) \right] \nonumber \\ & \le & \mathrm{E} \left[ \sum_{t=1}^{T} g_{t}^{\mathrm{T}} (\hat{x}_{t} - \hat{x}_{t + 1}) \right] + \mathrm{E} \left[ \norm{g_{0}}_{\infty} \norm{u - x_{1}}_{1} \right] \nonumber \\ & \le & \mathrm{E} \left[ \norm{g_{t}}_{\infty} \norm{ \hat{x}_{t} - \hat{x}_{t + 1} }_{1} \right] + \frac{1}{\eta} \mathrm{E} \left[ \norm{u - x_{1}}_{1} \right] \quad (\because \text{Hölder's inequality and lemma below}) \nonumber \\ & \le & \mathrm{E} \left[ \norm{g_{t}}_{\infty} \norm{ \hat{x}_{t} - \hat{x}_{t + 1} }_{1} \right] + \frac{2}{\eta} \quad (\because u, x_{1} \in \Delta_{n}) \nonumber \\ & \le & \mathrm{E} \left[ \norm{ \hat{x}_{t} - \hat{x}_{t + 1} }_{1} \right] + \frac{2}{\eta} \mathrm{E} \quad (\because \eqref{definition_cost_function_algorithm_15}) \nonumber \\ & = & \mathrm{E} \left[ \norm{ \hat{x}_{t} - \hat{x}_{t + 1} }_{1} 1_{\{x_{i} \neq x_{t + 1, i}, \forall i\}} \right] + \frac{2}{\eta} \mathrm{E} \nonumber \\ & = & 2 \mathrm{E} \left[ 1_{\{x_{i} \neq x_{t + 1, i}, \forall i\}} \right] + \frac{2}{\eta} \mathrm{E} \end{eqnarray}\]

The last inequality follows by the fact that only one element of $x_{t}$ has value 1, that is,

\[\forall \omega \in \Omega, \ i_{t}(\omega) \in \{1, \ldots, n\}, \ x_{t, i_{t}(\omega)}(\omega) = 1, \ x_{t, j}(\omega) = 0 \quad (j \neq i_{t}(\omega)) ,\]

see Remark above.

$\Box$

Lemma

$Q_{i}$,
- $\mathbb{R}$-valued r.v. following exponential distribution with $\eta$,
- $P(Q_{i} \le x) = 1 - e^{-\eta x}$,
$Q := (Q_{1}, \ldots, Q_{n})$,

\[\begin{eqnarray} \mathrm{E} \left[ \sup_{i} \abs{A_{i}} \right] \le \frac{ 2 \log n }{ \eta } . \end{eqnarray}\]

proof

\[\begin{eqnarray} \mathrm{E} \left[ \sup_{i \in \{1, \ldots, n\}} \abs{Q_{i}} \right] & \le & \sup_{i \in \{1, \ldots, n\}} \abs{ \mathrm{E} \left[ Q_{i} \right] } \nonumber \\ & = & \sup_{i \in \{1, \ldots, n\}} \abs{ \frac{1}{\eta} } \nonumber \\ & = & \frac{1}{\eta} \nonumber \end{eqnarray}\]

$\Box$

5.6 Optimal regularization

\[\begin{eqnarray} \mathrm{Regret}_{T} \le \max_{u \in \mathcal{K}} \sqrt{ 2 \sum_{t=1}^{T} (\norm{\nabla f_{t}(x_{t})}_{z}^{*})^{2} B_{R}(u \parallel x_{1}) } . \end{eqnarray}\] \[\begin{eqnarray} \mathcal{H} & := & \{ A \in \mathbb{R}^{n \times n} \mid \mathrm{tr} \left( A \right) \le 1, 0 \preccurlyeq A \} \nonumber \\ \mathcal{R} & := & \{ R:\mathcal{K} \rightarrow \mathbb{R}^{n} \mid \nabla^{2} R(x) \in \mathcal{H}, \ x \in \mathcal{K} \} \nonumber \end{eqnarray}\]

Algorithm 16 AdaGrad

$\eta > 0$,
$x_{1} \in \mathcal{K}$,
$S_{0} := 0 \in \mathbb{R}^{n \times n}$,
$G_{0} := 0 \in \mathbb{R}^{n \times n}$,

Step1. For $t=1$ to $T$

Step2. Play $x_{t}$

Step3. Suffer loss $f_{t}(x_{t})$,

Step4.

\[\begin{eqnarray} S_{t} & := & S_{t - 1} + \nabla f_{t}(x_{t}) \nabla f_{t}(x_{t})^{\mathrm{T}}, \nonumber \\ G_{t} & := & S_{t}^{1/2} \nonumber \\ y_{t + 1} & := & x_{t} - \eta G_{t}^{\dagger} \nabla f(x_{t}) \nonumber \\ x_{t + 1} & := & \arg\min_{x \in \mathcal{K}} \norm{ y_{t + 1} - x }_{G_{t}}^{2} \nonumber \end{eqnarray}\]

Step5. end for

Step6. Return $x_{T + 1}$.

■

In the algorithm definition and throughout this section, the notation $A^{\dagger}$ refers to the Moore-Penrose pseudoinverse of the matrix $A$. The regret of AdaGrad is at most a constant factor larger than the minimum regret of all RFTL algorithm with regularizatin functions whose Hesssian is fixed and belongs to the class $\mathcal{H}$.

Remark Algorithm 16

\[\begin{eqnarray} (G_{t})^{\mathrm{T}} & = & G_{t} \nonumber \\ (G_{t}^{\dagger})^{\mathrm{T}} & = & (G_{t}^{\dagger}) \nonumber \end{eqnarray}\]

By definition,

\[\begin{eqnarray} S_{1} & = & \nabla f_{1}(x_{1}) \nabla f_{1}(x_{1})^{\mathrm{T}} \label{algorithm_16_ada_grad_s1} \\ S_{t} & = & \sum_{s=1}^{t} \nabla f_{s}(x_{s}) \nabla f_{s}(x_{s})^{\mathrm{T}} . \label{algorithm_16_ada_grad_st} \end{eqnarray}\]

From proposition 9, $S_{t}$ is positive definite. Since $S_{t}$ is positive defnite, $G_{t}$ is also positive definite. Similarly, by proposition 9,

\[\begin{eqnarray} S_{t} - S_{t - 1} & = & \nabla f_{t}(x_{t}) \nabla f_{t}(x_{t})^{\mathrm{T}} \nonumber \end{eqnarray}\]

is positive definite.

■

Theorem 5.9

$x_{1} \in \mathcal{K}$,
$\eta := D$,

\[D := \max_{u \in \mathcal{K}} \norm{ u - x_{1} } .\]

In algorithm 16,

\[\begin{eqnarray} \mathrm{Regret}_{T} \le 2 D \sqrt{ \min_{H \in \mathcal{H}} \sum_{t=1}^{T} \left( \norm{ \nabla f_{t}(x_{t}) }_{H}^{*} \right)^{2} } \label{equation_05_06} \end{eqnarray}\]

proof

$\Box$

Proposition 5.1

The minimizer of the following minimization problem:

$A$,

\[\begin{align} \max_{Y} & & & \mathrm{tr} \left( Y^{\dagger} A \right) \nonumber \\ \mathrm{subject\ to} & & & Y \in \mathcal{H} \nonumber \end{align}\]

is

\[\begin{eqnarray} Y^{*} & = & \frac{1}{ \mathrm{tr}(A^{1/2})} A^{1/2} \nonumber \\ (Y^{*})^{\dagger} A & = & ( \mathrm{tr}(A^{1/2})) )^{2} \nonumber \end{eqnarray}\]

proof

$\Box$

Corollary 5.10

\[\begin{eqnarray} \sqrt{ \min_{H \in \mathcal{H}} \sum_{t=1}^{T} \left( \norm{ \nabla f_{t}(x_{t}) }_{H}^{*} \right)^{2} } & = & \sqrt{ \min_{H \in \mathcal{H}} \mathrm{tr} \left( H^{\dagger} \left( \sum_{t=1}^{T} \nabla f_{t}(x_{t}) \nabla f_{t}(x_{t})^{\mathrm{T}} \right) \right) } \nonumber \\ & = & \mathrm{tr} \left( \left( \sum_{t=1}^{T} \nabla f_{t}(x_{t}) \nabla f_{t}(x_{t})^{\mathrm{T}} \right)^{1/2} \right) \nonumber \\ & = & \mathrm{tr}(G_{T}) \nonumber \end{eqnarray}\]

proof

$\Box$

Lemma 5.11

In Algorithm 16,

\[\begin{eqnarray} \mathrm{Regret}_{T} & \le & 2D \mathrm{tr}(G_{T}) \nonumber \\ & = & 2 D \sqrt{ \min_{H \in \mathcal{H}} \sum_{t=1}^{T} \left( \norm{ \nabla f_{t}(x_{t}) }_{H}^{*} \right)^{2} } \nonumber \end{eqnarray}\]

proof

By the definiton of $y_{t + 1}$,

\[\begin{eqnarray} & & y_{t + 1} - u = x_{t} - u - \eta G_{t}^{\dagger}\nabla f_{t}(x_{t}) \label{equation_05_07} \\ & \Leftrightarrow & G_{t}(y_{t + 1} - u) = G_{t} (x_{t} - u) - \eta \nabla f_{t}(x_{t}) \label{equation_05_08} \end{eqnarray}\]

Multiplying the transpose of $\eqref{equation_05_07}$ by $\eqref{equation_05_08}$,

\[\begin{eqnarray} (y_{t + 1} - u)^{\mathrm{T}} G_{t}(y_{t + 1} - u) & = & (x_{t} - u)^{\mathrm{T}} G_{t} (x_{t} - u) - \eta (x_{t} - u)^{\mathrm{T}} \nabla f_{t}(x_{t}) - \eta (G_{t}^{\dagger}\nabla f_{t}(x_{t})) ^{\mathrm{T}} G_{t} (x_{t} - u) + \eta (G_{t}^{\dagger}\nabla f_{t}(x_{t})) ^{\mathrm{T}} \eta \nabla f_{t}(x_{t}) \nonumber \\ & = & (x_{t} - u)^{\mathrm{T}} G_{t} (x_{t} - u) - 2\eta (x_{t} - u)^{\mathrm{T}} \nabla f_{t}(x_{t}) + \eta^{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \label{equation_05_09} \end{eqnarray}\]

Since $x_{t + 1}$ is the projection of $y_{t + 1}$, by theorem 2.1

\[\begin{eqnarray} (y_{t + 1} - u)^{\mathrm{T}} G_{t} (y_{t + 1} - u) & = & \norm{ y_{t+1} - u }_{G_{t}}^{2} \nonumber \\ & \ge & \norm{ x_{t + 1} - u }_{G_{t}}^{2} \nonumber \end{eqnarray}\]

From $\eqref{equation_05_09}$,

\[\begin{eqnarray} & & \norm{ x_{t + 1} - u }_{G_{t}}^{2} \le \norm{ x_{t} - u }_{G_{t}}^{2} - 2\eta (x_{t} - u)^{\mathrm{T}} \nabla f_{t}(x_{t}) + \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \nonumber \\ & \Leftrightarrow & 2\eta (x_{t} - u)^{\mathrm{T}} \nabla f_{t}(x_{t}) \le \norm{ x_{t} - u }_{G_{t}}^{2} - \norm{ x_{t + 1} - u }_{G_{t}}^{2} + \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \nonumber \\ & \Leftrightarrow & \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - u) \le \frac{1}{2\eta} \left( \norm{ x_{t} - u }_{G_{t}}^{2} - \norm{ x_{t + 1} - u }_{G_{t}}^{2} \right) + \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) . \nonumber \end{eqnarray}\]

Now, summing up over $t=1$ to $T$ we get

\[\begin{eqnarray} \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - u) & \le & \sum_{t=1}^{T} \frac{1}{2\eta} \norm{ x_{t} - u }_{G_{t}}^{2} - \sum_{t=1}^{T} \frac{1}{2\eta} \left( \norm{ x_{t + 1} - u }_{G_{t}}^{2} \right) + \sum_{t=1}^{T} \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \nonumber \\ & = & \sum_{t=1}^{T} \frac{1}{2\eta} \norm{ x_{t} - u }_{G_{t}}^{2} - \sum_{t=1}^{T+1} \frac{1}{2\eta} \left( \norm{ x_{t} - u }_{G_{t-1}}^{2} \right) + \frac{1}{2\eta} \norm{ x_{1} - u }_{G_{0}}^{2} + \sum_{t=1}^{T} \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \nonumber \\ & = & \sum_{t=1}^{T} \frac{1}{2\eta} \left( \norm{ x_{t} - u }_{G_{t}}^{2} - \norm{ x_{t} - u }_{G_{t-1}}^{2} \right) + \frac{1}{2\eta} \norm{ x_{1} - u }_{G_{0}}^{2} - \frac{1}{2\eta} \norm{ x_{T + 1} - u }_{G_{T}}^{2} + \sum_{t=1}^{T} \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \nonumber \\ & = & \frac{1}{2\eta} \sum_{t=1}^{T} (x_{t} - u)^{\mathrm{T}} (G_{t} - G_{t - 1}) (x_{t} - u) - \frac{1}{2\eta} \norm{ x_{T + 1} - u }_{G_{T}}^{2} + \sum_{t=1}^{T} \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \quad (\because G_{0} = 0) \nonumber \\ & = & \frac{1}{2\eta} \sum_{t=1}^{T} (x_{t} - u)^{\mathrm{T}} (G_{t} - G_{t - 1}) (x_{t} - u) + \sum_{t=1}^{T} \frac{\eta}{2} \nabla f_{t}(x_{t})^{\mathrm{T}} (G_{t}^{\dagger}) \nabla f_{t}(x_{t}) \label{equation_05_10} \end{eqnarray}\]

$\Box$

Lemma 5.12

In Algorithm 16,

\[\begin{eqnarray} \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} G_{t}^{\dagger} \nabla f_{t}(x_{t}) & \le & 2 \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} G_{T}^{\dagger} \nabla f_{t}(x_{t}) \nonumber \\ & \le & 2 \mathrm{tr} \left( G_{T} \right) \nonumber \end{eqnarray}\]

proof

We prove lemma by induction. Suppose that $T = 1$. Since $S_{0} = 0$,

\[\begin{eqnarray} S_{1} & = & \nabla f_{1}(x_{1}) \nabla f_{1}(x_{1})^{\mathrm{T}} \nonumber \\ G_{1} & = & \left( \nabla f_{1}(x_{1}) \nabla f_{1}(x_{1})^{\mathrm{T}} \right)^{1/2} \nonumber \end{eqnarray}\]

Hence

\[\begin{eqnarray} \nabla f_{1}(x_{1})^{\mathrm{T}} G_{1}^{-\dagger} \nabla f_{1}(x_{1}) & = & \mathrm{tr} \left( G_{1}^{\dagger} \nabla f_{1}(x_{1}) \nabla f_{1}(x_{1})^{\mathrm{T}} \right) \nonumber \\ & = & \mathrm{tr} \left( G_{1}^{\dagger} G_{1}^{2} \right) \nonumber \\ & = & \mathrm{tr} \left( G_{1} \right) \nonumber \end{eqnarray}\]

The first equality holds since proposition.

Suppose that the inequalities hold up to $T-1$.

\[\begin{eqnarray} \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} G_{t}^{\dagger} \nabla f_{t}(x_{t}) & \le & \mathrm{tr} \left( G_{T - 1} \right) + \nabla f_{T}(x_{T})^{\mathrm{T}} G_{T}^{\dagger} \nabla f_{T}(x_{T}) \nonumber \\ & = & \mathrm{tr} \left( \left( G_{T}^{2} - \nabla f_{T}(x_{T})^{\mathrm{T}} \nabla f_{T}(x_{T}) \right)^{1/2} \right) + \mathrm{tr} \left( G_{T}^{\dagger} \nabla f_{T}(x_{T}) \nabla f_{T}(x_{T})^{\mathrm{T}} \right) \nonumber \\ & \le & \mathrm{tr} \left( G_{T} \right) \nonumber \end{eqnarray}\]

where the last inequality is due to the proposition below:

\[\begin{eqnarray} 2 \mathrm{tr} \left( (A - B)^{1/2} \right) + \mathrm{tr} \left( A^{-1/2}B \right) \le 2 \mathrm{tr}(A^{1/2}) . \end{eqnarray}\]

$\Box$

lemma 5.13

In Algorith 16,

\[\begin{eqnarray} \sum_{t=1}^{T} (x_{t} - x^{*})^{\mathrm{T}} (G_{t} - G_{t -1}) (x_{t} - x^{*}) & \le & D^{2} \mathrm{tr} \left( G_{T} \right) \nonumber \end{eqnarray}\]

proof

By definition

\[S_{t} - S_{t - 1} = f_{t}(x_{t}) f_{t}(x_{t})^{\mathrm{T}} .\]

That is, $S_{t} \succcurlyeq S_{t - 1}$. See remark. Moreover, $G_{t} - G_{t-1}$. Hence $G_{t} - G_{t-1}$ is diagonalizable. Let

\[\begin{eqnarray} L_{t} & := & (l_{j}^{i})_{i,j=1,\ldots,n} \nonumber \\ & := & \mathrm{diag}(\lambda_{1}(G_{t} - G_{t-1}), \ldots, \lambda_{n}(G_{t} - G_{t-1})), \nonumber \end{eqnarray}\]

and $Q_{t}$ is corresponding unitary matrix. For simplicity, letting $y_{t} := (y_{t, i}) := Q^{\mathrm{T}}(x_{t} - u)$,

\[\begin{eqnarray} \sum_{t=1}^{T} (x_{t} - u)^{\mathrm{T}} (G_{t} - G_{t - 1}) (x_{t} - u) & = & \sum_{t=1}^{T} (x_{t} - u)^{\mathrm{T}} Q_{t} L_{t} Q_{t}^{\mathrm{T}} (x_{t} - u) \nonumber \\ & = & \sum_{t=1}^{T} \left( Q_{t}^{\mathrm{T}} (x_{t} - u) \right)^{\mathrm{T}} L_{t} Q_{t}^{\mathrm{T}} (x_{t} - u) \nonumber \\ & = & \sum_{t=1}^{T} y_{t}^{\mathrm{T}} L_{t} y_{t} \nonumber \\ & = & \sum_{t=1}^{T} \sum_{k=1}^{n} \sum_{k^{\prime}=1}^{n} y_{t}^{k} l_{t, k}^{k^{\prime}} y_{t, k} \nonumber \\ & = & \sum_{t=1}^{T} \sum_{k=1}^{n} y_{t}^{k} l_{t, k}^{k} y_{t, k} \quad (\because \text{diagonal}) \nonumber \\ & \le & \sum_{t=1}^{T} \lambda_{1}(G_{t} - G_{t-1}) \sum_{k=1}^{n} y_{t}^{k} y_{t, k} \nonumber \\ & = & \sum_{t=1}^{T} \lambda_{1}(G_{t} - G_{t-1}) \norm{ y_{t} }_{2}^{2} \nonumber \\ & \le & \sum_{t=1}^{T} \lambda_{1}(G_{t} - G_{t-1}) \norm{ Q }_{2}^{2} \norm{ x_{t} - u }_{2}^{2} \nonumber \\ & \le & \sum_{t=1}^{T} \lambda_{1}(G_{t} - G_{t-1}) D^{2} \nonumber \\ & \le & \sum_{t=1}^{T} \mathrm{tr} \left( G_{t} - G_{t-1} \right) D^{2} \nonumber \\ & = & D^{2} \sum_{t=1}^{T} \left( \mathrm{tr} \left( G_{t} \right) - \mathrm{tr} \left( G_{t-1} \right) \right) \nonumber \\ & = & D^{2} \left( \mathrm{tr} \left( G_{T} \right) - \mathrm{tr} \left( G_{0} \right) \right) \nonumber \\ & = & D^{2} \mathrm{tr} \left( G_{T} \right) \nonumber \end{eqnarray}\]

$\Box$

Proposition

$A$,
- positive definite
$B$,
- positive definite
$A - B$,
- positive semidefinite

(1) $A^{1/2} \succcurlyeq B^{1/2}$,

(2)

\[2 \mathrm{tr} \left( (A - B)^{1/2} \right) + \mathrm{tr} \left( A^{-1/2}B \right) \le 2 \mathrm{tr}(A^{1/2}) .\]

proof

TODO: proof.

$\Box$

Proposition

$A \in \mathbb{R}^{n \times n}$,
- positive definite
$\mathcal{S} \subseteq \mathbb{R}^{n \times n}$,
- all symmetric matrices

\[\begin{align} \min_{X \in \mathcal{S}} & & & \mathrm{tr}(X^{\dagger}A) \nonumber \\ \mathrm{subject\ to} & & & X \succ 0 \nonumber \\ & & & \mathrm{tr}(A) \le 1 \nonumber \end{align}\]

Then the minimizer and the minimum of the above problem are give by

\[\begin{eqnarray} X^{*} & = & \frac{ A^{1/2} }{ \mathrm{tr}(A^{1/2}) } \label{proposition_minimizer} \\ \min (X^{*})^{\dagger} A & = & (\mathrm{tr}(A^{1/2}))^{2} \label{proposition_minimum} \end{eqnarray}\]

proof

TODO: proof

We first prove the minimizer is give by the $\eqref{proposition_minimizer}$. Note that the trace is given by the sum of eigenvalues. Let

\[\begin{eqnarray} \lambda_{i} & := & \lambda_{i}(A) \quad (i = 1, \ldots, n) . \nonumber \end{eqnarray}\]

We consider the following problem.

\[\begin{align} \min_{x \in \mathbb{R}_{> 0}^{n}} & & & \sum_{i=1}^{n} \frac{ \lambda_{i} }{ x_{i} } \nonumber \\ \mathrm{subject\ to} & & & \sum_{i=1}^{n} x_{i} \le 1 \nonumber \end{align}\]

By KKT condition for convex function, we can solve the above problem by Lagrange multipler method. Let

\[x \in \mathbb{R}^{n}, \ L(x, \mu) := \sum_{i=1}^{n} \frac{ \lambda_{i} }{ x_{i} } - \mu \left( 1 - \sum_{i=1}^{n} x_{i} \right) .\]

Then

\[\begin{eqnarray} \frac{\partial L}{\partial x_{j}} & = & - \frac{ \lambda_{j} }{ x_{j}^{2} } + \mu = 0 \nonumber \\ \frac{\partial L}{\partial \mu} & = & 1 - \sum_{i=1}^{n} x_{i} = 0 \nonumber \end{eqnarray}\]

From the first equation,

\[\begin{eqnarray} x_{j} & = & \sqrt{ \frac{ \lambda_{j} }{ \mu } } \nonumber \end{eqnarray}\]

By substituting the second quality, we obtain

\[\begin{eqnarray} & & 1 - \sum_{i=1}^{n} \sqrt{ \frac{ \lambda_{j} }{ \mu } } = 0 \nonumber \\ & \Leftrightarrow & \mu^{1/2} = \sqrt{ \sum_{i=1}^{n} \lambda_{j} } . \nonumber \end{eqnarray}\]

Back to the first equation, we have

\[\begin{eqnarray} x_{j} & = & \frac{ \sqrt{ \lambda_{j} } }{ \sum_{i=1}^{n} \sqrt{ \lambda_{i} } } \label{proposition_minimizer_alternative_problem} \end{eqnarray}\]

Now if we constract a matrix $X \in \mathcal{S}_{n}$ whose eigenvalues are given by $\eqref{proposition_minimizer_alternative_problem}$, it is the minimizer of the original problem. Let $X$ be $\eqref{proposition_minimizer}$. Since $A$ is positive definite, the Moore-Penrose inverse is equal to $X^{-1}$. By proposition

\[\begin{eqnarray} \mathrm{tr} \left( (X^{*})^{-1} A \right) & = & \mathrm{tr} \left( \mathrm{tr}(A^{1/2}) A^{-1/2} A \right) \nonumber \\ & = & \frac{ 1 }{ \mathrm{tr}(A^{1/2}) } \mathrm{tr} \left( A^{1/2} \right) \nonumber \\ & = & \frac{ 1 }{ \mathrm{tr}(A^{1/2}) } \sum_{i=1}^{n} \lambda_{i}(A)^{3/2} \nonumber \\ & = & \frac{ \sum_{i=1}^{n} \lambda_{i}(A)^{3/2} }{ \sum_{i=1}^{n} \lambda_{i}(A)^{1/2} } \nonumber \end{eqnarray}\]

$\Box$