View on GitHub

memo

Introduction to Online Convex Optimization Chapter03

Introduction to Online Convex Optimization Chapter03

\[\mathrm{Regret}(f) := \sum_{t=1}^{T} f_{t}(x_{t}) - \min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{t}(x) .\]

Average regret

3.1 Online gradient descent

Algorithm 6 online gradient descent

Step1. for $t=1, \ldots, T$

Step2. Play $x_{t}$ and observe cost $f_{t}(x_{t})$ where $f_{t} \in \mathcal{F}$,

Step3. Update and project

\[\begin{eqnarray} y_{t+1} & := & x_{t} - \eta_{t}\nabla f_{t}(x_{t}) \nonumber \\ x_{t+1} & := & \Pi_{\mathcal{K}} (y_{t+1}) \nonumber . \end{eqnarray}\]

Step4. end for

Step5. Return $x_{T + 1}$,

We denote this algorithm by \(\mathcal{A}_{OGD}(f_{1}, \ldots, f_{T}; \{\eta_{t}\})\), that is,

\[\forall t \in \{0, \ldots, T\}, \ \mathcal{A}_{OGD}(f_{1}, \ldots, f_{t}; \{\eta_{s}\}_{s=1,\ldots,t}) = x_{t + 1} .\]

Lemma1 difference between optimal points

In Algorithm 6, for any $x \in \mathcal{K}$,

\[\begin{eqnarray} \| x_{t + 1} - x \| & \le & \| x_{t} - x \| + \eta_{t}^{2} \| \nabla f(x_{t}) \| - 2 \eta_{t} \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x) \label{lemma1_difference_between_optimal_points_01} \\ & \le & \| x_{t} - x \| + \eta_{t}^{2} G^{2} - 2 \eta_{t} \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x) \label{lemma1_difference_between_optimal_points_02} \end{eqnarray}\]

proof

\[\begin{eqnarray} \| x_{t+1} - x \|^{2} & = & \| \Pi_{\mathcal{K}} (x_{t} - \eta_{t}\nabla f(x_{t})) - x \|^{2} \nonumber \\ & \le & \| (x_{t} - \eta_{t}\nabla f(x_{t})) - x \|^{2} \quad (\because \text{theorem 2.1}) \label{equation_03_02} \nonumber \\ & = & \| x_{t} - x \|^{2} - 2 \eta_{t} \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x) + \| \eta_{t}\nabla f(x_{t})) \|^{2} \nonumber \\ & \le & \| x_{t} - x \|^{2} - 2 \eta_{t} \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x) + \eta_{t}^{2} G^{2} \nonumber \end{eqnarray}\]
$\Box$

Lemma2 bound of the linear approximation

In Algorithm 6, for any $x \in \mathcal{K}$,

\[\begin{eqnarray} 2 \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x) & = & \sum_{t=1}^{T} \|x_{t} - x\|^{2} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \quad (\because \frac{1}{\eta_{0}} := 0) \label{lemma2_bound_of_the_linear_approximation_equation_01} \\ & \le & D^{2} \sum_{t=1}^{T} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \label{lemma2_bound_of_the_linear_approximation_equation_02} \end{eqnarray}\]

proof

By lemma, we have \(\eqref{lemma1_difference_between_optimal_points_02}\). Summing \(\eqref{lemma1_difference_between_optimal_points_02}\) from $t=1$ to $T$,

\[\begin{eqnarray} 2 \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x) & \le & \sum_{t=1}^{T} \frac{ \|x_{t} - x\|^{2} - \|x_{t + 1} - x\|^{2} }{ \eta_{t} } + G^{2} \sum_{t=1}^{T} \eta_{t} \nonumber \\ & = & \sum_{t=1}^{T} \frac{ \|x_{t} - x\|^{2} }{ \eta_{t} } - \sum_{t=2}^{T} \frac{ \|x_{t} - x\|^{2} }{ \eta_{t-1} } + G^{2} \sum_{t=1}^{T} \eta_{t} \nonumber \\ & = & \sum_{t=1}^{T} \|x_{t} - x\|^{2} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \quad (\because \frac{1}{\eta_{0}} := 0) \nonumber \\ & \le & D^{2} \sum_{t=1}^{T} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \nonumber \end{eqnarray}\]
$\Box$

Theorem 3.1

Then the regret of the Algorithm 6 is bounded by

\[\mathrm{Regret}_{T} = \sum_{t=1}^{T} f_{t}(x_{t}) - \min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{t}(x^{*}) \le \frac{3}{2} GD \sqrt{T} .\]

proof

Let

\[x^{*} \in \arg\min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{t}(x^{*}) .\]

By convexity, we have

\[\begin{eqnarray} f_{t}(x^{*}) - f_{t}(x_{t}) & \ge & \nabla f(x_{t})^{\mathrm{T}} (x^{*} - x_{t}) \nonumber \\ \Leftrightarrow \ f_{t}(x_{t}) - f_{t}(x^{*}) & \le & \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x^{*}) . \label{equation_03_01} \end{eqnarray}\]

By lemma, we have \(\eqref{lemma1_difference_between_optimal_points_01}\). Summing \(\eqref{equation_03_01}\) from $t=1$ to $T$,

\[\begin{eqnarray} 2 \left( \sum_{t=1}^{T} f_{t}(x_{t}) - f_{t}(x^{*}) \right) & \le & 2 \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x_{*}) \nonumber \\ & \le & D^{2} \sum_{t=1}^{T} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \quad (\because \eqref{lemma2_bound_of_the_linear_approximation_equation_02}) \nonumber \\ & = & D^{2} \frac{ 1 }{ \eta_{T} } + G^{2} \sum_{t=1}^{T} \eta_{t} \nonumber \\ & = & D \sqrt{T} G + G D \sum_{t=1}^{T} \frac{1}{\sqrt{t}} \nonumber \\ & \le & 3 D \sqrt{T} G . \end{eqnarray}\]

The last equation is derived from the fact

\[\begin{eqnarray} \sum_{t=1}^{T} \frac{1}{\sqrt{t}} & \le & \int_{0}^{T} \frac{1}{\sqrt{t}} \ dt \nonumber \\ & = & [ 2 \sqrt{t} ]_{0}^{T} = 2\sqrt{T} \nonumber . \end{eqnarray}\]
$\Box$

3.2 Lower bounds

Theroem 3.2

\[\]

proof

\[\begin{eqnarray} \mathcal{K} & := & \{ x \in \mathbb{R}^{n} \mid \|x\|_{\infty} \le 1 \} \nonumber \\ \mathcal{A} & := & \{ -1, 1 \}^{n} \nonumber \end{eqnarray} .\] \[v \in \mathcal{A}, \ f_{v}: \mathbb{R}^{n} \rightarrow \mathbb{R}, \quad f_{v}(x) := v^{\mathrm{T}} x .\] \[\mathcal{F} := \{ f_{v} \mid v \in \mathcal{A} \} .\] \[\begin{eqnarray} \| x - y \| & \le & \sqrt{ \sum_{i=1}^{n} 2^{2} } = 2 \sqrt{ n } \nonumber \\ \|f(x) - f(y)\| & \le & \sqrt{ \sum_{i=1}^{n} (\pm 1)^{2} } \| x - y \| = \sqrt{ n } \| x - y \| \nonumber \end{eqnarray}\]

Now, we show that

\[\min_{x \in \mathcal{K}} \sum_{i=1}^{n} \sum_{t=1}^{T} v_{t, i} x_{i} = \sum_{i=1}^{n} - \left| \sum_{t=1}^{T} v_{t, i} \right| .\]

Indeed, since \(v_{t, i} \in \{\pm 1\}\), $x_{i}$ can be $1$ or $-1$. Moreover, since we can choose $x$ in each coordinate $i$ independently, we can minimize the equation for each coordinate $i$. The above minimization is equivalent to

\[\sum_{i=1}^{n} \min_{z \in \{\pm 1\}} \sum_{t=1}^{T} v_{t, i} z = \sum_{i=1}^{n} \left( - \left| \sum_{t=1}^{T} v_{t, i} \right| \right) .\]

Let \(\Lambda_{t, i}\) be Bernoulli r.v.s whose values is in \(\{\pm 1\}\). Suppose \(\{\Lambda_{t, i}\}_{i=1,\ldots,n}\) is independent for each $t = 1, \ldots, T$. Let

\[t = 1, \ldots, T \ V_{t} := ( \Lambda_{t, 1}, \cdots, \Lambda_{t, n} ) .\] \[\begin{eqnarray} \mathrm{E} \left[ \min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{V_{t}}(x) \right] & = & \mathrm{E} \left[ \min_{x \in \mathcal{K}} \sum_{i=1}^{n} \sum_{t=1}^{T} V_{t, i} x_{i} \right] \nonumber \\ & = & \mathrm{E} \left[ \sum_{i=1}^{n} \left( - \left| \sum_{t=1}^{T} V_{t, i} \right| \right) \right] \nonumber \\ & = & - \sum_{i=1}^{n} \mathrm{E} \left[ \left| \sum_{t=1}^{T} V_{t, i} \right| \right] \nonumber \\ & = & - n \mathrm{E} \left[ \left| \sum_{t=1}^{T} V_{t, 1} \right| \right] \end{eqnarray}\] \[\begin{eqnarray} \mathrm{E} \left[ \left| \sum_{t=1}^{T} V_{t, 1} \right| \right] & = & \mathrm{E} \left[ \left| \sum_{t=1}^{T} 1_{\{V_{t, 1} = 1\}} - ( T - \sum_{t=1}^{T} 1_{\{V_{t, 1} = 1\}} ) \right| \right] \nonumber \\ & = & \mathrm{E} \left[ \left| 2 \sum_{t=1}^{T} 1_{\{V_{t, 1} = 1\}} - T \right| \right] \nonumber \\ & = & \sum_{t=0}^{T} | 2t - T | \frac{1}{2^{T}} \left( \begin{array}{c} T \\ t \end{array} \right) \nonumber \\ & = & \sum_{t=0}^{T} | 2t - T | \frac{1}{2^{T}} \left( \begin{array}{c} T \\ t \end{array} \right) \frac{ T! }{ k!(T - k)! } \end{eqnarray}\]
$\Box$

3.3 Logarithmic regret

3.3.1 Online gradient descent for strongly convex functions

Theorem 3.3

\[\mathrm{Regret}_{T} \le \frac{G^{2}}{2\alpha} \left( 1 + \log T \right) .\]

proof

Let

\[x^{*} := \arg\min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_{t}(x) .\]

By $\alpha$-strong convexity,

\[\begin{equation} 2(f_{t}(x_{t}) - f_{t}(x^{*})) \le 2 \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x^{*}) - \alpha \| x^{*} - x_{t} \|^{2} \label{equation_03_04} . \end{equation}\]

By theorem 2.1,

\[\begin{eqnarray} \| x_{t + 1} - x^{*} \|^{2} & = & \| \Pi_{x \in \mathcal{K}} \left( x_{t} - \eta_{t} \nabla f_{t}(x_{t} \right) - x^{*} \|^{2} \nonumber \\ & \le & \| x_{t} - \eta_{t} \nabla f_{t}(x_{t}) - x^{*} \|^{2} \end{eqnarray}\]

Summing \(\eqref{equation_03_04}\) from $t=1$ to $T$.

\[\begin{eqnarray} 2\sum_{t=1}^{T} (f_{t}(x_{t}) - f_{t}(x^{*})) & \le & 2 \sum_{t=1}^{T} \nabla f_{t}(x_{t})^{\mathrm{T}} (x_{t} - x^{*}) - \alpha \sum_{t=1}^{T} \| x^{*} - x_{t} \|^{2} \quad (\because \eqref{equation_03_04}) \nonumber \\ & \le & \sum_{t=1}^{T} \|x_{t} - x^{*}\|^{2} \left( \frac{ 1 }{ \eta_{t} } - \frac{ 1 }{ \eta_{t-1} } \right) + G^{2} \sum_{t=1}^{T} \eta_{t} - \alpha \sum_{t=1}^{T} \| x^{*} - x_{t} \|^{2} \quad (\because \eqref{lemma2_bound_of_the_linear_approximation_equation_01}) \nonumber \\ & = & \sum_{t=1}^{T} \|x_{t} - x^{*}\|^{2} \left( \alpha t - \alpha (t - 1) - \alpha \right) + G^{2} \sum_{t=1}^{T} \eta_{t} \frac{1}{\alpha t} \nonumber \\ & = & G^{2} \frac{1}{\alpha} \sum_{t=1}^{T} \frac{1}{t} \nonumber \\ & \le & G^{2} \frac{1}{\alpha} \left( 1 + \int_{1}^{T} \frac{1}{x} \ dx \right) \nonumber \\ & \le & G^{2} \frac{1}{\alpha} \left( 1 + \log T \right) \nonumber . \end{eqnarray}\]
$\Box$

3.4 Application; stochastic gradient descent

Stochastic opitimization problem is a problem to find the point of the minimum value of given function $f: \mathcal{K} \rightarrow \mathbb{R}$ under some condition that you can only access to random variables of gradient. Under the stochastic optimization problem, we minimize the

\[\min_{x \in \mathcal{K}} f(x) .\]

$\tilde{\nabla}_{x}$ is random variable of the gradient at $x$ with the following conditions;

\[\begin{eqnarray} \mathrm{E} \left[ \tilde{\nabla}_{x} \right] & = & \nabla f(x) \label{stochastic_optimization_problem_expectation_random_gradient} \\ \mathrm{E} \left[ \| \tilde{\nabla}_{x} \|^{2} \right] & = & G^{2} \nonumber \end{eqnarray}\]

The first constraint is the random variables is equal to the gradient in the sense of expectation.

We will show that the regret bounds of the stochastic optimization problem is bounded by

\[\mathrm{Regret}_{T} := O(DG\sqrt{T}) .\]

Algorithm 7 stochastic gradient descent

Step1. for $t=1$ to $T$ do

Step2. Let \(\tilde{\nabla}_{t}\) be indendent and identical distribution of $\nabla_{x_{t}}$.

\[\begin{eqnarray} f_{t}(x) & := & \langle \tilde{\nabla}_{t}, x \rangle \end{eqnarray}\]

Step3. Update and project

\[\begin{eqnarray} y_{t+1} & := & x_{t} - \eta_{t} \tilde{\nabla}_{t} \nonumber \\ x_{t+1} & := & \Pi_{\mathcal{K}}(y_{t+1}) \nonumber \end{eqnarray}\]

Step4. end for

Step5. Return \(\tilde{x}_{T} := \frac{1}{T}\sum_{t=1}^{T}x_{t}\).

Theorem 3.4

In Algorithm 7,

\[\begin{eqnarray} \mathrm{E} \left[ f(\tilde{x}_{T}) \right] & \le & \min_{x \in \mathcal{K}} f(x) + \frac{ 3 GD }{ 2 \sqrt{T} } \nonumber \end{eqnarray}\]

proof

Let

\[x^{*} := \arg\min_{x \in \mathcal{K}} f(x)\]

By the regret guarantee of OGD, we have

\[\begin{eqnarray} \mathrm{E} \left[ f(\tilde{x}_{T}) \right] - f(x^{*}) & = & \mathrm{E} \left[ f \left( \frac{1}{T} \sum_{t=1}^{T} \tilde{x}_{T} \right) \right] - f(x^{*}) \nonumber \\ & \le & \frac{1}{T} \mathrm{E} \left[ \sum_{t=1}^{T} f \left( \tilde{x}_{T} \right) \right] - \sum_{t=1}^{T} \frac{1}{T} f(x^{*}) \quad (\because \text{convexity of }f) \nonumber \\ & \le & \frac{1}{T} \sum_{t=1}^{T} \mathrm{E} \left[ \nabla f(x_{t})^{\mathrm{T}} (x_{t} - x^{*}) \right] \quad (\because \text{convexity of }f) \nonumber \\ & = & \frac{1}{T} \sum_{t=1}^{T} \mathrm{E} \left[ \tilde{\nabla}_{t}^{\mathrm{T}} (x_{t} - x^{*}) \right] \quad (\because \eqref{stochastic_optimization_problem_expectation_random_gradient} \text{ and linearity of expectation}) \nonumber \\ & = & \frac{1}{T} \sum_{t=1}^{T} \mathrm{E} \left[ f_{t}(x_{t}) - f_{t}(x^{*}) \right] \quad (\because \text{ definition of }f_{t}) \nonumber \\ & \le & \frac{\mathrm{Regret}_{T}}{T} \quad (\because \text{ definition of regret}) \nonumber \\ & \le & \frac{3GD}{2\sqrt{T}} \quad (\because \text{ theorem 3.1}) \nonumber \end{eqnarray}\]
$\Box$

3.4.1 Example: stochastic gradient descent for SVM traning

We again consider SVM with hinge loss and regularization.

\[\begin{eqnarray} f(x) & := & \min_{x \in \mathbb{R}^{d}} \left( \lambda \frac{1}{n} \sum_{i=1}^{n} \ell_{a_{i}, b_{i}}(x) + \frac{1}{2} \| x \|^{2} \right) . \nonumber \\ \ell_{a, b}(x) & = & \max \{ 0, 1 - b x^{\mathrm{T}}a \} . \nonumber \end{eqnarray}\]

Algorithm 8 SGD for SVM training

Step1. For $t = 1$ to $T$

Step2. Pick an example uniformly at random \(K_{t} \in \{1, \ldots, n\}\).

Step3. Let

\[\begin{eqnarray} \tilde{\nabla}_{t} & := & \lambda \ell_{a_{K_{t}}, b_{K_{t}}}(x_{t}) + x_{t} \nonumber \\ \nabla\ell_{a, b}(x) & = & \begin{cases} 0 & (b x^{\mathrm{T}}a > 1) \\ -b a & \text{otherwise} \end{cases} \nonumber \end{eqnarray}\]

Step4. Update

\[\begin{eqnarray} x_{t + 1} := x_{t} - \eta_{t} \tilde{\nabla}_{t} \nonumber \end{eqnarray}\]

Step3. end for

Step4. Return $\tilde{x} := \frac{1}{T} \sum_{t=1}^{T} x_{t}$,

Theorem

In Algorithm 8, if $T \in O(\frac{1}{\epsilon^{2}})$,

\[f(\tilde{x}) - f(x^{*}) \in O(\epsilon)\]

for some $\eta_{t}$.

proof

$\Box$