1.2 Metrics
1.2.1 Partition function and free energy
\[\begin{eqnarray} \int_{\mathbb{R}} \ dx_{1} \int_{\mathbb{R}} \ dx_{2} \cdots \int_{\mathbb{R}} \ dx_{n} Z_{n}(x^{n}; 1) & = & \int_{\mathbb{R}} \ dx_{1} \int_{\mathbb{R}} \ dx_{2} \cdots \int_{\mathbb{R}} \ dx_{n} \left( \int_{\mathbb{R}} \varphi(w) \ dw \prod_{i = 1}^{n} p(x_{i} \mid w) \right) \nonumber \\ & = & \int_{\mathbb{R}} \varphi(w) \ dw \int_{\mathbb{R}} \ dx_{1} \int_{\mathbb{R}} \ dx_{2} \cdots \int_{\mathbb{R}} \ dx_{n} \left( \prod_{i = 1}^{n} p(x_{i} \mid w) \right) \nonumber \\ & = & \int_{\mathbb{R}} \varphi(w) \ dw \prod_{i = 1}^{n} \left( \int_{\mathbb{R}} \ dx_{i} p(x_{i} \mid w) \right) \nonumber \\ & = & \int_{\mathbb{R}} \varphi(w) \ dw \nonumber \\ & = & 1 . \nonumber \end{eqnarray}\]By our assumption (even though not clearly mentioned in the book), true distribution of $X^{n}$ is given by
\[q_{X^{n}} (X^{n}) = \prod_{i = 1}^{n} q(X_{i}) .\]On the other hand, $Z_{n}(X^{n}; 1)$ is the distribution of $X^{n}$ estimated by our statistical model and our prior distribution. That is
\[\begin{eqnarray} p(x^{n}) & := & Z_{n}(x^{n}; 1) \nonumber \\ & = & \int_{W} \varphi(w) \prod_{i=1}^{n} p(x_{i} \mid w) \ dw \nonumber \\ & = & \int_{W} \varphi(w) \prod_{i=1}^{n} \frac{ p(x_{i}, w) }{ \varphi(w) } \ dw \nonumber \\ & = & \prod_{i=1}^{n} p(x_{i}) \nonumber \end{eqnarray}\]Definition. free energy
Let
\[F_{n}(\beta) := - \frac{1}{\beta} \log Z_{n}(\beta) .\]$F_{n}(\beta)$ is called free energy. In particular, if $\beta = 1$, free energy is equal to the additive inverse of the logarithmic likelihood:
\[\log Z_{n}(X^{n}; 1) .\]Definition. Entropy
- $q$,
- true dsitribution
Definition emprical entropy
Let
\[S_{n} := - \frac{1}{n} \sum_{i=1}^{n} \log q(X_{i}) .\]$S_{n}$ is called emrical entropy.
By definition,
\[\begin{eqnarray} \mathrm{E} \left[ S_{n} \right] & = & S \label{equation_01_16} \\ F_{n}(X^{n}; 1) & = & - \log Z_{n}(X^{n}; 1) \nonumber \\ & = & - \log \left( \int_{W} \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w) \ dw \right) \nonumber \\ & = & - \log \left( \int_{W} \varphi(w) \prod_{i=1}^{n} \frac{ p(X_{i}, w) }{ \varphi(w) } \ dw \right) \nonumber \\ & = & - \log \left( \int_{W} \prod_{i=1}^{n} p(X_{i}, w) \ dw \right) \nonumber \\ & = & - \log \left( \prod_{i=1}^{n} p(X_{i}) \right) \nonumber \\ & = & - \log \left( \prod_{i=1}^{n} q(X_{i}) \right) + \log \left( \prod_{i=1}^{n} q(X_{i}) \right) - \log \left( \prod_{i=1}^{n} p(X_{i}) \right) \nonumber \\ & = & n S_{n} + \log \left( \frac{ \prod_{i=1}^{n} q(X_{i}) }{ \prod_{i=1}^{n} p(X_{i}) } \right) \label{equation_01_18} \\ \mathrm{E} \left[ F_{n}(X^{n}; 1) \right] & = & n S + \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} \log \left( \frac{ \prod_{i=1}^{n} q(x_{i}) }{ \prod_{i=1}^{n} p(x_{i}) } \right) \prod_{i=1}^{n} q(x_{i}) \ dx_{1} \cdots \ dx_{n} \nonumber \end{eqnarray}\]The first term is the entropy of true distribution. The second term is the Kullback-Leibler divergence from $ \prod_{i=1}^{n} q(x_{i}) $ to $ \prod_{i=1}^{n} p(x_{i}) $.
1.2.2 Estimation and Generalization
Definition. Generalization losses
Let
\[\begin{eqnarray} G_{n}(X^{n}) & := & - \int_{\mathbb{R}^{n}} q(x) \log p^{*}(x \mid X^{n}) \ dx \nonumber \\ & = & \int_{\mathbb{R}^{n}} - q(x) \log q(x) + q(x) \log q(x) - q(x) \log p^{*}(x \mid X^{n}) \ dx \nonumber \\ & = & S + \int_{\mathbb{R}^{n}} q(x) \log \frac{ q(x) }{ p^{*}(x \mid X^{n}) } \ dx \label{equation_01_22} \end{eqnarray}\]where $p^{*}$ is prediction density function. $G_{n}(X^{n})$ is called generalization losses.
Definition. training losses
Let
\[\begin{eqnarray} T_{n} & := & - \frac{ 1 }{ n } \sum_{i=1}^{n} \log p(X_{i} \mid X^{n}) \nonumber \\ & = & - \frac{ 1 }{ n } \sum_{i=1}^{n} \log p^{*}(X_{i} \mid X^{n}) \nonumber \end{eqnarray}\]where $p^{*}$ is prediction density function. $T_{n}$ is called traning losses.
- In practical, it is hard to calculate $G_{n}$ because $G_{n}$ depends on the true distribution $q$ which will never be known,
- On the other hand, $T_{n}(\omega)$ is observable,
Here we will consider the problem: how $T_{n}$ approximates $G_{n}$.
\[\begin{eqnarray} p(X_{n + 1} \mid X^{n}) & = & \frac{ p(X_{n + 1}, X^{n}) }{ p(X^{n}) } \nonumber \\ & = & \frac{ \int_{W} p(X_{n + 1}, X^{n} \mid w) \varphi(w) \ dw }{ Z_{n}(X^{n}; 1) } \nonumber \\ & = & \frac{ \int_{W} p(X_{n + 1} \mid w) p(X^{n} \mid w) \varphi(w) \ dw }{ Z_{n}(X^{n}; 1) } \quad (\because \text{conditional independence}) \nonumber \\ & = & \frac{ \int_{W} p(X_{n + 1} \mid w) \prod_{i = 1}^{n} p(X_{i} \mid w) \varphi(w) \ dw }{ Z_{n}(X^{n}; 1) } \nonumber \\ & = & \frac{ Z_{n + 1}(X^{n + 1}; 1) }{ Z_{n}(X^{n}; 1) } \nonumber \end{eqnarray}\]Note that if we assume I.I.D. of $X_{1}, \ldots, X_{n + 1}$, $p(X_{n + 1} \mid X^{n}) = p(X_{n + 1})$.
\[\begin{eqnarray} - \log p^{*}(X_{n + 1} \mid X^{n}) & = & - \log p^{*}(X_{n + 1} \mid X^{n}) \quad (\because \text{By definition. See } \eqref{equation_01_09}) \nonumber \\ & = & - \log Z_{n + 1}(X^{n + 1}; 1) + \log Z_{n}(X^{n}; 1) \nonumber \\ & = & F_{n + 1}(X^{n+1}; 1) - F_{n}(X^{n}; 1) \nonumber \end{eqnarray}\]Taking integral of the both sides,
\[\begin{eqnarray} & & \int \cdots \int -\log p^{*}(x^{n + 1} \mid x^{n}) \prod_{i=1}^{n+1} q(x_{i}) \ dx_{1} \cdots \ dx_{n + 1} = \int \cdots \int F_{n+1}(x^{n+1}; 1) \prod_{i=1}^{n+1} q(x_{i}) \ dx_{1} \cdots \ dx_{n + 1} + \int \cdots \int F_{n}(x^{n}; 1) \prod_{i=1}^{n} q(x_{i}) \ dx_{1} \cdots \ dx_{n} \nonumber \\ & \Leftrightarrow & \int \cdots \int G_{n}(x^{n}) \prod_{i=1}^{n} q(x_{i}) \ dx_{1} \cdots \ dx_{n + 1} = \mathrm{E} \left[ F_{n+1}(X^{n+1}; 1) \right] + \mathrm{E} \left[ F_{n}(X^{n}; 1) \right] \nonumber \\ & \Leftrightarrow & \mathrm{E} \left[ G_{n}(X^{n}) \right] = \mathrm{E} \left[ F_{n+1}(X^{n+1}; 1) \right] + \mathrm{E} \left[ F_{n}(X^{n}; 1) \right] . \nonumber \end{eqnarray}\]1.2.3. Computable example
In Baysian estimation, it is hard to calculate posterior distribution or prediction density function analytically in many cases.
Definition exponential family
- $v: \mathbb{R}^{N} \rightarrow \mathbb{R}$,
- $f: W \rightarrow \mathbb{R}^{J}$,
- $g: \mathbb{R}^{N} \rightarrow \mathbb{R}^{J}$,
- $J \in \mathbb{N}$,
Statistical model $p(x \mid w)$ is said to be exponential family with $v, f, g$ if
\[\begin{equation} p(x \mid w) = v(x) \exp \left( f(w)^{\mathrm{T}} g(x) \right) \label{01_02_example_exponential_family_definition_statistical_model} \end{equation} .\]Since
\[\int_{\mathbb{R}^{N}} p(x \mid w) \ dx = 1,\] \[\int_{\mathbb{R}^{N}} v(x) \exp \left( f(w)^{\mathrm{T}} g(x) \right) \ dx .\]Suppose that statistical model is exponential family and prior distribution is given by
\[\begin{equation} \varphi(w \mid \phi) = \frac{1}{z(\phi)} \exp \left( \phi^{\mathrm{T}} f(w) \right) \label{01_02_example_exponential_family_definition_prior} \end{equation}\]where $\phi \in \mathbb{R}^{J}$ and
\[\begin{equation} z(\phi) := \int_{W} \exp \left( \phi^{\mathrm{T}} f(w) \right) \ dw \label{01_02_example_exponential_family_definition_z} \end{equation} .\]Partition function is given by
\[\begin{eqnarray} Z_{n}(X^{n}; \beta) & = & \int_{W} \varphi(w \mid \phi) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \ dw \quad (\because \text{by definition}) \nonumber \\ & = & \frac{1}{z(\phi)} \int_{W} \exp \left( \phi^{\mathrm{T}} f(w) \right) \prod_{i=1}^{n} v(X_{i})^{\beta} \exp \left( \beta f(w)^{\mathrm{T}} g(X_{i}) \right) \ dw \nonumber \\ & = & \frac{1}{z(\phi)} \left( \prod_{i=1}^{n} v(X_{i})^{\beta} \right) \int_{W} \exp \left( f(w)^{\mathrm{T}} \left( \phi + \sum_{i=1}^{T} \beta g(X_{i}) \right) \right) \ dw \label{01_02_example_exponential_family_partition_function} \\ & = & \frac{1}{z(\phi)} \left( \prod_{i=1}^{n} v(X_{i})^{\beta} \right) z(\hat{\phi}(X^{n})) \nonumber \\ \hat{\phi}(X^{n}) & := & \phi + \sum_{i=1}^{n} \beta g(X_{i}) \nonumber \end{eqnarray}\]Free energy is given by
\[\begin{eqnarray} F_{n}(X^{n}; \beta) & = & - \log Z_{n}(X^{n}; \beta) \nonumber \\ & = & - \log \left( \frac{1}{z(\phi)} \left( \prod_{i=1}^{n} v(X_{i})^{\beta} \right) z(\hat{\phi}(X^{n})) \right) \\ & = & - \beta \sum_{t=1}^{T} \log v(X_{i}) - \log \frac{ z(\hat{\phi}(X^{n})) }{ z(\phi) } . \nonumber \end{eqnarray}\]Posterior distribution is given by
\[\begin{eqnarray} p(w \mid X^{n}) & = & \frac{ 1 }{ Z_{n}(X^{n};\beta) } \varphi(w \mid \phi) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \quad (\because \text{by definition of posterior distribution}) \\ & = & \frac{ 1 }{ Z_{n}(X^{n};\beta) } \frac{1}{z(\phi)} \exp(\phi^{\mathrm{T}}f(w)) \prod_{i=1}^{n} v(x)^{\beta} \exp(f(w)^{\mathrm{T}}g(X_{i})^{\beta} \nonumber \\ & = & \frac{ 1 }{ Z_{n}(X^{n};\beta) } \frac{1}{z(\phi)} \left( \prod_{i=1}^{n} v(x)^{\beta} \right) \exp \left( f(w)^{\mathrm{T}} \left( \phi + \beta \sum_{i=1}^{n} g(X_{i}) \right) \right) \nonumber \\ & = & \frac{ 1 }{ Z_{n}(X^{n};\beta) } \frac{1}{z(\phi)} \left( \prod_{i=1}^{n} v(x)^{\beta} \right) \exp \left( \hat{\phi}(X^{n})^{\mathrm{T}} f(w) \right) \nonumber \\ & = & \frac{ 1 }{ Z_{n}(X^{n};\beta) } \frac{z(\hat{\phi)(X^{n})}}{z(\phi)} \left( \prod_{i=1}^{n} v(x)^{\beta} \right) \varphi(w \mid \hat{\phi}(X^{n})) \quad (\because \eqref{01_02_example_exponential_family_definition_prior}) \nonumber \\ & = & \frac{ z(\phi) }{ \left( \prod_{i=1}^{n} v(x)^{\beta} \right) z(\hat{\phi}(X^{n})) } \frac{z(\hat{\phi}(X^{n}))}{z(\phi)} \left( \prod_{i=1}^{n} v(x)^{\beta} \right) \varphi(w \mid \hat{\phi}(X^{n})) \quad (\because \eqref{01_02_example_exponential_family_partition_function}) \nonumber \\ & = & \varphi(w \mid \hat{\phi}(X^{n})) . \label{01_02_example_exponential_family_definition_posterior} \end{eqnarray}\]Posterior distribution is the same as prior distribution except for changing hyperparameter from $\phi$ to $\hat{\phi}(X^{n})$. In general, prior distribution is said to be conjugate prior distribution if posterior distribution is one of the family of prior distribution.
Prediction density function is given by
\[\begin{eqnarray} p(x \mid X^{n}) & = & \int_{W} p(x \mid w) p(w \mid X^{n}) \ dw \nonumber \\ & = & \int_{W} p(x \mid w) \varphi(w \mid \hat{\phi}(X^{n})) \ dw \quad (\because \eqref{01_02_example_exponential_family_definition_posterior}) \nonumber \\ & = & v(x) \int_{W} \exp \left( f(w)^{\mathrm{T}} g(x) \right) \frac{1}{z(\hat{\phi}(X^{n}))} \exp \left( f(w)^{\mathrm{T}} \hat{\phi}(X^{n}) \right) \ dw \nonumber \\ & = & \frac{ v(x) }{ z(\hat{\phi}(X^{n})) } \int_{W} \exp \left( f(w)^{\mathrm{T}} \left( g(x) + \hat{\phi}(X^{n}) \right) \right) \ dw \nonumber \\ & = & \frac{v(x)}{z(\hat{\phi}(X^{n}))} z(\hat{\phi}(X^{n}) + g(x)) . \label{equation_01_24} \end{eqnarray}\]In particular, if $\beta = 1$,
\[\begin{eqnarray} p(x \mid X^{n}) & = & \frac{v(x)}{z(\hat{\phi}(X^{n}))} \frac{ z \left( \phi + \sum_{i=1}^{n} \phi(X_{i}) + g(x) \right) }{ z \left( \phi + \sum_{i=1}^{n} \phi(X_{i}) \right) } \nonumber \end{eqnarray}\]Example1 normal distribtion
- $\sigma^{2} > 0$,
- $m \in \mathbb{R}$,
Statistical model
\[\begin{eqnarray} p(x \mid m) & = & \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left( - \frac{1}{2\sigma^{2}} (x - m)^{2} \right) \label{equatin_01_25} \\ & = & \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left( - \frac{x^{2}}{2\sigma^{2}} \right) \exp \left( + \frac{m}{2\sigma^{2}} x - \frac{m^{2}}{2\sigma^{2}} \right) nonumber . \end{eqnarray}\]Thus, by letting
\[\begin{eqnarray} v(x) & := & p(x \mid 0) \nonumber \\ f(w) & := & \left( \begin{array}{c} \frac{m}{\sigma^{2}} \\ \frac{-m^{2}}{2\sigma^{2}} \end{array} \right) \nonumber \\ g(x) & := & \left( \begin{array}{c} x \\ 1 \end{array} \right) , \nonumber \end{eqnarray}\]$p(x \mid m)$ is exponential family. We obtain conjugate prior distribution by substituting hyperparameters $\phi := (\phi_{1}, \phi_{2})$ into $(x, 1)$ in $p(x \mid m)$
\[\begin{eqnarray} \varphi(m \mid \phi) & = & \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left( - \frac{x^{2}}{2\sigma^{2}} \right) \exp \left( \frac{m}{2\sigma^{2}} \phi_{1} - \frac{m^{2}}{2\sigma^{2}} \phi_{2} \right) \nonumber \\ & = & \frac{1}{z(\phi)} \exp \left( \frac{m}{2\sigma^{2}} \phi_{1} - \frac{m^{2}}{2\sigma^{2}} \phi_{2} \right) \nonumber \\ \frac{1}{z(\phi)} & := & \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left( - \frac{x^{2}}{2\sigma^{2}} \right) \nonumber \end{eqnarray}\]