memo

Chapter1. Mathematical background

1.1 The definition of bayesian estimation

$(\Omega, \mathcal{F}, P)$,
- probability sp.
$N \in \mathbb{N}$,
- the dimention of the inputs
$n \in \mathbb{N}$,
- the number of input data
$x_{i} \in \mathbb{R}^{N}$ $(i = 1, \ldots, n)$,
- we call $\{x_{i}\}_{i=1, \ldots, n}$ a sample,
- $x_{i}$ is the actual observed data
$x^{n} := (x_{1}, \ldots, x_{n})$,
$X_{i, q} \ (i = 1, \ldots, n)$,
- i.i.d. of true r.v. $X$ whose p.d.f. is $q$,
$q_{X^{n}}$,
- p.d.f. over $(\mathbb{R}^{N})^{n}$,
$X_{q}^{n} = (X_{1}, \ldots, X_{n})$,
- the r.v.s whose distribution is $q_{X_{q}^{n}}$,

Remark 2

In statitstics, estimating true distribution based on samples is called statistical inference/estimation. However, in information theory, such the estimation is called statistical learning.

■

Let $f: (\mathbb{R}^{N})^{n} \rightarrow \mathbb{R}$.

\[\mathrm{E} \left[ f(X^{n}) \right] = \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} f(x_{1}, \ldots, x_{n}) \ \prod_{i=1}^{n} q(x) dx_{i} .\]

$d \in \mathbb{N}$,
- the dimension of the parameters of given model
$W \subseteq \mathbb{R}^{d}$,
- parameter space
$\Theta$,
- r.v. of parameter
$p(x \mid w)$,
- the p.d.f. of $X$ given $\Theta = w$,
- statistical model
$\varphi: W \rightarrow \mathbb{R}$,
- the p.d.f. of $\Theta$,
- prior distribution
$\beta \in \mathbb{R}_{> 0}$,
- constant
- inverse temperature

Given $p(x \mid w)$, $\varphi(x)$, we can define random variables whose p.d.f. is

\[\begin{eqnarray} x \in \mathbb{R}^{N}, \ \int_{W} p(x \mid w) \varphi(x) \ dx & = & \int_{W} p(x \mid w) \frac{ p(x, w) }{ \varphi(w) } \varphi(x) \ dx \nonumber \\ & = & p(x) \nonumber \end{eqnarray}\]

We denote the random variables defined the above p.d.f. by $X_{i}$ which is $\mathbb{R}^{N}$-valued function for each $i = 1, \ldots, n$. In addition, we assume this variables is conditional independent to parameters, that is,

\[\begin{eqnarray} p(x^{n} \mid w) = p(x_{1}, \ldots, x_{n} \mid w) = \prod_{i=1}^{n} p(x_{i} \mid w) \nonumber \end{eqnarray}\]

Note that we assume that the random variables defined by the true distribution is I.I.D. of $q$, however, the random variables defiened above don’t have to be independent, but conditionally independent to parameters.

Remark

$p(x \mid w)$ is the p.d.f. of

\[\mathrm{E} \left[ X \mid \Theta = w \right]\]

where $\Theta$ is a random variable whose p.d.f. is $\varphi$.

$p(w \mid X^{n})$ is the p.d.f. of

\[\mathrm{E} \left[ \Theta \mid X^{n} \right]\]

where $\Theta$ is a random variable whose p.d.f. is $\varphi$.

■

Posterior distribution with inverse temparature $\beta$ is defined by

\[\begin{eqnarray} p(w \mid X^{n}) & := & \frac{ 1 }{ Z_{n}(X^{n}; \beta) } \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \label{equation_01_05} \\ Z_{n}(X^{n}; \beta) & := & \int_{W} \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \ dw \label{equation_01_06} \end{eqnarray}\]

$Z_{n}(X^{n}; \beta)$ is called the partition function. In Bayes theory, the case of $\beta = 1$ is the most important. If $\beta=1$, $Z_{n}(X^{n}; 1)$ is called the marginal likelihood.

\[\begin{eqnarray} \mathrm{E}_{w} \left[ f(w) \right] & := & \int_{W} f(w) p(w \mid X^{n}) \ dw \label{equation_01_07} \\ & = & \mathrm{E} \left[ f(\Theta) \mid X^{n} \right] \nonumber \end{eqnarray}\]

The predictive density function is defined by

\[\begin{eqnarray} p^{*}(x) & := & p(x \mid X^{n}) \nonumber \\ & := & \mathrm{E}_{w} \left[ p(x \mid w) \right] \nonumber \\ & = & \int_{W} p(x \mid w) \ p(w \mid X^{n}) \ dw . \label{equation_01_08} \end{eqnarray}\]

$p^{*}$ is an estimation of the true distribution $q$ in the Bayesian sense, called Baysian estimation.

What we want to consider is the following;

1: how accurately predictive density function $p^{*}$ approximates the true distribution, $q$,
2: how $p^{*}$ efficiently approximates $q$,
3: mathematical formulation of the above concerns