1.1 The definition of bayesian estimation
- $(\Omega, \mathcal{F}, P)$,
- probability sp.
- $N \in \mathbb{N}$,
- the dimention of the inputs
- $n \in \mathbb{N}$,
- the number of input data
- $x_{i} \in \mathbb{R}^{N}$ $(i = 1, \ldots, n)$,
- we call \(\{x_{i}\}_{i=1, \ldots, n}\) a sample,
- $x_{i}$ is the actual observed data
- $x^{n} := (x_{1}, \ldots, x_{n})$,
- $X_{i, q} \ (i = 1, \ldots, n)$,
- i.i.d. of true r.v. $X$ whose p.d.f. is $q$,
- $q_{X^{n}}$,
- p.d.f. over $(\mathbb{R}^{N})^{n}$,
- $X_{q}^{n} = (X_{1}, \ldots, X_{n})$,
- the r.v.s whose distribution is $q_{X_{q}^{n}}$,
Remark 2
In statitstics, estimating true distribution based on samples is called statistical inference/estimation. However, in information theory, such the estimation is called statistical learning.
Let $f: (\mathbb{R}^{N})^{n} \rightarrow \mathbb{R}$.
\[\mathrm{E} \left[ f(X^{n}) \right] = \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} f(x_{1}, \ldots, x_{n}) \ \prod_{i=1}^{n} q(x) dx_{i} .\]- $d \in \mathbb{N}$,
- the dimension of the parameters of given model
- $W \subseteq \mathbb{R}^{d}$,
- parameter space
- $\Theta$,
- r.v. of parameter
- $p(x \mid w)$,
- the p.d.f. of $X$ given $\Theta = w$,
- statistical model
- $\varphi: W \rightarrow \mathbb{R}$,
- the p.d.f. of $\Theta$,
- prior distribution
- $\beta \in \mathbb{R}_{> 0}$,
- constant
- inverse temperature
Given $p(x \mid w)$, $\varphi(x)$, we can define random variables whose p.d.f. is
\[\begin{eqnarray} x \in \mathbb{R}^{N}, \ \int_{W} p(x \mid w) \varphi(x) \ dx & = & \int_{W} p(x \mid w) \frac{ p(x, w) }{ \varphi(w) } \varphi(x) \ dx \nonumber \\ & = & p(x) \nonumber \end{eqnarray}\]We denote the random variables defined the above p.d.f. by $X_{i}$ which is $\mathbb{R}^{N}$-valued function for each $i = 1, \ldots, n$. In addition, we assume this variables is conditional independent to parameters, that is,
\[\begin{eqnarray} p(x^{n} \mid w) = p(x_{1}, \ldots, x_{n} \mid w) = \prod_{i=1}^{n} p(x_{i} \mid w) \nonumber \end{eqnarray}\]Note that we assume that the random variables defined by the true distribution is I.I.D. of $q$, however, the random variables defiened above don’t have to be independent, but conditionally independent to parameters.
Remark
$p(x \mid w)$ is the p.d.f. of
\[\mathrm{E} \left[ X \mid \Theta = w \right]\]where $\Theta$ is a random variable whose p.d.f. is $\varphi$.
$p(w \mid X^{n})$ is the p.d.f. of
\[\mathrm{E} \left[ \Theta \mid X^{n} \right]\]where $\Theta$ is a random variable whose p.d.f. is $\varphi$.
Posterior distribution with inverse temparature $\beta$ is defined by
\[\begin{eqnarray} p(w \mid X^{n}) & := & \frac{ 1 }{ Z_{n}(X^{n}; \beta) } \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \label{equation_01_05} \\ Z_{n}(X^{n}; \beta) & := & \int_{W} \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \ dw \label{equation_01_06} \end{eqnarray}\]$Z_{n}(X^{n}; \beta)$ is called the partition function. In Bayes theory, the case of $\beta = 1$ is the most important. If $\beta=1$, $Z_{n}(X^{n}; 1)$ is called the marginal likelihood.
\[\begin{eqnarray} \mathrm{E}_{w} \left[ f(w) \right] & := & \int_{W} f(w) p(w \mid X^{n}) \ dw \label{equation_01_07} \\ & = & \mathrm{E} \left[ f(\Theta) \mid X^{n} \right] \nonumber \end{eqnarray}\]The predictive density function is defined by
\[\begin{eqnarray} p^{*}(x) & := & p(x \mid X^{n}) \nonumber \\ & := & \mathrm{E}_{w} \left[ p(x \mid w) \right] \nonumber \\ & = & \int_{W} p(x \mid w) \ p(w \mid X^{n}) \ dw . \label{equation_01_08} \end{eqnarray}\]$p^{*}$ is an estimation of the true distribution $q$ in the Bayesian sense, called Baysian estimation.
What we want to consider is the following;
- 1: how accurately predictive density function $p^{*}$ approximates the true distribution, $q$,
- 2: how $p^{*}$ efficiently approximates $q$,
- 3: mathematical formulation of the above concerns