View on GitHub

memo

Chapter1. Mathematical background

1.1 The definition of bayesian estimation

Remark 2

In statitstics, estimating true distribution based on samples is called statistical inference/estimation. However, in information theory, such the estimation is called statistical learning.

Let $f: (\mathbb{R}^{N})^{n} \rightarrow \mathbb{R}$.

\[\mathrm{E} \left[ f(X^{n}) \right] = \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} f(x_{1}, \ldots, x_{n}) \ \prod_{i=1}^{n} q(x) dx_{i} .\]

Given $p(x \mid w)$, $\varphi(x)$, we can define random variables whose p.d.f. is

\[\begin{eqnarray} x \in \mathbb{R}^{N}, \ \int_{W} p(x \mid w) \varphi(x) \ dx & = & \int_{W} p(x \mid w) \frac{ p(x, w) }{ \varphi(w) } \varphi(x) \ dx \nonumber \\ & = & p(x) \nonumber \end{eqnarray}\]

We denote the random variables defined the above p.d.f. by $X_{i}$ which is $\mathbb{R}^{N}$-valued function for each $i = 1, \ldots, n$. In addition, we assume this variables is conditional independent to parameters, that is,

\[\begin{eqnarray} p(x^{n} \mid w) = p(x_{1}, \ldots, x_{n} \mid w) = \prod_{i=1}^{n} p(x_{i} \mid w) \nonumber \end{eqnarray}\]

Note that we assume that the random variables defined by the true distribution is I.I.D. of $q$, however, the random variables defiened above don’t have to be independent, but conditionally independent to parameters.

Remark

$p(x \mid w)$ is the p.d.f. of

\[\mathrm{E} \left[ X \mid \Theta = w \right]\]

where $\Theta$ is a random variable whose p.d.f. is $\varphi$.

$p(w \mid X^{n})$ is the p.d.f. of

\[\mathrm{E} \left[ \Theta \mid X^{n} \right]\]

where $\Theta$ is a random variable whose p.d.f. is $\varphi$.

Posterior distribution with inverse temparature $\beta$ is defined by

\[\begin{eqnarray} p(w \mid X^{n}) & := & \frac{ 1 }{ Z_{n}(X^{n}; \beta) } \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \label{equation_01_05} \\ Z_{n}(X^{n}; \beta) & := & \int_{W} \varphi(w) \prod_{i=1}^{n} p(X_{i} \mid w)^{\beta} \ dw \label{equation_01_06} \end{eqnarray}\]

$Z_{n}(X^{n}; \beta)$ is called the partition function. In Bayes theory, the case of $\beta = 1$ is the most important. If $\beta=1$, $Z_{n}(X^{n}; 1)$ is called the marginal likelihood.

\[\begin{eqnarray} \mathrm{E}_{w} \left[ f(w) \right] & := & \int_{W} f(w) p(w \mid X^{n}) \ dw \label{equation_01_07} \\ & = & \mathrm{E} \left[ f(\Theta) \mid X^{n} \right] \nonumber \end{eqnarray}\]

The predictive density function is defined by

\[\begin{eqnarray} p^{*}(x) & := & p(x \mid X^{n}) \nonumber \\ & := & \mathrm{E}_{w} \left[ p(x \mid w) \right] \nonumber \\ & = & \int_{W} p(x \mid w) \ p(w \mid X^{n}) \ dw . \label{equation_01_08} \end{eqnarray}\]

$p^{*}$ is an estimation of the true distribution $q$ in the Bayesian sense, called Baysian estimation.

What we want to consider is the following;