Universal Algorithms in Signal Processing and Communications by Denver Greene - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 4Universal coding for classes of sources*

It is licensed under the Creative Commons Attribution License: http://creativecommons.org/licenses/by/3.0/

2013/05/16 12:48:35 -0500

Summary

We have discussed several parametric sources, and will now start developing mathematical tools in order to investigate properties of universal codes that offer universal compression w.r.t. a class of parametric sources.

4.1Preliminaries

Consider a class Λ of parametric models, where the parameter set θ characterizes the distribution for a specific source within this class, _autogen-svg2png-0003.png.

Example 4.1

Consider the class of memoryless sources over an alphabet α={1,2,...,r}. Here we have

(4.1) θ = { p ( 1 ) , p ( 2 ) , . . . , p ( r – 1 ) } .

The goal is to find a fixed to variable length lossless code that is independent of θ, which is unknown, yet achieves

(4.2)
_autogen-svg2png-0007.png

where expectation is taken w.r.t. the distribution implied by θ. We have seen for

(4.3)
_autogen-svg2png-0009.png

that a code that is good for two sources (distributions) p1 and p2 exists, modulo the one bit loss Equation 3.2. As an expansion beyond this idea, consider

(4.4) p ( x ) = ∫Λd w ( θ ) pθ ( X ) ,

where w(θ) is a prior.

Example 4.2

Let us revisit the memoryless source, choose r=2, and define the scalar parameter

(4.5)
_autogen-svg2png-0015.png

Then

(4.6) pθ ( x ) = θnX ( 1 ) · ( 1 – θ ) nX ( 0 )

and

(4.7) p ( x ) = ∫01d θ · θnX ( 1 ) · ( 1 – θ ) nX ( 0 ) .

Moreover, it can be shown that

(4.8)
_autogen-svg2png-0018.png

this result appears in Krichevsky and Trofimov 2.

Is the source X implied by the distribution p(x) an ergodic source? Consider the event _autogen-svg2png-0021.png. Owing to symmetry, in the limit of large n the probability of this event under p(x) must be _autogen-svg2png-0024.png,

(4.9)
_autogen-svg2png-0025.png

On the other hand, recall that an ergodic source must allocate probability 0 or 1 to this flavor of event. Therefore, the source implied by p(x) is not ergodic.

Recall the definitions of pθ(x) and p(x) in Equation 4.6 and Equation 4.7, respectively. Based on these definitions, consider the following,

(4.10)
_autogen-svg2png-0029.png

We get the following quantity for mutual information between the random variable Θ and random sequence X1N,

(4.11)
_autogen-svg2png-0032.png

Note that this quantity represents the gain in bits that the parameter θ creates; more about this quantity will be mentioned later.

4.2Redundancy

We now define the conditional redundancy,

(4.12)
_autogen-svg2png-0034.png

this quantifies how far a coding length function l is from the entropy where the parameter θ is known. Note that

(4.13)
_autogen-svg2png-0037.png

Denote by cn the collection of lossless codes for length-n inputs, and define the expected redundancy of a code lCn by

(4.14)
_autogen-svg2png-0041.png

The asymptotic expected redundancy follows,

(4.15)
_autogen-svg2png-0042.png

assuming that the limit exists.

We can also define the minimum redundancy that incorporates the worst prior for parameter,

(4.16)
_autogen-svg2png-0043.png

while keeping the best code. Similarly,

(4.17)
_autogen-svg2png-0044.png

Let us derive Rn,

(4.18)
_autogen-svg2png-0046.png

where Cn is the capacity of a channel from the sequence x to the parameter 4. That is, we try to estimate the parameter from the noisy channel.

In an analogous manner, we define

(4.19)
_autogen-svg2png-0049.png

where Q is the prior induced by the coding length function l.

4.3Minimal redundancy

Note that

(4.20)
_autogen-svg2png-0052.png

Therefore,

(4.21)
_autogen-svg2png-0053.png

In fact, Gallager showed that Rn+=Rn. That is, the min-max and max-min redundancies are equal.

Let us revisit the Bernoulli source pθ where θΛ=[0,1]. From the definition of Equation 4.6, which relies on a uniform prior for the sources, i.e., w(θ)=1,∀θΛ, it can be shown that there there exists a universal code with length function l such that

(4.22)
_autogen-svg2png-0059.png

where h2(p)=–plog(p)–(1–p)log(1–p) is the binary entropy. That is, the redundancy is approximately log(n) bits. Clarke and Barron 1 studied the weighting approach,

(4.23) p ( x ) = ∫Λd w ( θ ) pθ ( x ) ,

and constructed a prior that achieves Rn=Rn+ precisely for memoryless sources.

Theorem 5 1 For memoryless source with an alphabet of size r, θ=(p(0),p(1),⋯,p(r–1)),

(4.24)
_autogen-svg2png-0066.png

where On(1) vanishes uniformly as n→∞ for any compact subset of Λ, and

(4.25)
_autogen-svg2png-0070.png

is Fisher's information.

Note that when the parameter is sensitive to change we have large I(θ), which increases the redundancy. That is, good sensitivity means bad universal compression.

Denote

(4.26)
_autogen-svg2png-0072.png

this is known as Jeffrey's prior. Using w(θ)=J(θ), it can be shown that Rn=Rn+.

Example 4.3

Let us derive the Fisher information I(θ) for the Bernoulli source,

(4.27)
_autogen-svg2png-0076.png

Therefore, the Fisher information satisfies _autogen-svg2png-0077.png.

Example 4.4

Recall the Krichevsky–Trofimov coding, which was mentioned in Example 4.2. Using the definition of Jeffreys' prior Equation 4.26, we see that _autogen-svg2png-0078.png. Taking the integral over Jeffery's prior,

(4.28)
_autogen-svg2png-0079.png

where we used the gamma function. It can be shown that

(4.29)
_autogen-svg2png-0080.png

where

(4.30)
_autogen-svg2png-0081.png

Similar to before, this universal code can be implemented sequentially. It is due to Krichevsky and Trofimov 2, its redundancy satisfies  Theorem 5 by Clarke and Barron 1, and it is commonly used in universal lossless compression.

4.4Rissanen's bound

Let us consider – on an intuitive level – why _autogen-svg2png-0082.png. Expending _autogen-svg2png-0083.png bits allows to differentiate between _autogen-svg2png-0084.png parameter vectors. That is, we would differentiate between each of the r–1 parameters with _autogen-svg2png-0086.png levels. Now consider a Bernoulli RV with (unknown) parameter θ.

One perspective is that with n drawings of the RV, the standard deviation in the number of 1's is _autogen-svg2png-0089.png. That is, _autogen-svg2png-0090.png levels differentiate between parameter levels up to a resolution that reflects the randomness of the experiment.

A second perspective is that of coding a sequence of Bernoulli outcomes with an imprecise parameter, where it is convenient to think of a universal code in terms of first quantizing the parameter and then using that (imprecise) parameter to encode the input x. For the Bernoulli example, the maximum likelihood parameter θML satisfies

(4.31)
_autogen-svg2png-0093.png

and plugging this parameter θ=θML into pθ(x) minimizes the coding length among all possible parameters, θΛ. It is readily seen that

(4.32)
_autogen-svg2png-0097.png

Suppose, however, that we were to encode with θ'=θML+Δ. Then the coding length would be

(4.33)
_autogen-svg2png-0099.png

It can be shown that this coding length is suboptimal w.r.t. lθML(