Universal Algorithms in Signal Processing and Communications by Denver Greene - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 6Beyond lossless compression*

It is licensed under the Creative Commons Attribution License: http://creativecommons.org/licenses/by/3.0/

2013/05/21 14:23:17 -0500

Summary

6.1Universal lossy compression

Consider xαn. The goal of lossy compression 3 is to describe _autogen-svg2png-0002.png, also of length n but possibly defined over another reconstruction alphabet _autogen-svg2png-0004.png, such that the description requires few bits and the distortion

(6.1)
_autogen-svg2png-0005.png

is small, where d(·,·) is some distortion metric. It is well known that for every d(·,·) and distortion level D there is a minimum rate R(D), such that _autogen-svg2png-0010.png can be described at rate R(D). The rate R(D) is known as the rate distortion (RD) function, it is the fundamental information theoretic limit of lossy compression 3, 9.

The invention of lossy compression algorithms has been a challenging problem for decades. Despite numerous applications such as image compression 24, 16, video compression 23, and speech coding 18, 4, 20, there is a significant gap between theory and practice, and these practical lossy compressors do not achieve the RD function. On the other hand, theoretical constructions that achieve the RD function are impractical.

A promising recent algorithm by Jalali and Weissman 13 is universal in the limit of infinite runtime. Its RD performance is reasonable even with modest runtime. The main idea is that the distortion version _autogen-svg2png-0013.png of the input x can be computed as follows,

(6.2)
_autogen-svg2png-0015.png

where β<0 is the slope at the particular point of interest in the RD function, and _autogen-svg2png-0017.png is the empirical conditional entropy of order k,

(6.3)
_autogen-svg2png-0019.png

where uk is a context of order k, and as before _autogen-svg2png-0022.png is the number of times that the symbol a appears following a context uk in wn. Jalali and Weissman proved 14 that when k=o(log(n)), the RD pair _autogen-svg2png-0027.png converges to the RD function asymptotically in n. Therefore, an excellent lossy compression technique is to compute _autogen-svg2png-0029.png and then compress it. Moreover, this compression can be universal. In particular, the choice of context order k=o(log(n)) ensures that universal compressors for context tress sources can emulate the coding length of the empirical conditional entropy _autogen-svg2png-0031.png.

Despite this excellent potential performance, there is still a tremendous challenge. Brute force computation of the globally minimum energy solution _autogen-svg2png-0032.png involves an exhaustive search over exponentially many sequences and is thus infeasible. Therefore, Jalali and Weissman rely on Markov chain Monte Carlo (MCMC) 12, which is a stochastic relaxation approach to optimization. The crux of the matter is to define an energy function,

(6.4)
_autogen-svg2png-0033.png

The Boltzmann probability mass function (pmf) is

(6.5)
_autogen-svg2png-0034.png

where t>0 is related to temperature in simulated annealing, and Zt is the normalization constant, which does not need to be computed.

Because it is difficult to sample from the Boltzmann pmf Equation 6.5 directly, we instead use a Gibbs sampler, which computes the marginal distributions at all n locations conditioned on the rest of wn being kept fixed. For each location, the Gibbs sampler resamples from the distribution of wi conditioned on _autogen-svg2png-0040.png as induced by the joint pmf in Equation 6.5, which is computed as follows,

(6.6)
_autogen-svg2png-0041.png

We can now write,

(6.7)
_autogen-svg2png-0042.png

where _autogen-svg2png-0043.png is the change in _autogen-svg2png-0044.png when yi=a is replaced by b, and _autogen-svg2png-0047.png is the change in distortion. Combining Equation 6.6 and Equation 6.7,

(6.8)
_autogen-svg2png-0048.png

The maximum change in the energy within an iteration of MCMC algorithm is then bounded by

(6.9)
_autogen-svg2png-0049.png

We refer to the resampling from a single location as an iteration, and group the n possible locations into super-iterations.[2]

During the simulated annealing, the temperature t is gradually increased, where in super-iteration i we use t=O(1/log(i)) 12, 13. In each iteration, the Gibbs sampler modifies wn in a random manner that resembles heat bath concepts in statistical physics. Although MCMC could sink into a local minimum, we decrease the temperature slowly enough that the randomness of Gibbs sampling eventually drives MCMC out of the local minimum toward the set of minimal energy solutions, which includes _autogen-svg2png-0056.png, because low temperature t favors low-energy wn. Pseudo-code for our encoder appears in Algorithm 1 below.

Algorithm 1: Lossy encoder with fixed reproduction alphabetInput:xnαn, _autogen-svg2png-0060.png, β, c, rOutput: bit-stream Procedure:

  1. Initialize w by quantizing x with _autogen-svg2png-0066.png

  2. Initialize nw(·,·) using w

  3. for r=1 to R do // super-iteration

  4.      _autogen-svg2png-0071.png // temperature

  5.      Draw permutation of numbers {1,...,n} at random

  6.      for r'=1 to n do

  7.         Let j be component t' in permutation

  8.         Generate new wj using _autogen-svg2png-0078.png

  9.         Update nw(·,·)[·]

  10. Apply CTW to wn // compress outcome

6.2Computational issues

Looking at the pseudo-code, it is clear that the following could be computational bottlenecks:

  1. Initializing nw(·,·) - a naive implementation needs to scan the sequence wn (complexity O(n)) and initialize a data structure with _autogen-svg2png-0084.png elements. Unless _autogen-svg2png-0085.png, this is super linear in n. Therefore, we recall that K=o(log(n)), and initializing nw(·,·) requires linear complexity O(n).

  2. The inner loop is run rn times, and each time computing _autogen-svg2png-0091.png for all possible _autogen-svg2png-0092.png might be challenging. In particular, let us consider computation of Δd and ΔH.

    1. Computation of Δd requires constant time, and is not burdensome.

    2. Computation of ΔH requires to modify the symbol counts for each context that was modified. A key contribution by Jalali and Weissman  was to recognize that the array of symbol counts, nw(·,·), would change in O(k) locations, where k is the context order. Therefore, each computation of ΔH requires O(k) time. Seeing that _autogen-svg2png-0102.png such computations per iteration are needed, and there are rn iterations, this is _autogen-svg2png-0104.png.

    3. Updating nw(·,·) after wj is re-sampled from the Boltzmann distribution also requires O(k) time. However, this step is performed only once per iteration, and not _autogen-svg2png-0108.png times. Therefore, this step requires less computation than step (b).

6.3Lossy compression of an analog input

So far, we have described the approach by Jalali and Weissman to compress xnαn. Suppose instead that x∈Rn is analog. Seeing that MCMC optimizes wn over a discrete alphabet, it would be convenient to keep doing so. However, because _autogen-svg2png-0112.png is finite, and assuming that square distortion is used, i.i., _autogen-svg2png-0113.png, we see that the distortion could be large.

Fortunately, Baron and Weissman showed 6 that the following quantizer has favorable properties,

(6.10)
_autogen-svg2png-0114.png

where γ is an integer that grows with n. That is, as γ grows the set _autogen-svg2png-0118.png of reproduction levels quantizers a wider internal more finely. Therefore, we have that the probability that some xi is an outlier, i.e., |xi|<γ, vanishes with n, and the effect of outliers on _autogen-svg2png-0122.png vanishes. Moreover, because the internal is sampled more finely as γ increases with n, this quantizer can emulate any codebook with continuous levels, and so in the limit its RD performance converges to the RD function.

6.4Rate distortion performance

To evaluate the RD performance, we must first define the RD function. Consider a source X that generates a sequence xn∈Rn. A lossy encoder is a mapping e:Rn→C, where C⊂Rn is a codebook that contains a set of codewords in Rn. Let Cn(x,D) be the smallest cardinality codebook for inputs of length n generated by X such that the expected per-symbol distortion between the input xn and the codeword e(x)∈Cn(x,D) is at most D. The rate Rn(x,D) is the per-symbol log-cardinality of the codebook,

(6.11)
_autogen-svg2png-0137.png

This is an operational definition of the RD function in terms of the best block code. In contrast, it can be shown that the RD function is equal to a minimization over mutual information, yielding a different flavor of definition 9, 3.

Having defined the RD function, we can describe the RD performance of the compression algorithm by Baron and Weissman for continuous valued inputs.

Theorem 8 Consider square error distortion, _autogen-svg2png-0138.png, let X be a finite variance stationary and ergodic source with RD function