:
:
;
letter frequencies, but nothing else. The maximum entropy source is then the first approximation to English
and its entropy determines the required channel capacity.
As a simple example of some of these results consider a source which produces a sequence of letters
chosen from among A, B, C, D with probabilities 1 , 1 , 1 , 1 , successive symbols being chosen independently.
2
4
8
8
We have
,
H
1 log 1
1 log 1
2 log 1
=
,
2
2 + 4
4 + 8
8
7 bits per symbol
=
4
:
Thus we can approximate a coding system to encode messages from this source into binary digits with an
average of 7 binary digit per symbol. In this case we can actually achieve the limiting value by the following 4
code (obtained by the method of the second proof of Theorem 9):
A
0
B
10
C
110
D
111
The average number of binary digits used in encoding a sequence of N symbols will be
2
,
N 1
1
1
2
3
7 N
2
+
4
+
=
:
8
4
It is easily seen that the binary digits 0, 1 have probabilities 1 , 1 so the H for the coded sequences is one 2
2
bit per symbol. Since, on the average, we have 7 binary symbols per original letter, the entropies on a time
4
basis are the same. The maximum possible entropy for the original set is log 4
2, occurring when A, B, C,
=
D have probabilities 1 , 1 , 1 , 1 . Hence the relative entropy is 7 . We can translate the binary sequences into 4
4
4
4
8
the original set of symbols on a two-to-one basis by the following table:
00
A 0
01
B 0
10
C 0
11
D 0
18
This double process then encodes the original message into the same symbols but with an average compres-
sion ratio 7 .
8
As a second example consider a source which produces a sequence of A’s and B’s with probability p for A and q for B. If p
q we have
H
log pp 1
p 1 p
,
=
,
,
p log p 1
p 1 p p
,
=
=
,
,
e
:
p log
=
:
p
In such a case one can construct a fairly good coding of the message on a 0, 1 channel by sending a special
sequence, say 0000, for the infrequent symbol A and then a sequence indicating the number of B’s following it. This could be indicated by the binary representation with all numbers containing the special sequence
deleted. All numbers up to 16 are represented as usual; 16 is represented by the next binary number after 16
which does not contain four zeros, namely 17
10001, etc.
=
It can be shown that as p
0 the coding approaches ideal provided the length of the special sequence is
!
properly adjusted.
PART II: THE DISCRETE CHANNEL WITH NOISE
11. REPRESENTATION OF A NOISY DISCRETE CHANNEL
We now consider the case where the signal is perturbed by noise during transmission or at one or the other
of the terminals. This means that the received signal is not necessarily the same as that sent out by the
transmitter. Two cases may be distinguished. If a particular transmitted signal always produces the same
received signal, i.e., the received signal is a definite function of the transmitted signal, then the effect may be called distortion. If this function has an inverse — no two transmitted signals producing the same received
signal — distortion may be corrected, at least in principle, by merely performing the inverse functional
operation on the received signal.
The case of interest here is that in which the signal does not always undergo the same change in trans-
mission. In this case we may assume the received signal E to be a function of the transmitted signal S and a second variable, the noise N.
E
f S N
=
;
The noise is considered to be a chance variable just as the message was above. In general it may be repre-
sented by a suitable stochastic process. The most general type of noisy discrete channel we shall consider
is a generalization of the finite state noise-free channel described previously. We assume a finite number of
states and a set of probabilities
p i
j
;
:
;
This is the probability, if the channel is in state
and symbol i is transmitted, that symbol j will be received
and the channel left in state
. Thus
and
range over the possible states, i over the possible transmitted
signals and j over the possible received signals. In the case where successive symbols are independently perturbed by the noise there is only one state, and the channel is described by the set of transition probabilities pi j , the probability of transmitted symbol i being received as j.
If a noisy channel is fed by a source there are two statistical processes at work: the source and the noise.
Thus there are a number of entropies that can be calculated. First there is the entropy H x of the source or of the input to the channel (these will be equal if the transmitter is non-singular). The entropy of the
output of the channel, i.e., the received signal, will be denoted by H y . In the noiseless case H y H x .
=
The joint entropy of input and output will be H xy . Finally there are two conditional entropies Hx y and Hy x , the entropy of the output when the input is known and conversely. Among these quantities we have the relations
H x y
H x
Hx y
H y
Hy x
;
=
+
=
+
:
All of these entropies can be measured on a per-second or a per-symbol basis.
19
12. EQUIVOCATION AND CHANNEL CAPACITY
If the channel is noisy it is not in general possible to reconstruct the original message or the transmitted
signal with certainty by any operation on the received signal E. There are, however, ways of transmitting the information which are optimal in combating noise. This is the problem which we now consider.
Suppose there are two possible symbols 0 and 1, and we are transmitting at a rate of 1000 symbols per
second with probabilities p
1
0
p 1
. Thus our source is producing information at the rate of 1000 bits
=
=
2
per second. During transmission the noise introduces errors so that, on the average, 1 in 100 is received
incorrectly (a 0 as 1, or 1 as 0). What is the rate of transmission of information? Certainly less than 1000
bits per second since about 1% of the received symbols are incorrect. Our first impulse might be to say
the rate is 990 bits per second, merely subtracting the expected number of errors. This is not satisfactory
since it fails to take into account the recipient’s lack of knowledge of where the errors occur. We may carry
it to an extreme case and suppose the noise so great that the received symbols are entirely independent of
the transmitted symbols. The probability of receiving 1 is 1 whatever was transmitted and similarly for 0.
2
Then about half of the received symbols are correct due to chance alone, and we would be giving the system
credit for transmitting 500 bits per second while actually no information is being transmitted at all. Equally
“good” transmission would be obtained by dispensing with the channel entirely and flipping a coin at the
receiving point.
Evidently the proper correction to apply to the amount of information transmitted is the amount of this
information which is missing in the received signal, or alternatively the uncertainty when we have received
a signal of what was actually sent. From our previous discussion of entropy as a measure of uncertainty it
seems reasonable to use the conditional entropy of the message, knowing the received signal, as a measure
of this missing information. This is indeed the proper definition, as we shall see later. Following this idea
the rate of actual transmission, R, would be obtained by subtracting from the rate of production (i.e., the entropy of the source) the average rate of conditional entropy.
R
H x
Hy x
=
,
The conditional entropy Hy x will, for convenience, be called the equivocation. It measures the average ambiguity of the received signal.
In the example considered above, if a 0 is received the a posteriori probability that a 0 was transmitted is .99, and that a 1 was transmitted is .01. These figures are reversed if a 1 is received. Hence
Hy x
99 log 99
0 01 log0 01
=
,
:
:
+
:
:
081 bits/symbol
=
:
or 81 bits per second. We may say that the system is transmitting at a rate 1000
81
919 bits per second.
,
=
In the extreme case where a 0 is equally likely to be received as a 0 or 1 and similarly for 1, the a posteriori probabilities are 1 , 1 and
2
2
H
1
1
y x
log 1
log 1
=
,
2
2 + 2
2
1 bit per symbol
=
or 1000 bits per second. The rate of transmission is then 0 as it should be.
The following theorem gives a direct intuitive interpretation of the equivocation and also serves to justify
it as the unique appropriate measure. We consider a communication system and an observer (or auxiliary
device) who can see both what is sent and what is recovered (with errors due to noise). This observer notes
the errors in the recovered message and transmits data to the receiving point over a “correction channel” to
enable the receiver to correct the errors. The situation is indicated schematically in Fig. 8.
Theorem 10: If the correction channel has a capacity equal to Hy x it is possible to so encode the correction data as to send it over this channel and correct all but an arbitrarily small fraction of the errors.
This is not possible if the channel capacity is less than Hy x .
20
CORRECTION DATA
OBSERVER
M
M0
M
SOURCE
TRANSMITTER
RECEIVER
CORRECTING
DEVICE
Fig. 8 — Schematic diagram of a correction system.
Roughly then, Hy x is the amount of additional information that must be supplied per second at the
receiving point to correct the received message.
To prove the first part, consider long sequences of received message M 0 and corresponding original
message M. There will be logarithmically T Hy x of the M’s which could reasonably have produced each M 0. Thus we have T Hy x binary digits to send each T seconds. This can be done with frequency of errors on a channel of capacity Hy x .
The second part can be proved by noting, first, that for any discrete chance variables x, y, z Hy x z
Hy x
;
:
The left-hand side can be expanded to give
Hy z
Hyz x
Hy x
+
Hyz x
Hy x
Hy z
Hy x
H z
,
,
:
If we identify x as the output of the source, y as the received signal and z as the signal sent over the correction channel, then the right-hand side is the equivocation less the rate of transmission over the correction channel.
If the capacity of this channel is less than the equivocation the right-hand side will be greater than zero and Hyz x
0. But this is the uncertainty of what was sent, knowing both the received signal and the correction
signal. If this is greater than zero the frequency of errors cannot be arbitrarily small.
Example:
Suppose the errors occur at random in a sequence of binary digits: probability p that a digit is wrong and q
1
p that it is right. These errors can be corrected if their position is known. Thus the
=
,
correction channel need only send information as to these positions. This amounts to transmitting
from a source which produces binary digits with probability p for 1 (incorrect) and q for 0 (correct).
This requires a channel of capacity
p log p
q log q
,
+
which is the equivocation of the original system.
The rate of transmission R can be written in two other forms due to the identities noted above. We have R
H x
Hy x
=
,
H y
Hx y
=
,
H x
H y
H x y
=
+
,
;
:
21
The first defining expression has already been interpreted as the amount of information sent less the uncer-
tainty of what was sent. The second measures the amount received less the part of this which is due to noise.
The third is the sum of the two amounts less the joint entropy and therefore in a sense is the number of bits
per second common to the two. Thus all three expressions have a certain intuitive significance.
The capacity C of a noisy channel should be the maximum possible rate of transmission, i.e., the rate when the source is properly matched to the channel. We therefore define the channel capacity by
,
C
Max H x
Hy x
=
,
where the maximum is with respect to all possible information sources used as input to the channel. If the
channel is noiseless, Hy x
0. The definition is then equivalent to that already given for a noiseless channel
=
since the maximum entropy for the channel is its capacity.
13. THE FUNDAMENTAL THEOREM FOR A DISCRETE CHANNEL WITH NOISE
It may seem surprising that we should define a definite capacity C for a noisy channel since we can never send certain information in such a case. It is clear, however, that by sending the information in a redundant
form the probability of errors can be reduced. For example, by repeating the message many times and by a
statistical study of the different received versions of the message the probability of errors could be made very small. One would expect, however, that to make this probability of errors approach zero, the redundancy
of the encoding must increase indefinitely, and the rate of transmission therefore approach zero. This is by
no means true. If it were, there would not be a very well defined capacity, but only a capacity for a given
frequency of errors, or a given equivocation; the capacity going down as the error requirements are made
more stringent. Actually the capacity C defined above has a very definite significance. It is possible to send information at the rate C through the channel with as small a frequency of errors or equivocation as desired by proper encoding. This statement is not true for any rate greater than C. If an attempt is made to transmit at a higher rate than C, say C
R 1, then there will necessarily be an equivocation equal to or greater than the
+
excess R 1. Nature takes payment by requiring just that much uncertainty, so that we are not actually getting any more than C through correctly.
The situation is indicated in Fig. 9. The rate of information into the channel is plotted horizontally and
the equivocation vertically. Any point above the heavy line in the shaded region can be attained and those
below cannot. The points on the line cannot in general be attained, but there will usually be two points on
the line that can.
These results are the main justification for the definition of C and will now be proved.
Theorem 11: Let a discrete channel have the capacity C and a discrete source the entropy per second H.
If H
C there exists a coding system such that the output of the source can be transmitted over the channel
with an arbitrarily small frequency of errors (or an arbitrarily small equivocation). If H
C it is possible
to encode the source so that the equivocation is less than H
C
where is arbitrarily small. There is no
,
+
method of encoding which gives an equivocation less than H
C.
,
The method of proving the first part of this theorem is not by exhibiting a coding method having the
desired properties, but by showing that such a code must exist in a certain group of codes. In fact we will
ATTAINABLE
Hy x
REGION
1.0
=
OPE
SL
C
H x
Fig. 9 — The equivocation possible for a given input entropy to a channel.
22
average the frequency of errors over this group and show that this average can be made less than . If the
average of a set of numbers is less than
there must exist at least one in the set which is less than . This
will establish the desired result.
The capacity C of a noisy channel has been defined as
,
C
Max H x
Hy x
=
,
where x is the input and y the output. The maximization is over all sources which might be used as input to the channel.
Let S 0 be a source which achieves the maximum capacity C. If this maximum is not actually achieved by any source let S 0 be a source which approximates to giving the maximum rate. Suppose S 0 is used as input to the channel. We consider the possible transmitted and received sequences of a long duration T . The following will be true:
1. The transmitted sequences fall into two classes, a high probability gr