Fast Fourier Transforms (6x9 Version) by C. Sidney Burrus - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 7

Winograd’s Short DFT

Algorithms1

In 1976, S. Winograd [406] presented a new DFT algorithm which

had significantly fewer multiplications than the Cooley-Tukey FFT

which had been published eleven years earlier. This new Wino-

grad Fourier Transform Algorithm (WFTA) is based on the type-

one index map from Multidimensional Index Mapping (Chapter 3)

with each of the relatively prime length short DFT’s calculated by

very efficient special algorithms. It is these short algorithms that

this section will develop. They use the index permutation of Rader

described in the another module to convert the prime length short

DFT’s into cyclic convolutions. Winograd developed a method for

calculating digital convolution with the minimum number of mul-

tiplications. These optimal algorithms are based on the polynomial

residue reduction techniques of Polynomial Description of Signals:

Equation 1 (4.1) to break the convolution into multiple small ones

[29], [235], [263], [416], [408], [197].

1This content is available online at <http://cnx.org/content/m16333/1.14/>.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

57

CHAPTER 7. WINOGRAD’S SHORT

58

DFT ALGORITHMS

The operation of discrete convolution defined by

y (n) = ∑h(n − k) x(k)

(7.1)

k

is called a bilinear operation because, for a fixed h (n), y (n) is a

linear function of x (n) and for a fixed x (n) it is a linear function of

h (n). The operation of cyclic convolution is the same but with all

indices evaluated modulo N.

Recall from Polynomial Description of Signals: Equation 3 (4.3)

that length-N cyclic convolution of x (n) and h (n) can be repre-

sented by polynomial multiplication

Y (s) = X (s) H (s) mod

sN − 1

(7.2)

This bilinear operation of (7.1) and (7.2) can also be expressed

in terms of linear matrix operators and a simpler bilinear opera-

tor denoted by o which may be only a simple element-by-element

multiplication of the two vectors [235], [197], [212]. This matrix

formulation is

Y = C [AX oBH]

(7.3)

where X , H and Y are length-N vectors with elements of x (n),

h (n) and y (n) respectively. The matrices A and B have dimension

M x N , and C is N x M with M ≥ N. The elements of A, B, and

C are constrained to be simple; typically small integers or rational

numbers. It will be these matrix operators that do the equivalent of

the residue reduction on the polynomials in (7.2).

In order to derive a useful algorithm of the form (7.3) to calculate

(7.1), consider the polynomial formulation (7.2) again. To use the

residue reduction scheme, the modulus is factored into relatively

prime factors. Fortunately the factoring of this particular poly-

nomial, sN − 1, has been extensively studied and it has consider-

able structure. When factored over the rationals, which means that

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

59

the only coefficients allowed are rational numbers, the factors are

called cyclotomic polynomials [29], [235], [263]. The most inter-

esting property for our purposes is that most of the coefficients of

cyclotomic polynomials are zero and the others are plus or minus

unity for degrees up to over one hundred. This means the residue

reduction will generally require no multiplications.

The operations of reducing X (s) and H (s) in (7.2) are carried out

by the matrices A and B in (7.3). The convolution of the residue

polynomials is carried out by the o operator and the recombination

by the CRT is done by the C matrix. More details are in [29], [235],

[263], [197], [212] but the important fact is the A and B matrices

usually contain only zero and plus or minus unity entries and the

C matrix only contains rational numbers. The only general mul-

tiplications are those represented by o. Indeed, in the theoretical

results from computational complexity theory, these real or com-

plex multiplications are usually the only ones counted. In practical

algorithms, the rational multiplications represented by C could be

a limiting factor.

The h (n) terms are fixed for a digital filter, or they represent the W

terms from Multidimensional Index Mapping: Equation 1 (3.1) if

the convolution is being used to calculate a DFT. Because of this,

d = BH in (7.3) can be precalculated and only the A and C opera-

tors represent the mathematics done at execution of the algorithm.

In order to exploit this feature, it was shown [416], [197] that the

properties of (7.3) allow the exchange of the more complicated op-

erator C with the simpler operator B. Specifically this is given by

Y = C [AX oBH]

(7.4)

Y ’ = BT AX oCT H’

(7.5)

where H’ has the same elements as H, but in a permuted order,

and likewise Y ’ and Y . This very important property allows pre-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 7. WINOGRAD’S SHORT

60

DFT ALGORITHMS

computing the more complicated CT H’ in (7.5) rather than BH as

in (7.3).

Because BH or CT H’ can be precomputed, the bilinear form of

(7.3) and (7.5) can be written as a linear form. If an M x M diagonal

matrix D is formed from d = CT H, or in the case of (7.3), d = BH,

assuming a commutative property for o, (7.5) becomes

Y ’ = BT DAX

(7.6)

and (7.3) becomes

Y = CDAX

(7.7)

In most cases there is no reason not to use the same reduction

operations on X and H, therefore, B can be the same as A and (7.6)

then becomes

Y ’ = AT DAX

(7.8)

In order to illustrate how the residue reduction is carried out and

how the A matrix is obtained, the length-5 DFT algorithm started

in The DFT as Convolution or Filtering: Matrix 1 (5.12) will be

continued. The DFT is first converted to a length-4 cyclic convo-

lution by the index permutation from The DFT as Convolution or

Filtering: Equation 3 (5.3) to give the cyclic convolution in The

DFT as Convolution or Filtering (Chapter 5). To avoid confusion

from the permuted order of the data x (n) in The DFT as Convo-

lution or Filtering (Chapter 5), the cyclic convolution will first be

developed without the permutation, using the polynomial U (s)

U (s) = x (1) + x (3) s + x (4) s2 + x (2) s3

(7.9)

U (s) = u (0) + u (1) s + u (2) s2 + u (3) s3

(7.10)

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

61

and then the results will be converted back to the permuted x (n).

The length-4 cyclic convolution in terms of polynomials is

Y (s) = U (s) H (s) mod

s4 − 1

(7.11)

and the modulus factors into three cyclotomic polynomials

s4 − 1 = s2 − 1

s2 + 1

(7.12)

= (s − 1) (s + 1) s2 + 1

(7.13)

= P1 P2 P3

(7.14)

Both U (s) and H (s) are reduced modulo these three polynomials.

The reduction modulo P1 and P2 is done in two stages. First it is

done modulo s2 − 1 , then that residue is further reduced modulo

(s − 1) and (s + 1).

U (s) = u0 + u1s + u2s2 + u3s3

(7.15)

U ’ (s) = ((U (s)))(s2−1) = (u0 + u2) + (u1 + u3)s

(7.16)

U 1 (s) =

U ’ (s)

= (u0 + u1 + u2 + u3)

(7.17)

P1

U 2 (s) =

U ’ (s)

= (u0 − u1 + u2 − u3)

(7.18)

P2

U 3 (s) = ((U (s)))

= (u

P

0 − u2) + (u1 − u3) s

(7.19)

3

The reduction in (7.16) of the data polynomial (7.15) can be de-

noted by a matrix operation on a vector which has the data as en-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 7. WINOGRAD’S SHORT

62

DFT ALGORITHMS

tries.

u0

1 0 1 0

 u1 

u0 + u2

= 

(7.20)

0 1 0 1

 u2 

u1 + u3

u3

and the reduction in (7.19) is

u0

1 0 −1

0

 u1 

u0 − u2

= 

(7.21)

0 1

0

−1

 u2 

u1 − u3

u3

Combining (7.20) and (7.21) gives one operator

 

1 0

1

0

u0 + u2

 

 0 1

0

1   u

 

1 + u3 

 

=

(7.22)

 1 0 −1

0   u

 

0 − u2 

 

0 1

0

−1

u1 − u3

u0 + u2

w0

 u

 w 

1 + u3 

1 

 = 

 u

v

0 − u2 

0 

u1 − u3

v1

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

63

Further reduction of v0 + v1s is not possible because P3 = s2 + 1

cannot be factored over the rationals. However s2 − 1 can be fac-

tored into P1P2 = (s − 1) (s + 1) and, therefore, w0 + w1s can be

further reduced as was done in (7.17) and (7.18) by

w0

1 1

 = w0 + w1 = u0 + u2 + u1 + u3

(7.23)

w1

w0

1 −1

 = w0 − w1 = u0 + u2 − u1 − u3

(7.24)

w1

Combining (7.22), (7.23) and (7.24) gives

 

 

1

1

0 0

1 0

1

0

u0

r0

 

 

 1

−1 0 0   0 1

0

1   u1 

r1 

 

 

 = 

 

 

 0

0

1 0   1 0 −1

0   u2 

 v0 

 

 

0

0

0 1

0 1

0

−1

u3

v1

(7.25)

The same reduction is done to H (s) and then the convolution of

(7.11) is done by multiplying each residue polynomial of X (s) and

H (s) modulo each corresponding cyclotomic factor of P (s) and

finally a recombination using the polynomial Chinese Remainder

Theorem (CRT) as in Polynomial Description of Signals: Equation

9 (4.9) and Polynomial Description of Signals: Equation 13 (4.13).

Y (s)

=

K1 (s)U1 (s) H1 (s)

+

(7.26)

K2 (s)U2 (s) H2 (s) + K3 (s)U3 (s) H3 (s)

mod s4 − 1

where U1 (s) = r1 and U2 (s) = r2 are constants and U3 (s) = v0 +

v1s is a first degree polynomial. U1 times H1 and U2 times H2 are

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 7. WINOGRAD’S SHORT

64

DFT ALGORITHMS

easy, but multiplying U3 time H3 modulo s2 + 1 is more difficult.

The multiplication of U3 (s) times H3 (s) can be done by the Toom-

Cook algorithm [29], [235], [263] which can be viewed as La-

grange interpolation or polynomial multiplication modulo a spe-

cial polynomial with three arbitrary coefficients. To simplify the

arithmetic, the constants are chosen to be plus and minus one and

zero. The details of this can be found in [29], [235], [263]. For this

example it can be verified that

((v0 + v1s) (h0 + h1s))

= (v

s2+1

0h0 − v1h1) +

(7.27)

(v0h1 + v1h0) s

which by the Toom-Cook algorithm or inspection is



1 0

1 0

1

−1 0



v0

h0



0 1

 o

0 1

(7.28) =



1 −1 1



v1

h1

1 1

1 1

y0

y1

where o signifies point-by-point multiplication. The total A matrix

in (7.3) is a combination of (7.25) and (7.28) giving

AX = A1A2A3X

(7.29)

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

65

1 0 0 0

 

 

1

1

0 0

1 0

1

0

u0

 0 1 0 0  

 

 

 

 

 

1 −1 0 0

0 1

0

1

u1

=

 

 

 0 0 1 0  

 

 

(7.30)

 =

 

 

 

0

0

1 0

1 0 −1

0

u

 

 

2 

 0 0 0 1  

 

 

0

0

0 1

0 1

0

−1

u3

0 0 1 1

r0

 r 

1 

 v 

0 

v1

where the matrix A3 gives the residue reduction s2 − 1 and s2 + 1,

the upper left-hand part of A2 gives the reduction modulo s − 1 and

s + 1, and the lower right-hand part of A1 carries out the Toom-

Cook algorithm modulo s2 + 1 with the multiplication in (7.5). No-

tice that by calculating (7.30) in the three stages, seven additions

are required. Also notice that A1 is not square. It is this “expan-

sion" that causes more than N multiplications to be required in o in

(7.5) or D in (7.6). This staged reduction will derive the A operator

for (7.5)

The method described above is very straight-forward for the

shorter DFT lengths. For N = 3, both of the residue polynomials

are constants and the multiplication given by o in (7.3) is trivial.

For N = 5, which is the example used here, there is one first degree

polynomial multiplication required but the Toom-Cook algorithm

uses simple constants and, therefore, works well as indicated in

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 7. WINOGRAD’S SHORT

66

DFT ALGORITHMS

(7.28). For N = 7, there are two first degree residue polynomials

which can each be multiplied by the same techniques used in the

N = 5 example. Unfortunately, for any longer lengths, the residue

polynomials have an order of three or greater which causes the

Toom-Cook algorithm to require constants of plus and minus two

and worse. For that reason, the Toom-Cook method is not used,

and other techniques such as index mapping are used that require

more than the minimum number of multiplications, but do not re-

quire an excessive number of additions. The resulting algorithms

still have the structure of (7.8). Blahut [29] and Nussbaumer [263]

have a good collection of algorithms for polynomial multiplication

that can be used with the techniques discussed here to construct a

wide variety of DFT algorithms.

The constants in the diagonal matrix D can be found from the CRT

matrix C in (7.5) using d = CT H’ for the diagonal terms in D.

As mentioned above, for the smaller prime lengths of 3, 5, and

7 this works well but for longer lengths the CRT becomes very

complicated. An alternate method for finding D uses the fact that

since the linear form (7.6) or (7.8) calculates the DFT, it is possible

to calculate a known DFT of a given x (n) from the definition of

the DFT in Multidimensional Index Mapping: Equation 1 (3.1)

and, given the A matrix in (7.8), solve for D by solving a set of

simultaneous equations. The details of this procedure are described

in [197].

A modification of this approach also works for a length which is

an odd prime raised to some power: N = PM. This is a bit more

complicated [235], [416] but has been done for lengths of 9 and

25. For longer lengths, the conventional Cooley-Tukey type- two

index map algorithm seems to be more efficient. For powers of

two, there is no primitive root, and therefore, no simple conversion

of the DFT into convolution. It is possible to use two generators

[235], [263], [408] to make the conversion and there exists a set of

length 4, 8, and 16 DFT algorithms of the form in (7.8) in [235].

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

67

In Table 7.1 an operation count of several short DFT algorithms is

presented. These are practical algorithms that can be used alone

or in conjunction with the index mapping to give longer DFT’s as

shown in The Prime Factor and Winograd Fourier Transform Al-

gorithms (Chapter 10). Most are optimized in having either the

theoretical minimum number of multiplications or the minimum

number of multiplications without requiring a very large number of

additions. Some allow other reasonable trade-offs between num-

bers of multiplications and additions. There are two lists of the

number of multiplications. The first is the number of actual float-

ing point multiplications that must be done for that length DFT.

Some of these (one or two in most cases) will be by rational con-

stants and the others will be by irrational constants. The second

list is the total number of multiplications given in the diagonal ma-

trix D in (7.8). At least one of these will be unity ( the one as-

sociated with X (0)) and in some cases several will be unity ( for

N = 2M ). The second list is important in programming the WFTA

in The Prime Factor and Winograd Fourier Transform Algorithm:

The Winograd Fourier Transform Algorithm (Section 10.2: The

Winograd Fourier Transform Algorithm).

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 7. WINOGRAD’S SHORT

68

DFT ALGORITHMS

Length N

Mult Non-one

Mult Total

Adds

2

0

4

4

3

4

6

12

4

0

8

16

5

10

12

34

7

16

18

72

8

4

16

52

9

20

22

84

11

40

42

168

13