Fast Fourier Transforms (6x9 Version) by C. Sidney Burrus - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 9

The Cooley-Tukey Fast

Fourier Transform

Algorithm1

The publication by Cooley and Tukey [90] in 1965 of an efficient

algorithm for the calculation of the DFT was a major turning point

in the development of digital signal processing. During the five or

so years that followed, various extensions and modifications were

made to the original algorithm [95]. By the early 1970’s the prac-

tical programs were basically in the form used today. The standard

development presented in [274], [299], [38] shows how the DFT of

a length-N sequence can be simply calculated from the two length-

N/2 DFT’s of the even index terms and the odd index terms. This

is then applied to the two half-length DFT’s to give four quarter-

length DFT’s, and repeated until N scalars are left which are the

DFT values. Because of alternately taking the even and odd index

terms, two forms of the resulting programs are called decimation-

in-time and decimation-in-frequency. For a length of 2M, the di-

viding process is repeated M = log2N times and requires N multi-

plications each time. This gives the famous formula for the com-

1This content is available online at <http://cnx.org/content/m16334/1.13/>.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

109

CHAPTER 9. THE COOLEY-TUKEY

110

FAST FOURIER TRANSFORM

ALGORITHM

putational complexity of the FFT of Nlog2N which was derived in

Multidimensional Index Mapping: Equation 34 (3.34).

Although the decimation methods are straightforward and easy to

understand, they do not generalize well. For that reason it will be

assumed that the reader is familiar with that description and this

chapter will develop the FFT using the index map from Multidi-

mensional Index Mapping (Chapter 3).

The Cooley-Tukey FFT always uses the Type 2 index map from

Multidimensional Index Mapping: Equation 11 (3.11). This is

necessary for the most popular forms that have N = RM, but is

also used even when the factors are relatively prime and a Type 1

map could be used. The time and frequency maps from Multidi-

mensional Index Mapping: Equation 6 (3.6) and Multidimensional

Index Mapping: Equation 12 (3.12) are

n = ((K1n1 + K2n2))

(9.1)

N

k = ((K3k1 + K4k2))

(9.2)

N

Type-2 conditions Multidimensional Index Mapping: Equation 8

(3.8) and Multidimensional Index Mapping: Equation 11 (3.11)

become

K1 = aN2 or K2 = bN1 but not both

(9.3)

and

K3 = cN2 or K4 = dN1 but not both

(9.4)

The row and column calculations in Multidimensional Index Map-

ping: Equation 15 (3.15) are uncoupled by Multidimensional Index

Mapping: Equation 16 (3.16) which for this case are

((K1K4)) = 0 or ((K

= 0 but not both

(9.5)

N

2K3))N

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

111

To make each short sum a DFT, the Ki must satisfy

((K1K3)) = N

= N

N

2

and

((K2K4))N

1

(9.6)

In order to have the smallest values for Ki the constants in (9.3)

are chosen to be

a = d = K2 = K3 = 1

(9.7)

which makes the index maps of (9.1) become

n = N2n1 + n2

(9.8)

k = k1 + N1k2

(9.9)

These index maps are all evaluated modulo N, but in (9.8), explicit

reduction is not necessary since n never exceeds N. The reduction

notation will be omitted for clarity. From Multidimensional In-

dex Mapping: Equation 15 (3.15) and example Multidimensional

Index Mapping: Equation 19 (3.19), the DFT is

N2−1N1−1

X = ∑ ∑ x W n1k1 W n2k1 W n2k2

(9.10)

N1

N

N2

n2=0 n1=0

This map of (9.8) and the form of the DFT in (9.10) are the fun-

damentals of the Cooley-Tukey FFT.

The order of the summations using the Type 2 map in (9.10) cannot

be reversed as it can with the Type-1 map. This is because of the

WN terms, the twiddle factors.

Turning (9.10) into an efficient program requires some care. From

Multidimensional Index Mapping: Efficiencies Resulting from In-

dex Mapping with the DFT (Section 3.3: Efficiencies Resulting

from Index Mapping with the DFT) we know that all the factors

should be equal. If N = RM , with R called the radix, N1 is first

set equal to R and N2 is then necessarily RM−1. Consider n1 to

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

112

FAST FOURIER TRANSFORM

ALGORITHM

be the index along the rows and n2 along the columns. The inner

sum of (9.10) over n1 represents a length-N1 DFT for each value

of n2. These N2 length-N1 DFT’s are the DFT’s of the rows of the

x (n1, n2) array. The resulting array of row DFT’s is multiplied by

an array of twiddle factors which are the WN terms in (9.10). The

twiddle-factor array for a length-8 radix-2 FFT is

W 0 W 0

1

1

 W 0

W 1 

 1

W

T F :

W n2k1 = 

 = 

(9.11)

8

 W 0

W 2 

 1

− j 

W 0 W 3

1 − jW

The twiddle factor array will always have unity in the first row and

first column.

To complete (9.10) at this point, after the row DFT’s are multiplied

by the TF array, the N1 length-N2 DFT’s of the columns are calcu-

lated. However, since the columns DFT’s are of length RM−1, they

can be posed as a RM−2 by R array and the process repeated, again

using length-R DFT’s. After M stages of length-R DFT’s with TF

multiplications interleaved, the DFT is complete. The flow graph

of a length-2 DFT is given in Figure 1 (7.18) and is called a butter-

fly because of its shape. The flow graph of the complete length-8

radix-2 FFT is shown in Figure 2 (7.19) .

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

113

x(0)

X(0) = x(0) + x(1)

x(1)

X(0) = x(0) - x(1)

Figure 9.1: A Radix-2 Butterfly

x

x

0

0

x

x

1

-

4

0

W

x

x

2

2

-

2

W

x

x

3

-

-

6

0

W

x

x

4

-

1

1

W

x

x

5

-

-

5

2

0

W

W

x

x

6

-

-

3

3

W

2

W

x

x

7

-

-

-

7

Figure 9.2: Length-8 Radix-2 FFT Flow Graph

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

114

FAST FOURIER TRANSFORM

ALGORITHM

This flow-graph, the twiddle factor map of (9.11), and the basic

equation (9.10) should be completely understood before going fur-

ther.

A very efficient indexing scheme has evolved over the years that

results in a compact and efficient computer program. A FORTRAN

program is given below that implements the radix-2 FFT. It should

be studied [64] to see how it implements (9.10) and the flow-graph

representation.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

115

◆✷ ❂ ◆

❉❖ ✶✵ ❑ ❂ ✶✱ ▼

◆✶ ❂ ◆✷

◆✷ ❂ ◆✷✴✷

❊ ❂ ✻✳✷✽✸✶✽✴◆✶

❆ ❂ ✵

❉❖ ✷✵ ❏ ❂ ✶✱ ◆✷

❈ ❂ ❈❖❙ ✭❆✮

❙ ❂✲❙■◆ ✭❆✮

❆ ❂ ❏✯❊

❉❖ ✸✵ ■ ❂ ❏✱ ◆✱ ◆✶

▲ ❂ ■ ✰ ◆✷

❳❚

❂ ❳✭■✮ ✲ ❳✭▲✮

❳✭■✮ ❂ ❳✭■✮ ✰ ❳✭▲✮

❨❚

❂ ❨✭■✮ ✲ ❨✭▲✮

❨✭■✮ ❂ ❨✭■✮ ✰ ❨✭▲✮

❳✭▲✮ ❂ ❳❚✯❈ ✲ ❨❚✯❙

❨✭▲✮ ❂ ❳❚✯❙ ✰ ❨❚✯❈

✸✵

❈❖◆❚■◆❯❊

✷✵

❈❖◆❚■◆❯❊

✶✵

❈❖◆❚■◆❯❊

Listing 9.1: A Radix-2 Cooley-Tukey FFT Program

This discussion, the flow graph of Winograd’s Short DFT Algo-

rithms: Figure 2 (Figure 7.2) and the program of p. ?? are all based

on the input index map of Multidimensional Index Mapping: Equa-

tion 6 (3.6) and (9.1) and the calculations are performed in-place.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

116

FAST FOURIER TRANSFORM

ALGORITHM

According to Multidimensional Index Mapping: In-Place Calcu-

lation of the DFT and Scrambling (Section 3.2: In-Place Calcula-

tion of the DFT and Scrambling), this means the output is scram-

bled in bit-reversed order and should be followed by an unscram-

bler to give the DFT in proper order. This formulation is called

a decimation-in-frequency FFT [274], [299], [38]. A very similar

algorithm based on the output index map can be derived which is

called a decimation-in-time FFT. Examples of FFT programs are

found in [64] and in the Appendix of this book.

9.1 Modifications to the Basic Cooley-Tukey

FFT

Soon after the paper by Cooley and Tukey, there were improve-

ments and extensions made. One very important discovery was the

improvement in efficiency by using a larger radix of 4, 8 or even 16.

For example, just as for the radix-2 butterfly, there are no multipli-

cations required for a length-4 DFT, and therefore, a radix-4 FFT

would have only twiddle factor multiplications. Because there are

half as many stages in a radix-4 FFT, there would be half as many

multiplications as in a radix-2 FFT. In practice, because some of

the multiplications are by unity, the improvement is not by a fac-

tor of two, but it is significant. A radix-4 FFT is easily developed

from the basic radix-2 structure by replacing the length-2 butter-

fly by a length-4 butterfly and making a few other modifications.

Programs can be found in [64] and operation counts will be given

in "Evaluation of the Cooley-Tukey FFT Algorithms" (Section 9.3:

Evaluation of the Cooley-Tukey FFT Algorithms).

Increasing the radix to 8 gives some improvement but not as much

as from 2 to 4. Increasing it to 16 is theoretically promising but the

small decrease in multiplications is somewhat offset by an increase

in additions and the program becomes rather long. Other radices

are not attractive because they generally require a substantial num-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

117

ber of multiplications and additions in the butterflies.

The second method of reducing arithmetic is to remove the un-

necessary TF multiplications by plus or minus unity or by plus or

minus the square root of minus one. This occurs when the expo-

nent of WN is zero or a multiple of N/4. A reduction of additions as

well as multiplications is achieved by removing these extraneous

complex multiplications since a complex multiplication requires at

least two real additions. In a program, this reduction is usually

achieved by having special butterflies for the cases where the TF is

one or j. As many as four special butterflies may be necessary to

remove all unnecessary arithmetic, but in many cases there will be

no practical improvement above two or three.

In addition to removing multiplications by one or j, there can

be a reduction in multiplications by using a special butterfly for

TFs with WN/8, which have equal real and imaginary parts. Also,

for computers or hardware with multiplication considerably slower

than addition, it is desirable to use an algorithm for complex mul-

tiplication that requires three multiplications and three additions

rather than the conventional four multiplications and two additions.

Note that this gives no reduction in the total number of arithmetic

operations, but does give a trade of multiplications for additions.

This is one reason not to use complex data types in programs but

to explicitly program complex arithmetic.

A time-consuming and unnecessary part of the execution of a FFT

program is the calculation of the sine and cosine terms which are

the real and imaginary parts of the TFs. There are basically three

approaches to obtaining the sine and cosine values. They can be

calculated as needed which is what is done in the sample program

above. One value per stage can be calculated and the others recur-

sively calculated from those. That method is fast but suffers from

accumulated round-off errors. The fastest method is to fetch pre-

calculated values from a stored table. This has the disadvantage of

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

118

FAST FOURIER TRANSFORM

ALGORITHM

requiring considerable memory space.

If all the N DFT values are not needed, special forms of the FFT

can be developed using a process called pruning [226] which re-

moves the operations concerned with the unneeded outputs.

Special algorithms are possible for cases with real data or with

symmetric data [82]. The decimation-in-time algorithm can be

easily modified to transform real data and save half the arithmetic

required for complex data [357]. There are numerous other mod-

ifications to deal with special hardware considerations such as an

array processor or a special microprocessor such as the Texas In-

struments TMS320. Examples of programs that deal with some of

these items can be found in [299], [64], [82].

9.2 The Split-Radix FFT Algorithm

Recently several papers [228], [106], [393], [350], [102] have been

published on algorithms to calculate a length-2M DFT more ef-

ficiently than a Cooley-Tukey FFT of any radix. They all have

the same computational complexity and are optimal for lengths up

through 16 and until recently was thought to give the best total

add-multiply count possible for any power-of-two length. Yavne

published an algorithm with the same computational complexity

in 1968 [421], but it went largely unnoticed. Johnson and Frigo

have recently reported the first improvement in almost 40 years

[201]. The reduction in total operations is only a few percent, but

it is a reduction.

The basic idea behind the split-radix FFT (SRFFT) as derived by

Duhamel and Hollmann [106], [102] is the application of a radix-2

index map to the even-indexed terms and a radix-4 map to the odd-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

119

indexed terms. The basic definition of the DFT

N−1

Ck = ∑ xn W nk

(9.12)

n=0

with W = e− j2π/N gives

N/2−1

C2k = ∑

xn + xn+N/2 W 2nk

(9.13)

n=0

for the even index terms, and

N/4−1

C4k+1 = ∑

xn − xn+N/2 − j xn+N/4 − xn+3N/4

W n W 4nk

(9.14)

n=0

and

N/4−1

C4k+3 = ∑

xn − xn+N/2 + j xn+N/4 − xn+3N/4

W 3n W 4nk

(9.15)

n=0

for the odd index terms. This results in an L-shaped “butterfly"

shown in Figure 9.3 which relates a length-N DFT to one length-

N/2 DFT and two length-N/4 DFT’s with twiddle factors. Re-

peating this process for the half and quarter length DFT’s until

scalars result gives the SRFFT algorithm in much the same way

the decimation-in-frequency radix-2 Cooley-Tukey FFT is derived

[274], [299], [38]. The resulting flow graph for the algorithm cal-

culated in place looks like a radix-2 FFT except for the location

of the twiddle factors. Indeed, it is the location of the twiddle fac-

tors that makes this algorithm use less arithmetic. The L- shaped

SRFFT butterfly Figure 9.3 advances the calculation of the top half

by one of the M stages while the lower half, like a radix-4 butter-

fly, calculates two stages at once. This is illustrated for N = 8 in

Figure 9.4.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

120

FAST FOURIER TRANSFORM

ALGORITHM

- j

-

-

Figure 9.3: SRFFT Butterfly

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

121

j

w

j

j

w3

Figure 9.4: Length-8 SRFFT

Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey

FFT or even the prime factor algorithm or Winograd Fourier trans-

form algorithm , the Split-Radix FFT does not progress completely

stage by stage, or, in terms of indices, does not complete each

nested sum in order. This is perhaps better seen from the polyno-

mial formulation of Martens [228]. Because of this, the indexing is

somewhat more complicated than the conventional Cooley-Tukey

program.

A FORTRAN program is given below which implements the basic

decimation-in-frequency split-radix FFT algorithm. The indexing

scheme [350] of this program gives a structure very similar to the

Cooley-Tukey programs in [64] and allows the same modifications

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 9. THE COOLEY-TUKEY

122

FAST FOURIER TRANSFORM

ALGORITHM

and improvements such as decimation-in-time, multiple butterflies,

table look-up of sine and cosine values, three real per complex

multiply methods, and real data versions [102], [357].

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

123

❙❯❇❘❖❯❚■◆❊ ❋❋❚✭❳✱❨✱◆✱▼✮

◆✷ ❂ ✷✯◆

❉❖ ✶✵ ❑ ❂ ✶✱ ▼✲✶

◆✷ ❂ ◆✷✴✷

◆✹ ❂ ◆✷✴✹

❊ ❂ ✻✳✷✽✸✶✽✺✸✵✼✶✼✾✺✽✻✴◆✷

❆ ❂ ✵

❉❖ ✷✵ ❏ ❂ ✶✱ ◆✹

❆✸ ❂ ✸✯❆

❈❈✶ ❂ ❈❖❙✭❆✮

❙❙✶ ❂ ❙■◆✭❆✮

❈❈✸ ❂ ❈❖❙✭❆✸✮

❙❙✸ ❂ ❙■◆✭❆✸✮

❂ ❏✯❊

■❙ ❂ ❏

■❉ ❂ ✷✯◆✷

✹✵

❉❖ ✸✵ ■✵ ❂ ■❙✱ ◆✲✶✱ ■❉

■✶ ❂ ■✵ ✰ ◆✹

■✷ ❂ ■✶ ✰ ◆✹

■✸ ❂ ■✷ ✰ ◆✹

❘✶

❂ ❳✭■✵✮ ✲ ❳✭■✷✮

❳✭■✵✮ ❂ ❳✭■✵✮ ✰ ❳✭■✷✮

❘✷

❂ ❳✭■✶✮ ✲ ❳✭■✸✮

❳✭■✶✮ ❂ ❳✭■✶✮ ✰ ❳✭■✸✮

❙✶

❂ ❨✭■✵✮ ✲ ❨✭■✷✮

❨✭■✵✮ ❂ ❨✭■✵✮ ✰ ❨✭■✷✮

❙✷

❂ ❨✭■✶✮ ✲ ❨✭■✸✮

❨✭■✶✮ ❂ ❨✭■✶✮ ✰ ❨✭■✸✮

❙✸

❂ ❘✶ ✲ ❙✷

❘✶

❂ ❘✶ ✰ ❙✷

❙✷

❂ ❘✷ ✲ ❙✶

❘✷

❂ ❘✷ ✰ ❙✶

❳✭■✷✮ ❂ ❘✶✯❈❈✶ ✲ ❙✷✯❙❙✶

❨✭■✷✮ ❂✲❙✷✯❈❈✶ ✲ ❘✶✯❙❙✶

Av

❳✭■✸✮

ailable for

free ❙✸✯❈❈✸

at Conne

xions ❘✷✯❙❙✸

❨✭■✸✮

<http://cnx.or

❂ ❘✷✯❈❈✸ ✲ ❙✸✯❙❙✸

g/content/col10683/1.5>

✸✵

❈❖◆❚■◆❯❊

■❙ ❂ ✷✯■❉ ✲ ◆✷ ✰ ❏

■❉ ❂ ✹✯■❉

■❋ ✭■❙✳▲❚✳◆✮ ●❖❚❖ ✹✵

✷✵

❈❖◆❚■◆❯❊

✶✵

❈❖◆❚■◆❯❊

■❙ ❂ ✶

■❉ ❂ ✹

✺✵

❉❖ ✻✵ ■✵ ❂ ■❙✱ ◆✱ ■❉

■✶

❂ ■✵ ✰ ✶

❘✶

❂ ❳✭■✵✮

❳✭■✵✮ ❂ ❘✶ ✰ ❳✭■✶✮

❳✭■✶✮ ❂ ❘✶ ✲ ❳✭■✶✮

❘✶

❂ ❨✭■✵✮

❨✭■✵✮ ❂ ❘✶ ✰ ❨✭■✶✮

✻✵

❨✭■✶✮ ❂ ❘✶ ✲ ❨✭■✶✮

■❙ ❂ ✷✯■❉ ✲ ✶

■❉ ❂ ✹✯■❉

■❋ ✭■❙✳▲❚✳◆✮ ●❖❚❖ ✺✵

Listing 9.2: Split-Radix FFT FORTRAN Subroutine

CHAPTER 9. THE COOLEY-TUKEY

124

FAST FOURIER TRANSFORM

ALGORITHM

As was done for the other decimation-in-frequency algorithms, the

input index map is used and the calculations are done in place re-

sulting in the output being in bit-reversed order. It is the three state-

ments following label 30 that do the special indexing required by

the SRFFT. The last stage is length- 2 and, therefore, inappropriate

for the standard L-shaped butterfly, so it is calculated separately in

the DO 60 loop. This program is considered a one-butterfly ver-

sion. A second butterfly can be added just before statement 40 to