Fourier Transform
Algorithm1
The publication by Cooley and Tukey [90] in 1965 of an efficient
algorithm for the calculation of the DFT was a major turning point
in the development of digital signal processing. During the five or
so years that followed, various extensions and modifications were
made to the original algorithm [95]. By the early 1970’s the prac-
tical programs were basically in the form used today. The standard
development presented in [274], [299], [38] shows how the DFT of
a length-N sequence can be simply calculated from the two length-
N/2 DFT’s of the even index terms and the odd index terms. This
is then applied to the two half-length DFT’s to give four quarter-
length DFT’s, and repeated until N scalars are left which are the
DFT values. Because of alternately taking the even and odd index
terms, two forms of the resulting programs are called decimation-
in-time and decimation-in-frequency. For a length of 2M, the di-
viding process is repeated M = log2N times and requires N multi-
plications each time. This gives the famous formula for the com-
1This content is available online at <http://cnx.org/content/m16334/1.13/>.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
109
CHAPTER 9. THE COOLEY-TUKEY
110
FAST FOURIER TRANSFORM
ALGORITHM
putational complexity of the FFT of Nlog2N which was derived in
Multidimensional Index Mapping: Equation 34 (3.34).
Although the decimation methods are straightforward and easy to
understand, they do not generalize well. For that reason it will be
assumed that the reader is familiar with that description and this
chapter will develop the FFT using the index map from Multidi-
mensional Index Mapping (Chapter 3).
The Cooley-Tukey FFT always uses the Type 2 index map from
Multidimensional Index Mapping: Equation 11 (3.11). This is
necessary for the most popular forms that have N = RM, but is
also used even when the factors are relatively prime and a Type 1
map could be used. The time and frequency maps from Multidi-
mensional Index Mapping: Equation 6 (3.6) and Multidimensional
Index Mapping: Equation 12 (3.12) are
n = ((K1n1 + K2n2))
(9.1)
N
k = ((K3k1 + K4k2))
(9.2)
N
Type-2 conditions Multidimensional Index Mapping: Equation 8
(3.8) and Multidimensional Index Mapping: Equation 11 (3.11)
become
K1 = aN2 or K2 = bN1 but not both
(9.3)
and
K3 = cN2 or K4 = dN1 but not both
(9.4)
The row and column calculations in Multidimensional Index Map-
ping: Equation 15 (3.15) are uncoupled by Multidimensional Index
Mapping: Equation 16 (3.16) which for this case are
((K1K4)) = 0 or ((K
= 0 but not both
(9.5)
N
2K3))N
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
111
To make each short sum a DFT, the Ki must satisfy
((K1K3)) = N
= N
N
2
and
((K2K4))N
1
(9.6)
In order to have the smallest values for Ki the constants in (9.3)
are chosen to be
a = d = K2 = K3 = 1
(9.7)
which makes the index maps of (9.1) become
n = N2n1 + n2
(9.8)
k = k1 + N1k2
(9.9)
These index maps are all evaluated modulo N, but in (9.8), explicit
reduction is not necessary since n never exceeds N. The reduction
notation will be omitted for clarity. From Multidimensional In-
dex Mapping: Equation 15 (3.15) and example Multidimensional
Index Mapping: Equation 19 (3.19), the DFT is
N2−1N1−1
X = ∑ ∑ x W n1k1 W n2k1 W n2k2
(9.10)
N1
N
N2
n2=0 n1=0
This map of (9.8) and the form of the DFT in (9.10) are the fun-
damentals of the Cooley-Tukey FFT.
The order of the summations using the Type 2 map in (9.10) cannot
be reversed as it can with the Type-1 map. This is because of the
WN terms, the twiddle factors.
Turning (9.10) into an efficient program requires some care. From
Multidimensional Index Mapping: Efficiencies Resulting from In-
dex Mapping with the DFT (Section 3.3: Efficiencies Resulting
from Index Mapping with the DFT) we know that all the factors
should be equal. If N = RM , with R called the radix, N1 is first
set equal to R and N2 is then necessarily RM−1. Consider n1 to
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
112
FAST FOURIER TRANSFORM
ALGORITHM
be the index along the rows and n2 along the columns. The inner
sum of (9.10) over n1 represents a length-N1 DFT for each value
of n2. These N2 length-N1 DFT’s are the DFT’s of the rows of the
x (n1, n2) array. The resulting array of row DFT’s is multiplied by
an array of twiddle factors which are the WN terms in (9.10). The
twiddle-factor array for a length-8 radix-2 FFT is
W 0 W 0
1
1
W 0
W 1
1
W
T F :
W n2k1 =
=
(9.11)
8
W 0
W 2
1
− j
W 0 W 3
1 − jW
The twiddle factor array will always have unity in the first row and
first column.
To complete (9.10) at this point, after the row DFT’s are multiplied
by the TF array, the N1 length-N2 DFT’s of the columns are calcu-
lated. However, since the columns DFT’s are of length RM−1, they
can be posed as a RM−2 by R array and the process repeated, again
using length-R DFT’s. After M stages of length-R DFT’s with TF
multiplications interleaved, the DFT is complete. The flow graph
of a length-2 DFT is given in Figure 1 (7.18) and is called a butter-
fly because of its shape. The flow graph of the complete length-8
radix-2 FFT is shown in Figure 2 (7.19) .
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
113
x(0)
X(0) = x(0) + x(1)
x(1)
X(0) = x(0) - x(1)
Figure 9.1: A Radix-2 Butterfly
x
x
0
0
x
x
1
-
4
0
W
x
x
2
2
-
2
W
x
x
3
-
-
6
0
W
x
x
4
-
1
1
W
x
x
5
-
-
5
2
0
W
W
x
x
6
-
-
3
3
W
2
W
x
x
7
-
-
-
7
Figure 9.2: Length-8 Radix-2 FFT Flow Graph
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
114
FAST FOURIER TRANSFORM
ALGORITHM
This flow-graph, the twiddle factor map of (9.11), and the basic
equation (9.10) should be completely understood before going fur-
ther.
A very efficient indexing scheme has evolved over the years that
results in a compact and efficient computer program. A FORTRAN
program is given below that implements the radix-2 FFT. It should
be studied [64] to see how it implements (9.10) and the flow-graph
representation.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
115
◆✷ ❂ ◆
❉❖ ✶✵ ❑ ❂ ✶✱ ▼
◆✶ ❂ ◆✷
◆✷ ❂ ◆✷✴✷
❊ ❂ ✻✳✷✽✸✶✽✴◆✶
❆ ❂ ✵
❉❖ ✷✵ ❏ ❂ ✶✱ ◆✷
❈ ❂ ❈❖❙ ✭❆✮
❙ ❂✲❙■◆ ✭❆✮
❆ ❂ ❏✯❊
❉❖ ✸✵ ■ ❂ ❏✱ ◆✱ ◆✶
▲ ❂ ■ ✰ ◆✷
❳❚
❂ ❳✭■✮ ✲ ❳✭▲✮
❳✭■✮ ❂ ❳✭■✮ ✰ ❳✭▲✮
❨❚
❂ ❨✭■✮ ✲ ❨✭▲✮
❨✭■✮ ❂ ❨✭■✮ ✰ ❨✭▲✮
❳✭▲✮ ❂ ❳❚✯❈ ✲ ❨❚✯❙
❨✭▲✮ ❂ ❳❚✯❙ ✰ ❨❚✯❈
✸✵
❈❖◆❚■◆❯❊
✷✵
❈❖◆❚■◆❯❊
✶✵
❈❖◆❚■◆❯❊
Listing 9.1: A Radix-2 Cooley-Tukey FFT Program
This discussion, the flow graph of Winograd’s Short DFT Algo-
rithms: Figure 2 (Figure 7.2) and the program of p. ?? are all based
on the input index map of Multidimensional Index Mapping: Equa-
tion 6 (3.6) and (9.1) and the calculations are performed in-place.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
116
FAST FOURIER TRANSFORM
ALGORITHM
According to Multidimensional Index Mapping: In-Place Calcu-
lation of the DFT and Scrambling (Section 3.2: In-Place Calcula-
tion of the DFT and Scrambling), this means the output is scram-
bled in bit-reversed order and should be followed by an unscram-
bler to give the DFT in proper order. This formulation is called
a decimation-in-frequency FFT [274], [299], [38]. A very similar
algorithm based on the output index map can be derived which is
called a decimation-in-time FFT. Examples of FFT programs are
found in [64] and in the Appendix of this book.
9.1 Modifications to the Basic Cooley-Tukey
FFT
Soon after the paper by Cooley and Tukey, there were improve-
ments and extensions made. One very important discovery was the
improvement in efficiency by using a larger radix of 4, 8 or even 16.
For example, just as for the radix-2 butterfly, there are no multipli-
cations required for a length-4 DFT, and therefore, a radix-4 FFT
would have only twiddle factor multiplications. Because there are
half as many stages in a radix-4 FFT, there would be half as many
multiplications as in a radix-2 FFT. In practice, because some of
the multiplications are by unity, the improvement is not by a fac-
tor of two, but it is significant. A radix-4 FFT is easily developed
from the basic radix-2 structure by replacing the length-2 butter-
fly by a length-4 butterfly and making a few other modifications.
Programs can be found in [64] and operation counts will be given
in "Evaluation of the Cooley-Tukey FFT Algorithms" (Section 9.3:
Evaluation of the Cooley-Tukey FFT Algorithms).
Increasing the radix to 8 gives some improvement but not as much
as from 2 to 4. Increasing it to 16 is theoretically promising but the
small decrease in multiplications is somewhat offset by an increase
in additions and the program becomes rather long. Other radices
are not attractive because they generally require a substantial num-
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
117
ber of multiplications and additions in the butterflies.
The second method of reducing arithmetic is to remove the un-
necessary TF multiplications by plus or minus unity or by plus or
minus the square root of minus one. This occurs when the expo-
nent of WN is zero or a multiple of N/4. A reduction of additions as
well as multiplications is achieved by removing these extraneous
complex multiplications since a complex multiplication requires at
least two real additions. In a program, this reduction is usually
achieved by having special butterflies for the cases where the TF is
one or j. As many as four special butterflies may be necessary to
remove all unnecessary arithmetic, but in many cases there will be
no practical improvement above two or three.
In addition to removing multiplications by one or j, there can
be a reduction in multiplications by using a special butterfly for
TFs with WN/8, which have equal real and imaginary parts. Also,
for computers or hardware with multiplication considerably slower
than addition, it is desirable to use an algorithm for complex mul-
tiplication that requires three multiplications and three additions
rather than the conventional four multiplications and two additions.
Note that this gives no reduction in the total number of arithmetic
operations, but does give a trade of multiplications for additions.
This is one reason not to use complex data types in programs but
to explicitly program complex arithmetic.
A time-consuming and unnecessary part of the execution of a FFT
program is the calculation of the sine and cosine terms which are
the real and imaginary parts of the TFs. There are basically three
approaches to obtaining the sine and cosine values. They can be
calculated as needed which is what is done in the sample program
above. One value per stage can be calculated and the others recur-
sively calculated from those. That method is fast but suffers from
accumulated round-off errors. The fastest method is to fetch pre-
calculated values from a stored table. This has the disadvantage of
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
118
FAST FOURIER TRANSFORM
ALGORITHM
requiring considerable memory space.
If all the N DFT values are not needed, special forms of the FFT
can be developed using a process called pruning [226] which re-
moves the operations concerned with the unneeded outputs.
Special algorithms are possible for cases with real data or with
symmetric data [82]. The decimation-in-time algorithm can be
easily modified to transform real data and save half the arithmetic
required for complex data [357]. There are numerous other mod-
ifications to deal with special hardware considerations such as an
array processor or a special microprocessor such as the Texas In-
struments TMS320. Examples of programs that deal with some of
these items can be found in [299], [64], [82].
9.2 The Split-Radix FFT Algorithm
Recently several papers [228], [106], [393], [350], [102] have been
published on algorithms to calculate a length-2M DFT more ef-
ficiently than a Cooley-Tukey FFT of any radix. They all have
the same computational complexity and are optimal for lengths up
through 16 and until recently was thought to give the best total
add-multiply count possible for any power-of-two length. Yavne
published an algorithm with the same computational complexity
in 1968 [421], but it went largely unnoticed. Johnson and Frigo
have recently reported the first improvement in almost 40 years
[201]. The reduction in total operations is only a few percent, but
it is a reduction.
The basic idea behind the split-radix FFT (SRFFT) as derived by
Duhamel and Hollmann [106], [102] is the application of a radix-2
index map to the even-indexed terms and a radix-4 map to the odd-
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
119
indexed terms. The basic definition of the DFT
N−1
Ck = ∑ xn W nk
(9.12)
n=0
with W = e− j2π/N gives
N/2−1
C2k = ∑
xn + xn+N/2 W 2nk
(9.13)
n=0
for the even index terms, and
N/4−1
C4k+1 = ∑
xn − xn+N/2 − j xn+N/4 − xn+3N/4
W n W 4nk
(9.14)
n=0
and
N/4−1
C4k+3 = ∑
xn − xn+N/2 + j xn+N/4 − xn+3N/4
W 3n W 4nk
(9.15)
n=0
for the odd index terms. This results in an L-shaped “butterfly"
shown in Figure 9.3 which relates a length-N DFT to one length-
N/2 DFT and two length-N/4 DFT’s with twiddle factors. Re-
peating this process for the half and quarter length DFT’s until
scalars result gives the SRFFT algorithm in much the same way
the decimation-in-frequency radix-2 Cooley-Tukey FFT is derived
[274], [299], [38]. The resulting flow graph for the algorithm cal-
culated in place looks like a radix-2 FFT except for the location
of the twiddle factors. Indeed, it is the location of the twiddle fac-
tors that makes this algorithm use less arithmetic. The L- shaped
SRFFT butterfly Figure 9.3 advances the calculation of the top half
by one of the M stages while the lower half, like a radix-4 butter-
fly, calculates two stages at once. This is illustrated for N = 8 in
Figure 9.4.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
120
FAST FOURIER TRANSFORM
ALGORITHM
- j
-
-
Figure 9.3: SRFFT Butterfly
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
121
j
w
j
j
w3
Figure 9.4: Length-8 SRFFT
Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey
FFT or even the prime factor algorithm or Winograd Fourier trans-
form algorithm , the Split-Radix FFT does not progress completely
stage by stage, or, in terms of indices, does not complete each
nested sum in order. This is perhaps better seen from the polyno-
mial formulation of Martens [228]. Because of this, the indexing is
somewhat more complicated than the conventional Cooley-Tukey
program.
A FORTRAN program is given below which implements the basic
decimation-in-frequency split-radix FFT algorithm. The indexing
scheme [350] of this program gives a structure very similar to the
Cooley-Tukey programs in [64] and allows the same modifications
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 9. THE COOLEY-TUKEY
122
FAST FOURIER TRANSFORM
ALGORITHM
and improvements such as decimation-in-time, multiple butterflies,
table look-up of sine and cosine values, three real per complex
multiply methods, and real data versions [102], [357].
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
123
❙❯❇❘❖❯❚■◆❊ ❋❋❚✭❳✱❨✱◆✱▼✮
◆✷ ❂ ✷✯◆
❉❖ ✶✵ ❑ ❂ ✶✱ ▼✲✶
◆✷ ❂ ◆✷✴✷
◆✹ ❂ ◆✷✴✹
❊ ❂ ✻✳✷✽✸✶✽✺✸✵✼✶✼✾✺✽✻✴◆✷
❆ ❂ ✵
❉❖ ✷✵ ❏ ❂ ✶✱ ◆✹
❆✸ ❂ ✸✯❆
❈❈✶ ❂ ❈❖❙✭❆✮
❙❙✶ ❂ ❙■◆✭❆✮
❈❈✸ ❂ ❈❖❙✭❆✸✮
❙❙✸ ❂ ❙■◆✭❆✸✮
❆
❂ ❏✯❊
■❙ ❂ ❏
■❉ ❂ ✷✯◆✷
✹✵
❉❖ ✸✵ ■✵ ❂ ■❙✱ ◆✲✶✱ ■❉
■✶ ❂ ■✵ ✰ ◆✹
■✷ ❂ ■✶ ✰ ◆✹
■✸ ❂ ■✷ ✰ ◆✹
❘✶
❂ ❳✭■✵✮ ✲ ❳✭■✷✮
❳✭■✵✮ ❂ ❳✭■✵✮ ✰ ❳✭■✷✮
❘✷
❂ ❳✭■✶✮ ✲ ❳✭■✸✮
❳✭■✶✮ ❂ ❳✭■✶✮ ✰ ❳✭■✸✮
❙✶
❂ ❨✭■✵✮ ✲ ❨✭■✷✮
❨✭■✵✮ ❂ ❨✭■✵✮ ✰ ❨✭■✷✮
❙✷
❂ ❨✭■✶✮ ✲ ❨✭■✸✮
❨✭■✶✮ ❂ ❨✭■✶✮ ✰ ❨✭■✸✮
❙✸
❂ ❘✶ ✲ ❙✷
❘✶
❂ ❘✶ ✰ ❙✷
❙✷
❂ ❘✷ ✲ ❙✶
❘✷
❂ ❘✷ ✰ ❙✶
❳✭■✷✮ ❂ ❘✶✯❈❈✶ ✲ ❙✷✯❙❙✶
❨✭■✷✮ ❂✲❙✷✯❈❈✶ ✲ ❘✶✯❙❙✶
Av
❳✭■✸✮
ailable for
❂
free ❙✸✯❈❈✸
at Conne
✰
xions ❘✷✯❙❙✸
❨✭■✸✮
<http://cnx.or
❂ ❘✷✯❈❈✸ ✲ ❙✸✯❙❙✸
g/content/col10683/1.5>
✸✵
❈❖◆❚■◆❯❊
■❙ ❂ ✷✯■❉ ✲ ◆✷ ✰ ❏
■❉ ❂ ✹✯■❉
■❋ ✭■❙✳▲❚✳◆✮ ●❖❚❖ ✹✵
✷✵
❈❖◆❚■◆❯❊
✶✵
❈❖◆❚■◆❯❊
■❙ ❂ ✶
■❉ ❂ ✹
✺✵
❉❖ ✻✵ ■✵ ❂ ■❙✱ ◆✱ ■❉
■✶
❂ ■✵ ✰ ✶
❘✶
❂ ❳✭■✵✮
❳✭■✵✮ ❂ ❘✶ ✰ ❳✭■✶✮
❳✭■✶✮ ❂ ❘✶ ✲ ❳✭■✶✮
❘✶
❂ ❨✭■✵✮
❨✭■✵✮ ❂ ❘✶ ✰ ❨✭■✶✮
✻✵
❨✭■✶✮ ❂ ❘✶ ✲ ❨✭■✶✮
■❙ ❂ ✷✯■❉ ✲ ✶
■❉ ❂ ✹✯■❉
■❋ ✭■❙✳▲❚✳◆✮ ●❖❚❖ ✺✵
Listing 9.2: Split-Radix FFT FORTRAN Subroutine
CHAPTER 9. THE COOLEY-TUKEY
124
FAST FOURIER TRANSFORM
ALGORITHM
As was done for the other decimation-in-frequency algorithms, the
input index map is used and the calculations are done in place re-
sulting in the output being in bit-reversed order. It is the three state-
ments following label 30 that do the special indexing required by
the SRFFT. The last stage is length- 2 and, therefore, inappropriate
for the standard L-shaped butterfly, so it is calculated separately in
the DO 60 loop. This program is considered a one-butterfly ver-
sion. A second butterfly can be added just before statement 40 to