Winograd Fourier Transform
Algorithms1
The prime factor algorithm (PFA) and the Winograd Fourier trans-
form algorithm (WFTA) are methods for efficiently calculating the
DFT which use, and in fact, depend on the Type-1 index map from
Multidimensional Index Mapping: Equation 10 (3.10) and Mul-
tidimensional Index Mapping: Equation 6 (3.6). The use of this
index map preceded Cooley and Tukey’s paper [150], [302] but its
full potential was not realized until it was combined with Wino-
grad’s short DFT algorithms. The modern PFA was first presented
in [213] and a program given in [57]. The WFTA was first pre-
sented in [407] and programs given in [236], [83].
The number theoretic basis for the indexing in these algorithms
may, at first, seem more complicated than in the Cooley-Tukey
FFT; however, if approached from the general index mapping point
of view of Multidimensional Index Mapping (Chapter 3), it is
straightforward, and part of a common approach to breaking large
1This content is available online at <http://cnx.org/content/m16335/1.9/>.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
137
CHAPTER 10. THE PRIME FACTOR
138
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
problems into smaller ones. The development in this section will
parallel that in The Cooley-Tukey Fast Fourier Transform Algo-
rithm (Chapter 9).
The general index maps of Multidimensional Index Mapping:
Equation 6 (3.6) and Multidimensional Index Mapping: Equa-
tion 12 (3.12) must satisfy the Type-1 conditions of Multidimen-
sional Index Mapping: Equation 7 (3.7) and Multidimensional In-
dex Mapping: Equation 10 (3.10) which are
K1 = aN2 and K2 = bN1 with (K1, N1) = (K2, N2) = 1 (10.1)
K3 = cN2 and K4 = dN1 with (K3, N1) = (K4, N2) = 1 (10.2)
The row and column calculations in Multidimensional Index Map-
ping: Equation 15 (3.15) are uncoupled by Multidimensional Index
Mapping: Equation 16 (3.16) which for this case are
((K1K4)) = ((K
= 0
(10.3)
N
2K3))N
In addition, to make each short sum a DFT, the Ki must also satisfy
((K1K3)) = N
= N
N
2
and ((K2K4))N
1
(10.4)
In order to have the smallest values for Ki, the constants in (10.1)
are chosen to be
a = b = 1, c =
N−1
, d =
N−1
(10.5)
2
N
1
N
which gives for the index maps in (10.1)
n = ((N2n1 + N1n2))
(10.6)
N
k = ((K3k1 + K4k2))
(10.7)
N
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
139
The frequency index map is a form of the Chinese remainder theo-
rem. Using these index maps, the DFT in Multidimensional Index
Mapping: Equation 15 (3.15) becomes
N2−1N1−1
X = ∑ ∑ x W n1k1 W n2k2
(10.8)
N1
N2
n2=0 n1=0
which is a pure two-dimensional DFT with no twiddle factors and
the summations can be done in either order. Choices other than
(10.5) could be used. For example, a = b = c = d = 1 will cause
the input and output index map to be the same and, therefore, there
will be no scrambling of the output order. The short summations
in (96), however, will no longer be short DFT’s [57].
An important feature of the short Winograd DFT’s described in
Winograd’s Short DFT Algorithms (Chapter 7) that is useful for
both the PFA and WFTA is the fact that the multiplier constants
in Winograd’s Short DFT Algorithms: Equation 6 (7.6) or Wino-
grad’s Short DFT Algorithms: Equation 8 (7.8) are either real or
imaginary, never a general complex number. For that reason, mul-
tiplication by complex data requires only two real multiplications,
not four. That is a very significant feature. It is also true that the
j multiplier can be commuted from the D operator to the last part
of the AT operator. This means the D operator has only real mul-
tipliers and the calculations on real data remains real until the last
stage. This can be seen by examining the short DFT modules in
[65], [198] and in the appendices.
10.1 The Prime Factor Algorithm
If the DFT is calculated directly using (10.8), the algorithm is
called a prime factor algorithm [150], [302] and was discussed
in Winograd’s Short DFT Algorithms (Chapter 7) and Multidi-
mensional Index Mapping: In-Place Calculation of the DFT and
Scrambling (Section 3.2: In-Place Calculation of the DFT and
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
140
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
Scrambling). When the short DFT’s are calculated by the very
efficient algorithms of Winograd discussed in Factoring the Signal
Processing Operators (Chapter 6), the PFA becomes a very pow-
erful method that is as fast or faster than the best Cooley-Tukey
FFT’s [57], [213].
A flow graph is not as helpful with the PFA as it was with the
Cooley-Tukey FFT, however, the following representation in Fig-
ure 10.1 which combines Figures Multidimensional Index Map-
ping: Figure 1 (Figure 3.1) and Winograd’s Short DFT Algorithms:
Figure 2 (Figure 7.2) gives a good picture of the algorithm with the
example of Multidimensional Index Mapping: Equation 25 (3.25)
10
5
5
10
11
0
0
8
1
2
3
6
11
7
8
6
12
14
13
14
9
3
2
4
12
9
Figure 10.1: A Prime Factor FFT for N = 15
If N is factored into three factors, the DFT of (10.8) would have
three nested summations and would be a three-dimensional DFT.
This principle extends to any number of factors; however, recall
that the Type-1 map requires that all the factors be relatively prime.
A very simple three-loop indexing scheme has been developed [57]
which gives a compact, efficient PFA program for any number of
factors. The basic program structure is illustrated in p. ?? with
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
141
the short DFT’s being omitted for clarity. Complete programs are
given in [65] and in the appendices.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
142
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲P❋❆ ■◆❉❊❳■◆● ▲❖❖P❙✲✲✲✲✲✲✲✲✲✲✲✲✲✲
❉❖ ✶✵ ❑ ❂ ✶✱ ▼
◆✶ ❂ ◆■✭❑✮
◆✷ ❂ ◆✴◆✶
■✭✶✮ ❂ ✶
❉❖ ✷✵ ❏ ❂ ✶✱ ◆✷
❉❖ ✸✵ ▲❂✷✱ ◆✶
■✭▲✮ ❂ ■✭▲✲✶✮ ✰ ◆✷
■❋ ✭■✭▲ ✳●❚✳◆✮ ■✭▲✮ ❂ ■✭▲✮ ✲ ◆
✸✵
❈❖◆❚■◆❯❊
●❖❚❖ ✭✷✵✱✶✵✷✱✶✵✸✱✶✵✹✱✶✵✺✮✱ ◆✶
■✭✶✮ ❂ ■✭✶✮ ✰ ◆✶
✷✵
❈❖◆❚■◆❯❊
✶✵
❈❖◆❚■◆❯❊
❘❊❚❯❘◆
❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲▼❖❉❯▲❊ ❋❖❘ ◆❂✷✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲
✶✵✷
❘✶
❂ ❳✭■✭✶✮✮
❳✭■✭✶✮✮ ❂ ❘✶ ✰ ❳✭■✭✷✮✮
❳✭■✭✷✮✮ ❂ ❘✶ ✲ ❳✭■✭✷✮✮
❘✶
❂ ❨✭■✭✶✮✮
❨✭■✭✶✮✮ ❂ ❘✶ ✰ ❨✭■✭✷✮✮
❨✭■✭✷✮✮ ❂ ❘✶ ✲ ❨✭■✭✷✮✮
●❖❚❖ ✷✵
❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲❖❚❍❊❘ ▼❖❉❯▲❊❙✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲
✶✵✸
▲❡♥❣t❤✲✸ ❉❋❚
✶✵✹
▲❡♥❣t❤✲✹ ❉❋❚
✶✵✺
▲❡♥❣t❤✲✺ ❉❋❚
❡t❝✳
Listing 10.1: Part of a FORTRAN PFA Program
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
143
As in the Cooley-Tukey program, the DO 10 loop steps through
the M stages (factors of N) and the DO 20 loop calculates the N/N1
length-N1 DFT’s. The input index map of (10.6) is implemented in
the DO 30 loop and the statement just before label 20. In the PFA,
each stage or factor requires a separately programmed module or
butterfly. This lengthens the PFA program but an efficient Cooley-
Tukey program will also require three or more butterflies.
Because the PFA is calculated in-place using the input index map,
the output is scrambled. There are five approaches to dealing with
this scrambled output. First, there are some applications where the
output does not have to be unscrambled as in the case of high-speed
convolution. Second, an unscrambler can be added after the PFA
to give the output in correct order just as the bit-reversed-counter
is used for the Cooley-Tukey FFT. A simple unscrambler is given
in [65], [57] but it is not in place. The third method does the un-
scrambling in the modules while they are being calculated. This is
probably the fastest method but the program must be written for a
specific length [65], [57]. A fourth method is similar and achieves
the unscrambling by choosing the multiplier constants in the mod-
ules properly [198]. The fifth method uses a separate indexing
method for the input and output of each module [65], [320].
10.2 The Winograd Fourier Transform Al-
gorithm
The Winograd Fourier transform algorithm (WFTA) uses a very
powerful property of the Type-1 index map and the DFT to give
a further reduction of the number of multiplications in the PFA.
Using an operator notation where F1 represents taking row DFT’s
and F2 represents column DFT’s, the two-factor PFA of (10.8) is
represented by
X = F2 F1 x
(10.9)
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
144
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
It has been shown [410], [190] that if each operator represents
identical operations on each row or column, they commute. Since
F1 and F2 represent length N1 and N2 DFT’s, they commute and
(10.9) can also be written
X = F1 F2 x
(10.10)
If each short DFT in F is expressed by three operators as in Wino-
grad’s Short DFT Algorithms: Equation 8 (7.8) and Winograd’s
Short DFT Algorithms: Figure 2 (Figure 7.2), F can be factored as
F = AT DA
(10.11)
where A represents the set of additions done on each row or col-
umn that performs the residue reduction as Winograd’s Short DFT
Algorithms: Equation 30 (7.30). Because of the appearance of the
flow graph of A and because it is the first operator on x, it is called a
preweave operator [236]. D is the set of M multiplications and AT
(or BT or CT ) from Winograd’s Short DFT Algorithms: Equation 5
(7.5) or Winograd’s Short DFT Algorithms: Equation 6 (7.6) is the
reconstruction operator called the postweave. Applying (10.11) to
(10.9) gives
X = AT2 D2 A2 AT1 D1 A1 x
(10.12)
This is the PFA of (10.8) and Figure 10.1 where A1D1A1 represents
the row DFT’s on the array formed from x. Because these operators
commute, (10.12) can also be written as
X = AT2 AT1 D2 D1 A2 A1 x
(10.13)
or
X = AT1 AT2 D2 D1 A2 A1 x
(10.14)
but the two adjacent multiplication operators can be premultiplied
and the result represented by one operator D = D2 D1 which is no
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
145
longer the same for each row or column. Equation (10.14) becomes
X = AT1 AT2 D A2 A1 x
(10.15)
This is the basic idea of the Winograd Fourier transform algorithm.
The commuting of the multiplication operators together in the cen-
ter of the algorithm is called nesting and it results in a significant
decrease in the number of multiplications that must be done at the
execution of the algorithm. Pictorially, the PFA of Figure 10.1 be-
comes [213] the WFTA in Figure 10.2.
Figure 10.2: A Length-15 WFTA with Nested Multiplica-
tions
The rectangular structure of the preweave addition operators
causes an expansion of the data in the center of the algorithm. The
15 data points in Figure 10.2 become 18 intermediate values. This
expansion is a major problem in programming the WFTA because
it prevents a straightforward in-place calculation and causes an in-
crease in the number of required additions and in the number of
multiplier constants that must be precalculated and stored.
From Figure 10.2 and the idea of premultiplying the individual
multiplication operators, it can be seen why the multiplications by
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
146
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
unity had to be considered in Winograd’s Short DFT Algorithms:
Table 1 (Table 7.1). Even if a multiplier in D1 is unity, it may not
be in D2D1. In Figure 10.2 with factors of three and five, there
appear to be 18 multiplications required because of the expansion
of the length-5 preweave operator, A2, however, one of multipliers
in each of the length three and five operators is unity, so one of
the 18 multipliers in the product is unity. This gives 17 required
multiplications - a rather impressive reduction from the 152 = 225
multiplications required by direct calculation. This number of 17
complex multiplications will require only 34 real multiplications
because, as mentioned earlier, the multiplier constants are purely
real or imaginary while the 225 complex multiplications are gen-
eral and therefore will require four times as many real multiplica-
tions.
The number of additions depends on the order of the pre- and
postweave operators. For example in the length-15 WFTA in Fig-
ure 10.2, if the length-5 had been done first and last, there would
have been six row addition preweaves in the preweave operator
rather than the five shown. It is difficult to illustrate the algorithm
for three or more factors of N, but the ideas apply to any number
of factors. Each length has an optimal ordering of the pre- and
postweave operators that will minimize the number of additions.
A program for the WFTA is not as simple as for the FFT or PFA
because of the very characteristic that reduces the number of mul-
tiplications, the nesting. A simple two-factor example program is
given in [65] and a general program can be found in [236], [83].
The same lengths are possible with the PFA and WFTA and the
same short DFT modules can be used, however, the multiplies in
the modules must occur in one place for use in the WFTA.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
147
10.3 Modifications of the PFA and WFTA
Type Algorithms
In the previous section it was seen how using the permutation prop-
erty of the elementary operators in the PFA allowed the nesting of
the multiplications to reduce their number. It was also seen that
a proper ordering of the operators could minimize the number of
additions. These ideas have been extended in formulating a more
general algorithm optimizing problem. If the DFT operator F in
(10.11) is expressed in a still more factored form obtained from
Winograd’s Short DFT Algorithms: Equation 30 (7.30), a greater
variety of ordering can be optimized. For example if the A opera-
tors have two factors
F1 = AT1 A’T
1
D1 A’1A1
(10.16)
The DFT in (10.10) becomes
X = AT
T
T
2 A’2 D2A’2A2AT
1 A’1 D1A’1A1x
(10.17)
The operator notation is very helpful in understanding the central
ideas, but may hide some important facts. It has been shown [410],
[198] that operators in different Fi commute with each other, but
the order of the operators within an Fi cannot be changed. They
represent the matrix multiplications in Winograd’s Short DFT Al-
gorithms: Equation 30 (7.30) or Winograd’s Short DFT Algo-
rithms: Equation 8 (7.8) which do not commute.
This formulation allows a very large set of possible orderings, in
fact, the number is so large that some automatic technique must
be used to find the “best". It is possible to set up a criterion of
optimality that not only includes the number of multiplications but
the number of additions as well. The effects of relative multiply-
add times, data transfer times, CPU register and memory sizes,
and other hardware characteristics can be included in the criterion.
Dynamic programming can then be applied to derive an optimal
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
148
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
algorithm for a particular application [190]. This is a very interest-
ing idea as there is no longer a single algorithm, but a class and an
optimizing procedure. The challenge is to generate a broad enough
class to result in a solution that is close to a global optimum and to
have a practical scheme for finding the solution.
Results obtained applying the dynamic programming method to
the design of fairly long DFT algorithms gave algorithms that
had fewer multiplications and additions than either a pure PFA
or WFTA [190]. It seems that some nesting is desirable but not
total nesting for four or more factors. There are also some interest-
ing possibilities in mixing the Cooley-Tukey with this formulation.
Unfortunately, the twiddle factors are not the same for all rows and
columns, therefore, operations cannot commute past a twiddle fac-
tor operator. There are ways of breaking the total algorithm into
horizontal paths and using different orderings along the different
paths [264], [198]. In a sense, this is what the split-radix FFT does
with its twiddle factors when compared to a conventional Cooley-
Tukey FFT.
There are other modifications of the basic structure of the Type-1
index map DFT algorithm. One is to use the same index struc-
ture and conversion of the short DFT’s to convolution as the PFA
but to use some other method for the high-speed convolution. Ta-
ble look-up of partial products based on distributed arithmetic to
eliminate all multiplications [78] looks promising for certain very
specific applications, perhaps for specialized VLSI implementa-
tion. Another possibility is to calculate the short convolutions us-
ing number-theoretic transforms [30], [236], [264]. This would
also require special hardware. Direct calculation of short convolu-
tions is faster on certain pipelined processor such as the TMS-320
microprocessor [216].
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
149
10.4 Evaluation of the PFA and WFTA
As for the Cooley-Tukey FFT’s, the first evaluation of these al-
gorithms will be on the number of multiplications and additions
required. The number of multiplications to compute the PFA in
(10.8) is given by Multidimensional Index Mapping: Equation 3
(3.3). Using the notation that T (N ) is the number of multiplica-
tions or additions necessary to calculate a length-N DFT, the total
number for a four-factor PFA of length-N, where N = N1N2N3N4
is
T (N) = N1N2N3T (N4) + N2N3N4T (N1) + (10.18)
N3N4N1T (N2) + N4N1N2T (N3)
The count of multiplies and adds in Table 10.1 are calculated from
(105) with the counts of the factors taken from Winograd’s Short
DFT Algorithms: Table 1 (Table 7.1). The list of lengths are those
possible with modules in the program of length 2, 3, 4, 5, 7, 8,
9 and 16 as is true for the PFA in [65], [57] and the WFTA in
[236], [83]. A maximum of four relatively prime lengths can be
used from this group giving 59 different lengths over the range
from 2 to 5040. The radix-2 or split-radix FFT allows 12 different
lengths over the same range. If modules of length 11 and 13 from
[188] are added, the maximum length becomes 720720 and the
number of different lengths becomes 239. Adding modules for 17,
19 and 25 from [188] gives a maximum length of 1163962800 and
a very large and dense number of possible lengths. The length of
the code for the longer modules becomes excessive and should not
be included unless needed.
The number of multiplications necessary for the WFTA is simply
the product of those necessary for the required modules, includ-
ing multiplications by unity. The total number may contain some
unity multipliers but it is difficult to remove them in a practical
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 10. THE PRIME FACTOR
150
AND WINOGRAD FOURIER
TRANSFORM ALGORITHMS
program. Table 10.1 contains both the total number (MULTS) and
the number with the unity multiplies removed (RMULTS).
Calculating the number of additions for the WFTA is more compli-
cated than for the PFA because of the expansion of the data moving
through the algorithm. For example the number of additions, TA,
for the length-15 example in Figure 10.2 is given by
TA (N) = N2TA (N1) + T M1TA (N2)
(10.19)
where N1 = 3, N2 = 5, T M1 = the number of multiplies for the
length-3 module and hence the expansion factor. As mentioned
earlier there is an optimum ordering to minimize additions. The
ordering used to calculate Table 10.1 is the ordering used in [236],
[83] which is optimal in most cases and close to optimal in the
others.
Length
PFA
PFA
WFTA
WFTA
WFTA
N
Mults
Adds
Mults
RMults
Adds
10
20
88
24
20
88
12
16
96
24
16
96
14
32
172
36
32
172<