Algorithms1
In 1976, S. Winograd [406] presented a new DFT algorithm which
had significantly fewer multiplications than the Cooley-Tukey FFT
which had been published eleven years earlier. This new Wino-
grad Fourier Transform Algorithm (WFTA) is based on the type-
one index map from Multidimensional Index Mapping (Chapter 3)
with each of the relatively prime length short DFT’s calculated by
very efficient special algorithms. It is these short algorithms that
this section will develop. They use the index permutation of Rader
described in the another module to convert the prime length short
DFT’s into cyclic convolutions. Winograd developed a method for
calculating digital convolution with the minimum number of mul-
tiplications. These optimal algorithms are based on the polynomial
residue reduction techniques of Polynomial Description of Signals:
Equation 1 (4.1) to break the convolution into multiple small ones
[29], [235], [263], [416], [408], [197].
1This content is available online at <http://cnx.org/content/m16333/1.14/>.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
57
CHAPTER 7. WINOGRAD’S SHORT
58
DFT ALGORITHMS
The operation of discrete convolution defined by
y (n) = ∑h(n − k) x(k)
(7.1)
k
is called a bilinear operation because, for a fixed h (n), y (n) is a
linear function of x (n) and for a fixed x (n) it is a linear function of
h (n). The operation of cyclic convolution is the same but with all
indices evaluated modulo N.
Recall from Polynomial Description of Signals: Equation 3 (4.3)
that length-N cyclic convolution of x (n) and h (n) can be repre-
sented by polynomial multiplication
Y (s) = X (s) H (s) mod
sN − 1
(7.2)
This bilinear operation of (7.1) and (7.2) can also be expressed
in terms of linear matrix operators and a simpler bilinear opera-
tor denoted by o which may be only a simple element-by-element
multiplication of the two vectors [235], [197], [212]. This matrix
formulation is
Y = C [AX oBH]
(7.3)
where X , H and Y are length-N vectors with elements of x (n),
h (n) and y (n) respectively. The matrices A and B have dimension
M x N , and C is N x M with M ≥ N. The elements of A, B, and
C are constrained to be simple; typically small integers or rational
numbers. It will be these matrix operators that do the equivalent of
the residue reduction on the polynomials in (7.2).
In order to derive a useful algorithm of the form (7.3) to calculate
(7.1), consider the polynomial formulation (7.2) again. To use the
residue reduction scheme, the modulus is factored into relatively
prime factors. Fortunately the factoring of this particular poly-
nomial, sN − 1, has been extensively studied and it has consider-
able structure. When factored over the rationals, which means that
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
59
the only coefficients allowed are rational numbers, the factors are
called cyclotomic polynomials [29], [235], [263]. The most inter-
esting property for our purposes is that most of the coefficients of
cyclotomic polynomials are zero and the others are plus or minus
unity for degrees up to over one hundred. This means the residue
reduction will generally require no multiplications.
The operations of reducing X (s) and H (s) in (7.2) are carried out
by the matrices A and B in (7.3). The convolution of the residue
polynomials is carried out by the o operator and the recombination
by the CRT is done by the C matrix. More details are in [29], [235],
[263], [197], [212] but the important fact is the A and B matrices
usually contain only zero and plus or minus unity entries and the
C matrix only contains rational numbers. The only general mul-
tiplications are those represented by o. Indeed, in the theoretical
results from computational complexity theory, these real or com-
plex multiplications are usually the only ones counted. In practical
algorithms, the rational multiplications represented by C could be
a limiting factor.
The h (n) terms are fixed for a digital filter, or they represent the W
terms from Multidimensional Index Mapping: Equation 1 (3.1) if
the convolution is being used to calculate a DFT. Because of this,
d = BH in (7.3) can be precalculated and only the A and C opera-
tors represent the mathematics done at execution of the algorithm.
In order to exploit this feature, it was shown [416], [197] that the
properties of (7.3) allow the exchange of the more complicated op-
erator C with the simpler operator B. Specifically this is given by
Y = C [AX oBH]
(7.4)
Y ’ = BT AX oCT H’
(7.5)
where H’ has the same elements as H, but in a permuted order,
and likewise Y ’ and Y . This very important property allows pre-
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 7. WINOGRAD’S SHORT
60
DFT ALGORITHMS
computing the more complicated CT H’ in (7.5) rather than BH as
in (7.3).
Because BH or CT H’ can be precomputed, the bilinear form of
(7.3) and (7.5) can be written as a linear form. If an M x M diagonal
matrix D is formed from d = CT H, or in the case of (7.3), d = BH,
assuming a commutative property for o, (7.5) becomes
Y ’ = BT DAX
(7.6)
and (7.3) becomes
Y = CDAX
(7.7)
In most cases there is no reason not to use the same reduction
operations on X and H, therefore, B can be the same as A and (7.6)
then becomes
Y ’ = AT DAX
(7.8)
In order to illustrate how the residue reduction is carried out and
how the A matrix is obtained, the length-5 DFT algorithm started
in The DFT as Convolution or Filtering: Matrix 1 (5.12) will be
continued. The DFT is first converted to a length-4 cyclic convo-
lution by the index permutation from The DFT as Convolution or
Filtering: Equation 3 (5.3) to give the cyclic convolution in The
DFT as Convolution or Filtering (Chapter 5). To avoid confusion
from the permuted order of the data x (n) in The DFT as Convo-
lution or Filtering (Chapter 5), the cyclic convolution will first be
developed without the permutation, using the polynomial U (s)
U (s) = x (1) + x (3) s + x (4) s2 + x (2) s3
(7.9)
U (s) = u (0) + u (1) s + u (2) s2 + u (3) s3
(7.10)
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
61
and then the results will be converted back to the permuted x (n).
The length-4 cyclic convolution in terms of polynomials is
Y (s) = U (s) H (s) mod
s4 − 1
(7.11)
and the modulus factors into three cyclotomic polynomials
s4 − 1 = s2 − 1
s2 + 1
(7.12)
= (s − 1) (s + 1) s2 + 1
(7.13)
= P1 P2 P3
(7.14)
Both U (s) and H (s) are reduced modulo these three polynomials.
The reduction modulo P1 and P2 is done in two stages. First it is
done modulo s2 − 1 , then that residue is further reduced modulo
(s − 1) and (s + 1).
U (s) = u0 + u1s + u2s2 + u3s3
(7.15)
U ’ (s) = ((U (s)))(s2−1) = (u0 + u2) + (u1 + u3)s
(7.16)
U 1 (s) =
U ’ (s)
= (u0 + u1 + u2 + u3)
(7.17)
P1
U 2 (s) =
U ’ (s)
= (u0 − u1 + u2 − u3)
(7.18)
P2
U 3 (s) = ((U (s)))
= (u
P
0 − u2) + (u1 − u3) s
(7.19)
3
The reduction in (7.16) of the data polynomial (7.15) can be de-
noted by a matrix operation on a vector which has the data as en-
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 7. WINOGRAD’S SHORT
62
DFT ALGORITHMS
tries.
u0
1 0 1 0
u1
u0 + u2
=
(7.20)
0 1 0 1
u2
u1 + u3
u3
and the reduction in (7.19) is
u0
1 0 −1
0
u1
u0 − u2
=
(7.21)
0 1
0
−1
u2
u1 − u3
u3
Combining (7.20) and (7.21) gives one operator
1 0
1
0
u0 + u2
0 1
0
1 u
1 + u3
=
(7.22)
1 0 −1
0 u
0 − u2
0 1
0
−1
u1 − u3
u0 + u2
w0
u
w
1 + u3
1
=
u
v
0 − u2
0
u1 − u3
v1
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
63
Further reduction of v0 + v1s is not possible because P3 = s2 + 1
cannot be factored over the rationals. However s2 − 1 can be fac-
tored into P1P2 = (s − 1) (s + 1) and, therefore, w0 + w1s can be
further reduced as was done in (7.17) and (7.18) by
w0
1 1
= w0 + w1 = u0 + u2 + u1 + u3
(7.23)
w1
w0
1 −1
= w0 − w1 = u0 + u2 − u1 − u3
(7.24)
w1
Combining (7.22), (7.23) and (7.24) gives
1
1
0 0
1 0
1
0
u0
r0
1
−1 0 0 0 1
0
1 u1
r1
=
0
0
1 0 1 0 −1
0 u2
v0
0
0
0 1
0 1
0
−1
u3
v1
(7.25)
The same reduction is done to H (s) and then the convolution of
(7.11) is done by multiplying each residue polynomial of X (s) and
H (s) modulo each corresponding cyclotomic factor of P (s) and
finally a recombination using the polynomial Chinese Remainder
Theorem (CRT) as in Polynomial Description of Signals: Equation
9 (4.9) and Polynomial Description of Signals: Equation 13 (4.13).
Y (s)
=
K1 (s)U1 (s) H1 (s)
+
(7.26)
K2 (s)U2 (s) H2 (s) + K3 (s)U3 (s) H3 (s)
mod s4 − 1
where U1 (s) = r1 and U2 (s) = r2 are constants and U3 (s) = v0 +
v1s is a first degree polynomial. U1 times H1 and U2 times H2 are
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 7. WINOGRAD’S SHORT
64
DFT ALGORITHMS
easy, but multiplying U3 time H3 modulo s2 + 1 is more difficult.
The multiplication of U3 (s) times H3 (s) can be done by the Toom-
Cook algorithm [29], [235], [263] which can be viewed as La-
grange interpolation or polynomial multiplication modulo a spe-
cial polynomial with three arbitrary coefficients. To simplify the
arithmetic, the constants are chosen to be plus and minus one and
zero. The details of this can be found in [29], [235], [263]. For this
example it can be verified that
((v0 + v1s) (h0 + h1s))
= (v
s2+1
0h0 − v1h1) +
(7.27)
(v0h1 + v1h0) s
which by the Toom-Cook algorithm or inspection is
1 0
1 0
1
−1 0
v0
h0
0 1
o
0 1
(7.28) =
−
1 −1 1
v1
h1
1 1
1 1
y0
y1
where o signifies point-by-point multiplication. The total A matrix
in (7.3) is a combination of (7.25) and (7.28) giving
AX = A1A2A3X
(7.29)
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
65
1 0 0 0
1
1
0 0
1 0
1
0
u0
0 1 0 0
1 −1 0 0
0 1
0
1
u1
=
0 0 1 0
(7.30)
=
0
0
1 0
1 0 −1
0
u
2
0 0 0 1
0
0
0 1
0 1
0
−1
u3
0 0 1 1
r0
r
1
v
0
v1
where the matrix A3 gives the residue reduction s2 − 1 and s2 + 1,
the upper left-hand part of A2 gives the reduction modulo s − 1 and
s + 1, and the lower right-hand part of A1 carries out the Toom-
Cook algorithm modulo s2 + 1 with the multiplication in (7.5). No-
tice that by calculating (7.30) in the three stages, seven additions
are required. Also notice that A1 is not square. It is this “expan-
sion" that causes more than N multiplications to be required in o in
(7.5) or D in (7.6). This staged reduction will derive the A operator
for (7.5)
The method described above is very straight-forward for the
shorter DFT lengths. For N = 3, both of the residue polynomials
are constants and the multiplication given by o in (7.3) is trivial.
For N = 5, which is the example used here, there is one first degree
polynomial multiplication required but the Toom-Cook algorithm
uses simple constants and, therefore, works well as indicated in
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 7. WINOGRAD’S SHORT
66
DFT ALGORITHMS
(7.28). For N = 7, there are two first degree residue polynomials
which can each be multiplied by the same techniques used in the
N = 5 example. Unfortunately, for any longer lengths, the residue
polynomials have an order of three or greater which causes the
Toom-Cook algorithm to require constants of plus and minus two
and worse. For that reason, the Toom-Cook method is not used,
and other techniques such as index mapping are used that require
more than the minimum number of multiplications, but do not re-
quire an excessive number of additions. The resulting algorithms
still have the structure of (7.8). Blahut [29] and Nussbaumer [263]
have a good collection of algorithms for polynomial multiplication
that can be used with the techniques discussed here to construct a
wide variety of DFT algorithms.
The constants in the diagonal matrix D can be found from the CRT
matrix C in (7.5) using d = CT H’ for the diagonal terms in D.
As mentioned above, for the smaller prime lengths of 3, 5, and
7 this works well but for longer lengths the CRT becomes very
complicated. An alternate method for finding D uses the fact that
since the linear form (7.6) or (7.8) calculates the DFT, it is possible
to calculate a known DFT of a given x (n) from the definition of
the DFT in Multidimensional Index Mapping: Equation 1 (3.1)
and, given the A matrix in (7.8), solve for D by solving a set of
simultaneous equations. The details of this procedure are described
in [197].
A modification of this approach also works for a length which is
an odd prime raised to some power: N = PM. This is a bit more
complicated [235], [416] but has been done for lengths of 9 and
25. For longer lengths, the conventional Cooley-Tukey type- two
index map algorithm seems to be more efficient. For powers of
two, there is no primitive root, and therefore, no simple conversion
of the DFT into convolution. It is possible to use two generators
[235], [263], [408] to make the conversion and there exists a set of
length 4, 8, and 16 DFT algorithms of the form in (7.8) in [235].
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
67
In Table 7.1 an operation count of several short DFT algorithms is
presented. These are practical algorithms that can be used alone
or in conjunction with the index mapping to give longer DFT’s as
shown in The Prime Factor and Winograd Fourier Transform Al-
gorithms (Chapter 10). Most are optimized in having either the
theoretical minimum number of multiplications or the minimum
number of multiplications without requiring a very large number of
additions. Some allow other reasonable trade-offs between num-
bers of multiplications and additions. There are two lists of the
number of multiplications. The first is the number of actual float-
ing point multiplications that must be done for that length DFT.
Some of these (one or two in most cases) will be by rational con-
stants and the others will be by irrational constants. The second
list is the total number of multiplications given in the diagonal ma-
trix D in (7.8). At least one of these will be unity ( the one as-
sociated with X (0)) and in some cases several will be unity ( for
N = 2M ). The second list is important in programming the WFTA
in The Prime Factor and Winograd Fourier Transform Algorithm:
The Winograd Fourier Transform Algorithm (Section 10.2: The
Winograd Fourier Transform Algorithm).
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 7. WINOGRAD’S SHORT
68
DFT ALGORITHMS
Length N
Mult Non-one
Mult Total
Adds
2
0
4
4
3
4
6
12
4
0
8
16
5
10
12
34
7
16
18
72
8
4
16
52
9
20
22
84
11
40
42
168
13