Fast Fourier Transforms (6x9 Version) by C. Sidney Burrus - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Chapter 10

The Prime Factor and

Winograd Fourier Transform

Algorithms1

The prime factor algorithm (PFA) and the Winograd Fourier trans-

form algorithm (WFTA) are methods for efficiently calculating the

DFT which use, and in fact, depend on the Type-1 index map from

Multidimensional Index Mapping: Equation 10 (3.10) and Mul-

tidimensional Index Mapping: Equation 6 (3.6). The use of this

index map preceded Cooley and Tukey’s paper [150], [302] but its

full potential was not realized until it was combined with Wino-

grad’s short DFT algorithms. The modern PFA was first presented

in [213] and a program given in [57]. The WFTA was first pre-

sented in [407] and programs given in [236], [83].

The number theoretic basis for the indexing in these algorithms

may, at first, seem more complicated than in the Cooley-Tukey

FFT; however, if approached from the general index mapping point

of view of Multidimensional Index Mapping (Chapter 3), it is

straightforward, and part of a common approach to breaking large

1This content is available online at <http://cnx.org/content/m16335/1.9/>.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

137

CHAPTER 10. THE PRIME FACTOR

138

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

problems into smaller ones. The development in this section will

parallel that in The Cooley-Tukey Fast Fourier Transform Algo-

rithm (Chapter 9).

The general index maps of Multidimensional Index Mapping:

Equation 6 (3.6) and Multidimensional Index Mapping: Equa-

tion 12 (3.12) must satisfy the Type-1 conditions of Multidimen-

sional Index Mapping: Equation 7 (3.7) and Multidimensional In-

dex Mapping: Equation 10 (3.10) which are

K1 = aN2 and K2 = bN1 with (K1, N1) = (K2, N2) = 1 (10.1)

K3 = cN2 and K4 = dN1 with (K3, N1) = (K4, N2) = 1 (10.2)

The row and column calculations in Multidimensional Index Map-

ping: Equation 15 (3.15) are uncoupled by Multidimensional Index

Mapping: Equation 16 (3.16) which for this case are

((K1K4)) = ((K

= 0

(10.3)

N

2K3))N

In addition, to make each short sum a DFT, the Ki must also satisfy

((K1K3)) = N

= N

N

2

and ((K2K4))N

1

(10.4)

In order to have the smallest values for Ki, the constants in (10.1)

are chosen to be

a = b = 1, c =

N−1

, d =

N−1

(10.5)

2

N

1

N

which gives for the index maps in (10.1)

n = ((N2n1 + N1n2))

(10.6)

N

k = ((K3k1 + K4k2))

(10.7)

N

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

139

The frequency index map is a form of the Chinese remainder theo-

rem. Using these index maps, the DFT in Multidimensional Index

Mapping: Equation 15 (3.15) becomes

N2−1N1−1

X = ∑ ∑ x W n1k1 W n2k2

(10.8)

N1

N2

n2=0 n1=0

which is a pure two-dimensional DFT with no twiddle factors and

the summations can be done in either order. Choices other than

(10.5) could be used. For example, a = b = c = d = 1 will cause

the input and output index map to be the same and, therefore, there

will be no scrambling of the output order. The short summations

in (96), however, will no longer be short DFT’s [57].

An important feature of the short Winograd DFT’s described in

Winograd’s Short DFT Algorithms (Chapter 7) that is useful for

both the PFA and WFTA is the fact that the multiplier constants

in Winograd’s Short DFT Algorithms: Equation 6 (7.6) or Wino-

grad’s Short DFT Algorithms: Equation 8 (7.8) are either real or

imaginary, never a general complex number. For that reason, mul-

tiplication by complex data requires only two real multiplications,

not four. That is a very significant feature. It is also true that the

j multiplier can be commuted from the D operator to the last part

of the AT operator. This means the D operator has only real mul-

tipliers and the calculations on real data remains real until the last

stage. This can be seen by examining the short DFT modules in

[65], [198] and in the appendices.

10.1 The Prime Factor Algorithm

If the DFT is calculated directly using (10.8), the algorithm is

called a prime factor algorithm [150], [302] and was discussed

in Winograd’s Short DFT Algorithms (Chapter 7) and Multidi-

mensional Index Mapping: In-Place Calculation of the DFT and

Scrambling (Section 3.2: In-Place Calculation of the DFT and

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

140

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

Scrambling). When the short DFT’s are calculated by the very

efficient algorithms of Winograd discussed in Factoring the Signal

Processing Operators (Chapter 6), the PFA becomes a very pow-

erful method that is as fast or faster than the best Cooley-Tukey

FFT’s [57], [213].

A flow graph is not as helpful with the PFA as it was with the

Cooley-Tukey FFT, however, the following representation in Fig-

ure 10.1 which combines Figures Multidimensional Index Map-

ping: Figure 1 (Figure 3.1) and Winograd’s Short DFT Algorithms:

Figure 2 (Figure 7.2) gives a good picture of the algorithm with the

example of Multidimensional Index Mapping: Equation 25 (3.25)

10

5

5

10

11

0

0

8

1

2

3

6

11

7

8

6

12

14

13

14

9

3

2

4

12

9

Figure 10.1: A Prime Factor FFT for N = 15

If N is factored into three factors, the DFT of (10.8) would have

three nested summations and would be a three-dimensional DFT.

This principle extends to any number of factors; however, recall

that the Type-1 map requires that all the factors be relatively prime.

A very simple three-loop indexing scheme has been developed [57]

which gives a compact, efficient PFA program for any number of

factors. The basic program structure is illustrated in p. ?? with

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

141

the short DFT’s being omitted for clarity. Complete programs are

given in [65] and in the appendices.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

142

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲P❋❆ ■◆❉❊❳■◆● ▲❖❖P❙✲✲✲✲✲✲✲✲✲✲✲✲✲✲

❉❖ ✶✵ ❑ ❂ ✶✱ ▼

◆✶ ❂ ◆■✭❑✮

◆✷ ❂ ◆✴◆✶

■✭✶✮ ❂ ✶

❉❖ ✷✵ ❏ ❂ ✶✱ ◆✷

❉❖ ✸✵ ▲❂✷✱ ◆✶

■✭▲✮ ❂ ■✭▲✲✶✮ ✰ ◆✷

■❋ ✭■✭▲ ✳●❚✳◆✮ ■✭▲✮ ❂ ■✭▲✮ ✲ ◆

✸✵

❈❖◆❚■◆❯❊

●❖❚❖ ✭✷✵✱✶✵✷✱✶✵✸✱✶✵✹✱✶✵✺✮✱ ◆✶

■✭✶✮ ❂ ■✭✶✮ ✰ ◆✶

✷✵

❈❖◆❚■◆❯❊

✶✵

❈❖◆❚■◆❯❊

❘❊❚❯❘◆

❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲▼❖❉❯▲❊ ❋❖❘ ◆❂✷✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲

✶✵✷

❘✶

❂ ❳✭■✭✶✮✮

❳✭■✭✶✮✮ ❂ ❘✶ ✰ ❳✭■✭✷✮✮

❳✭■✭✷✮✮ ❂ ❘✶ ✲ ❳✭■✭✷✮✮

❘✶

❂ ❨✭■✭✶✮✮

❨✭■✭✶✮✮ ❂ ❘✶ ✰ ❨✭■✭✷✮✮

❨✭■✭✷✮✮ ❂ ❘✶ ✲ ❨✭■✭✷✮✮

●❖❚❖ ✷✵

❈✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲❖❚❍❊❘ ▼❖❉❯▲❊❙✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲

✶✵✸

▲❡♥❣t❤✲✸ ❉❋❚

✶✵✹

▲❡♥❣t❤✲✹ ❉❋❚

✶✵✺

▲❡♥❣t❤✲✺ ❉❋❚

❡t❝✳

Listing 10.1: Part of a FORTRAN PFA Program

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

143

As in the Cooley-Tukey program, the DO 10 loop steps through

the M stages (factors of N) and the DO 20 loop calculates the N/N1

length-N1 DFT’s. The input index map of (10.6) is implemented in

the DO 30 loop and the statement just before label 20. In the PFA,

each stage or factor requires a separately programmed module or

butterfly. This lengthens the PFA program but an efficient Cooley-

Tukey program will also require three or more butterflies.

Because the PFA is calculated in-place using the input index map,

the output is scrambled. There are five approaches to dealing with

this scrambled output. First, there are some applications where the

output does not have to be unscrambled as in the case of high-speed

convolution. Second, an unscrambler can be added after the PFA

to give the output in correct order just as the bit-reversed-counter

is used for the Cooley-Tukey FFT. A simple unscrambler is given

in [65], [57] but it is not in place. The third method does the un-

scrambling in the modules while they are being calculated. This is

probably the fastest method but the program must be written for a

specific length [65], [57]. A fourth method is similar and achieves

the unscrambling by choosing the multiplier constants in the mod-

ules properly [198]. The fifth method uses a separate indexing

method for the input and output of each module [65], [320].

10.2 The Winograd Fourier Transform Al-

gorithm

The Winograd Fourier transform algorithm (WFTA) uses a very

powerful property of the Type-1 index map and the DFT to give

a further reduction of the number of multiplications in the PFA.

Using an operator notation where F1 represents taking row DFT’s

and F2 represents column DFT’s, the two-factor PFA of (10.8) is

represented by

X = F2 F1 x

(10.9)

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

144

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

It has been shown [410], [190] that if each operator represents

identical operations on each row or column, they commute. Since

F1 and F2 represent length N1 and N2 DFT’s, they commute and

(10.9) can also be written

X = F1 F2 x

(10.10)

If each short DFT in F is expressed by three operators as in Wino-

grad’s Short DFT Algorithms: Equation 8 (7.8) and Winograd’s

Short DFT Algorithms: Figure 2 (Figure 7.2), F can be factored as

F = AT DA

(10.11)

where A represents the set of additions done on each row or col-

umn that performs the residue reduction as Winograd’s Short DFT

Algorithms: Equation 30 (7.30). Because of the appearance of the

flow graph of A and because it is the first operator on x, it is called a

preweave operator [236]. D is the set of M multiplications and AT

(or BT or CT ) from Winograd’s Short DFT Algorithms: Equation 5

(7.5) or Winograd’s Short DFT Algorithms: Equation 6 (7.6) is the

reconstruction operator called the postweave. Applying (10.11) to

(10.9) gives

X = AT2 D2 A2 AT1 D1 A1 x

(10.12)

This is the PFA of (10.8) and Figure 10.1 where A1D1A1 represents

the row DFT’s on the array formed from x. Because these operators

commute, (10.12) can also be written as

X = AT2 AT1 D2 D1 A2 A1 x

(10.13)

or

X = AT1 AT2 D2 D1 A2 A1 x

(10.14)

but the two adjacent multiplication operators can be premultiplied

and the result represented by one operator D = D2 D1 which is no

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

145

longer the same for each row or column. Equation (10.14) becomes

X = AT1 AT2 D A2 A1 x

(10.15)

This is the basic idea of the Winograd Fourier transform algorithm.

The commuting of the multiplication operators together in the cen-

ter of the algorithm is called nesting and it results in a significant

decrease in the number of multiplications that must be done at the

execution of the algorithm. Pictorially, the PFA of Figure 10.1 be-

comes [213] the WFTA in Figure 10.2.

Figure 10.2: A Length-15 WFTA with Nested Multiplica-

tions

The rectangular structure of the preweave addition operators

causes an expansion of the data in the center of the algorithm. The

15 data points in Figure 10.2 become 18 intermediate values. This

expansion is a major problem in programming the WFTA because

it prevents a straightforward in-place calculation and causes an in-

crease in the number of required additions and in the number of

multiplier constants that must be precalculated and stored.

From Figure 10.2 and the idea of premultiplying the individual

multiplication operators, it can be seen why the multiplications by

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

146

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

unity had to be considered in Winograd’s Short DFT Algorithms:

Table 1 (Table 7.1). Even if a multiplier in D1 is unity, it may not

be in D2D1. In Figure 10.2 with factors of three and five, there

appear to be 18 multiplications required because of the expansion

of the length-5 preweave operator, A2, however, one of multipliers

in each of the length three and five operators is unity, so one of

the 18 multipliers in the product is unity. This gives 17 required

multiplications - a rather impressive reduction from the 152 = 225

multiplications required by direct calculation. This number of 17

complex multiplications will require only 34 real multiplications

because, as mentioned earlier, the multiplier constants are purely

real or imaginary while the 225 complex multiplications are gen-

eral and therefore will require four times as many real multiplica-

tions.

The number of additions depends on the order of the pre- and

postweave operators. For example in the length-15 WFTA in Fig-

ure 10.2, if the length-5 had been done first and last, there would

have been six row addition preweaves in the preweave operator

rather than the five shown. It is difficult to illustrate the algorithm

for three or more factors of N, but the ideas apply to any number

of factors. Each length has an optimal ordering of the pre- and

postweave operators that will minimize the number of additions.

A program for the WFTA is not as simple as for the FFT or PFA

because of the very characteristic that reduces the number of mul-

tiplications, the nesting. A simple two-factor example program is

given in [65] and a general program can be found in [236], [83].

The same lengths are possible with the PFA and WFTA and the

same short DFT modules can be used, however, the multiplies in

the modules must occur in one place for use in the WFTA.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

147

10.3 Modifications of the PFA and WFTA

Type Algorithms

In the previous section it was seen how using the permutation prop-

erty of the elementary operators in the PFA allowed the nesting of

the multiplications to reduce their number. It was also seen that

a proper ordering of the operators could minimize the number of

additions. These ideas have been extended in formulating a more

general algorithm optimizing problem. If the DFT operator F in

(10.11) is expressed in a still more factored form obtained from

Winograd’s Short DFT Algorithms: Equation 30 (7.30), a greater

variety of ordering can be optimized. For example if the A opera-

tors have two factors

F1 = AT1 A’T

1

D1 A’1A1

(10.16)

The DFT in (10.10) becomes

X = AT

T

T

2 A’2 D2A’2A2AT

1 A’1 D1A’1A1x

(10.17)

The operator notation is very helpful in understanding the central

ideas, but may hide some important facts. It has been shown [410],

[198] that operators in different Fi commute with each other, but

the order of the operators within an Fi cannot be changed. They

represent the matrix multiplications in Winograd’s Short DFT Al-

gorithms: Equation 30 (7.30) or Winograd’s Short DFT Algo-

rithms: Equation 8 (7.8) which do not commute.

This formulation allows a very large set of possible orderings, in

fact, the number is so large that some automatic technique must

be used to find the “best". It is possible to set up a criterion of

optimality that not only includes the number of multiplications but

the number of additions as well. The effects of relative multiply-

add times, data transfer times, CPU register and memory sizes,

and other hardware characteristics can be included in the criterion.

Dynamic programming can then be applied to derive an optimal

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

148

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

algorithm for a particular application [190]. This is a very interest-

ing idea as there is no longer a single algorithm, but a class and an

optimizing procedure. The challenge is to generate a broad enough

class to result in a solution that is close to a global optimum and to

have a practical scheme for finding the solution.

Results obtained applying the dynamic programming method to

the design of fairly long DFT algorithms gave algorithms that

had fewer multiplications and additions than either a pure PFA

or WFTA [190]. It seems that some nesting is desirable but not

total nesting for four or more factors. There are also some interest-

ing possibilities in mixing the Cooley-Tukey with this formulation.

Unfortunately, the twiddle factors are not the same for all rows and

columns, therefore, operations cannot commute past a twiddle fac-

tor operator. There are ways of breaking the total algorithm into

horizontal paths and using different orderings along the different

paths [264], [198]. In a sense, this is what the split-radix FFT does

with its twiddle factors when compared to a conventional Cooley-

Tukey FFT.

There are other modifications of the basic structure of the Type-1

index map DFT algorithm. One is to use the same index struc-

ture and conversion of the short DFT’s to convolution as the PFA

but to use some other method for the high-speed convolution. Ta-

ble look-up of partial products based on distributed arithmetic to

eliminate all multiplications [78] looks promising for certain very

specific applications, perhaps for specialized VLSI implementa-

tion. Another possibility is to calculate the short convolutions us-

ing number-theoretic transforms [30], [236], [264]. This would

also require special hardware. Direct calculation of short convolu-

tions is faster on certain pipelined processor such as the TMS-320

microprocessor [216].

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

149

10.4 Evaluation of the PFA and WFTA

As for the Cooley-Tukey FFT’s, the first evaluation of these al-

gorithms will be on the number of multiplications and additions

required. The number of multiplications to compute the PFA in

(10.8) is given by Multidimensional Index Mapping: Equation 3

(3.3). Using the notation that T (N ) is the number of multiplica-

tions or additions necessary to calculate a length-N DFT, the total

number for a four-factor PFA of length-N, where N = N1N2N3N4

is

T (N) = N1N2N3T (N4) + N2N3N4T (N1) + (10.18)

N3N4N1T (N2) + N4N1N2T (N3)

The count of multiplies and adds in Table 10.1 are calculated from

(105) with the counts of the factors taken from Winograd’s Short

DFT Algorithms: Table 1 (Table 7.1). The list of lengths are those

possible with modules in the program of length 2, 3, 4, 5, 7, 8,

9 and 16 as is true for the PFA in [65], [57] and the WFTA in

[236], [83]. A maximum of four relatively prime lengths can be

used from this group giving 59 different lengths over the range

from 2 to 5040. The radix-2 or split-radix FFT allows 12 different

lengths over the same range. If modules of length 11 and 13 from

[188] are added, the maximum length becomes 720720 and the

number of different lengths becomes 239. Adding modules for 17,

19 and 25 from [188] gives a maximum length of 1163962800 and

a very large and dense number of possible lengths. The length of

the code for the longer modules becomes excessive and should not

be included unless needed.

The number of multiplications necessary for the WFTA is simply

the product of those necessary for the required modules, includ-

ing multiplications by unity. The total number may contain some

unity multipliers but it is difficult to remove them in a practical

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 10. THE PRIME FACTOR

150

AND WINOGRAD FOURIER

TRANSFORM ALGORITHMS

program. Table 10.1 contains both the total number (MULTS) and

the number with the unity multiplies removed (RMULTS).

Calculating the number of additions for the WFTA is more compli-

cated than for the PFA because of the expansion of the data moving

through the algorithm. For example the number of additions, TA,

for the length-15 example in Figure 10.2 is given by

TA (N) = N2TA (N1) + T M1TA (N2)

(10.19)

where N1 = 3, N2 = 5, T M1 = the number of multiplies for the

length-3 module and hence the expansion factor. As mentioned

earlier there is an optimum ordering to minimize additions. The

ordering used to calculate Table 10.1 is the ordering used in [236],

[83] which is optimal in most cases and close to optimal in the

others.

Length

PFA

PFA

WFTA

WFTA

WFTA

N

Mults

Adds

Mults

RMults

Adds

10

20

88

24

20

88

12

16

96

24

16

96

14

32

172

36

32

172<