Fast Fourier Transforms (6x9 Version) by C. Sidney Burrus - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

choice because a loop of small transforms is always required. Is

it better to implement a hard-coded FFT of size 64, for example,

or an unrolled loop of four size-16 FFTs, both of which operate

on the same amount of data? The former should be more efficient

because it performs more computations with the same amount of

data, thanks to the logn factor in the FFT’s nlogn complexity.)

10One practical difficulty is that some “optimizing” compilers will tend to

greatly re-order the code, destroying FFTW’s optimal schedule. With GNU

gcc, we circumvent this problem by using compiler flags that explicitly disable certain stages of the optimizer.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

177

In addition, there are many other techniques that FFTW employs

to supplement the basic recursive strategy, mainly to address the

fact that cache implementations strongly favor accessing consec-

utive data—thanks to cache lines, limited associativity, and direct

mapping using low-order address bits (accessing data at power-of-

two intervals in memory, which is distressingly common in FFTs,

is thus especially prone to cache-line conflicts). Unfortunately, the

known FFT algorithms inherently involve some non-consecutive

access (whether mixed with the computation or in separate bit-

reversal/transposition stages). There are many optimizations in

FFTW to address this. For example, the data for several butter-

flies at a time can be copied to a small buffer before computing

and then copied back, where the copies and computations involve

more consecutive access than doing the computation directly in-

place. Or, the input data for the subtransform can be copied from

(discontiguous) input to (contiguous) output before performing the

subtransform in-place (see "Indirect plans" (Section 11.5.2.4: Indirect plans)), rather than performing the subtransform directly out-

of-place (as in algorithm 1 (p. ??)). Or, the order of loops can

be interchanged in order to push the outermost loop from the first

radix step [the 2 loop in (11.2)] down to the leaves, in order to

make the input access more consecutive (see "Discussion" (Sec-

tion 11.5.2.6: Discussion)). Or, the twiddle factors can be com-

puted using a smaller look-up table (fewer memory loads) at the

cost of more arithmetic (see "Numerical Accuracy in FFTs" (Sec-

tion 11.7: Numerical Accuracy in FFTs)). The choice of whether

to use any of these techniques, which come into play mainly for

moderate n (213 < n < 220), is made by the self-optimizing planner

as described in the next section.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 11. IMPLEMENTING FFTS IN

178

PRACTICE

11.5 Adaptive Composition of FFT Algo-

rithms

As alluded to several times already, FFTW implements a wide va-

riety of FFT algorithms (mostly rearrangements of Cooley-Tukey)

and selects the “best” algorithm for a given n automatically. In this

section, we describe how such self-optimization is implemented,

and especially how FFTW’s algorithms are structured as a com-

position of algorithmic fragments. These techniques in FFTW are

described in greater detail elsewhere [134], so here we will focus

only on the essential ideas and the motivations behind them.

An FFT algorithm in FFTW is a composition of algorithmic steps

called a plan. The algorithmic steps each solve a certain class of

problems (either solving the problem directly or recursively break-

ing it into sub-problems of the same type). The choice of plan for

a given problem is determined by a planner that selects a compo-

sition of steps, either by runtime measurements to pick the fastest

algorithm, or by heuristics, or by loading a pre-computed plan.

These three pieces: problems, algorithmic steps, and the planner,

are discussed in the following subsections.

11.5.1 The problem to be solved

In early versions of FFTW, the only choice made by the planner

was the sequence of radices [131], and so each step of the plan took

a DFT of a given size n, possibly with discontiguous input/output,

and reduced it (via a radix r) to DFTs of size n/r, which were

solved recursively. That is, each step solved the following prob-

lem: given a size n, an input pointer I, an input stride ι, an out-

put pointer O, and an output stride o, it computed the DFT of

I [ ι] for 0 ≤

< n and stored the result in O [ko] for 0 ≤ k < n.

However, we soon found that we could not easily express many in-

teresting algorithms within this framework; for example, in-place

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

179

(I = O) FFTs that do not require a separate bit-reversal stage [195],

[375], [297], [166]. It became clear that the key issue was not the

choice of algorithms, as we had first supposed, but the definition of

the problem to be solved. Because only problems that can be ex-

pressed can be solved, the representation of a problem determines

an outer bound to the space of plans that the planner can explore,

and therefore it ultimately constrains FFTW’s performance.

The difficulty with our initial (n, I, ι, O, o) problem definition was

that it forced each algorithmic step to address only a single DFT. In

fact, FFTs break down DFTs into multiple smaller DFTs, and it is

the combination of these smaller transforms that is best addressed

by many algorithmic choices, especially to rearrange the order of

memory accesses between the subtransforms. Therefore, we rede-

fined our notion of a problem in FFTW to be not a single DFT, but

rather a loop of DFTs, and in fact multiple nested loops of DFTs.

The following sections describe some of the new algorithmic steps

that such a problem definition enables, but first we will define the

problem more precisely.

DFT problems in FFTW are expressed in terms of structures called

I/O tensors,11 which in turn are described in terms of ancillary

structures called I/O dimensions. An I/O dimension d is a triple

d = (n, ι, o), where n is a non-negative integer called the length,

ι is an integer called the input stride, and o is an integer called

the output stride. An I/O tensor t = {d1, d2, ..., d } is a set of I/O

ρ

dimensions. The non-negative integer ρ = |t| is called the rank

of the I/O tensor. A DFT problem, denoted by d f t (N, V, I, O),

consists of two I/O tensors N and V, and of two pointers I and

O. Informally, this describes |V| nested loops of |N|-dimensional

DFTs with input data starting at memory location I and output data

starting at O.

11I/O tensors are unrelated to the tensor-product notation used by some other authors to describe FFT algorithms [389], [296].

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 11. IMPLEMENTING FFTS IN

180

PRACTICE

For simplicity, let us consider only one-dimensional DFTs, so that

N = {(n, ι, o)} implies a DFT of length n on input data with stride

ι and output data with stride o, much like in the original FFTW as

described above. The main new feature is then the addition of zero

or more “loops” V. More formally, d f t (N, {(n, ι, o)} ∪ V, I, O)

is recursively defined as a “loop” of n problems: for all 0 ≤ k <

n, do all computations in d f t (N, V, I + k · ι, O + k · o). The case

of multi-dimensional DFTs is defined more precisely elsewhere

[134], but essentially each I/O dimension in N gives one dimen-

sion of the transform.

We call N the size of the problem. The rank of a problem is de-

fined to be the rank of its size (i.e., the dimensionality of the DFT).

Similarly, we call V the vector size of the problem, and the vector

rank of a problem is correspondingly defined to be the rank of its

vector size. Intuitively, the vector size can be interpreted as a set

of “loops” wrapped around a single DFT, and we therefore refer to

a single I/O dimension of V as a vector loop. (Alternatively, one

can view the problem as describing a DFT over a |V|-dimensional

vector space.) The problem does not specify the order of execution

of these loops, however, and therefore FFTW is free to choose the

fastest or most convenient order.

11.5.1.1 DFT problem examples

A more detailed discussion of the space of problems in FFTW can

be found in [134] , but a simple understanding can be gained by

examining a few examples demonstrating that the I/O tensor repre-

sentation is sufficiently general to cover many situations that arise

in practice, including some that are not usually considered to be

instances of the DFT.

A single one-dimensional DFT of length n, with stride-1 in-

put X and output Y, as in (11.1), is denoted by the problem

d f t ({(n, 1, 1)}, {}, X, Y) (no loops: vector-rank zero).

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

181

As a more complicated example, suppose we have an n1 × n2

matrix X stored as n1 consecutive blocks of contiguous length-

n2 rows (this is called row-major format). The in-place DFT

of all the rows of this matrix would be denoted by the prob-

lem d f t ({(n2, 1, 1)}, {(n1, n2, n2)}, X, X): a length-n1 loop of size-

n2 contiguous DFTs, where each iteration of the loop offsets

its input/output data by a stride n2.

Conversely, the in-place

DFT of all the columns of this matrix would be denoted by

d f t ({(n1, n2, n2)}, {(n2, 1, 1)}, X, X): compared to the previous

example, N and V are swapped. In the latter case, each DFT

operates on discontiguous data, and FFTW might well choose to

interchange the loops: instead of performing a loop of DFTs com-

puted individually, the subtransforms themselves could act on n2-

component vectors, as described in "The space of plans in FFTW"

(Section 11.5.2: The space of plans in FFTW).

A size-1 DFT is simply a copy Y [0] = X [0], and here this can

also be denoted by N = {} (rank zero, a “zero-dimensional”

DFT). This allows FFTW’s problems to represent many kinds

of copies and permutations of the data within the same prob-

lem framework, which is convenient because these sorts of

operations arise frequently in FFT algorithms.

For exam-

ple, to copy n consecutive numbers from I to O, one would

use the rank-zero problem d f t ({}, {(n, 1, 1)}, I, O).

More in-

terestingly, the in-place transpose of an n1 × n2 matrix X

stored in row-major format, as described above, is denoted

by d f t ({}, {(n1, n2, 1) , (n2, 1, n1)}, X, X) (rank zero, vector-rank

two).

11.5.2 The space of plans in FFTW

Here, we describe a subset of the possible plans considered by

FFTW; while not exhaustive [134], this subset is enough to illus-

trate the basic structure of FFTW and the necessity of including

the vector loop(s) in the problem definition to enable several inter-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 11. IMPLEMENTING FFTS IN

182

PRACTICE

esting algorithms. The plans that we now describe usually perform

some simple “atomic” operation, and it may not be apparent how

these operations fit together to actually compute DFTs, or why cer-

tain operations are useful at all. We shall discuss those matters in

"Discussion" (Section 11.5.2.6: Discussion).

Roughly speaking, to solve a general DFT problem, one must per-

form three tasks. First, one must reduce a problem of arbitrary vec-

tor rank to a set of loops nested around a problem of vector rank

0, i.e., a single (possibly multi-dimensional) DFT. Second, one

must reduce the multi-dimensional DFT to a sequence of of rank-1

problems, i.e., one-dimensional DFTs; for simplicity, however, we

do not consider multi-dimensional DFTs below. Third, one must

solve the rank-1, vector rank-0 problem by means of some DFT

algorithm such as Cooley-Tukey. These three steps need not be

executed in the stated order, however, and in fact, almost every

permutation and interleaving of these three steps leads to a correct

DFT plan. The choice of the set of plans explored by the planner is

critical for the usability of the FFTW system: the set must be large

enough to contain the fastest possible plans, but it must be small

enough to keep the planning time acceptable.

11.5.2.1 Rank-0 plans

The rank-0 problem d f t ({}, V, I, O) denotes a permutation of the

input array into the output array. FFTW does not solve arbitrary

rank-0 problems, only the following two special cases that arise in

practice.

• When |V| = 1 and I = O, FFTW produces a plan that copies

the input array into the output array.

Depending on the

strides, the plan consists of a loop or, possibly, of a call to

the ANSI C function ♠❡♠❝♣②, which is specialized to copy

contiguous regions of memory.

• When |V| = 2, I = O, and the strides denote a matrix-

transposition problem, FFTW creates a plan that transposes

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

183

the array in-place. FFTW implements the square transposi-

tion d f t ({}, {(n, ι, o) , (n, o, ι)}, I, O) by means of the cache-

oblivious algorithm from [137], which is fast and, in theory,

uses the cache optimally regardless of the cache size (using

principles similar to those described in the section "FFTs and

the Memory Hierarchy" (Section 11.4: FFTs and the Mem-

ory Hierarchy)). A generalization of this idea is employed

for non-square transpositions with a large common factor or

a small difference between the dimensions, adapting algo-

rithms from [100].

11.5.2.2 Rank-1 plans

Rank-1 DFT problems denote ordinary one-dimensional Fourier

transforms. FFTW deals with most rank-1 problems as follows.

11.5.2.2.1 Direct plans

When the DFT rank-1 problem is “small enough” (usually, n ≤ 64),

FFTW produces a direct plan that solves the problem directly.

These plans operate by calling a fragment of C code (a codelet)

specialized to solve problems of one particular size, whose genera-

tion is described in "Generating Small FFT Kernels" (Section 11.6:

Generating Small FFT Kernels). More precisely, the codelets com-

pute a loop (|V| ≤ 1) of small DFTs.

11.5.2.2.2 Cooley-Tukey plans

For problems of the form d f t ({(n, ι, o)}, V, I, O) where n = rm,

FFTW generates a plan that implements a radix-r Cooley-Tukey

algorithm "Review of the Cooley-Tukey FFT" (Section 11.2: Re-

view of the Cooley-Tukey FFT). Both decimation-in-time and

decimation-in-frequency plans are supported, with both small fixed

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 11. IMPLEMENTING FFTS IN

184

PRACTICE

radices (usually, r ≤ 64) produced by the codelet generator "Gen-

erating Small FFT Kernels" (Section 11.6: Generating Small FFT

Kernels) and also arbitrary radices (e.g. radix- n).

The most common case is a decimation in time (DIT)

plan,

corresponding to a radix r = n2 (and thus m =

n1) in the notation of "Review of the Cooley-Tukey FFT"

(Section 11.2:

Review of the Cooley-Tukey FFT): it first

solves d f t ({(m, r · ι, o)}, V ∪ {(r, ι, m · o)}, I, O), then multiplies

the output array O by the twiddle factors, and finally solves

d f t ({(r, m · o, m · o)}, V ∪ {(m, o, o)}, O, O). For performance, the

last two steps are not planned independently, but are fused together

in a single “twiddle” codelet—a fragment of C code that multiplies

its input by the twiddle factors and performs a DFT of size r, oper-

ating in-place on O.

11.5.2.3 Plans for higher vector ranks

These plans extract a vector loop to reduce a DFT problem to a

problem of lower vector rank, which is then solved recursively.

Any of the vector loops of V could be extracted in this way, lead-

ing to a number of possible plans corresponding to different loop

orderings.

Formally, to solve d f t (N, V, I, O), where V = {(n, ι, o)} ∪ V1,

FFTW generates a loop that, for all k such that 0 ≤ k < n, invokes

a plan for d f t (N, V1, I + k · ι, O + k · o).

11.5.2.4 Indirect plans

Indirect plans transform a DFT problem that requires some data

shuffling (or discontiguous operation) into a problem that requires

no shuffling plus a rank-0 problem that performs the shuffling.

Formally, to solve d f t (N, V, I, O) where |N| > 0, FFTW gener-

ates a plan that first solves d f t ({}, N ∪ V, I, O), and then solves

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

185

d f t (copy − o (N) , copy − o (V) , O, O).

Here we define copy −

o (t) to be the I/O tensor {(n, o, o) | (n, ι, o) ∈ t}: that is, it replaces

the input strides with the output strides. Thus, an indirect plan first

rearranges/copies the data to the output, then solves the problem in

place.

11.5.2.5 Plans for prime sizes

As discussed in "Goals and Background of the FFTW Project"

(Section 11.3: Goals and Background of the FFTW Project), it

turns out to be surprisingly useful to be able to handle large prime

n (or large prime factors). Rader plans implement the algorithm

from [309] to compute one-dimensional DFTs of prime size in

Θ (nlogn) time. Bluestein plans implement Bluestein’s “chirp-z”

algorithm, which can also handle prime n in Θ (nlogn) time [35],

[305], [278]. Generic plans implement a naive Θ n2 algorithm

(useful for n

100).

11.5.2.6 Discussion

Although it may not be immediately apparent, the combination

of the recursive rules in "The space of plans in FFTW" (Sec-

tion 11.5.2: The space of plans in FFTW) can produce a number

of useful algorithms. To illustrate these compositions, we discuss

three particular issues: depth- vs. breadth-first, loop reordering,

and in-place transforms.

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

index-192_1.png

CHAPTER 11. IMPLEMENTING FFTS IN

186

PRACTICE

Figure 11.3: Two possible decompositions for a size-30 DFT,

both for the arbitrary choice of DIT radices 3 then 2 then

5, and prime-size codelets. Items grouped by a "{" result

from the plan for a single sub-problem. In the depth-first

case, the vector rank was reduced to zero as per "Plans for

higher vector ranks" (Section 11.5.2.3: Plans for higher vec-

tor ranks) before decomposing sub-problems, and vice-versa

in the breadth-first case.

As discussed previously in sections "Review of the Cooley-Tukey

FFT" (Section 11.2: Review of the Cooley-Tukey FFT) and "Un-

derstanding FFTs with an ideal cache" (Section 11.4.1: Under-

standing FFTs with an ideal cache), the same Cooley-Tukey de-

composition can be executed in either traditional breadth-first or-

der or in recursive depth-first order, where the latter has some

theoretical cache advantages. FFTW is explicitly recursive, and

thus it can naturally employ a depth-first order. Because its sub-

problems contain a vector loop that can be executed in a variety of

orders, however, FFTW can also employ breadth-first traversal. In

particular, a 1d algorithm resembling the traditional breadth-first

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

187

Cooley-Tukey would result from applying "Cooley-Tukey plans"

(Section 11.5.2.2.2: Cooley-Tukey plans) to completely factorize

the problem size before applying the loop rule "Plans for higher

vector ranks" (Section 11.5.2.3: Plans for higher vector ranks) to

reduce the vector ranks, whereas depth-first traversal would result

from applying the loop rule before factorizing each subtransform.

These two possibilities are illustrated by an example in Figure 11.3.

Another example of the effect of loop reordering is a style of plan

that we sometimes call vector recursion (unrelated to “vector-

radix” FFTs [114]). The basic idea is that, if one has a loop

(vector-rank 1) of transforms, where the vector stride is smaller

than the transform size, it is advantageous to push the loop towards

the leaves of the transform decomposition, while otherwise main-

taining recursive depth-first ordering, rather than looping “outside”

the transform; i.e., apply the usual FFT to “vectors” rather than

numbers. Limited forms of this idea have appeared for computing

multiple FFTs on vector processors (where the loop in question

maps directly to a hardware vector) [372]. For example, Cooley-

Tukey produces a unit input-stride vector loop at the top-level DIT

decomposition, but with a large output stride; this difference in

strides makes it non-obvious whether vector recursion is advanta-

geous for the sub-problem, but for large transforms we often ob-

serve the planner to choose this possibility.

In-place 1d transforms (with no separate bit reversal pass) can be

obtained as follows by a combination DIT and DIF plans "Cooley-

Tukey plans" (Section 11.5.2.2.2: Cooley-Tukey plans) with trans-

poses "Rank-0 plans" (Section 11.5.2.1: Rank-0 plans). First, the

transform is decomposed via a radix-p DIT plan into a vector of

p transforms of size qm, then these are decomposed in turn by a

radix-q DIF plan into a vector (rank 2) of p × q transforms of size

m. These transforms of size m have input and output at differ-

ent places/strides in the original array, and so cannot be solved

independently.

Instead, an indirect plan "Indirect plans" (Sec-

Available for free at Connexions

<http://cnx.org/content/col10683/1.5>

CHAPTER 11. IMPLEMENTING FFTS IN

188

PRACTICE

tion 11.5.2.4: Indirect plans) is used to express the sub-problem

as pq in-place transforms of size m, followed or preceded by an

m × p×q rank-0 transform. The latter sub-problem is easily seen to

be m in-place p × q transposes (ideally square, i.e. p = q). Related

strategies for in-place transforms based on small transposes were

described in [195], [375], [297], [166]; alternating DIT/DIF, with-

out concern for in-place operation, was also considered in [255],

[322].

11.5.3 The FFTW planner

Given a problem and a set of possible plans, the basic principle

behind the FFTW planner is straightforward: construct a plan for

each applicable algorithmic step, time the execution of these plans,

and select the fastest one. Each algorithmic step may break the

problem into subproblems, and the fastest plan for each subprob-

lem is constructed in the same way. These timing measurements

can either be performed at runtime, or alternatively the plans for a

given set of sizes can be precomputed and loaded at a later time.

A direct implementation of this approach, however, faces an expo-

nential explosion of the number of possible plans, and hence of the

planning time, as n increases. In order to reduce the planning time

to a manageable level, we employ several heuristics to reduce the

space of possible plans that must be compared. The most important

of these heuristics is dynamic programming [96]: it optimizes

each sub-problem locally, independently of the larger context (so

that the “best” plan for a given sub-problem is re-used whenever

that sub-problem is encountered). Dynamic programming is not