13.1 Fast Convolution by the FFT
One of the main applications of the FFT is to do convolution more
efficiently than the direct calculation from the definition which is:
y (n) = ∑ h(m) x(n − m)
(13.1)
which, with a change of variables, can also be written as:
y (n) = ∑ x(m) h(n − m)
(13.2)
This is often used to filter a signal x (n) with a filter whose im-
pulse response is h (n). Each output value y (n) requires N multi-
plications and N − 1 additions if y (n) and h (n) have N terms. So,
for N output values, on the order of N2 arithmetic operations are
required.
Because the DFT converts convolution to multiplication:
DFT {y (n)} = DFT {h (n)} DFT {x (n)}
(13.3)
1This content is available online at <http://cnx.org/content/m16339/1.10/>.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
203
CHAPTER 13. CONVOLUTION
204
ALGORITHMS
can be calculated with the FFT and bring the order of arithmetic
operations down to Nlog (N) which can be significant for large N.
This approach, which is called “fast convolutions", is a form of
block processing since a whole block or segment of x (n) must be
available to calculate even one output value, y (n). So, a time de-
lay of one block length is always required. Another problem is the
filtering use of convolution is usually non-cyclic and the convo-
lution implemented with the DFT is cyclic. This is dealt with by
appending zeros to x (n) and h (n) such that the output of the cyclic
convolution gives one block of the output of the desired non-cyclic
convolution.
For filtering and some other applications, one wants “on going"
convolution where the filter response h (n) may be finite in length
or duration, but the input x (n) is of arbitrary length. Two methods
have traditionally used to break the input into blocks and use the
FFT to convolve the block so that the output that would have been
calculated by directly implementing (13.1) or (13.2) can be con-
structed efficiently. These are called “overlap-add" and “over-lap
save".
13.1.1 Fast Convolution by Overlap-Add
In order to use the FFT to convolve (or filter) a long input sequence
x (n) with a finite length-M impulse response, h (n), we partition
the input sequence in segments or blocks of length L. Because
convolution (or filtering) is linear, the output is a linear sum of
the result of convolving the first block with h (n) plus the result of
convolving the second block with h (n), plus the rest. Each of these
block convolutions can be calculated by using the FFT. The output
is the inverse FFT of the product of the FFT of x (n) and the FFT
of h (n). Since the number of arithmetic operation to calculate the
convolution directly is on the order of M2 and, if done with the
FFT, is on the order of Mlog (M), there can be a great savings by
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
205
using the FFT for large M.
The reason this procedure is not totally straightforward, is the
length of the output of convolving a length-L block with a length-
M filter is of length L +M −1. This means the output blocks cannot
simply be concatenated but must be overlapped and added, hence
the name for this algorithm is “Overlap-Add".
The second issue that must be taken into account is the fact that the
overlap-add steps need non-cyclic convolution and convolution by
the FFT is cyclic. This is easily handled by appending L − 1 zeros
to the impulse response and M − 1 zeros to each input block so that
all FFTs are of length M + L − 1. This means there is no aliasing
and the implemented cyclic convolution gives the same output as
the desired non-cyclic convolution.
The savings in arithmetic can be considerable when implementing
convolution or performing FIR digital filtering. However, there are
two penalties. The use of blocks introduces a delay of one block
length. None of the first block of output can be calculated until
all of the first block of input is available. This is not a problem
for “off line" or “batch" processing but can be serious for real-time processing. The second penalty is the memory required to store
and process the blocks. The continuing reduction of memory cost
often removes this problem.
The efficiency in terms of number of arithmetic operations per out-
put point increases for large blocks because of the Mlog (M) re-
quirements of the FFT. However, the blocks become very large
(L > > M), much of the input block will be the appended zeros and
efficiency is lost. For any particular application, taking the particu-
lar filter and FFT algorithm being used and the particular hardware
being used, a plot of efficiency vs. block length, L should be made
and L chosen to maximize efficiency given any other constraints
that are applicable.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 13. CONVOLUTION
206
ALGORITHMS
Usually, the block convolutions are done by the FFT, but they could
be done by any efficient, finite length method. One could use “rect-
angular transforms" or “number-theoretic transforms". A general-
ization of this method is presented later in the notes.
13.1.2 Fast Convolution by Overlap-Save
An alternative approach to the Overlap-Add can be developed by
starting with segmenting the output rather than the input. If one
considers the calculation of a block of output, it is seen that not
only the corresponding input block is needed, but part of the pre-
ceding input block also needed. Indeed, one can show that a length
M + L − 1 segment of the input is needed for each output block.
So, one saves the last part of the preceding block and concatenates
it with the current input block, then convolves that with h (n) to
calculate the current output
13.2 Block Processing, a Generalization of
Overlap Methods
Convolution is intimately related to the DFT. It was shown in The
DFT as Convolution or Filtering (Chapter 5) that a prime length
DFT could be converted to cyclic convolution. It has been long
known [276] that convolution can be calculated by multiplying the
DFTs of signals.
An important question is what is the fastest method for calculating
digital convolution. There are several methods that each have some
advantage. The earliest method for fast convolution was the use
of sectioning with overlap-add or overlap-save and the FFT [276],
[300], [66]. In most cases the convolution is of real data and, there-
fore, real-data FFTs should be used. That approach is still proba-
bly the fastest method for longer convolution on a general purpose
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
207
computer or microprocessor. The shorter convolutions should sim-
ply be calculated directly.
13.3 Introduction
The partitioning of long or infinite strings of data into shorter sec-
tions or blocks has been used to allow application of the FFT to
realize on-going or continuous convolution [368], [181]. This sec-
tion develops the idea of block processing and shows that it is a
generalization of the overlap-add and overlap-save methods [368],
[147]. They further generalize the idea to a multidimensional for-
mulation of convolution [3], [47]. Moving in the opposite direc-
tion, it is shown that, rather than partitioning a string of scalars into
blocks and then into blocks of blocks, one can partition a scalar
number into blocks of bits and then include the operation of mul-
tiplication in the signal processing formulation. This is called dis-
tributed arithmetic [45] and, since it describes operations at the bit
level, is completely general. These notes try to present a coherent
development of these ideas.
13.4 Block Signal Processing
In this section the usual convolution and recursion that imple-
ments FIR and IIR discrete-time filters are reformulated in terms
of vectors and matrices. Because the same data is partitioned and
grouped in a variety of ways, it is important to have a consistent
notation in order to be clear. The nth element of a data sequence
is expressed h (n) or, in some cases to simplify, hn. A block or
finite length column vector is denoted hn with n indicating the nth
block or section of a longer vector. A matrix, square or rectangu-
lar, is indicated by an upper case letter such as H with a subscript
if appropriate.
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 13. CONVOLUTION
208
ALGORITHMS
13.4.1 Block Convolution
The operation of a finite impulse response (FIR) filter is described
by a finite convolution as
L−1
y (n) = ∑ h(k) x(n − k)
(13.4)
k=0
where x (n) is causal, h (n) is causal and of length L, and the time
index n goes from zero to infinity or some large value. With a
change of index variables this becomes
y (n) = ∑h(n − k) x(k)
(13.5)
which can be expressed as a matrix operation by
y0
h0
0
0
· · · 0
x0
y1
h1
h0
0
x1
=
.
(13.6)
y2
h2
h1 h0
x2
.
.
.
.
.
.
..
..
..
The H matrix of impulse response values is partitioned into N by
N square sub matrices and the X and Y vectors are partitioned into
length-N blocks or sections. This is illustrated for N = 3 by
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
209
h0 0
0
H
0
=
h
H1
=
(13.7)
1
h0 0
h2 h1 h0
h3 h2 h1
h
etc.
4
h3 h2
h5 h4 h3
x0
x
0
=
x
x
(13.8)
1
1
=
x2
x3
y0
x
y = y
etc.
4
0
1
x5
y2
Substituting these definitions into (13.6) gives
y
H
x
0
0
0
0
· · · 0
0
y
H1
H0
0
x1
1 =
(13.9)
y
H
x
2
2
H1 H0
2
.
.
.
.
.
.
..
..
..
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 13. CONVOLUTION
210
ALGORITHMS
The general expression for the nth output block is
n
y =
H
n
∑ n−k xk
(13.10)
k=0
which is a vector or block convolution. Since the matrix-vector
multiplication within the block convolution is itself a convolution,
(13.10) is a sort of convolution of convolutions and the finite length
matrix-vector multiplication can be carried out using the FFT or
other fast convolution methods.
The equation for one output block can be written as the product
x0
y = [H
x
(13.11)
2
2H1H0]
1
x2
and the effects of one input block can be written
H0
y0
H1 x1 = y
.
(13.12)
1
H2
y2
These are generalize statements of overlap save and overlap add
[368], [147]. The block length can be longer, shorter, or equal to
the filter length.
13.4.2 Block Recursion
Although less well-known, IIR filters can also be implemented
with block processing [145], [74], [396], [43], [44]. The block
form of an IIR filter is developed in much the same way as for the
block convolution implementation of the FIR filter. The general
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
211
constant coefficient difference equation which describes an IIR fil-
ter with recursive coefficients al, convolution coefficients bk, input
signal x (n), and output signal y (n) is given by
N−1
M−1
y (n) = ∑ al yn−l + ∑ bk xn−k
(13.13)
l=1
k=0
using both functional notation and subscripts, depending on which
is easier and clearer. The impulse response h (n) is
N−1
M−1
h (n) = ∑ al h(n − l) + ∑ bk δ (n − k)
(13.14)
l=1
k=0
which can be written in matrix operator form
1
0
0
· · · 0
h0
b0
a1
1
0
h1
b1
a2
a1
1
h2
b2
=
(13.15)
a3
a2 a1
h3
b3
0
a
h
0
3
a2
4
.
.
.
.
.
.
..
..
..
In terms of N by N submatrices and length-N blocks, this becomes
A0
0
0
· · · 0
h0
b0
A1
A0
0
h1
b1
=
(13.16)
0
A1 A0
h2
0
.
.
.
.
.
.
..
..
..
From this formulation, a block recursive equation can be written
that will generate the impulse response block by block.
A0 hn + A1 hn−1 = 0 for n ≥ 2
(13.17)
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 13. CONVOLUTION
212
ALGORITHMS
hn = −A−1A
0
1 hn−1 = K hn−1 for n ≥ 2
(13.18)
with initial conditions given by
h1 = −A−1A
b
b
0
1A−1
0
0 + A−1
0
1
(13.19)
This can also be written to generate the square partitions of the
impulse response matrix by
Hn = KHn−1 for n ≥ 2
(13.20)
with initial conditions given by
H1 = KA−1 B
B
0
0 + A−1
0
1
(13.21)
ane K = −A−1A
0
1. This recursively generates square submatrices
of H similar to those defined in (13.7) and (13.9) and shows the
basic structure of the dynamic system.
Next, we develop the recursive formulation for a general input as
described by the scalar difference equation (13.14) and in matrix
operator form by
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
213
1
0
0
· · · 0
y0
a1
1
0
y1
a2 a1
1
y2
=
(13.22)
a3
a2 a1
y3
0
a3 a2
y4
.
.
.
.
.
..
..
b0 0
0
· · · 0
x0
b1 b0
0
x1
b2 b1 b0
x2
0
b2 b1
x3
0
0
b2
x4
.
.
.
.
.
..
..
which, after substituting the definitions of the sub matrices and
assuming the block length is larger than the order of the numerator
or denominator, becomes
Available for free at Connexions
<http://cnx.org/content/col10683/1.5>
CHAPTER 13. CONVOLUTION
214
ALGORITHMS
A0 0
0
· · · 0
y0
A
y
<