Pseudo-inverses
De tous les principes qu’on peut proposer pour cet objet, je pense qu’il n’en est pas de
plus général, de plus exact, ni d’une application plus facile, que celui dont nous avons
fait usage dans les recherches pécédentes, et qui consiste à rendre minimum la somme
des carrés des erreurs. Par ce moyen il s’établit entre les erreurs une sorte d’équilibre
qui, empêchant les extrêmes de prévaloir, est très propre à faire connaitre l’état du
système le plus proche de la vérité.
—Legendre, 1805, Nouvelles Méthodes pour la détermination des Orbites des
Comètes
17.1
Least Squares Problems and the Pseudo-inverse
This chapter presents several applications of SVD. The first one is the pseudo-inverse, which
plays a crucial role in solving linear systems by the method of least squares. The second ap-
plication is data compression. The third application is principal component analysis (PCA),
whose purpose is to identify patterns in data and understand the variance–covariance struc-
ture of the data. The fourth application is the best affine approximation of a set of data, a
problem closely related to PCA.
The method of least squares is a way of “solving” an overdetermined system of linear
equations
Ax = b,
i.e., a system in which A is a rectangular m × n matrix with more equations than unknowns
(when m > n). Historically, the method of least squares was used by Gauss and Legendre
to solve problems in astronomy and geodesy. The method was first published by Legendre
in 1805 in a paper on methods for determining the orbits of comets. However, Gauss had
already used the method of least squares as early as 1801 to determine the orbit of the asteroid
437
438
CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES
Ceres, and he published a paper about it in 1810 after the discovery of the asteroid Pallas.
Incidentally, it is in that same paper that Gaussian elimination using pivots is introduced.
The reason why more equations than unknowns arise in such problems is that repeated
measurements are taken to minimize errors. This produces an overdetermined and often
inconsistent system of linear equations. For example, Gauss solved a system of eleven equa-
tions in six unknowns to determine the orbit of the asteroid Pallas. As a concrete illustration,
suppose that we observe the motion of a small object, assimilated to a point, in the plane.
From our observations, we suspect that this point moves along a straight line, say of equation
y = dx + c. Suppose that we observed the moving point at three different locations (x1, y1),
(x2, y2), and (x3, y3). Then we should have
c + dx1 = y1,
c + dx2 = y2,
c + dx3 = y3.
If there were no errors in our measurements, these equations would be compatible, and c
and d would be determined by only two of the equations. However, in the presence of errors,
the system may be inconsistent. Yet we would like to find c and d!
The idea of the method of least squares is to determine (c, d) such that it minimizes the
sum of the squares of the errors, namely,
(c + dx1 − y1)2 + (c + dx2 − y2)2 + (c + dx3 − y3)2.
In general, for an overdetermined m×n system Ax = b, what Gauss and Legendre discovered
is that there are solutions x minimizing
Ax − b 2
(where u 2 = u21 +· · ·+u2n, the square of the Euclidean norm of the vector u = (u1, . . . , un)),
and that these solutions are given by the square n × n system
A Ax = A b,
called the normal equations. Furthermore, when the columns of A are linearly independent,
it turns out that A A is invertible, and so x is unique and given by
x = (A A)−1A b.
Note that A A is a symmetric matrix, one of the nice features of the normal equations of a
least squares problem. For instance, the normal equations for the above problem are
3
x1 + x2 + x3
c
y
=
1 + y2 + y3
.
x1 + x2 + x3 x21 + x22 + x23
d
x1y1 + x2y2 + x3y3
In fact, given any real m × n matrix A, there is always a unique x+ of minimum norm
that minimizes Ax − b 2, even when the columns of A are linearly dependent. How do we
prove this, and how do we find x+?
17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE
439
Theorem 17.1. Every linear system Ax = b, where A is an m × n matrix, has a unique
least squares solution x+ of smallest norm.
Proof. Geometry offers a nice proof of the existence and uniqueness of x+. Indeed, we can
interpret b as a point in the Euclidean (affine) space
m
R , and the image subspace of A (also
called the column space of A) as a subspace U of
m
R
(passing through the origin). Then, we
claim that x minimizes Ax − b 2 iff Ax is the orthogonal projection p of b onto the subspace
−
→
U , which is equivalent to pb = b − Ax being orthogonal to U.
First of all, if U ⊥ is the vector space orthogonal to U , the affine space b + U ⊥ intersects
U in a unique point p (this follows from Lemma 19.15 (2)). Next, for any point y ∈ U, the
−
→
vectors −
→
py and bp are orthogonal, which implies that
−
→
−
→
by 2 = bp 2 + −
→
py 2.
Thus, p is indeed the unique point in U that minimizes the distance from b to any point in
U .
To show that there is a unique x+ of minimum norm minimizing the (square) error
Ax − b 2, we use the fact that
n
R = Ker A ⊕ (Ker A)⊥.
Indeed, every x ∈
n
R
can be written uniquely as x = u + v, where u ∈ Ker A and v ∈
(Ker A)⊥, and since u and v are orthogonal,
x 2 = u 2 + v 2.
Furthermore, since u ∈ Ker A, we have Au = 0, and thus Ax = p iff Av = p, which shows
that the solutions of Ax = p for which x has minimum norm must belong to (Ker A)⊥.
However, the restriction of A to (Ker A)⊥ is injective. This is because if Av1 = Av2, where
v1, v2 ∈ (Ker A)⊥, then A(v2 − v2) = 0, which implies v2 − v1 ∈ Ker A, and since v1, v2 ∈
(Ker A)⊥, we also have v2 − v1 ∈ (Ker A)⊥, and consequently, v2 − v1 = 0. This shows that
there is a unique x of minimum norm minimizing Ax − b 2, and that it must belong to
(Ker A)⊥.
−
→
The proof also shows that x minimizes Ax − b 2 iff pb = b − Ax is orthogonal to U,
which can be expressed by saying that b − Ax is orthogonal to every column of A. However,
this is equivalent to
A (b − Ax) = 0, i.e., A Ax = A b.
Finally, it turns out that the minimum norm least squares solution x+ can be found in terms
of the pseudo-inverse A+ of A, which is itself obtained from any SVD of A.
440
CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES
Definition 17.1. Given any m × n matrix A, if A = V DU is an SVD of A with
D = diag(λ1, . . . , λr, 0, . . . , 0),
where D is an m × n matrix and λi > 0, if we let
D+ = diag(1/λ1, . . . , 1/λr, 0, . . . , 0),
an n × m matrix, the pseudo-inverse of A is defined by
A+ = U D+V .
Actually, it seems that A+ depends on the specific choice of U and V in an SVD (U, D, V )
for A, but the next theorem shows that this is not so.
Theorem 17.2. The least squares solution of smallest norm of the linear system Ax = b,
where A is an m × n matrix, is given by
x+ = A+b = U D+V b.
Proof. First, assume that A is a (rectangular) diagonal matrix D, as above. Then, since x
minimizes Dx − b 2 iff Dx is the projection of b onto the image subspace F of D, it is fairly
obvious that x+ = D+b. Otherwise, we can write
A = V DU ,
where U and V are orthogonal. However, since V is an isometry,
Ax − b = V DU x − b = DU x − V b .
Letting y = U x, we have x = y , since U is an isometry, and since U is surjective,
Ax − b is minimized iff Dy − V b is minimized, and we have shown that the least
solution is
y+ = D+V b.
Since y = U x, with x = y , we get
x+ = U D+V b = A+b.
Thus, the pseudo-inverse provides the optimal solution to the least squares problem.
By Lemma 17.2 and Theorem 17.1, A+b is uniquely defined by every b, and thus A+
depends only on A.
17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE
441
Let A = U ΣV
be an SVD for A. It is easy to check that
AA+A = A,
A+AA+ = A+,
and both AA+ and A+A are symmetric matrices. In fact,
I
AA+ = U ΣV V Σ+U = U ΣΣ+U = U
r
0
U
0 0n−r
and
I
A+A = V Σ+U U ΣV
= V Σ+ΣV
= V
r
0
V .
0 0n−r
We immediately get
(AA+)2 = AA+,
(A+A)2 = A+A,
so both AA+ and A+A are orthogonal projections (since they are both symmetric). We
claim that AA+ is the orthogonal projection onto the range of A and A+A is the orthogonal
projection onto Ker(A)⊥ = Im(A ), the range of A .
Obviously, we have range(AA+) ⊆ range(A), and for any y = Ax ∈ range(A), since
AA+A = A, we have
AA+y = AA+Ax = Ax = y,
so the image of AA+ is indeed the range of A. It is also clear that Ker(A) ⊆ Ker(A+A), and
since AA+A = A, we also have Ker(A+A) ⊆ Ker(A), and so
Ker(A+A) = Ker(A).
Since A+A is Hermitian, range(A+A) = Ker(A+A)⊥ = Ker(A)⊥, as claimed.
It will also be useful to see that range(A) = range(AA+) consists of all vectors y ∈ n
R
such that
z
U y =
,
0
with z ∈ r
R .
Indeed, if y = Ax, then
Σ
z
U y = U Ax = U U ΣV x = ΣV x =
r
0
V x =
,
0
0n−r
0
442
CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES
where Σr is the r × r diagonal matrix diag(σ1, . . . , σr). Conversely, if U y = ( z ), then
0
y = U ( z ), and
0
I
AA+y = U
r
0
U y
0 0n−r
I
z
= U
r
0
U U
0 0n−r
0
I
z
= U
r
0
0 0n−r
0
z
= U
= y,
0
which shows that y belongs to the range of A.
Similarly, we claim that range(A+A) = Ker(A)⊥ consists of all vectors y ∈ n
R such that
z
V y =
,
0
with z ∈ r
R .
If y = A+Au, then
I
z
y = A+Au = V
r
0
V u = V
,
0 0n−r
0
for some z ∈ r
R . Conversely, if V y = ( z ), then y = V ( z ), and so
0
0
z
I
z
A+AV
= V
r
0
V V
0
0 0n−r
0
I
z
= V
r
0
0 0n−r
0
z
= V
= y,
0
which shows that y ∈ range(A+A).
If A is a symmetric matrix, then in general, there is no SVD U ΣV
of A with U = V .
However, if A is positive semidefinite, then the eigenvalues of A are nonnegative, and so the
nonzero eigenvalues of A are equal to the singular values of A and SVDs of A are of the form
A = U ΣU .
Analogous results hold for complex matrices, but in this case, U and V are unitary
matrices and AA+ and A+A are Hermitian orthogonal projections.
17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE
443
If A is a normal matrix, which means that AA
= A A, then there is an intimate
relationship between SVD’s of A and block diagonalizations of A. As a consequence, the
pseudo-inverse of a normal matrix A can be obtained directly from a block diagonalization
of A.
If A is a (real) normal matrix, then we know from Theorem 13.16 that A can be block
diagonalized with respect to an orthogonal matrix U as
A = U ΛU ,
where Λ is the (real) block diagonal matrix
Λ = diag(B1, . . . , Bn),
consisting either of 2 × 2 blocks of the form
λ
B
j
−µj
j =
µj
λj
with µj = 0, or of one-dimensional blocks Bk = (λk). Then we have the following proposition:
Proposition 17.3. For any (real) normal matrix A and any block diagonalization A =
U ΛU
of A as above, the pseudo-inverse of A is given by
A+ = U Λ+U ,
where Λ+ is the pseudo-inverse of Λ. Furthermore, if
Λ
Λ =
r
0 ,
0
0
where Λr has rank r, then
Λ−1
Λ+ =
r
0 .
0
0
Proof. Assume that B1, . . . , Bp are 2 × 2 blocks and that λ2p+1, . . . , λn are the scalar entries.
We know that the numbers λj ± iµj, and the λ2p+k are the eigenvalues of A. Let ρ2j−1 =
ρ2j =
λ2 + µ2 for j = 1, . . . , p, ρ
j
j
2p+j = λj for j = 1, . . . , n − 2p, and assume that the blocks
are ordered so that ρ1 ≥ ρ2 ≥ · · · ≥ ρn. Then it is easy to see that
U U = U U = U ΛU U Λ U = U ΛΛ U ,
with
ΛΛ = diag(ρ21, . . . , ρ2n),
so the singular values σ1 ≥ σ2 ≥ · · · ≥ σn of A, which are the nonnegative square roots of
the eigenvalues of AA , are such that
σj = ρj,
1 ≤ j ≤ n.
444
CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES
We can define the diagonal matrices
Σ = diag(σ1, . . . , σr, 0, . . . , 0),
where r = rank(A), σ1 ≥ · · · ≥ σr > 0 and
Θ = diag(σ−1
1 B1, . . . , σ−1
2p Bp, 1, . . . , 1),
so that Θ is an orthogonal matrix and
Λ = ΘΣ = (B1, . . . , Bp, λ2p+1, . . . , λr, 0, . . . , 0).
But then we can write
A = U ΛU = U ΘΣU ,
and we if let V = U Θ, since U is orthogonal and Θ is also orthogonal, V is also orthogonal
and A = V ΣU
is an SVD for A. Now we get
A+ = U Σ+V
= U Σ+Θ U .
However, since Θ is an orthogonal matrix, Θ = Θ−1, and a simple calculation shows that
Σ+Θ = Σ+Θ−1 = Λ+,
which yields the formula
A+ = U Λ+U .
Also observe that if we write
Λr = (B1, . . . , Bp, λ2p+1, . . . , λr),
then Λr is invertible and
Λ−1
Λ+ =
r
0 .
0
0
Therefore, the pseudo-inverse of a normal matrix can be computed directly from any block
diagonalization of A, as claimed.
The following properties, due to Penrose, characterize the pseudo-inverse of a matrix.
We have already proved that the pseudo-inverse satisfies these equations. For a proof of the
converse, see Kincaid and Cheney [61].
Lemma 17.4. Given any m × n matrix A (real or complex), the pseudo-inverse A+ of A is
the unique n × m matrix satisfying the following properties:
AA+A = A,
A+AA+ = A+,
(AA+) = AA+,
(A+A) = A+A.
17.2. DATA COMPRESSION AND SVD
445
If A is an m × n matrix of rank n (and so m ≥ n), it is immediately shown that the
QR-decomposition in terms of Householder transformations applies as follows:
There are n m × m matrices H1, . . . , Hn, Householder matrices or the identity, and an
upper triangular m × n matrix R of rank n such that
A = H1 · · · HnR.
Then, because each Hi is an isometry,
Ax − b = Rx − Hn · · · H1b ,
and the least squares problem Ax = b is equivalent to the system
Rx = Hn · · · H1b.
Now, the system
Rx = Hn · · · H1b
is of the form
R1
c
x =
,
0m−n
d
where R
n
m−n
1 is an invertible n × n matrix (since A has rank n), c ∈ R , and d ∈ R
, and the
least squares solution of smallest norm is
x+ = R−1
1 c.
Since R1 is a triangular matrix, it is very easy to invert R1.
The method of least squares is one of the most effective tools of the mathematical sciences.
There are entire books devoted to it. Readers are advised to consult Strang [101], Golub and
Van Loan [47], Demmel [25], and Trefethen and Bau [106], where extensions and applications
of least squares (such as weighted least squares and recursive least squares) are described.
Golub and Van Loan [47] also contains a very extensive bibliography, including a list of
books on least squares.
17.2
Data Compression and SVD
Among the many applications of SVD, a very useful one is data compression, notably for
images. In order to make precise the notion of closeness of matrices, we review briefly the
notion of matrix norm. We assume that the reader is familiar with the concept of vector
nroem and a matrix norm. The concept of a norm is defined in Chapter 7 and the reader
may want to review it before reading any further.
Given an m × n matrix of rank r, we would like to find a best approximation of A by a
matrix B of rank k ≤ r (actually, k < r) so that A − B
(or A
) is minimized.
2
− B F
446
CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES
Proposition 17.5. Let A be an m × n matrix of rank r and let V DU = A be an SVD for
A. Write ui for the columns of U, vi for the columns of V , and σ1 ≥ σ2 ≥ · · · ≥ σp for the
singular values of A (p = min(m, n)). Then a matrix of rank k < r closest to A (in the
2
norm) is given by
k
Ak =
σiviui = V diag(σ1, . . . , σk)U
i=1
and A − Ak
= σ
2
k+1.
Proof. By construction, Ak has rank k, and we have
p
A − Ak
=
σ
= V diag(0, . . . , 0, σ
= σ
2
iviui
k+1, . . . , σp)U
k+1.
2
2
i=k+1
It remains to show that A − B 2 ≥ σk+1 for all rank-k matrices B. Let B be any rank-k
matrix, so its kernel has dimension p − k. The subspace Vk+1 spanned by (v1, . . . , vk+1) has
dimension k + 1, and because the sum of the dimensions of the kernel of B and of Vk+1 is
(p − k) + k + 1 = p + 1, these two subspaces must intersect in a subspace of dimension at
least 1. Pick any unit vector h in Ker(B) ∩ Vk+1. Then since Bh = 0, we have
2
2
A − B 2
= Ah 2 = V DU h
= σ2
2 ≥
(A − B)h 22
2
≥ σ2
2
k+1
U h 2
k+1,
which proves our claim.
Note that Ak can be stored using (m + n)k entries, as opposed to mn entries. When
k
m, this is a substantial gain.
A nice example of the use of Proposition 17.5 in image compression is given in Demmel
[25], Chapter 3, Section 3.2.3, pages 113–115; see the Matlab demo.
An interesting topic that we have not addressed is the actual computation of an SVD.
This is a very interesting but tricky subject. Most methods reduce the computation of an
SVD to the diagonalization of a well-chosen symmetric matrix (which is not A A). Interested
readers should read Section 5.4 of Demmel’s excellent book [25], which contains an overview
of most known methods and an extensive list of references.
17.3
Principal Components Analysis (PCA)
Suppose we have a set of data consisting of n points X
d
1, . . . , Xn, with each Xi ∈ R viewed
as a row vector .
Think of the Xi’s as persons, and if Xi = (xi 1, . . . , xi d), each xi j is the value of some
feature (or attribute) of that person. For example, the Xi’s could be mathematicians, d = 2,
and the first component, xi 1, of Xi could be the year that Xi was born, and the second
component, xi 2, the length of the beard of Xi in centimeters. Here is a small data set:
17.3. PRINCIPAL COMPONENTS ANALYSIS (PCA)
447
Name
year
length
Carl Friedrich Gauss
1777
0
Camille Jordan
1838
12
Adrien-Marie Legendre 1752
0
Bernhard Riemann
1826
15
David Hilbert
1862
2
Henri Poincaré
1854
5
Emmy Noether
1882
0
Karl Weierstrass
1815
0
Eugenio Beltrami
1835
2
Hermann Schwarz
1843
20
We usually form the n × d matrix X whose ith row is Xi, with 1 ≤ i ≤ n. Then the
jth column is denoted by Cj (1 ≤ j ≤ d). It is sometimes called a feature vector, but this
terminology is far from being universally accepted. In fact, many people in computer vision
call the data points Xi feature vectors!
The purpose of principal components analysis, for short PCA, is to identify patterns in
data and understand the variance–covariance structure of the data. This is useful for the
following tasks:
1. Data reduction: Often much of the variability of the data can be accounted for by a
smaller number of principal components.
2. Interpretation: PCA can show relationships that were not previously suspected.
Given a vector (a sample of measurements) x = (x
n
1, . . . , xn) ∈ R , recall that the mean
(or average) x of x is given by
n
x
x =
i=1
i .
n
We let x − x denote the centered data point
x − x = (x1 − x, . . . , xn − x).
In order to measure the spread of the xi’s around the mean, we define the sample variance
(for short, variance) var(x) (or s2) of the sample x by
n
(x
var(x) =
i=1
i − x)2 .
n − 1
There is a reason for using n − 1 i