Basics of Algebra, Topology, and Differential Calculus by Jean Gallier - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

Chapter 17

Applications of SVD and

Pseudo-inverses

De tous les principes qu’on peut proposer pour cet objet, je pense qu’il n’en est pas de

plus général, de plus exact, ni d’une application plus facile, que celui dont nous avons

fait usage dans les recherches pécédentes, et qui consiste à rendre minimum la somme

des carrés des erreurs. Par ce moyen il s’établit entre les erreurs une sorte d’équilibre

qui, empêchant les extrêmes de prévaloir, est très propre à faire connaitre l’état du

système le plus proche de la vérité.

—Legendre, 1805, Nouvelles Méthodes pour la détermination des Orbites des

Comètes

17.1

Least Squares Problems and the Pseudo-inverse

This chapter presents several applications of SVD. The first one is the pseudo-inverse, which

plays a crucial role in solving linear systems by the method of least squares. The second ap-

plication is data compression. The third application is principal component analysis (PCA),

whose purpose is to identify patterns in data and understand the variance–covariance struc-

ture of the data. The fourth application is the best affine approximation of a set of data, a

problem closely related to PCA.

The method of least squares is a way of “solving” an overdetermined system of linear

equations

Ax = b,

i.e., a system in which A is a rectangular m × n matrix with more equations than unknowns

(when m > n). Historically, the method of least squares was used by Gauss and Legendre

to solve problems in astronomy and geodesy. The method was first published by Legendre

in 1805 in a paper on methods for determining the orbits of comets. However, Gauss had

already used the method of least squares as early as 1801 to determine the orbit of the asteroid

437

438

CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES

Ceres, and he published a paper about it in 1810 after the discovery of the asteroid Pallas.

Incidentally, it is in that same paper that Gaussian elimination using pivots is introduced.

The reason why more equations than unknowns arise in such problems is that repeated

measurements are taken to minimize errors. This produces an overdetermined and often

inconsistent system of linear equations. For example, Gauss solved a system of eleven equa-

tions in six unknowns to determine the orbit of the asteroid Pallas. As a concrete illustration,

suppose that we observe the motion of a small object, assimilated to a point, in the plane.

From our observations, we suspect that this point moves along a straight line, say of equation

y = dx + c. Suppose that we observed the moving point at three different locations (x1, y1),

(x2, y2), and (x3, y3). Then we should have

c + dx1 = y1,

c + dx2 = y2,

c + dx3 = y3.

If there were no errors in our measurements, these equations would be compatible, and c

and d would be determined by only two of the equations. However, in the presence of errors,

the system may be inconsistent. Yet we would like to find c and d!

The idea of the method of least squares is to determine (c, d) such that it minimizes the

sum of the squares of the errors, namely,

(c + dx1 − y1)2 + (c + dx2 − y2)2 + (c + dx3 − y3)2.

In general, for an overdetermined m×n system Ax = b, what Gauss and Legendre discovered

is that there are solutions x minimizing

Ax − b 2

(where u 2 = u21 +· · ·+u2n, the square of the Euclidean norm of the vector u = (u1, . . . , un)),

and that these solutions are given by the square n × n system

A Ax = A b,

called the normal equations. Furthermore, when the columns of A are linearly independent,

it turns out that A A is invertible, and so x is unique and given by

x = (A A)−1A b.

Note that A A is a symmetric matrix, one of the nice features of the normal equations of a

least squares problem. For instance, the normal equations for the above problem are

3

x1 + x2 + x3

c

y

=

1 + y2 + y3

.

x1 + x2 + x3 x21 + x22 + x23

d

x1y1 + x2y2 + x3y3

In fact, given any real m × n matrix A, there is always a unique x+ of minimum norm

that minimizes Ax − b 2, even when the columns of A are linearly dependent. How do we

prove this, and how do we find x+?

17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE

439

Theorem 17.1. Every linear system Ax = b, where A is an m × n matrix, has a unique

least squares solution x+ of smallest norm.

Proof. Geometry offers a nice proof of the existence and uniqueness of x+. Indeed, we can

interpret b as a point in the Euclidean (affine) space

m

R , and the image subspace of A (also

called the column space of A) as a subspace U of

m

R

(passing through the origin). Then, we

claim that x minimizes Ax − b 2 iff Ax is the orthogonal projection p of b onto the subspace

U , which is equivalent to pb = b − Ax being orthogonal to U.

First of all, if U ⊥ is the vector space orthogonal to U , the affine space b + U ⊥ intersects

U in a unique point p (this follows from Lemma 19.15 (2)). Next, for any point y ∈ U, the

vectors −

py and bp are orthogonal, which implies that

by 2 = bp 2 + −

py 2.

Thus, p is indeed the unique point in U that minimizes the distance from b to any point in

U .

To show that there is a unique x+ of minimum norm minimizing the (square) error

Ax − b 2, we use the fact that

n

R = Ker A ⊕ (Ker A)⊥.

Indeed, every x ∈

n

R

can be written uniquely as x = u + v, where u ∈ Ker A and v ∈

(Ker A)⊥, and since u and v are orthogonal,

x 2 = u 2 + v 2.

Furthermore, since u ∈ Ker A, we have Au = 0, and thus Ax = p iff Av = p, which shows

that the solutions of Ax = p for which x has minimum norm must belong to (Ker A)⊥.

However, the restriction of A to (Ker A)⊥ is injective. This is because if Av1 = Av2, where

v1, v2 ∈ (Ker A)⊥, then A(v2 − v2) = 0, which implies v2 − v1 ∈ Ker A, and since v1, v2 ∈

(Ker A)⊥, we also have v2 − v1 ∈ (Ker A)⊥, and consequently, v2 − v1 = 0. This shows that

there is a unique x of minimum norm minimizing Ax − b 2, and that it must belong to

(Ker A)⊥.

The proof also shows that x minimizes Ax − b 2 iff pb = b − Ax is orthogonal to U,

which can be expressed by saying that b − Ax is orthogonal to every column of A. However,

this is equivalent to

A (b − Ax) = 0, i.e., A Ax = A b.

Finally, it turns out that the minimum norm least squares solution x+ can be found in terms

of the pseudo-inverse A+ of A, which is itself obtained from any SVD of A.

440

CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES

Definition 17.1. Given any m × n matrix A, if A = V DU is an SVD of A with

D = diag(λ1, . . . , λr, 0, . . . , 0),

where D is an m × n matrix and λi > 0, if we let

D+ = diag(1/λ1, . . . , 1/λr, 0, . . . , 0),

an n × m matrix, the pseudo-inverse of A is defined by

A+ = U D+V .

Actually, it seems that A+ depends on the specific choice of U and V in an SVD (U, D, V )

for A, but the next theorem shows that this is not so.

Theorem 17.2. The least squares solution of smallest norm of the linear system Ax = b,

where A is an m × n matrix, is given by

x+ = A+b = U D+V b.

Proof. First, assume that A is a (rectangular) diagonal matrix D, as above. Then, since x

minimizes Dx − b 2 iff Dx is the projection of b onto the image subspace F of D, it is fairly

obvious that x+ = D+b. Otherwise, we can write

A = V DU ,

where U and V are orthogonal. However, since V is an isometry,

Ax − b = V DU x − b = DU x − V b .

Letting y = U x, we have x = y , since U is an isometry, and since U is surjective,

Ax − b is minimized iff Dy − V b is minimized, and we have shown that the least

solution is

y+ = D+V b.

Since y = U x, with x = y , we get

x+ = U D+V b = A+b.

Thus, the pseudo-inverse provides the optimal solution to the least squares problem.

By Lemma 17.2 and Theorem 17.1, A+b is uniquely defined by every b, and thus A+

depends only on A.

17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE

441

Let A = U ΣV

be an SVD for A. It is easy to check that

AA+A = A,

A+AA+ = A+,

and both AA+ and A+A are symmetric matrices. In fact,

I

AA+ = U ΣV V Σ+U = U ΣΣ+U = U

r

0

U

0 0n−r

and

I

A+A = V Σ+U U ΣV

= V Σ+ΣV

= V

r

0

V .

0 0n−r

We immediately get

(AA+)2 = AA+,

(A+A)2 = A+A,

so both AA+ and A+A are orthogonal projections (since they are both symmetric). We

claim that AA+ is the orthogonal projection onto the range of A and A+A is the orthogonal

projection onto Ker(A)⊥ = Im(A ), the range of A .

Obviously, we have range(AA+) ⊆ range(A), and for any y = Ax ∈ range(A), since

AA+A = A, we have

AA+y = AA+Ax = Ax = y,

so the image of AA+ is indeed the range of A. It is also clear that Ker(A) ⊆ Ker(A+A), and

since AA+A = A, we also have Ker(A+A) ⊆ Ker(A), and so

Ker(A+A) = Ker(A).

Since A+A is Hermitian, range(A+A) = Ker(A+A)⊥ = Ker(A)⊥, as claimed.

It will also be useful to see that range(A) = range(AA+) consists of all vectors y ∈ n

R

such that

z

U y =

,

0

with z ∈ r

R .

Indeed, if y = Ax, then

Σ

z

U y = U Ax = U U ΣV x = ΣV x =

r

0

V x =

,

0

0n−r

0

442

CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES

where Σr is the r × r diagonal matrix diag(σ1, . . . , σr). Conversely, if U y = ( z ), then

0

y = U ( z ), and

0

I

AA+y = U

r

0

U y

0 0n−r

I

z

= U

r

0

U U

0 0n−r

0

I

z

= U

r

0

0 0n−r

0

z

= U

= y,

0

which shows that y belongs to the range of A.

Similarly, we claim that range(A+A) = Ker(A)⊥ consists of all vectors y ∈ n

R such that

z

V y =

,

0

with z ∈ r

R .

If y = A+Au, then

I

z

y = A+Au = V

r

0

V u = V

,

0 0n−r

0

for some z ∈ r

R . Conversely, if V y = ( z ), then y = V ( z ), and so

0

0

z

I

z

A+AV

= V

r

0

V V

0

0 0n−r

0

I

z

= V

r

0

0 0n−r

0

z

= V

= y,

0

which shows that y ∈ range(A+A).

If A is a symmetric matrix, then in general, there is no SVD U ΣV

of A with U = V .

However, if A is positive semidefinite, then the eigenvalues of A are nonnegative, and so the

nonzero eigenvalues of A are equal to the singular values of A and SVDs of A are of the form

A = U ΣU .

Analogous results hold for complex matrices, but in this case, U and V are unitary

matrices and AA+ and A+A are Hermitian orthogonal projections.

17.1. LEAST SQUARES PROBLEMS AND THE PSEUDO-INVERSE

443

If A is a normal matrix, which means that AA

= A A, then there is an intimate

relationship between SVD’s of A and block diagonalizations of A. As a consequence, the

pseudo-inverse of a normal matrix A can be obtained directly from a block diagonalization

of A.

If A is a (real) normal matrix, then we know from Theorem 13.16 that A can be block

diagonalized with respect to an orthogonal matrix U as

A = U ΛU ,

where Λ is the (real) block diagonal matrix

Λ = diag(B1, . . . , Bn),

consisting either of 2 × 2 blocks of the form

λ

B

j

−µj

j =

µj

λj

with µj = 0, or of one-dimensional blocks Bk = (λk). Then we have the following proposition:

Proposition 17.3. For any (real) normal matrix A and any block diagonalization A =

U ΛU

of A as above, the pseudo-inverse of A is given by

A+ = U Λ+U ,

where Λ+ is the pseudo-inverse of Λ. Furthermore, if

Λ

Λ =

r

0 ,

0

0

where Λr has rank r, then

Λ−1

Λ+ =

r

0 .

0

0

Proof. Assume that B1, . . . , Bp are 2 × 2 blocks and that λ2p+1, . . . , λn are the scalar entries.

We know that the numbers λj ± iµj, and the λ2p+k are the eigenvalues of A. Let ρ2j−1 =

ρ2j =

λ2 + µ2 for j = 1, . . . , p, ρ

j

j

2p+j = λj for j = 1, . . . , n − 2p, and assume that the blocks

are ordered so that ρ1 ≥ ρ2 ≥ · · · ≥ ρn. Then it is easy to see that

U U = U U = U ΛU U Λ U = U ΛΛ U ,

with

ΛΛ = diag(ρ21, . . . , ρ2n),

so the singular values σ1 ≥ σ2 ≥ · · · ≥ σn of A, which are the nonnegative square roots of

the eigenvalues of AA , are such that

σj = ρj,

1 ≤ j ≤ n.

444

CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES

We can define the diagonal matrices

Σ = diag(σ1, . . . , σr, 0, . . . , 0),

where r = rank(A), σ1 ≥ · · · ≥ σr > 0 and

Θ = diag(σ−1

1 B1, . . . , σ−1

2p Bp, 1, . . . , 1),

so that Θ is an orthogonal matrix and

Λ = ΘΣ = (B1, . . . , Bp, λ2p+1, . . . , λr, 0, . . . , 0).

But then we can write

A = U ΛU = U ΘΣU ,

and we if let V = U Θ, since U is orthogonal and Θ is also orthogonal, V is also orthogonal

and A = V ΣU

is an SVD for A. Now we get

A+ = U Σ+V

= U Σ+Θ U .

However, since Θ is an orthogonal matrix, Θ = Θ−1, and a simple calculation shows that

Σ+Θ = Σ+Θ−1 = Λ+,

which yields the formula

A+ = U Λ+U .

Also observe that if we write

Λr = (B1, . . . , Bp, λ2p+1, . . . , λr),

then Λr is invertible and

Λ−1

Λ+ =

r

0 .

0

0

Therefore, the pseudo-inverse of a normal matrix can be computed directly from any block

diagonalization of A, as claimed.

The following properties, due to Penrose, characterize the pseudo-inverse of a matrix.

We have already proved that the pseudo-inverse satisfies these equations. For a proof of the

converse, see Kincaid and Cheney [61].

Lemma 17.4. Given any m × n matrix A (real or complex), the pseudo-inverse A+ of A is

the unique n × m matrix satisfying the following properties:

AA+A = A,

A+AA+ = A+,

(AA+) = AA+,

(A+A) = A+A.

17.2. DATA COMPRESSION AND SVD

445

If A is an m × n matrix of rank n (and so m ≥ n), it is immediately shown that the

QR-decomposition in terms of Householder transformations applies as follows:

There are n m × m matrices H1, . . . , Hn, Householder matrices or the identity, and an

upper triangular m × n matrix R of rank n such that

A = H1 · · · HnR.

Then, because each Hi is an isometry,

Ax − b = Rx − Hn · · · H1b ,

and the least squares problem Ax = b is equivalent to the system

Rx = Hn · · · H1b.

Now, the system

Rx = Hn · · · H1b

is of the form

R1

c

x =

,

0m−n

d

where R

n

m−n

1 is an invertible n × n matrix (since A has rank n), c ∈ R , and d ∈ R

, and the

least squares solution of smallest norm is

x+ = R−1

1 c.

Since R1 is a triangular matrix, it is very easy to invert R1.

The method of least squares is one of the most effective tools of the mathematical sciences.

There are entire books devoted to it. Readers are advised to consult Strang [101], Golub and

Van Loan [47], Demmel [25], and Trefethen and Bau [106], where extensions and applications

of least squares (such as weighted least squares and recursive least squares) are described.

Golub and Van Loan [47] also contains a very extensive bibliography, including a list of

books on least squares.

17.2

Data Compression and SVD

Among the many applications of SVD, a very useful one is data compression, notably for

images. In order to make precise the notion of closeness of matrices, we review briefly the

notion of matrix norm. We assume that the reader is familiar with the concept of vector

nroem and a matrix norm. The concept of a norm is defined in Chapter 7 and the reader

may want to review it before reading any further.

Given an m × n matrix of rank r, we would like to find a best approximation of A by a

matrix B of rank k ≤ r (actually, k < r) so that A − B

(or A

) is minimized.

2

− B F

446

CHAPTER 17. APPLICATIONS OF SVD AND PSEUDO-INVERSES

Proposition 17.5. Let A be an m × n matrix of rank r and let V DU = A be an SVD for

A. Write ui for the columns of U, vi for the columns of V , and σ1 ≥ σ2 ≥ · · · ≥ σp for the

singular values of A (p = min(m, n)). Then a matrix of rank k < r closest to A (in the

2

norm) is given by

k

Ak =

σiviui = V diag(σ1, . . . , σk)U

i=1

and A − Ak

= σ

2

k+1.

Proof. By construction, Ak has rank k, and we have

p

A − Ak

=

σ

= V diag(0, . . . , 0, σ

= σ

2

iviui

k+1, . . . , σp)U

k+1.

2

2

i=k+1

It remains to show that A − B 2 ≥ σk+1 for all rank-k matrices B. Let B be any rank-k

matrix, so its kernel has dimension p − k. The subspace Vk+1 spanned by (v1, . . . , vk+1) has

dimension k + 1, and because the sum of the dimensions of the kernel of B and of Vk+1 is

(p − k) + k + 1 = p + 1, these two subspaces must intersect in a subspace of dimension at

least 1. Pick any unit vector h in Ker(B) ∩ Vk+1. Then since Bh = 0, we have

2

2

A − B 2

= Ah 2 = V DU h

= σ2

2 ≥

(A − B)h 22

2

≥ σ2

2

k+1

U h 2

k+1,

which proves our claim.

Note that Ak can be stored using (m + n)k entries, as opposed to mn entries. When

k

m, this is a substantial gain.

A nice example of the use of Proposition 17.5 in image compression is given in Demmel

[25], Chapter 3, Section 3.2.3, pages 113–115; see the Matlab demo.

An interesting topic that we have not addressed is the actual computation of an SVD.

This is a very interesting but tricky subject. Most methods reduce the computation of an

SVD to the diagonalization of a well-chosen symmetric matrix (which is not A A). Interested

readers should read Section 5.4 of Demmel’s excellent book [25], which contains an overview

of most known methods and an extensive list of references.

17.3

Principal Components Analysis (PCA)

Suppose we have a set of data consisting of n points X

d

1, . . . , Xn, with each Xi ∈ R viewed

as a row vector .

Think of the Xi’s as persons, and if Xi = (xi 1, . . . , xi d), each xi j is the value of some

feature (or attribute) of that person. For example, the Xi’s could be mathematicians, d = 2,

and the first component, xi 1, of Xi could be the year that Xi was born, and the second

component, xi 2, the length of the beard of Xi in centimeters. Here is a small data set:

17.3. PRINCIPAL COMPONENTS ANALYSIS (PCA)

447

Name

year

length

Carl Friedrich Gauss

1777

0

Camille Jordan

1838

12

Adrien-Marie Legendre 1752

0

Bernhard Riemann

1826

15

David Hilbert

1862

2

Henri Poincaré

1854

5

Emmy Noether

1882

0

Karl Weierstrass

1815

0

Eugenio Beltrami

1835

2

Hermann Schwarz

1843

20

We usually form the n × d matrix X whose ith row is Xi, with 1 ≤ i ≤ n. Then the

jth column is denoted by Cj (1 ≤ j ≤ d). It is sometimes called a feature vector, but this

terminology is far from being universally accepted. In fact, many people in computer vision

call the data points Xi feature vectors!

The purpose of principal components analysis, for short PCA, is to identify patterns in

data and understand the variance–covariance structure of the data. This is useful for the

following tasks:

1. Data reduction: Often much of the variability of the data can be accounted for by a

smaller number of principal components.

2. Interpretation: PCA can show relationships that were not previously suspected.

Given a vector (a sample of measurements) x = (x

n

1, . . . , xn) ∈ R , recall that the mean

(or average) x of x is given by

n

x

x =

i=1

i .

n

We let x − x denote the centered data point

x − x = (x1 − x, . . . , xn − x).

In order to measure the spread of the xi’s around the mean, we define the sample variance

(for short, variance) var(x) (or s2) of the sample x by

n

(x

var(x) =

i=1

i − x)2 .

n − 1

There is a reason for using n − 1 i