Singular Value Decomposition

Spectral Decomposition Theorem
Assume A is symmetric, then it has orthogonal eigenvectors
u⃗ 1,u⃗ 2,,u⃗ n with eigenvalues λ1,λ2,,λn. In this case,


Q=(u⃗ 1,u⃗ 2,,u⃗ n),D=diag(λ1,λ2,,λn)

Singular Value Decomposition

Given a m×n matrix A, we want


where U and V are orthogonal matrices and Σ is diagonal. The diagonal elements of Σ is called singular values.

Assume mn. Consider ATA,


where (ΣTΣ)n×n=diag(σ21,,σ2n). So, from the spectral decomposition theorem, we pick

V=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and σ2i=λi where λi0 are eigenvalues of ATA.

Now consider AAT,


If λ is an eigenvalue for ATA then it's an eigenvalue for AAT or λ=0.

From the theorem, we can pick

U=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT

To sum it up, we have



UV=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and Σ=diag(σ1,σ2,,σn), σi=λi, λi are eigenvalues of ATA.

Assembling the pieces, it's assumed that:

  1. σ1σ2σn
  2. Av⃗ i=σiu⃗ i, used to determine the signs of v⃗ i and u⃗ i

Properties and Mathematical Consequences of SVD

I. Outer Product Expansion

We have



(Wi)m×n=u⃗ iv⃗ Ti


  1. Wix⃗ =(u⃗ iv⃗ Ti)x⃗ =u⃗ i(v⃗ Tix⃗ )=u⃗ i(v⃗ ix⃗ )=(v⃗ ix⃗ )u⃗ i
  2. If m>n then u⃗ i+1,,u⃗ m makes no contribution to A.

II. Rank

rank(A) is the number of linearly independent columns in A.

The number of linearly independent columns in A is equal to the number of linearly independent rows in A.


rank(A)=k where σk0 while σk+1=0. Consequences:

  1. Wi has rank 1. Proof: We have Wiv⃗ i=(v⃗ iv⃗ i)u⃗ i=u⃗ i, while Wiv⃗ j=(v⃗ iv⃗ j)u⃗ i=0. So rank(Wi)=1.
  2. li=1σiWi has rank l.

III. Norms


Proof: Take x⃗ =ni=1aiv⃗ i where ni=1a2i=1. Notice that

Ax⃗ =Ai=1naiv⃗ i=i=1naiσiu⃗ i

Ax⃗ 22=i=1na2iσ2iσ21i=1na2i=σ2

Thus we have x, Ax⃗ 2σ1, and we have Ax⃗ 2=σ1 when x⃗ =v⃗ 1. Since A2=maxx⃗ 2=1Ax⃗ 2, we have proved that A=σ1.

If A is invertible (so m=n) then


Proof: We have


Forbinus norm: AF=ijaij2.


AF=i=1nσ2i=σ⃗ 2

IV. Transpose and Inverse

Let A=UΣVT, so


Note this is not a standard SVD for A1, but can be rearranged.

V. Invariance

If B=PAQ where P and Q are orthogonal, then B has same singular values as A.

Proof: Singular values for A: det(ATAλI)=0. Since QT=Q, we have


So det(ATAλI)=det(BTBλI).

VI. Solve Ax⃗ =b⃗ 

If A is invertible, so A1=VΣ1UT, and

A1=i=1n1σiv⃗ iu⃗ Ti

So we can solve for x⃗ =A1b. If A is not invertible, we could use pseudo-inverse A+ instead.

VII. Positive / Negative Eigenvalues

If A is symmetric and positive definite, so the eigenvalues of A are positive, and


where the eigenvalues in D are sorted by size. If A is symmetric but has negative eigenvalues, the singular values for A are the absolute values of the eigenvalues for A.

For example, if Ax⃗ =3x⃗  where x⃗ 2=1, so we have σi=3, v⃗ i=x⃗ , and u⃗ i=x⃗ , so

Av⃗ i=3x⃗ =3u⃗ i

VIII. Low Rank Approximation

Let the low rank approximation of A be


where k=1,2,,r1. So rank(Ak)=k. Since rank(A)=r, Ar=A.

Eckart-Young Theorem

  1. A2=σ1
  2. AAk2=σk+1 for k=1,2,,r1

The most accurate approximation of A with rank k is Ak. In other words,


Proof: (by contradiction) Suppose there exists B such that rank(B)k and AB2<σk+1. In this case, let N(B) be the null space of B. Since dim(N(B))nk and dim(span{v⃗ 1,,v⃗ k+1})=k+1, we know that N(B) overlaps with Sk+1=span{v⃗ 1,,v⃗ k+1}. Let x⃗ N(B)Sk+1, so

(AB)x⃗ 2=Ax⃗ 2=α1σ1u⃗ 1++αk+1σk+1u⃗ k+12σ2k+1

This contradicts to our assumption that AB2<σk+1.

Application of SVD: Image Compression

The relative error of our approximation / compression is given by

relative errorAAk2A2=σk+1σ1

Given a grayscale image of size m by n, we reprensent the image in the form of a matirx Am×n where 0aij255. The compression of the image is given by the low rank approximation of the matrix A.

  1. How well does Ak do in reproducing the original image?

  2. Image compression ratio?

    • original image: mn
    • compressed image: k(m+n)
  3. What about RGB images?
    For an RGB image of size m by n, we concatenate the three color matrices and get a matrix A3m×n. The following procedures are similar to grayscale images.


    • complications from using floats (instead of integers)
    • lossy compression

SVD Application: Water Marking / Steganography

The goal is to embed / hide an image in another one (and extract it later). We follow the same procedure as in Liu & Tan 2002.

Let A be the original grayscale image and B be the grayscale image to be embeded (of the same size as A).

  1. Find the SVD for A: A=UΣVT
  2. Let B¯=Σ+aB for 0<a<1 and find the SVD for A: B¯=UB¯ΣB¯VTB¯
  3. Embed B¯ to get A¯: A¯=UΣB¯VT

Now, given an image A (that you think is your A¯),

  1. Find SVD for A: A=UAΣAVTA
  2. Find B¯ (using UB¯ and VB¯): B¯=UB¯ΣAVTB¯
  3. B=1a(B¯Σ)

SVD Application: Principle Component Analysis (PCA)

The goal is find connections in data.

Example: Find connections between word length and number of words used in its definition.

L # of letters in word
W # of words in definition

bag 3 223
across 6 117

Assumption: Linear connection:


Error: To make the error function mathematically equivalent for both linear connections, we use true distance Di:


Data preprocessing:

  1. Center the data: Setting αE=βE=0 requires solving cubic euqations. To avoid this, we center the data by

    XiYi=LiL¯where L¯=1nLi=WiW¯where W¯=1nWi

    with this we assume Y=α0X.

  2. Scale variables: (nondimensionalization) 11+α20 must be dimensionless (?), so we scale X and Y using XC and YC:


    Possible choices for XC and YC:

    • XC=1nX⃗ 1 and YC=1nY⃗ 1
    • XC=X⃗  and YC=Y⃗ 
    • XC=1nX⃗ 2 and YC=1nY⃗ 2

To sum it up, we assume Y=αX with

E(α)=11+α2(αxiyi)2=11+α2[α2x⃗ x⃗ 2αx⃗ y⃗ +y⃗ y⃗ ]

Let αE=0, we have

α=12(λ±λ2+4){+x⃗ y⃗ >0x⃗ y⃗ <0


λ=x⃗ x⃗ y⃗ y⃗ x⃗ y⃗ 

Principle Component Analysis

Assume there are m variables (or quantities) q1,,qm and n measurements for them. So we have a data matrix An×m. We first center the data


Then we scale the data

Sjpijscaling factor for the jth variable=QijSj

Out objective is to find k-D approximation of data. When k=1, p⃗ =α1v⃗ 1. When k=2, p⃗ =α1v⃗ 1+α2v⃗ 2. Here, v⃗ iv⃗ i=1 and v⃗ iv⃗ j=0. The error is given by

E(v⃗ )=d2i=p⃗ ip⃗ i(p⃗ iv⃗ 1)2(p⃗ iv⃗ k)2

Notice that ni=1p⃗ ip⃗ i=i,jP2ij=P2F=σ21++σ2m.

To minimize the error, we use Lagrange multiplier

E(v⃗ ,λ)=d2i+λ(v⃗ 221)

and set


So we have

(PTP)v⃗ =λv⃗ 

and λ is the eigenvalue of PTP and v⃗  is the corresponding eigenvector.

ni=1(p⃗ iv⃗ i)2=λ1

With this,

E1=[p⃗ ip⃗ i(p⃗ iv⃗ i)2]=σ21++σ2mλ1

So λi=[singular values]2.

Let P be the n×m adjusted data matrix, (with nm and rank(P)=m). Also, let

p⃗ =α1v⃗ 1++αkv⃗ k

For each k, the best fit is obtained using the first k columns from V. The error is

and the relative error is
relative error=σ2k+1++σ2mσ21++σ2m


  1. For general data, how to pick k?
    • Guttman-Kaiser: pick k so that σk1mσm
    • Cattell scree: corner point on curve
    • Math solution: use relative error
  2. How much data is needed?
    Rule of 10: 10m

SVD Application: Independent Component Analysis (ICA)

In this situation, we have two sources and two recorders. Our goal is to determine what source I and source II said given the two recordings. Our assumption is


This can be written as

x⃗ =As⃗ 

The matrix A is known as the mixing matrix. The data matrix can be written as Xn×m where mn and X is full rank. Similar to the previous section, we preprocess the data by centering and scaling. First we center the data by

x⃗ M=1ni=1nxi,x⃗ i=x⃗ ix⃗ Mx⃗ M=As⃗ M,s⃗ i=s⃗ is⃗ M

Notice that the solution for A and s⃗ i is not unique. If A and s⃗  is a solution such that x⃗ =As⃗ , then B, AB and Bs⃗  is also a solution since x⃗ =As⃗ =(AB)(B1s⃗ ). To deal with this, we will look for a solution that satisfies

1ni=1ns⃗ i(s⃗ i)T=Im×m

In statistics, this is known as whitening the data.

Assume A=UΣVT, notice that

1n(X)TX=1ni=1nx⃗ i(x⃗ i)T=1ni=1n(As⃗ i)(As⃗ i)T=A(1ni=1ns⃗ i(s⃗ i)T)AT=AAT=UΣVT(UΣVT)T=UΣ2UT

To compute A, we need to compute the SVD of 1n(X)TX. Since it is positive definite, we have


So U=UX and Σ2=ΣX (which means σi=σi).

To find V, we introduce an idea of contrast function. An ususal choice is kurtosis κ. Given a single variable s, using the fact that we want whitened sources, we have


When s is gaussian, we have κ=0. We want as non-gaussian as possible, so we want κ or κ2 as large as possible. For multiple sources, we can use

κ⃗ =(κ1,,κm)

So the contrast function would be κ⃗