Mofii.NET

Singular Value Decomposition

Spectral Decomposition Theorem
Assume A is symmetric, then it has orthogonal eigenvectors
u⃗ 1,u⃗ 2,,u⃗ n with eigenvalues λ1,λ2,,λn. In this case,

A=QDQT

where
Q=(u⃗ 1,u⃗ 2,,u⃗ n),D=diag(λ1,λ2,,λn)

Singular Value Decomposition

Given a m×n matrix A, we want

A=UΣVT

where U and V are orthogonal matrices and Σ is diagonal. The diagonal elements of Σ is called singular values.

Assume mn. Consider ATA,

ATA=(VΣTUT)UΣVT=V(ΣTΣ)VT

where (ΣTΣ)n×n=diag(σ21,,σ2n). So, from the spectral decomposition theorem, we pick

V=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and σ2i=λi where λi0 are eigenvalues of ATA.

Now consider AAT,

AAT=U(ΣΣT)UT

Theorem
If λ is an eigenvalue for ATA then it's an eigenvalue for AAT or λ=0.

From the theorem, we can pick

U=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT

To sum it up, we have

A=UΣVT

where

UV=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and Σ=diag(σ1,σ2,,σn), σi=λi, λi are eigenvalues of ATA.

Assembling the pieces, it's assumed that:

  1. σ1σ2σn
  2. Av⃗ i=σiu⃗ i, used to determine the signs of v⃗ i and u⃗ i

Properties and Mathematical Consequences of SVD

I. Outer Product Expansion

We have

A=i=1nσiWi

where

(Wi)m×n=u⃗ iv⃗ Ti

Notes:

  1. Wix⃗ =(u⃗ iv⃗ Ti)x⃗ =u⃗ i(v⃗ Tix⃗ )=u⃗ i(v⃗ ix⃗ )=(v⃗ ix⃗ )u⃗ i
  2. If m>n then u⃗ i+1,,u⃗ m makes no contribution to A.

II. Rank

Definition
rank(A) is the number of linearly independent columns in A.

Theorem
The number of linearly independent columns in A is equal to the number of linearly independent rows in A.

Fact
rank(A)<min{m,n}

rank(A)=k where σk0 while σk+1=0. Consequences:

  1. Wi has rank 1. Proof: We have Wiv⃗ i=(v⃗ iv⃗ i)u⃗ i=u⃗ i, while Wiv⃗ j=(v⃗ iv⃗ j)u⃗ i=0. So rank(Wi)=1.
  2. li=1σiWi has rank l.

III. Norms

Theorem
A2=σ1

Proof: Take x⃗ =ni=1aiv⃗ i where ni=1a2i=1. Notice that

Ax⃗ =Ai=1naiv⃗ i=i=1naiσiu⃗ i

So
Ax⃗ 22=i=1na2iσ2iσ21i=1na2i=σ2

Thus we have x, Ax⃗ 2σ1, and we have Ax⃗ 2=σ1 when x⃗ =v⃗ 1. Since A2=maxx⃗ 2=1Ax⃗ 2, we have proved that A=σ1.

Theorem
If A is invertible (so m=n) then

κ2(A)=σ1σn

Proof: We have

κ2(A)=A2A12=σ11σn=σ1σn

Forbinus norm: AF=ijaij2.

Fact

AF=i=1nσ2i=σ⃗ 2

IV. Transpose and Inverse

Let A=UΣVT, so

AT=VΣTUTA1=VΣ1UT

Note this is not a standard SVD for A1, but can be rearranged.

V. Invariance

If B=PAQ where P and Q are orthogonal, then B has same singular values as A.

Proof: Singular values for A: det(ATAλI)=0. Since QT=Q, we have

BTBλI=QT(ATA)QλI=QT(ATAλI)Q

So det(ATAλI)=det(BTBλI).

VI. Solve Ax⃗ =b⃗ 

If A is invertible, so A1=VΣ1UT, and

A1=i=1n1σiv⃗ iu⃗ Ti

So we can solve for x⃗ =A1b. If A is not invertible, we could use pseudo-inverse A+ instead.

VII. Positive / Negative Eigenvalues

If A is symmetric and positive definite, so the eigenvalues of A are positive, and

A=UΣVT=QDQT

where the eigenvalues in D are sorted by size. If A is symmetric but has negative eigenvalues, the singular values for A are the absolute values of the eigenvalues for A.

For example, if Ax⃗ =3x⃗  where x⃗ 2=1, so we have σi=3, v⃗ i=x⃗ , and u⃗ i=x⃗ , so

Av⃗ i=3x⃗ =3u⃗ i

VIII. Low Rank Approximation

Let the low rank approximation of A be

Ak=i=1kσiWi

where k=1,2,,r1. So rank(Ak)=k. Since rank(A)=r, Ar=A.

Eckart-Young Theorem

  1. A2=σ1
  2. AAk2=σk+1 for k=1,2,,r1

Theorem
The most accurate approximation of A with rank k is Ak. In other words,

minrank(B)kAB2=AAk2=σk+1

Proof: (by contradiction) Suppose there exists B such that rank(B)k and AB2<σk+1. In this case, let N(B) be the null space of B. Since dim(N(B))nk and dim(span{v⃗ 1,,v⃗ k+1})=k+1, we know that N(B) overlaps with Sk+1=span{v⃗ 1,,v⃗ k+1}. Let x⃗ N(B)Sk+1, so

(AB)x⃗ 2=Ax⃗ 2=α1σ1u⃗ 1++αk+1σk+1u⃗ k+12σ2k+1

This contradicts to our assumption that AB2<σk+1.

Application of SVD: Image Compression

The relative error of our approximation / compression is given by

relative errorAAk2A2=σk+1σ1

Given a grayscale image of size m by n, we reprensent the image in the form of a matirx Am×n where 0aij255. The compression of the image is given by the low rank approximation of the matrix A.

  1. How well does Ak do in reproducing the original image?
    relative-error.png

  2. Image compression ratio?

    • original image: mn
    • compressed image: k(m+n)
  3. What about RGB images?
    For an RGB image of size m by n, we concatenate the three color matrices and get a matrix A3m×n. The following procedures are similar to grayscale images.

  4. Comments:

    • complications from using floats (instead of integers)
    • lossy compression

SVD Application: Water Marking / Steganography

The goal is to embed / hide an image in another one (and extract it later). We follow the same procedure as in Liu & Tan 2002.

Let A be the original grayscale image and B be the grayscale image to be embeded (of the same size as A).

  1. Find the SVD for A: A=UΣVT
  2. Let B¯=Σ+aB for 0<a<1 and find the SVD for A: B¯=UB¯ΣB¯VTB¯
  3. Embed B¯ to get A¯: A¯=UΣB¯VT

Now, given an image A (that you think is your A¯),

  1. Find SVD for A: A=UAΣAVTA
  2. Find B¯ (using UB¯ and VB¯): B¯=UB¯ΣAVTB¯
  3. B=1a(B¯Σ)

SVD Application: Principle Component Analysis (PCA)

The goal is find connections in data.

Example: Find connections between word length and number of words used in its definition.

L # of letters in word
W # of words in definition

L W
bag 3 223
across 6 117

Assumption: Linear connection:

orW=αL+βL=aW+b

Error: To make the error function mathematically equivalent for both linear connections, we use true distance Di:

E(α,β)=D2i=11+α2(αLiβiWi)2

Data preprocessing:

  1. Center the data: Setting αE=βE=0 requires solving cubic euqations. To avoid this, we center the data by

    XiYi=LiL¯where L¯=1nLi=WiW¯where W¯=1nWi

    with this we assume Y=α0X.

  2. Scale variables: (nondimensionalization) 11+α20 must be dimensionless (?), so we scale X and Y using XC and YC:

    Xi=Xi/XCYi=Yi/YC

    Possible choices for XC and YC:

    • XC=1nX⃗ 1 and YC=1nY⃗ 1
    • XC=X⃗  and YC=Y⃗ 
    • XC=1nX⃗ 2 and YC=1nY⃗ 2

To sum it up, we assume Y=αX with

E(α)=11+α2(αxiyi)2=11+α2[α2x⃗ x⃗ 2αx⃗ y⃗ +y⃗ y⃗ ]

Let αE=0, we have

α=12(λ±λ2+4){+x⃗ y⃗ >0x⃗ y⃗ <0

where

λ=x⃗ x⃗ y⃗ y⃗ x⃗ y⃗ 

Principle Component Analysis

Assume there are m variables (or quantities) q1,,qm and n measurements for them. So we have a data matrix An×m. We first center the data

qj¯Qij=1ni=1nqij=qijqj¯

Then we scale the data

Sjpijscaling factor for the jth variable=QijSj

Out objective is to find k-D approximation of data. When k=1, p⃗ =α1v⃗ 1. When k=2, p⃗ =α1v⃗ 1+α2v⃗ 2. Here, v⃗ iv⃗ i=1 and v⃗ iv⃗ j=0. The error is given by

E(v⃗ )=d2i=p⃗ ip⃗ i(p⃗ iv⃗ 1)2(p⃗ iv⃗ k)2

Notice that ni=1p⃗ ip⃗ i=i,jP2ij=P2F=σ21++σ2m.

To minimize the error, we use Lagrange multiplier

E(v⃗ ,λ)=d2i+λ(v⃗ 221)

and set

{vi1E=0λE=0

So we have

(PTP)v⃗ =λv⃗ 

and λ is the eigenvalue of PTP and v⃗  is the corresponding eigenvector.

Fact
ni=1(p⃗ iv⃗ i)2=λ1

With this,

E1=[p⃗ ip⃗ i(p⃗ iv⃗ i)2]=σ21++σ2mλ1

So λi=[singular values]2.

Theorem
Let P be the n×m adjusted data matrix, (with nm and rank(P)=m). Also, let

p⃗ =α1v⃗ 1++αkv⃗ k

For each k, the best fit is obtained using the first k columns from V. The error is
Ek=σ2k+1++σ2m

and the relative error is
relative error=σ2k+1++σ2mσ21++σ2m

Comments:

  1. For general data, how to pick k?
    • Guttman-Kaiser: pick k so that σk1mσm
    • Cattell scree: corner point on curve
    • Math solution: use relative error
  2. How much data is needed?
    Rule of 10: 10m

SVD Application: Independent Component Analysis (ICA)

In this situation, we have two sources and two recorders. Our goal is to determine what source I and source II said given the two recordings. Our assumption is

x1=a11s1+a12s2x2=a21s1+a22s2

This can be written as

x⃗ =As⃗ 

The matrix A is known as the mixing matrix. The data matrix can be written as Xn×m where mn and X is full rank. Similar to the previous section, we preprocess the data by centering and scaling. First we center the data by

x⃗ M=1ni=1nxi,x⃗ i=x⃗ ix⃗ Mx⃗ M=As⃗ M,s⃗ i=s⃗ is⃗ M

Notice that the solution for A and s⃗ i is not unique. If A and s⃗  is a solution such that x⃗ =As⃗ , then B, AB and Bs⃗  is also a solution since x⃗ =As⃗ =(AB)(B1s⃗ ). To deal with this, we will look for a solution that satisfies

1ni=1ns⃗ i(s⃗ i)T=Im×m

In statistics, this is known as whitening the data.

Assume A=UΣVT, notice that

1n(X)TX=1ni=1nx⃗ i(x⃗ i)T=1ni=1n(As⃗ i)(As⃗ i)T=A(1ni=1ns⃗ i(s⃗ i)T)AT=AAT=UΣVT(UΣVT)T=UΣ2UT

To compute A, we need to compute the SVD of 1n(X)TX. Since it is positive definite, we have

1n(X)TX=UXΣXUTX

So U=UX and Σ2=ΣX (which means σi=σi).

To find V, we introduce an idea of contrast function. An ususal choice is kurtosis κ. Given a single variable s, using the fact that we want whitened sources, we have

κ=3+1ni=1n(s)4

When s is gaussian, we have κ=0. We want as non-gaussian as possible, so we want κ or κ2 as large as possible. For multiple sources, we can use

κ⃗ =(κ1,,κm)

So the contrast function would be κ⃗