Spectral Decomposition Theorem
Assume A is symmetric, then it has orthogonal eigenvectors
u⃗ 1,u⃗ 2,⋯,u⃗ n with eigenvalues λ1,λ2,⋯,λn. In this case,
A=QDQT
where
Q=(u⃗ 1,u⃗ 2,⋯,u⃗ n),D=diag(λ1,λ2,⋯,λn)
Singular Value Decomposition
Given a m×n matrix A, we want
A=UΣVT
where U and V are orthogonal matrices and Σ is diagonal. The diagonal elements of Σ is called singular values.
Assume m≥n. Consider ATA,
ATA=(VΣTUT)UΣVT=V(ΣTΣ)VT
where (ΣTΣ)n×n=diag(σ21,⋯,σ2n). So, from the spectral decomposition theorem, we pick
V=(v⃗ 1,v⃗ 2,⋯,v⃗ n) where v⃗ i≡eigenvectors of ATA
and σ2i=λi where λi≥0 are eigenvalues of ATA.
Now consider AAT,
AAT=U(ΣΣT)UT
Theorem
If λ∗ is an eigenvalue for ATA then it's an eigenvalue for AAT or λ∗=0.
From the theorem, we can pick
U=(u⃗ 1,u⃗ 2,⋯,u⃗ n) where u⃗ i≡eigenvectors of AAT
To sum it up, we have
A=UΣVT
where
UV=(u⃗ 1,u⃗ 2,⋯,u⃗ n) where u⃗ i≡eigenvectors of AAT=(v⃗ 1,v⃗ 2,⋯,v⃗ n) where v⃗ i≡eigenvectors of ATA
and Σ=diag(σ1,σ2,⋯,σn), σi=λi‾‾√, λi are eigenvalues of ATA.
Assembling the pieces, it's assumed that:
 σ1≥σ2≥⋯≥σn
 Av⃗ i=σiu⃗ i, used to determine the signs of v⃗ i and u⃗ i
Properties and Mathematical Consequences of SVD
I. Outer Product Expansion
We have
A=∑i=1nσiWi
where
(Wi)m×n=u⃗ iv⃗ Ti
Notes:
 Wix⃗ =(u⃗ iv⃗ Ti)x⃗ =u⃗ i(v⃗ Tix⃗ )=u⃗ i(v⃗ ix⃗ )=(v⃗ ix⃗ )u⃗ i
 If m>n then u⃗ i+1,⋯,u⃗ m makes no contribution to A.
II. Rank
Definition
rank(A) is the number of linearly independent columns in A.
Theorem
The number of linearly independent columns in A is equal to the number of linearly independent rows in A.
Fact
rank(A)<min{m,n}
rank(A)=k where σk≠0 while σk+1=0. Consequences:
 Wi has rank 1. Proof: We have Wiv⃗ i=(v⃗ iv⃗ i)u⃗ i=u⃗ i, while Wiv⃗ j=(v⃗ iv⃗ j)u⃗ i=0. So rank(Wi)=1.
 ∑li=1σiWi has rank l.
III. Norms
Theorem
∥A∥2=σ1
Proof: Take x⃗ =∑ni=1aiv⃗ i where ∑ni=1a2i=1. Notice that
Ax⃗ =A∑i=1naiv⃗ i=∑i=1naiσiu⃗ i
So
∥Ax⃗ ∥22=∑i=1na2iσ2i≤σ21∑i=1na2i=σ2
Thus we have ∀x, ∥Ax⃗ ∥2≤σ1, and we have ∥Ax⃗ ∥2=σ1 when x⃗ =v⃗ 1. Since ∥A∥2=max∥x⃗ ∥2=1∥Ax⃗ ∥2, we have proved that ∥A∥=σ1.
Theorem
If A is invertible (so m=n) then
κ2(A)=σ1σn
Proof: We have
κ2(A)=∥A∥2∥A−1∥2=σ1⋅1σn=σ1σn
Forbinus norm: ∥A∥F=∑i∑j∣aij∣2‾‾‾‾‾‾‾‾‾‾‾√.
Fact
∥A∥F=∑i=1nσ2i‾‾‾‾‾‾⎷=∥σ⃗ ∥2
IV. Transpose and Inverse
Let A=UΣVT, so
AT=VΣTUTA−1=VΣ−1UT
Note this is not a standard SVD for A−1, but can be rearranged.
V. Invariance
If B=PAQ where P and Q are orthogonal, then B has same singular values as A.
Proof: Singular values for A: det(ATA−λI)=0. Since Q−T=Q, we have
BTB−λI=QT(ATA)Q−λI=QT(ATA−λI)Q
So det(ATA−λI)=det(BTB−λI).
VI. Solve Ax⃗ =b⃗
If A is invertible, so A−1=VΣ−1UT, and
A−1=∑i=1n1σiv⃗ iu⃗ Ti
So we can solve for x⃗ =A−1b. If A is not invertible, we could use pseudoinverse A+ instead.
VII. Positive / Negative Eigenvalues
If A is symmetric and positive definite, so the eigenvalues of A are positive, and
A=UΣVT=QDQT
where the eigenvalues in D are sorted by size. If A is symmetric but has negative eigenvalues, the singular values for A are the absolute values of the eigenvalues for A.
For example, if Ax⃗ =−3x⃗ where ∥x⃗ ∥2=1, so we have σi=3, v⃗ i=x⃗ , and u⃗ i=−x⃗ , so
Av⃗ i=−3x⃗ =3u⃗ i
VIII. Low Rank Approximation
Let the low rank approximation of A be
Ak=∑i=1kσiWi
where
k=1,2,⋯,r−1. So
rank(Ak)=k. Since
rank(A)=r,
Ar=A.
EckartYoung Theorem
 ∥A∥2=σ1
 ∥A−Ak∥2=σk+1 for k=1,2,⋯,r−1
Theorem
The most accurate approximation of A with rank ≤k is Ak. In other words,
minrank(B)≤k∥A−B∥2=∥A−Ak∥2=σk+1
Proof: (by contradiction) Suppose there exists B such that rank(B)≤k and ∥A−B∥2<σk+1. In this case, let N(B) be the null space of B. Since dim(N(B))≥n−k and dim(span{v⃗ 1,⋯,v⃗ k+1})=k+1, we know that N(B) overlaps with Sk+1=span{v⃗ 1,⋯,v⃗ k+1}. Let x⃗ ∈N(B)∩Sk+1, so
∥(A−B)x⃗ ∥2=∥Ax⃗ ∥2=∥α1σ1u⃗ 1+⋯+αk+1σk+1u⃗ k+1∥2≥σ2k+1
This contradicts to our assumption that ∥A−B∥2<σk+1.
Application of SVD: Image Compression
The relative error of our approximation / compression is given by
relative error≡∥A−Ak∥2∥A∥2=σk+1σ1
Given a grayscale image of size m by n, we reprensent the image in the form of a matirx Am×n where 0≤aij≤255. The compression of the image is given by the low rank approximation of the matrix A.

How well does Ak do in reproducing the original image?

Image compression ratio?
 original image: m⋅n
 compressed image: k⋅(m+n)

What about RGB images?
For an RGB image of size m by n, we concatenate the three color matrices and get a matrix A3m×n. The following procedures are similar to grayscale images.

Comments:
 complications from using floats (instead of integers)
 lossy compression
SVD Application: Water Marking / Steganography
The goal is to embed / hide an image in another one (and extract it later). We follow the same procedure as in Liu & Tan 2002.
Let A be the original grayscale image and B be the grayscale image to be embeded (of the same size as A).
 Find the SVD for A: A=UΣVT
 Let B¯=Σ+aB for 0<a<1 and find the SVD for A: B¯=UB¯ΣB¯VTB¯
 Embed B¯ to get A¯: A¯=UΣB¯VT
Now, given an image A∗ (that you think is your A¯),
 Find SVD for A∗: A=UA∗ΣA∗VTA∗
 Find B¯∗ (using UB¯ and VB¯): B¯∗=UB¯ΣA∗VTB¯
 B∗=1a(B¯∗−Σ)
SVD Application: Principle Component Analysis (PCA)
The goal is find connections in data.
Example: Find connections between word length and number of words used in its definition.
L≡ # of letters in word
W≡ # of words in definition

L 
W 
bag 
3 
223 
across 
6 
117 
⋮ 
⋮ 
⋮ 
Assumption: Linear connection:
orW=αL+βL=aW+b
Error: To make the error function mathematically equivalent for both linear connections, we use true distance Di:
E(α,β)=∑D2i=11+α2∑(αLi−βi−Wi)2
Data preprocessing:

Center the data: Setting ∂αE=∂βE=0 requires solving cubic euqations. To avoid this, we center the data by
XiYi=Li−L¯where L¯=1n∑Li=Wi−W¯where W¯=1n∑Wi
with this we assume Y=α0X.

Scale variables: (nondimensionalization) 11+α20 must be dimensionless (?), so we scale X and Y using XC and YC:
Xi=Xi/XCYi=Yi/YC
Possible choices for XC and YC:
 XC=1n∥X⃗ ∥1 and YC=1n∥Y⃗ ∥1
 XC=∥X⃗ ∥∞ and YC=∥Y⃗ ∥∞
 XC=1n√∥X⃗ ∥2 and YC=1n√∥Y⃗ ∥2
To sum it up, we assume Y=αX with
E(α)=11+α2∑(αxi−yi)2=11+α2[α2x⃗ ⋅x⃗ −2αx⃗ ⋅y⃗ +y⃗ ⋅y⃗ ]
Let ∂αE=0, we have
α=12(−λ±λ2+4‾‾‾‾‾‾√){+−x⃗ ⋅y⃗ >0x⃗ ⋅y⃗ <0
where
λ=x⃗ ⋅x⃗ −y⃗ ⋅y⃗ x⃗ ⋅y⃗
Principle Component Analysis
Assume there are m variables (or quantities) q1,⋯,qm and n measurements for them. So we have a data matrix An×m. We first center the data
qj¯Qij=1n∑i=1nqij=qij−qj¯
Then we scale the data
Sjpij≡scaling factor for the jth variable=QijSj
Out objective is to find kD approximation of data. When k=1, p⃗ =α1v⃗ 1. When k=2, p⃗ =α1v⃗ 1+α2v⃗ 2. Here, v⃗ i⋅v⃗ i=1 and v⃗ i⋅v⃗ j=0. The error is given by
E(v⃗ )=∑d2i=p⃗ i⋅p⃗ i−(p⃗ i⋅v⃗ 1)2−⋯−(p⃗ i⋅v⃗ k)2
Notice that ∑ni=1p⃗ ip⃗ i=∑i,jP2ij=∥P∥2F=σ21+⋯+σ2m.
To minimize the error, we use Lagrange multiplier
E(v⃗ ,λ)=∑d2i+λ(∥v⃗ ∣22−1)
and set
{∂vi1E=0∂λE=0
So we have
(PTP)v⃗ =λv⃗
and λ is the eigenvalue of PTP and v⃗ is the corresponding eigenvector.
Fact
∑ni=1(p⃗ iv⃗ i)2=λ1
With this,
E1=∑[p⃗ ip⃗ i−(p⃗ iv⃗ i)2]=σ21+⋯+σ2m−λ1
So λi=[singular values]2.
Theorem
Let P be the n×m adjusted data matrix, (with n≥m and rank(P)=m). Also, let
p⃗ =α1v⃗ 1+⋯+αkv⃗ k
For each k, the best fit is obtained using the first k columns from V. The error is
Ek=σ2k+1+⋯+σ2m
and the relative error is
relative error=σ2k+1+⋯+σ2mσ21+⋯+σ2m
Comments:
 For general data, how to pick k?
 GuttmanKaiser: pick k so that σk≥1m∑σm
 Cattell scree: corner point on curve
 Math solution: use relative error
 How much data is needed?
Rule of 10: 10⋅m
SVD Application: Independent Component Analysis (ICA)
In this situation, we have two sources and two recorders. Our goal is to determine what source I and source II said given the two recordings. Our assumption is
x1=a11s1+a12s2x2=a21s1+a22s2
This can be written as
x⃗ =As⃗
The matrix A is known as the mixing matrix. The data matrix can be written as Xn×m where m≥n and X is full rank. Similar to the previous section, we preprocess the data by centering and scaling. First we center the data by
x⃗ M=1n∑i=1nxi→,x⃗ ∗i=x⃗ i−x⃗ Mx⃗ M=As⃗ M,s⃗ ∗i=s⃗ i−s⃗ M
Notice that the solution for A and s⃗ i is not unique. If A and s⃗ is a solution such that x⃗ =As⃗ , then ∀B, AB and Bs⃗ is also a solution since x⃗ =As⃗ =(AB)(B−1s⃗ ). To deal with this, we will look for a solution that satisfies
1n∑i=1ns⃗ ∗i(s⃗ ∗i)T=Im×m
In statistics, this is known as whitening the data.
Assume A=UΣVT, notice that
1n(X∗)TX∗=1n∑i=1nx⃗ ∗i(x⃗ ∗i)T=1n∑i=1n(As⃗ ∗i)(As⃗ ∗i)T=A(1n∑i=1ns⃗ ∗i(s⃗ ∗i)T)AT=AAT=UΣVT(UΣVT)T=UΣ2UT
To compute A, we need to compute the SVD of 1n(X∗)TX∗. Since it is positive definite, we have
1n(X∗)TX∗=UXΣXUTX
So U=UX and Σ2=ΣX (which means σi=σ∗i‾‾‾√).
To find V, we introduce an idea of contrast function. An ususal choice is kurtosis κ. Given a single variable s∗, using the fact that we want whitened sources, we have
κ=−3+1n∑i=1n(s∗)4
When s∗ is gaussian, we have κ=0. We want as nongaussian as possible, so we want ∣κ∣ or κ2 as large as possible. For multiple sources, we can use
κ⃗ =(κ1,⋯,κm)
So the contrast function would be ∥κ⃗