# Mofii.NET

## Singular Value Decomposition

Spectral Decomposition Theorem
Assume A$\mathbf{A}$ is symmetric, then it has orthogonal eigenvectors
u⃗ 1,u⃗ 2,,u⃗ n$\vec{u}_1, \vec{u}_2, \cdots, \vec{u}_n$ with eigenvalues λ1,λ2,,λn$\lambda_1, \lambda_2, \cdots, \lambda_n$. In this case,

A=QDQT

where
Q=(u⃗ 1,u⃗ 2,,u⃗ n),D=diag(λ1,λ2,,λn)

## Singular Value Decomposition

Given a m×n$m \times n$ matrix A$\mathbf{A}$, we want

A=UΣVT

where U$\mathbf{U}$ and V$\mathbf{V}$ are orthogonal matrices and Σ$\mathbf{\Sigma}$ is diagonal. The diagonal elements of Σ$\mathbf{\Sigma}$ is called singular values.

Assume mn$m \geq n$. Consider ATA$\mathbf{A}^T\mathbf{A}$,

ATA=(VΣTUT)UΣVT=V(ΣTΣ)VT

where (ΣTΣ)n×n=diag(σ21,,σ2n)$\left(\mathbf{\Sigma}^T\mathbf{\Sigma}\right)_{n\times n} = \operatorname{diag}\left(\sigma_1^2, \cdots, \sigma_n^2\right)$. So, from the spectral decomposition theorem, we pick

V=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and σ2i=λi$\sigma_i^2 = \lambda_i$ where λi0$\lambda_i \geq 0$ are eigenvalues of ATA$\mathbf{A}^T\mathbf{A}$.

Now consider AAT$\mathbf{A}\mathbf{A}^T$,

AAT=U(ΣΣT)UT

Theorem
If λ$\lambda^*$ is an eigenvalue for ATA$\mathbf{A}^T\mathbf{A}$ then it's an eigenvalue for AAT$\mathbf{A}\mathbf{A}^T$ or λ=0$\lambda^* = 0$.

From the theorem, we can pick

U=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT

To sum it up, we have

A=UΣVT

where

UV=(u⃗ 1,u⃗ 2,,u⃗ n) where u⃗ ieigenvectors of AAT=(v⃗ 1,v⃗ 2,,v⃗ n) where v⃗ ieigenvectors of ATA

and Σ=diag(σ1,σ2,,σn)$\mathbf{\Sigma} = \operatorname{diag}\left(\sigma_1, \sigma_2, \cdots, \sigma_n\right)$, σi=λi$\sigma_i = \sqrt{\lambda_i}$, λi$\lambda_i$ are eigenvalues of ATA$\mathbf{A}^T\mathbf{A}$.

Assembling the pieces, it's assumed that:

1. σ1σ2σn$\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n$
2. Av⃗ i=σiu⃗ i$\mathbf{A}\vec{v}_i = \sigma_i\vec{u}_i$, used to determine the signs of v⃗ i$\vec{v}_i$ and u⃗ i$\vec{u}_i$

## Properties and Mathematical Consequences of SVD

### I. Outer Product Expansion

We have

A=i=1nσiWi

where

(Wi)m×n=u⃗ iv⃗ Ti

Notes:

1. Wix⃗ =(u⃗ iv⃗ Ti)x⃗ =u⃗ i(v⃗ Tix⃗ )=u⃗ i(v⃗ ix⃗ )=(v⃗ ix⃗ )u⃗ i$\mathbf{W}_i\vec{x} = \left(\vec{u}_i\vec{v}_i^T\right)\vec{x} = \vec{u}_i\left(\vec{v}_i^T\vec{x}\right) = \vec{u}_i\left(\vec{v}_i\vec{x}\right) = \left(\vec{v}_i\vec{x}\right)\vec{u}_i$
2. If m>n$m > n$ then u⃗ i+1,,u⃗ m$\vec{u}_{i+1}, \cdots, \vec{u}_m$ makes no contribution to A$\mathbf{A}$.

### II. Rank

Definition
rank(A)$\operatorname{rank}(\mathbf{A})$ is the number of linearly independent columns in A$\mathbf{A}$.

Theorem
The number of linearly independent columns in A$\mathbf{A}$ is equal to the number of linearly independent rows in A$\mathbf{A}$.

Fact
rank(A)<min{m,n}$\operatorname{rank}(\mathbf{A}) < \min\{m, n\}$

rank(A)=k$\operatorname{rank}(\mathbf{A}) = k$ where σk0$\sigma_k \neq 0$ while σk+1=0$\sigma_{k+1} = 0$. Consequences:

1. Wi$\mathbf{W}_i$ has rank 1$1$. Proof: We have Wiv⃗ i=(v⃗ iv⃗ i)u⃗ i=u⃗ i$\mathbf{W}_i\vec{v}_i = \left( \vec{v}_i\vec{v}_i\right)\vec{u}_i = \vec{u}_i$, while Wiv⃗ j=(v⃗ iv⃗ j)u⃗ i=0$\mathbf{W}_i\vec{v}_j = \left(\vec{v}_i\vec{v}_j\right) \vec{u}_i = 0$. So rank(Wi)=1$\operatorname{rank} \left(\mathbf{W}_i\right) = 1$.
2. li=1σiWi$\sum_{i=1}^{l} \sigma_i\mathbf{W}_i$ has rank l$l$.

### III. Norms

Theorem
A2=σ1$\lVert \mathbf{A} \rVert_2 = \sigma_1$

Proof: Take x⃗ =ni=1aiv⃗ i$\vec{x} = \sum_{i=1}^na_i\vec{v}_i$ where ni=1a2i=1$\sum_{i=1}^n a_i^2 = 1$. Notice that

Ax⃗ =Ai=1naiv⃗ i=i=1naiσiu⃗ i

So
Ax⃗ 22=i=1na2iσ2iσ21i=1na2i=σ2

Thus we have x$\forall x$, Ax⃗ 2σ1$\lVert \mathbf{A}\vec{x} \rVert_2 \leq \sigma_1$, and we have Ax⃗ 2=σ1$\lVert \mathbf{A}\vec{x} \rVert_2 = \sigma_1$ when x⃗ =v⃗ 1$\vec{x} = \vec{v}_1$. Since A2=maxx⃗ 2=1Ax⃗ 2$\lVert \mathbf{A} \rVert_2 = \max_{\lVert \vec{x} \rVert_2=1} \lVert \mathbf{A}\vec{x} \rVert_2$, we have proved that A=σ1$\lVert \mathbf{A} \rVert = \sigma_1$.

Theorem
If A$\mathbf{A}$ is invertible (so m=n$m = n$) then

κ2(A)=σ1σn

Proof: We have

κ2(A)=A2A12=σ11σn=σ1σn

Forbinus norm: AF=ijaij2$\lVert \mathbf{A} \rVert_F = \sqrt{\sum_i \sum_j \lvert a_{ij} \rvert^2}$.

Fact

AF=i=1nσ2i=σ⃗ 2

### IV. Transpose and Inverse

Let A=UΣVT$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$, so

AT=VΣTUTA1=VΣ1UT

Note this is not a standard SVD for A1$\mathbf{A}^{-1}$, but can be rearranged.

### V. Invariance

If B=PAQ$\mathbf{B} = \mathbf{PAQ}$ where P$\mathbf{P}$ and Q$\mathbf{Q}$ are orthogonal, then B$\mathbf{B}$ has same singular values as A$\mathbf{A}$.

Proof: Singular values for A$\mathbf{A}$: det(ATAλI)=0$\det\left(\mathbf{A}^T\mathbf{A} - \lambda\mathbf{I}\right) = 0$. Since QT=Q$\mathbf{Q}^{-T} = \mathbf{Q}$, we have

BTBλI=QT(ATA)QλI=QT(ATAλI)Q

So det(ATAλI)=det(BTBλI)$\det\left(\mathbf{A}^T\mathbf{A} - \lambda\mathbf{I}\right) = \det\left(\mathbf{B}^T\mathbf{B} - \lambda\mathbf{I}\right)$.

### VI. Solve Ax⃗ =b⃗ $\mathbf{A}\vec{x} = \vec{b}$

If A$\mathbf{A}$ is invertible, so A1=VΣ1UT$\mathbf{A}^{-1} = \mathbf{V}\mathbf{\Sigma}^{-1}\mathbf{U}^T$, and

A1=i=1n1σiv⃗ iu⃗ Ti

So we can solve for x⃗ =A1b$\vec{x} = \mathbf{A}^{-1}\mathbf{b}$. If A$\mathbf{A}$ is not invertible, we could use pseudo-inverse A+$\mathbf{A}^+$ instead.

### VII. Positive / Negative Eigenvalues

If A$\mathbf{A}$ is symmetric and positive definite, so the eigenvalues of A$\mathbf{A}$ are positive, and

A=UΣVT=QDQT

where the eigenvalues in D$\mathbf{D}$ are sorted by size. If A$\mathbf{A}$ is symmetric but has negative eigenvalues, the singular values for A$\mathbf{A}$ are the absolute values of the eigenvalues for A$\mathbf{A}$.

For example, if Ax⃗ =3x⃗ $\mathbf{A}\vec{x} = -3 \vec{x}$ where x⃗ 2=1$\lVert \vec{x} \rVert_2 = 1$, so we have σi=3$\sigma_i = 3$, v⃗ i=x⃗ $\vec{v}_i = \vec{x}$, and u⃗ i=x⃗ $\vec{u}_i = -\vec{x}$, so

Av⃗ i=3x⃗ =3u⃗ i

### VIII. Low Rank Approximation

Let the low rank approximation of A$\mathbf{A}$ be

Ak=i=1kσiWi

where k=1,2,,r1$k = 1, 2, \cdots, r-1$. So rank(Ak)=k$\operatorname{rank}(\mathbf{A}_k) = k$. Since rank(A)=r$\operatorname{rank}(\mathbf{A}) = r$, Ar=A$\mathbf{A}_r = \mathbf{A}$.

Eckart-Young Theorem

1. A2=σ1$\lVert \mathbf{A} \rVert_2 = \sigma_1$
2. AAk2=σk+1$\lVert \mathbf{A}-\mathbf{A}_k \rVert_2 = \sigma_{k+1}$ for k=1,2,,r1$k = 1, 2, \cdots, r-1$

Theorem
The most accurate approximation of A$\mathbf{A}$ with rank k$\leq k$ is Ak$\mathbf{A}_k$. In other words,

minrank(B)kAB2=AAk2=σk+1

Proof: (by contradiction) Suppose there exists B$\mathbf{B}$ such that rank(B)k$\operatorname{rank}(\mathbf{B}) \leq k$ and AB2<σk+1$\lVert \mathbf{A} - \mathbf{B} \rVert_2 < \sigma_{k+1}$. In this case, let N(B)$N(\mathbf{B})$ be the null space of B$\mathbf{B}$. Since dim(N(B))nk$\dim(N(\mathbf{B})) \geq n-k$ and dim(span{v⃗ 1,,v⃗ k+1})=k+1$\dim(\operatorname{span} \{\vec{v}_1, \cdots, \vec{v}_{k+1}\}) = k+1$, we know that N(B)$N(\mathbf{B})$ overlaps with Sk+1=span{v⃗ 1,,v⃗ k+1}$S_{k+1} = \operatorname{span} \{\vec{v}_1, \cdots, \vec{v}_{k+1}\}$. Let x⃗ N(B)Sk+1$\vec{x} \in N(\mathbf{B}) \cap S_{k+1}$, so

(AB)x⃗ 2=Ax⃗ 2=α1σ1u⃗ 1++αk+1σk+1u⃗ k+12σ2k+1

This contradicts to our assumption that AB2<σk+1$\lVert \mathbf{A} - \mathbf{B} \rVert_2 < \sigma_{k+1}$.

## Application of SVD: Image Compression

The relative error of our approximation / compression is given by

relative errorAAk2A2=σk+1σ1

Given a grayscale image of size m$m$ by n$n$, we reprensent the image in the form of a matirx Am×n$\mathbf{A}_{m\times n}$ where 0aij255$0 \leq a_{ij} \leq 255$. The compression of the image is given by the low rank approximation of the matrix A$\mathbf{A}$.

1. How well does Ak$\mathbf{A}_k$ do in reproducing the original image? 2. Image compression ratio?

• original image: mn$m \cdot n$
• compressed image: k(m+n)$k \cdot (m + n)$
For an RGB image of size m$m$ by n$n$, we concatenate the three color matrices and get a matrix A3m×n$\mathbf{A}_{3m \times n}$. The following procedures are similar to grayscale images.

• complications from using floats (instead of integers)
• lossy compression

## SVD Application: Water Marking / Steganography

The goal is to embed / hide an image in another one (and extract it later). We follow the same procedure as in Liu & Tan 2002.

Let A$\mathbf{A}$ be the original grayscale image and B$\mathbf{B}$ be the grayscale image to be embeded (of the same size as A$\mathbf{A}$).

1. Find the SVD for A$\mathbf{A}$: A=UΣVT$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$
2. Let B¯=Σ+aB$\bar{\mathbf{B}} = \mathbf{\Sigma} + a\mathbf{B}$ for 0<a<1$0 < a < 1$ and find the SVD for A$\mathbf{A}$: B¯=UB¯ΣB¯VTB¯$\bar{\mathbf{B}} = \mathbf{U}_\bar{\mathbf{B}} \mathbf{\Sigma}_\bar{\mathbf{B}}\mathbf{V}_\bar{\mathbf{B}}^T$
3. Embed B¯$\bar{\mathbf{B}}$ to get A¯$\bar{\mathbf{A}}$: A¯=UΣB¯VT$\bar{\mathbf{A}} = \mathbf{U}\mathbf{\Sigma_\bar{\mathbf{B}}}\mathbf{V}^T$

Now, given an image A$\mathbf{A}^*$ (that you think is your A¯$\bar{\mathbf{A}}$),

1. Find SVD for A$\mathbf{A}^*$: A=UAΣAVTA$\mathbf{A} = \mathbf{U}_{\mathbf{A}^*} \mathbf{\Sigma}_{\mathbf{A}^*} \mathbf{V}_{\mathbf{A}^*}^T$
2. Find B¯$\bar{\mathbf{B}}^*$ (using UB¯$\mathbf{U}_\bar{\mathbf{B}}$ and VB¯$\mathbf{V}_\bar{\mathbf{B}}$): B¯=UB¯ΣAVTB¯$\bar{\mathbf{B}}^* = \mathbf{U}_{\bar{\mathbf{B}}} \mathbf{\Sigma}_{\mathbf{A}^*} \mathbf{V}_\bar{\mathbf{B}}^T$
3. B=1a(B¯Σ)$\mathbf{B}^* = \frac{1}{a} \left( \bar{\mathbf{B}}^* - \mathbf{\Sigma} \right)$

## SVD Application: Principle Component Analysis (PCA)

The goal is find connections in data.

Example: Find connections between word length and number of words used in its definition.

L$L \equiv$ # of letters in word
W$W \equiv$ # of words in definition

L$L$ W$W$
bag 3 223
across 6 117
$\vdots$ $\vdots$ $\vdots$

Assumption: Linear connection:

orW=αL+βL=aW+b

Error: To make the error function mathematically equivalent for both linear connections, we use true distance Di$D_i$:

E(α,β)=D2i=11+α2(αLiβiWi)2

Data preprocessing:

1. Center the data: Setting αE=βE=0$\partial_\alpha E = \partial_\beta E = 0$ requires solving cubic euqations. To avoid this, we center the data by

XiYi=LiL¯where L¯=1nLi=WiW¯where W¯=1nWi

with this we assume Y=α0X$\mathbf{Y} = \alpha_0 \mathbf{X}$.

2. Scale variables: (nondimensionalization) 11+α20$\frac{1}{1+\alpha_0^2}$ must be dimensionless (?), so we scale X$\mathbf{X}$ and Y$\mathbf{Y}$ using XC$\mathbf{X}_C$ and YC$\mathbf{Y}_C$:

Xi=Xi/XCYi=Yi/YC

Possible choices for XC$\mathbf{X}_C$ and YC$\mathbf{Y}_C$:

• XC=1nX⃗ 1$\mathbf{X}_C = \frac{1}{n}\lVert \vec{\mathbf{X}}\rVert_1$ and YC=1nY⃗ 1$\mathbf{Y}_C = \frac{1}{n}\lVert \vec{\mathbf{Y}}\rVert_1$
• XC=X⃗ $\mathbf{X}_C = \lVert \vec{\mathbf{X}} \rVert_\infty$ and YC=Y⃗ $\mathbf{Y}_C = \lVert \vec{\mathbf{Y}} \rVert_\infty$
• XC=1nX⃗ 2$\mathbf{X}_C = \frac{1}{\sqrt{n}} \lVert \vec{\mathbf{X}}\rVert_2$ and YC=1nY⃗ 2$\mathbf{Y}_C = \frac{1}{\sqrt{n}} \lVert \vec{\mathbf{Y}}\rVert_2$

To sum it up, we assume Y=αX$\mathbf{Y} = \alpha \mathbf{X}$ with

E(α)=11+α2(αxiyi)2=11+α2[α2x⃗ x⃗ 2αx⃗ y⃗ +y⃗ y⃗ ]

Let αE=0$\partial_\alpha E = 0$, we have

α=12(λ±λ2+4){+x⃗ y⃗ >0x⃗ y⃗ <0

where

λ=x⃗ x⃗ y⃗ y⃗ x⃗ y⃗

## Principle Component Analysis

Assume there are m$m$ variables (or quantities) q1,,qm$q_1, \cdots, q_m$ and n$n$ measurements for them. So we have a data matrix An×m$\mathbf{A}_{n \times m}$. We first center the data

qj¯Qij=1ni=1nqij=qijqj¯

Then we scale the data

Sjpijscaling factor for the jth variable=QijSj

Out objective is to find k$k$-D approximation of data. When k=1$k = 1$, p⃗ =α1v⃗ 1$\vec{p} = \alpha_1 \vec{v}_1$. When k=2$k = 2$, p⃗ =α1v⃗ 1+α2v⃗ 2$\vec{p} = \alpha_1\vec{v}_1 + \alpha_2\vec{v}_2$. Here, v⃗ iv⃗ i=1$\vec{v}_i\cdot\vec{v}_i = 1$ and v⃗ iv⃗ j=0$\vec{v}_i\cdot\vec{v}_j = 0$. The error is given by

E(v⃗ )=d2i=p⃗ ip⃗ i(p⃗ iv⃗ 1)2(p⃗ iv⃗ k)2

Notice that ni=1p⃗ ip⃗ i=i,jP2ij=P2F=σ21++σ2m$\sum_{i=1}^n \vec{p}_i\vec{p}_i = \sum_{i,j} \mathbf{P}_{ij}^2 = \lVert \mathbf{P} \rVert_F^2 = \sigma_1^2 + \cdots + \sigma_m^2$.

To minimize the error, we use Lagrange multiplier

E(v⃗ ,λ)=d2i+λ(v⃗ 221)

and set

{vi1E=0λE=0

So we have

(PTP)v⃗ =λv⃗

and λ$\lambda$ is the eigenvalue of PTP$\mathbf{P}^T\mathbf{P}$ and v⃗ $\vec{v}$ is the corresponding eigenvector.

Fact
ni=1(p⃗ iv⃗ i)2=λ1$\sum_{i=1}^n (\vec{p}_i \vec{v}_i)^2 = \lambda_1$

With this,

E1=[p⃗ ip⃗ i(p⃗ iv⃗ i)2]=σ21++σ2mλ1

So λi=[singular values]2$\lambda_i = [\text{singular values}]^2$.

Theorem
Let P$\mathbf{P}$ be the n×m$n \times m$ adjusted data matrix, (with nm$n \geq m$ and rank(P)=m$\operatorname{rank}(\mathbf{P}) = m$). Also, let

p⃗ =α1v⃗ 1++αkv⃗ k

For each k$k$, the best fit is obtained using the first k$k$ columns from V$\mathbf{V}$. The error is
Ek=σ2k+1++σ2m

and the relative error is
relative error=σ2k+1++σ2mσ21++σ2m

1. For general data, how to pick k$k$?
• Guttman-Kaiser: pick k$k$ so that σk1mσm$\sigma_k \geq \frac{1}{m}\sum \sigma_m$
• Cattell scree: corner point on curve
• Math solution: use relative error
2. How much data is needed?
Rule of 10: 10m$10 \cdot m$

## SVD Application: Independent Component Analysis (ICA)

In this situation, we have two sources and two recorders. Our goal is to determine what source I and source II said given the two recordings. Our assumption is

x1=a11s1+a12s2x2=a21s1+a22s2

This can be written as

x⃗ =As⃗

The matrix A$\mathbf{A}$ is known as the mixing matrix. The data matrix can be written as Xn×m$\mathbf{X}_{n \times m}$ where mn$m \geq n$ and X$\mathbf{X}$ is full rank. Similar to the previous section, we preprocess the data by centering and scaling. First we center the data by

x⃗ M=1ni=1nxi,x⃗ i=x⃗ ix⃗ Mx⃗ M=As⃗ M,s⃗ i=s⃗ is⃗ M

Notice that the solution for A$\mathbf{A}$ and s⃗ i$\vec{s}_i$ is not unique. If A$\mathbf{A}$ and s⃗ $\vec{s}$ is a solution such that x⃗ =As⃗ $\vec{x} = \mathbf{A}\vec{s}$, then B$\forall \mathbf{B}$, AB$\mathbf{AB}$ and Bs⃗ $\mathbf{B}\vec{s}$ is also a solution since x⃗ =As⃗ =(AB)(B1s⃗ )$\vec{x} = \mathbf{A}\vec{s} = \left(\mathbf{AB}\right) \left(\mathbf{B}^{-1}\vec{s}\right)$. To deal with this, we will look for a solution that satisfies

1ni=1ns⃗ i(s⃗ i)T=Im×m

In statistics, this is known as whitening the data.

Assume A=UΣVT$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$, notice that

1n(X)TX=1ni=1nx⃗ i(x⃗ i)T=1ni=1n(As⃗ i)(As⃗ i)T=A(1ni=1ns⃗ i(s⃗ i)T)AT=AAT=UΣVT(UΣVT)T=UΣ2UT

To compute A$\mathbf{A}$, we need to compute the SVD of 1n(X)TX$\frac{1}{n}(\mathbf{X}^*)^T\mathbf{X}^*$. Since it is positive definite, we have

1n(X)TX=UXΣXUTX

So U=UX$\mathbf{U} = \mathbf{U}_\mathbf{X}$ and Σ2=ΣX$\mathbf{\Sigma}^2 = \mathbf{\Sigma}_\mathbf{X}$ (which means σi=σi$\sigma_i = \sqrt{\sigma_i^*}$).

To find V$\mathbf{V}$, we introduce an idea of contrast function. An ususal choice is kurtosis κ$\kappa$. Given a single variable s$s^*$, using the fact that we want whitened sources, we have

κ=3+1ni=1n(s)4

When s$s^*$ is gaussian, we have κ=0$\kappa = 0$. We want as non-gaussian as possible, so we want κ$\lvert \kappa \rvert$ or κ2$\kappa^2$ as large as possible. For multiple sources, we can use

κ⃗ =(κ1,,κm)

So the contrast function would be κ⃗