Principal Component Analysis (PCA)

Is a technique commonly used in machine learning for data complexity reduction. Based on the concept of projections, the goal is to reduce the number of total variables used in the analysis.

Leads to smaller datasets while minimizing information loss.
Turn data easier to be visualized.

The main idea here is to project the data into the directions with most variance. Combining the concepts of eigenvalues and eigenvectors, projections and covariance matrix is possible to find these directions. The covariance matrix $C$ characterizes the spread of the data, it’s eigenvectors tells the direction in which the matrix can be viewed as just a straight stretching and the largest eigenvalue tells which stretching is greatest.

Note

Larger the eigenvalue larger the variance when projecting data.

Mathematical Formulation

Given a dataset matrix $X_{n \times m}$ where $n$ is the number of observations and $m$ is the number of variables (features).
Center the data, calculating $X - μ$
Calculate the covariance matrix $C = \frac{1}{n - 1} (X - μ)^{T} (X - μ)$
Calculate the eigenvalues and eigenvectors for the matrix $C$ and sort them from largest to smallest.
Create the projection matrix $(k = 2)$ :

V = [\frac{v _{1}}{∥ v _{1} ∥ _{2}} \frac{v _{2}}{∥ v _{2} ∥ _{2}}]

Project centered data $X_{PCA} = (X - μ) V$
Reconstructing the information $X = X_{PCA} \cdot V^{T}$

🗂️ Knowledge Wiki

Explorer

Principal Component Analysis (PCA)

Mathematical Formulation

Graph View

Backlinks