Principal Component Analysis
Principal component analysis, commonly referred to as PCA, is a linear method to reduce dimension in large sets of data. For instance, in linear regression, you want to approximate the output by a linear combination of inputs. Two problems then come to mind:
- The dimension \(d\) might be extremely large resulting into overfiting problems.
- Some of the inputs might themselves be collinear reducing the explanatory power of one factor with respect to the other one.
The idea is to proceed to an orthonormal basis change under which the inputs will be orthogonal and select the vectors with the largest eigenvalue.
Theory
Theorem: Spectral Decomposition
Any \(d\times d\) dimensional real valued symetric matrix \(\mathbf{A}\) can be decomposed into.
where
- \(\mathbf{\Lambda} = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)\) is a matrix of real valued eigenvalues.
- \(\mathbf{\Gamma}\) is an orthogonal matrix whose columns are normalized eigenvectors satisfying \(\mathbf{\Gamma}^\top \mathbf{\Gamma} = \mathbf{\Gamma} \mathbf{\Gamma}^\top = I\).
Furthermore, if the matrix is positive semi-definite, that is \(\mathbf{x}^\top \mathbf{A}\mathbf{x} \geq 0\) for any column vector \(\mathbf{x}\), the eigenvalues are then positives and strictly positive in case of positive definiteness.
Consider a column vector \(\mathbf{X}\) of random variable and define the mean vector and covariance matrix
Properties of Covariance and Mean
Clearly, the covariance is a real valued symetric matrix.
However, since for a matrix \(\mathbf{B} \in \mathbb{R}^{l\times d}\) and a column vector \(\mathbf{a}\in \mathbb{R}^l\), it holds
it follows that \(\mathrm{Cov}(\mathbf{X})\) is positive semi-definite. Indeed, for any column vector \(\mathbf{y}\), it follows that
Since \(\mathbf{\Sigma}\) is positive semidefinite, let \(\lambda_{1}, \ldots, \lambda_{d}\) be the real valued eigenvalues and \(\mathbf{\Gamma}_1, \ldots,\mathbf{\Gamma}_d\) be the the corresponding eigenvectors for which holds
Without loss of generality up to permutation of the columns of \(\mathbf{\Gamma}\) we can consider that \(\lambda_1\geq \ldots \geq \lambda_0 =0\).
Permuting the Representation
The spectral decomposition holds under permutation. Indeed, let \(\sigma\) be a permutation of \(\{1, \ldots, d\}\), we define the column permutation matrix \(P\) as \(P_{ij} = 1\) if \(j =\sigma(i)\) and zero otherwise. This permutation matrix is invertible with inverse \(P^\top\) which coincide with the row permutation matrix. In particular \(P\) is orthonormal.
Given a matrix \(A\), the product \(P^\top A\) permuts the rows of \(A\) while the product \(AP\) permuts the columns of \(A\). Hence, it follows that
Since the set of orthogonal matrices is a group we can define \(\tilde{\mathbf{\Gamma}} = \mathbf{\Gamma}P\) and \(\tilde{\mathbf{\Lambda}}=P^\top \mathbf{\Lambda}P\).
PCA Decomposition
The Principal component transform of \(\mathbf{X}\) is the random vector
It clearly follows that
In particular, with a total variance \(\sum_{k=1}^d \mathrm{VAR}\left(Y_k\right) = \sum_{k=1}^d \lambda_k = \mathrm{Trace}(\mathbf{\Lambda}) = \mathrm{Trace}(\mathbf{X}) = \sum_{k=1}^d \mathrm{VAR}(X_k)\) for which we define the ratio
that can be interpreted as the percentage of variance explained by the first \(k\) components. This ratio is clearly increasing from \(0\) to \(100\%\) and concave since we ordered the components decreasingly. Given a threshold \(\theta\) (for instance \(\theta = 50\%\)), the smallest \(k_0\) such that \(\mathrm{tv}(k)\geq \theta\) is such that the first \(k_0\) components explains at least \(\theta \%\) of the total variation of \(\mathbf{X}\).
We then define \(\mathbf{Y}^{(1)}\) and \(\mathbf{Y}^{(2)}\) as the first \(k_0\) and last \(d-k_0+1\) rows of \(\mathbf{Y}\) respectively as well as (remember \(\Gamma\) is the orthogonal matrix of eigenvectors after permuting the columns) \(\mathbf{\Gamma}^{(1)}\) and \(\mathbf{\Gamma}^{(2)}\) as the matrix with the first \(k_0\) and last \(d-k_0+1\) eigenvectors, respectively. It holds that
where \(\mathbf{\varepsilon}\) is a random vector only contributing to less than \((1-\theta)\%\) of the total variance of \(\mathbf{X}\).
Normalization
Note that the PCA is not invariant with scaling of one or the other input. Depending on the situation (in particular when there is a large variation in terms of the amplitude of the inputs) a normalization (by the standard deviation or with quantiles) can be performed. However the interpretation is totally different as the projection might differ a lot.
Numpy Implementation
To perform a PCA decomposition, the functionalities of np.linalg.eig
are entirely sufficient to provide a PCA decomposition.
For this we will make use of the dataset CSI that represents the price of 208 stocks over 7 years of the CSI Index.
We want to consider as \(\mathbf{X}\) the normalized returns of these stock prices and proceed to a PCA decomposition of which (basically we do a PCA of the correlation matrix rather than covariance matrix.)
Import the data and
# import numpy, pandas to load data and plotly to plot
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
# load the dataset, convert the dates to datetime and set as index
dfstocks = pd.read_csv('./data/CSI.csv')
dfstocks["Date"] = pd.to_datetime(dfstocks["Date"])
dfstocks = dfstocks.set_index("Date")
# get the returns and dropna as it destroys positive definiteness
dfret = dfstocks.pct_change().dropna().copy()
# compute the correlation matrix
Corr = dfret.corr()
# and plot it as a heat map
fig = px.imshow(Corr, color_continuous_scale='Inferno_r')
fig.show()
# use linalg.eig to get eigenvalues and eigenvectors (eigenvector[:, i] is the ith eigenvector)
eigenvalues, eigenvectors = np.linalg.eig(Corr)
# %% plot the cumulative normalized eigenvalues and their relative weights with two different y-axi
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Bar(
x=dfeigenval.index,
y=dfeigenval / dfeigenval.sum(),
name="Eigenvalues Relative Weights",
),
secondary_y=False,
)
fig.add_trace(
go.Scatter(
x=dfeigenval.index,
y=dfeigenval.cumsum() / dfeigenval.sum(),
mode="lines",
name="Eigenvalues Cumulative Weights",
),
secondary_y=True,
)
# update the axis layout for each axis
fig.update_yaxes(tickformat=",.0%", title="Relative Weights", secondary_y=False)
fig.update_yaxes(tickformat=",.0%", title="Cumulative Weights", secondary_y=True)
fig.show()
Now decide for a threshold (here \(60\%\)) and define the principal factor loading matrix.
# set the threshold
theta = 0.6
# get the index of those factor that only contributes beyond theta
idx = dfeigenval.cumsum() / dfeigenval.sum() >= theta
print(f"The index of those factors that explain at least {theta:.0%} of the variance are:{~idx}")
# now slice the dfeigenvectors dataframe by keeping only the columns that are in ~idx
pca = dfeigenvec.loc[:, ~idx].copy()
display(pca)
print(f"The original dimension of {len(dfeigenvec.columns)} has been reduced to {len(pca.columns)}")
Using Scikitlearn
Scikit learn provides a package performing PCA entirely.