Written by and
Principal Component Analysis (PCA) involves a multivariate procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (or loadings, or factors), which are ordered by reducing variability. The uncorrelated variables are linear combinations of the original variables, and the last of these variables can be removed with minimum loss of real data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
The main use of PCA is to reduce the dimensionality of a data set while retaining as much information as is possible. It computes a compact and optimal description of the data set.
PCA can be viewed as a mathematical procedure which rotates the data such that maximum variability is projected onto the axes. The rotation of the existing axes to new positions in the space is defined by the original variables. In this new rotation, there will be no correlation between the new variables defined by the rotation.
The basic model for PCA of the spectral data matrix is:
T = S.F + E
where T is the matrix of training spectra, S is a matrix of the scores, F is a matrix of the PCA factors, and E is an n by p matrix of the residual error not modeled by PCA.
In PCA factors space, the scores are the coordinates of the samples and the samples from a same class are grouped in a cluster. So how can we measure a distance from a point (unknown sample) and a cluster (substance) and how can we say if the point belong to a cluster? The problem with simply measure the distance between the point and the center of the cluster doesn't work well because the shape of the cluster could vary a lot and the same distance in one direction could mean - in, in another direction could mean - out. The solution is called Mahalanobis distance which transform the factor space, so that the cluster shape (distribution) become round with size (dispersion) equal to 1. In that new space the direction of distance is not important and the distance measures at how many cluster sigmas the point is.
The calculation of the Mahalanobis matrix is then done on the S matrix of scores:
M = (S'.S)/(n-1) where n is number of spectra
and the Mahalanobis distance is defined as:
D2 = (S).M-1.(S)'
The traditional approach is to do PCA for all training samples
together. Then usually in PC1 vs. PC2, scores plot is used to find
groupings that correspond to sample classes. Later applying PCA
decomposition with the same loadings on the unknown sample and
classifying it to the closest group on the same plot (whether the
unknown sample point is within any of the clusters of the training set).
Such approach is relatively simple, visually controlled and may works in
some cases (of few classes). Increasing number of classes however will
increase the optimum number of loadings required, so the visual control
is no longer applied (e.g. it is really hard to follow 10 or 15
dimensions simultaneously). Of course there are other non-visual
methods, but there is another approach - to do PCA on each class
separately (base class) and later decompose the unknowns on the
base class loadings, so every class has its own factor space. In
such manner an unknown will be tested against each class separately -
always in base class factor space.
An optimization of the PCA decomposition consists two phases repeated consecutively:
For more information - download Classifion and see the help.