About Classifion

Take a tour

Inside the box

Brief theory of ...

List of references

Glossary of terms

Download & Order

Spectrino website

Sicyon website

 Written by and
copyright
Teodor Krastev

 

Principal Component Analysis (PCA) involves a multivariate procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (or loadings, or factors), which are ordered by reducing variability. The uncorrelated variables are linear combinations of the original variables, and the last of these variables can be removed with minimum loss of real data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

The main use of PCA is to reduce the dimensionality of a data set while retaining as much information as is possible. It computes a compact and optimal description of the data set.

PCA can be viewed as a mathematical procedure which rotates the data such that maximum variability is projected onto the axes. The rotation of the existing axes to new positions in the space is defined by the original variables. In this new rotation, there will be no correlation between the new variables defined by the rotation.

The basic model for PCA of the spectral data matrix is:

T = S.F + E

where T is the matrix of training spectra, S is a matrix of the scores, F is a matrix of the PCA factors, and E is an n by p matrix of the residual error not modeled by PCA.

In PCA factors space, the scores are the coordinates of the samples and the samples from a same class are grouped in a cluster. So how can we measure a distance from a point (unknown sample) and a cluster (substance) and how can we say if the point belong to a cluster? The problem with simply measure the distance between the point and the center of the cluster doesn't work well because the shape of the cluster could vary a lot and the same distance in one direction could mean - in, in another direction could mean - out. The solution is called Mahalanobis distance which transform the factor space, so that the cluster shape (distribution) become round with size (dispersion) equal to 1. In that new space the direction of distance is not important and the distance measures at how many cluster sigmas the point is. 

The calculation of the Mahalanobis matrix is then done on the S matrix of scores:

M = (S'.S)/(n-1) where n is number of spectra

and the Mahalanobis distance is defined as:

D2 = (S).M-1.(S)'

The traditional approach is to do PCA for all training samples together. Then usually in PC1 vs. PC2, scores plot is used to find groupings that correspond to sample classes. Later applying PCA decomposition with the same loadings on the unknown sample and classifying it to the closest group on the same plot (whether the unknown sample point is within any of the clusters of the training set). Such approach is relatively simple, visually controlled and may works in some cases (of few classes). Increasing number of classes however will increase the optimum number of loadings required, so the visual control is no longer applied (e.g. it is really hard to follow 10 or 15 dimensions simultaneously). Of course there are other non-visual methods, but there is another approach - to do PCA on each class separately (base class) and later decompose the unknowns on the base class loadings, so every class has its own factor space.  In such manner an unknown will be tested against each class separately - always in base class factor space.
Choosing PCs to be calculated for each class (spec-group) separately has two advantages:

  • It improves the precision of classification, because principal components are more group specific instead of averaging over different groups. That includes optimization of the training as well.

  • The expansion of the list of trainings is trivial. You just add another training to the list, instead of recalculating the whole thing every time you add a group.

An optimization of the PCA decomposition consists two phases repeated consecutively:

  • optimizing the threshold - significant spectral info / noise, choosing the number of factors kept in the training.

  • detect and discard the outliers (specs with characteristics "far away" from the rest of the group) using Mahalanobis distance.

For more information - download Classifion and see the help.