Written by and |
Principal Component Analysis (PCA) involves a
multivariate procedure that transforms a number of (possibly) correlated
variables into a (smaller) number of uncorrelated variables called The main use of PCA is to reduce the dimensionality of a data set while retaining as much information as is possible. It computes a compact and optimal description of the data set. PCA can be viewed as a mathematical procedure which rotates the data such that maximum variability is projected onto the axes. The rotation of the existing axes to new positions in the space is defined by the original variables. In this new rotation, there will be no correlation between the new variables defined by the rotation. The T = S.F + E where is a matrix of the scores, S is a
matrix of the PCA factors, and F is an n by p matrix of
the residual error not modeled by PCA. EIn PCA factors space, the scores are the coordinates of the samples and the samples from a same class are grouped in a cluster. So how can we measure a distance from a point (unknown sample) and a cluster (substance) and how can we say if the point belong to a cluster? The problem with simply measure the distance between the point and the center of the cluster doesn't work well because the shape of the cluster could vary a lot and the same distance in one direction could mean - in, in another direction could mean - out. The solution is called Mahalanobis distance which transform the factor space, so that the cluster shape (distribution) become round with size (dispersion) equal to 1. In that new space the direction of distance is not important and the distance measures at how many cluster sigmas the point is. The calculation of the Mahalanobis matrix is then done on the S matrix of scores: M = (S'.S)/(n-1) where n is number of spectra and the Mahalanobis distance is defined as: D The traditional approach is to do PCA for all training samples
together. Then usually in PC1 vs. PC2, scores plot is used to find
groupings that correspond to sample classes. Later applying PCA
decomposition with the same loadings on the unknown sample and
classifying it to the closest group on the same plot (whether the
unknown sample point is within any of the clusters of the training set).
Such approach is relatively simple, visually controlled and may works in
some cases (of few classes). Increasing number of classes however will
increase the optimum number of loadings required, so the visual control
is no longer applied (e.g. it is really hard to follow 10 or 15
dimensions simultaneously). Of course there are other non-visual
methods, but there is another approach - to do PCA on each class
separately ( -
It improves the of classification, because principal components are more group specific instead of averaging over different groups. That includes optimization of the training as well.**precision** -
The of the list of trainings is trivial. You just add another training to the list, instead of recalculating the whole thing every time you add a group.**expansion**
An optimization of the PCA decomposition consists two phases repeated consecutively: -
optimizing the threshold - significant spectral info / noise, choosing the number of factors kept in the training. -
detect and discard the outliers (specs with characteristics "far away" from the rest of the group) using Mahalanobis distance.
For more information - download Classifion and see the help. |