In the previous post I explained how PCA can be used to create a convenient basis for representing real-world reflectivities. The procedure uses a set of measured spectra to compute an orthonormal basis, sorted by variance or “statistical significance.” All but the three most significant vectors are then truncated. I called the difference between the actual spectrum and the reconstructed the error term $\pmb\epsilon$.
Naturally the smaller the error, the better. Because $\pmb\epsilon$ is a vector, the authors chose the Euclidean norm (2-norm or L2) to represent the (scalar) error: $$\epsilon = \sqrt{\sum_i \pmb\epsilon_i^2}$$ To get an idea of how well the method works in the general case, the authors would sum up reconstruction errors of all spectra in the dataset. I call this the total error $E$: $$E = \sum_i \epsilon|_{\mathbf s_i}$$ where $\epsilon|_{\mathbf s_i}$ denotes the reconstruction error of $\mathbf s_i$, the $i$-th measured spectrum.
Using just one basis for everything isn't ideal. The picture below shows reconstructions of a few colorchecker spectra. While by definition XYZ tristimulus values (and perceived colors) match perfectly, the spectra aren't that close.
The authors decided to subdivide the CIE 1931 xy space using a binary space partitioning tree. Every node that isn't a leaf divides its subspace in two with a horizontal or vertical line. Leaves, called clusters in the article, contain spectra that lie in their subspace. PCA is done for every cluster separately, creating a set of bases.
A quick note about CIE 1931 xy chromaticity coordinates. They're derived from CIE 1931 XYZ tristimulus values: $$\begin{cases}
x = \frac{X}{X + Y + Z} \\ y = \frac{Y}{X + Y + Z}
\end{cases}$$ They're often used where only the chromaticity matters and not the lightness $Y$. When plotted, they give rise to the familiar “color triangle” or “horseshoe” diagram, pictured on the right.
The algorithm for creating an optimized clustering is described as follows.
The algorithm terminates when no more partitions are possible or there are enough clusters. Authors chose 8 clusters, because of diminishing returns and dataset size concerns. You can look at the original implementation or my code. The article also has a lot more plots and comparison than I post here, so it's definitely worth looking at too.
Below are the results of running my code. I used the entire dataset of 1,279 spectra and optimized for 8 clusters. This takes about 3 hours on my dated i7 3770.
I mentioned that nodes cannot be allowed to be split too finely. First of all, there have to be at least three spectra in a node for PCA to even work. Fewer will result in a singular $M$ and the algorithm simply won't be able to continue. In practice, having few spectra is no good either. Below is a result of PCA decomposition done with just three spectra.
Only two eigenvectors are meaningful and the rest are basically noise. Trying to use this for reconstruction will create completely random-looking spectra, usually going way outside the [0, 1] range. Clearly a lot more data needs to be used to get useful results. In my program I (arbitrarily) set the minimum cluster size to the average cluster size divided by two and it seems to work pretty well. $$\text{min. cluster size} = \left\lfloor\frac{1}{2}\,\frac{\text{num.\,of spectra}}{\text{num.\,of clusters}}\right\rfloor$$
But can reconstructed spectra exceed the allowed [0, 1] range even if the basis is good? Unfortunately yes, and this will be the topic of the next (and the last) post in this series.