====== Otsu et al. (2018), part 3 ======

In the [[gsoc:otsu2018_2|previous post]] I explained how PCA can be used to create a convenient basis for representing real-world reflectivities. The procedure uses a set of measured spectra to compute an orthonormal basis, sorted by variance or "statistical significance." All but the three most significant vectors are then truncated. I called the difference between the actual spectrum and the reconstructed the error term $\pmb\epsilon$.

Naturally the smaller the error, the better. Because $\pmb\epsilon$ is a vector, the authors chose the Euclidean norm (2-norm or L2) to represent the (scalar) error:
$$\epsilon =  \sqrt{\sum_i \pmb\epsilon_i^2}$$
To get an idea of how well the method works in the general case, the authors would sum up reconstruction errors of //all// spectra in the dataset. I call this the total error $E$:
$$E = \sum_i \epsilon|_{\mathbf s_i}$$
where $\epsilon|_{\mathbf s_i}$ denotes the reconstruction error of $\mathbf s_i$, the $i$-th measured spectrum.

Using just one basis for everything isn't ideal. The picture below shows reconstructions of a few colorchecker spectra. While by definition XYZ tristimulus values (and perceived colors) match perfectly, the spectra aren't that close.

[{{ gsoc:otsu_clustering1.png?600 |Reconstructions of a few colorchecker spectra.}}]

===== Clustering =====

The authors decided to subdivide the CIE 1931 xy space using a binary space partitioning tree. Every node that isn't a leaf divides its subspace in two with a horizontal or vertical line. Leaves, called clusters in the article, contain spectra that lie in their subspace. PCA is done for every cluster separately, creating a set of bases.

[{{ gsoc:chromaticity_diagram.png?150|CIE 1931 xy chromaticity diagram}}]

A quick note about CIE 1931 xy chromaticity coordinates. They're derived from CIE 1931 XYZ tristimulus values:
$$\begin{cases}
  x = \frac{X}{X + Y + Z} \\
  y = \frac{Y}{X + Y + Z}
\end{cases}$$
They're often used where only the chromaticity matters and not the lightness $Y$. When plotted, they give rise to the familiar "color triangle" or "horseshoe" diagram, pictured on the right.

The algorithm for creating an optimized clustering is described as follows.
  * Search every possible new partitioning in the tree. For every color of every leaf, try partitions with a horizontal and a vertical line going through the color, assigning the color itself to the "lesser" part (to the left or below the line).
  * Discard partitions that would create very small nodes. More on this later.
  * Compute how each partition would affect the total error of the entire tree (the sum of $E$ of every leaf). Discard those that would increase it.
  * Realize the partition that decreases the error the most.
  * Repeat the entire process.
The algorithm terminates when no more partitions are possible or there are enough clusters. Authors chose 8 clusters, because of diminishing returns and dataset size concerns. You can look at [[http://lightmetrica.org/h-otsu/download/rgb2spec.zip|the original implementation]] or [[https://github.com/enneract/otsu2018/tree/d218ef0097174897634477bcc180b6268ba6c337|my code]]. The article also has a lot more plots and comparison than I post here, so it's definitely worth looking at too.

Below are the results of running my code. I used the entire dataset of 1,279 spectra and optimized for 8 clusters. This takes about 3 hours on my dated i7 3770.

[{{ gsoc:otsu_clustering2b.png?600 |A visualization of the clustering. Each cluster has its own symbol for the colors that belong to it. (I actually used the cross symbol twice, my bad.)}}]
[{{ gsoc:otsu_clustering2a.png?600 |Reconstructions of the same colorchecker spectra as before, this time using clustering. The discrepancies are notably smaller and some reconstructed spectra look exactly like the measurements.}}]

===== Minimum node size =====

I mentioned that nodes cannot be allowed to be split too finely. First of all, there have to be at least three spectra in a node for PCA to even work. Fewer will result in a singular $M$ and the algorithm simply won't be able to continue. In practice, having few spectra is no good either. Below is a result of PCA decomposition done with just three spectra.

[{{ gsoc:otsu2018_eigenstates_3_colors.png?600 |PCA decomposition of the first three spectra in Otsu et al.'s dataset. An explanation of what is plotted here is in the [[gsoc:otsu2018_2|previous post]].}}]

Only //two// eigenvectors are meaningful and the rest are basically noise. Trying to use this for reconstruction will create completely random-looking spectra, usually going way outside the [0, 1] range. Clearly a lot more data needs to be used to get useful results. In my program I (arbitrarily) set the minimum cluster size to the average cluster size divided by two and it seems to work pretty well.
$$\text{min. cluster size} = \left\lfloor\frac{1}{2}\,\frac{\text{num.\,of spectra}}{\text{num.\,of clusters}}\right\rfloor$$

But can reconstructed spectra exceed the allowed [0, 1] range even if the basis is good? Unfortunately yes, and this will be the topic of the next (and the last) post in this series.