# Otsu et al. (2018), part 2

The previous post mentions three basis vectors and a “dataset mean.” Here I explain how to find a good basis and what “good” means in this context.

The idea is to find a basis so that any spectral reflectivity can be expressed in the form:

where runs over all dimensions of the space (36 using Otsu et al.'s spectral data) and is a constant term you can ignore for now. This is a very common problem and there are many general solutions. By far the most common one is the discrete Fourier transform, which turns arbitrary signals into weighted sums of sines and cosines.

Unfortunately, recalling that the goal is to upsample a color triplet (e.g. XYZ tristimulus values or RGB data), there are only three degrees of freedom to work with and so at most three weights can be determined. Reformulating the problem,

where all but three terms are lumped together into an error term . I think it's clear that three sines or cosines (and a constant term) added together won't accurately reconstruct most spectra. There are far better purely mathematical models, such as Jakob and Hanika (2019).

## Principal component analysis

Otsu et al. came up with a data-driven solution. Given enough measured data, you can use statistics and linear algebra to create a small basis that carries as much information from the original dataset as possible. The authors used principal component analysis (PCA) and a set of measured reflectances.

Spectral reflectances are treated as random vectors, with each component a random variable. This method requires that their sample means are zero, so the measured data has to be adjusted first. Let

which I called the “'dataset mean” in the previous article, then simply subtract it from measured spectra:

From these the so-called data matrix, in this case an N x 36 matrix, can be constructed:

Now the idea is to use this to derive a basis that maximizes its variance or “significance.” I'll skip the proofs, as I'm not particularly well-versed in statistics and it would take me too much effort. The result is given by the eigensystem of the covariance matrix of . The eigenvector of with the largest eigenvalue is the vector maximizing variance. The eigenvector with the second largest eigenvalue is perpendicular and has the second largest variance, and so on.

Let's take a look at the code. Assume that the measured data, sds, is an N x 36 array.

basis_mean = np.mean(sds, axis=0)
data_matrix = sds - basis_mean

A covariance matrix can be easily computed with np.cov, but a trick can be used. Because the column-wise means are zero, is going to be equal to it up to a constant. The constant and any normalization can be ignored, as only eigenvectors are sought after.

covariance_matrix = np.dot(data_matrix.T, data_matrix)

The last step is the eigendecomposition. NumPy comes with a few algorithms, so it's not a problem. Because cov is Hermitian (it's a real, symmetric matrix), one can use:

eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

The bonus feature of eigh (aside from being more efficient) is that the eigenvalues are sorted in ascending order. The result is simply the last three columns of eigenvectors: the three basis vectors with the greatest variance.

Because zero-mean spectra were used to create the basis, it best describes primed spectra: