Publications Details

Publications / Other Report

FORTÉ Machine Learning

Kagie, Matthew J.; Hays, Park E.

Of the non-corrupted data collected by the Orbiting Experiment (forté) satellite’s Photo-Diode Detector during the year 2001, I estimate that 7.9% of 914 894 signals are noise. My result differs dramatically from Guillen’s estimate of 96%. To arrive at this estimate, I used Gaussian mixture model (GMM) clustering–unsupervised machine learning–to aggregate the wave forms into groups based on the absolute value of the lowest 25 positive frequency discrete Fourier transform coefficients. Then, I marked several of the groups as noise by inspecting a random sampling of wave forms from each group. Marking groups as either noise or non-noise is a supervised binary classification operation. After removing the signals in noise groups from further consideration, I clustered the remaining signals into families. Again, I used a GMM, but for the familial clustering I used a Non-Negative Matrix Factorization feature vector transform. The result was 9 distinct families of lightning signals, as well as a second stage of noise filtering. To efficiently represent the entirety of the signal space, I broke each family into deciles based on their distance from the family mean. In this case, distance means the log-likelihood based on the GMM. Signals in lower deciles are more similar in shape and amplitude to their family average. I took the top 200 samples from each decile of each group, resulting in 18 000 signals. These signal approximately represent the entirety of the forté observations. To represent outliers, I also kept a zoo of the 1000 signals furthest from any family’s average. All told, the resulting data set represents the forté data with a reduction of about 51:1. To allow synthesis of an arbitrarily large number of test signals, I also captured each family’s average signal and the time-sample covariance matrix over the signals in each family. Using these two pieces of information, I can synthesize new waveforms by using a Gaussian random realization from the family average and covariance matrix. I wrote a program to test the synthesis quality. The program shows me two signals on the screen, one synthesized and one randomly drawn from the data. I attempted to identify the synthesized signal. Although the synthesis is imperfect, in an A/B comparison I only correctly chose the synthesized signal 36% of the time.