The original goal of this
research project was to run IBM's Intelligent Miner (IM) for Data on a NASA database
containing measurements of gases known to be involved in ozone layer depletion, along with
ancillary measurements of pressure, temperature, time etc ... This database has been
analyzed and pored over for years by arguably some of the best scientists in the field of
atmospheric chemistry and the idea of the research was to compare IM's automated findings
with those of the scientists doing manual, query driven research. After DB2 and IM
had been installed and configured, the software was initially run on the database but the
returned results were, in my opinion, unsatisfactory. Although powerful and full
featured, IM seemed more oriented toward deeper investigation after knowledge has been
discovered; quite the opposite (at least in my mind) of what a data mining tool should
do. The output was a firehose of plots and tables, none of which seemed relevant,
and I knew what knowledge was likely to be found. On successive runs, things were
improved by tweaking the input parameters, but that of course required knowledge a
priori, the very thing we're asking IM to discover for us. This motivated a
change of direction in the research, namely to experiment with new algorithms that could
discover relationships, especially correlations, among the measurements.
In our work with this database, correlation is one of the most common and useful tools for
analysis. Correlations are used to determine the stoichiometry of the many
reactions, discover the lifetimes of important tracer gases in the atmosphere and even
help detect calibration problems between separate instruments when a linear relationship
is known to exist between their measurements. In addition, deviations from well
known correlations can indicate interesting phenomena such as dehydration or
denitrification in the atmosphere. Correlation has a quantitative definition in
statistics but for our purposes we give it a more qualitative, common English definition.
Two variables are said to be positively correlated if when one increases in value,
the other increases as well. Conversely, if while one variable increases the other
decreases, then those variables are said to be negatively correlated.
As part of NASA's ongoing research on ozone layer depletion, our group (the Atmospheric Laser Spectroscopy Group at the Jet Propulsion Laboratory) participated in an Arctic polar campaign that was conducted during the 1999-2000 winter in Kiruna, Sweden. The flights were made by a NASA owned U2, the famous reconnaissance airplane used during the Cold War which NASA subsequently renamed ER2 (for Earth Research). The ER2 has been modified to house instruments from several leading universities and NASA Centers. With this payload, the pilot flies into an area of interest for about 8 hours at an altitude of 70,000 feet with data being recorded throughout the flight. After landing, the raw data is processed by the individual groups and the final measurements are archived in accordance with a custom file format to an ftp site at AMES Research Center. The instruments all operate independently of each other and consequently, their acquisition methods, time resolution, and noise levels vary. These individual files, which represent about 60 measurements of molecules, compounds and ancillary measurements found in the troposphere and stratosphere along the flight track of the ER2, were then binned into ten-second averages and finally combined into a large tab-delimited file.

Figure 1

Figure 2
Now we transform the data by taking the N2O column and normalizing it to both its range of values as well as its position in the table (i.e. percentile). We then sort the data, keeping an array of its indices in this new sorted order. We next take the CH4 column, normalize to its range of values but put its values in the order of N2O's sorted indices. Figure 3 shows N2O and CH4 both plotted in the sorted order of N2O.

Figure 3
Sorting, of course, makes N2O values increase monotonically. Since CH4 is also monotonically increasing, it must be positively correlated to N2O, if we accept our earlier simple definition of correlation. If it were instead monotonically decreasing, we would then say it was negatively correlated, as shown in Figure 4 below; a graph of the mean age of air parcels plotted in the sorted order of N2O.

Figure 4
A close look at Figure 3 shows that while CH4 is generally monotonically increasing, it is not perfectly sorted as is N2O. Even if two measurements share a perfect theoretical correlation, noise in one or both of the measurements will cause their sorted orders to differ slightly. Given this fact of real world measurements we allow the user to select a number of partitions into which to bin the data and gain some control over granularity. We take the average and median of each partition and examine those results for monotonicity (i.e. correlation). We use the median along with the average to help discriminate against badly "spiked" data. Note that by using averaged partitions, there are no requirements that the two data sets have the same number of points.
To look for correlation, we first sort and store the normalized values and percentiles in a table offline, along with information about the number of points in each column to aid with the partitioning. Since we are comparing each column to every other, we must have an outer and inner loop over the number of columns. The outer loop iterates over the sorted columns. In the inner loop, we partition and average the next unsorted column. We then go through and compare each partition average n to partition average n-1. If n > n-1, we store a 1 into a vector, signifying that there was an upward growth between the two partitions. Similarly, if n < n-1, we place a 0 in the vector representing downward growth. So, if we have m partitions, we will build a vector of 1's and 0's of length m-1. To qualify for correlation we set a threshold for the percentage of partitions that must have the same direction of growth (1's or 0's). For example if the user sets the partition count to 10 and we have an 80% threshold ( the default), we must have at least 8 1's or no more than 2 0's in the growth pattern. To provide more information, we set a second threshold indicating a strong correlation, say at 90%. The growth pattern for Figure 3 is 111111111 which indicates a strong positive correlation. Now consider Figure 5 below.

Figure 5
The growth pattern for this correlation at 10 partitions is 001001000. At an eighty percent threshold it just qualifies for negative correlation. But there's more information in the data that isn't revealed in its growth pattern. Visually, we can see that the graph shows a tighter correlation above the 50th percentile. If we change our partition size from 10 to 20 the bit pattern becomes 110110111100000010000000. This pattern makes a distinction between the somewhat scattered first half of the data with the almost uniform downward trend in the second half of the graph. Although not implemented as of yet, we can search with bitwise operations for similar behavior in other measurements sorted by the same column. Note that a large enough increase in resolution will weight too heavily the point-to-point noise in the data set.and might disqualify two measurements that are in fact correlated. So, control over granularity allows us to use a coarse resolution for examining general trends and a higher resolution to search for interesting growth patterns on a selectable number of partitions.
Figure 2 shows that N2O and CH4 have not only a strong positive correlation, but a linear one as well. We can, for a little extra work, determine if our correlations are linear or curved. Consider the graphs in Table 1 below.
|
|
|
|
|
|
|
|
Table 1
It is easily seen in the progression from the top of the table to the bottom that if the correlation is linear, the unsorted data lies within very close proximity of the sorted data. As the correlation has a more pronounced curvature, the two data sets begin to separate. While I have not confirmed this relationship in any mathematical or formal sense, let me appeal to some intuition from the graphs themselves. Let S be the sorted data set and U be the unsorted data set. The fact that we have normalized the data, in a graphical sense, forces both data sets to "touch" all four boundaries. If S and U are correlated, then to some degree they must start and end at the same corners and grow monotonically between them. In the remaining middle percentiles U can only do one of four things; lay on, be under, be over or cross S. If on top, U is growing at the same rate (slope) as S and therefore must be linear. If U is above or below S, it is growing more quickly or more slowly than S respectively. That is a reasonable definition of curvature in a graph. The last case, crossing S, is a combination of cases 2 and 3 where the rate is faster than S for some partitions and slower for others. To detect curvature, we take sum of the (absolute value) differences between S and U. Because of their being normalized to 100, these values are in units of percent difference. If a "small" value is returned, S and U are in close proximity and therefore have a linear correlation. If a "large" value is returned, S and U are separated and therefore exhibit curvature. Of course, a weak correlation will return large numbers also, so this test is only valid if the data sets have already been found (by examining the partition averages) to be correlated. Using the examples above, I came up with the simple heuristic that if the percent difference is less than 10, the relationship is linear; if it is greater then 10, then the relationship is curved. In all this, I have considered only the positive correlations. The same holds true for negative correlations with the added step that after a correlation is determined to be negative, we subtract its normalized values from 100 percent.
The program is written in Perl and currently performs the table sorting on startup. This table link is the actual output from the program and is a matrix of the correlations between all measurements. The names of the measurements are their chemical names followed by the instituion or instrument name, if there is more than one measurement of that specie. The entries in the table show the relationships found by the program and a legend is given at the top of the page for quick reference as to their meaning. The entries themselves are links to another script that will plot the standard correlation along with the correlation in sorted order for visual inspection.
The algorithm has an outer and inner loop over the k columns. In the inner loop, there are no nested loops for the n rows, giving it a complexity of O(k^2n). Technically, it could be called simply O(n), since k^2 is a constant for any given table and could thus be removed. To check the complexity, I timed the program on a single flight with 2800 rows and then on full file with 37000 rows. Subtracting the time the program spent on sorting operations gave elapsed times of 104 seconds and 1623 seconds respectively. The ratio of the rows is about 13:1 and the ratio of the run times is about 15:1. This would seem to confirm O(n), the difference probably being due to the number of page faults involved with accessing a large table.
At first, I had hoped that I could change k^2 to (k^2)/2 by considering only the combinations, not the permutations of measurements. In other words, examine only the lower diagonal of the matrix. But I found that often, due to noise in the measurements, some X might be found to be correlated to Y, but Y might not be correlated to X in reverse. As an example, consider the case of CO and CFC11. CFC11 increases somewhat linearly with altitude, giving its sorted order enough of a slope over its wide range of values to nicely order CO (see the graph). CO, on the other hand, can be as high as 140 parts-per-billion (ppb) in the lower atmosphere but in the stratosphere converges to about 10 ppb, regardless of altitude (see the graph). This coupled with the fact that it is a noisy measurement causes CFC11 to be reordered wildly when it is plotted in the sorted order of CO. This makes the argument that the tests for sorted correlation are not commutative at all. We can fight against the case of flat noisy data in two ways. We could use log percentiles instead of linear ones, but this would require twice the storage and twice the computation work. A preferable idea would be to measure the slope of the partitions averages and use the one with the higher value to be the sorted column. This method would also allow us to get to k^2/2.
I think the algorithm performs well overall. Except for the cases like that mentioned above, it found all of the major correlations and even found a few that had not occurred to some of the scientists in our group (although they were not scientific breakthroughs). I think the algorithm provides a good first order look at interesting relationships in a large data set and through control of granularity can allow the user to dig deeper if desired.
Compared to this method, regular statistical correlation is more quantitative, but requires the data sets to have the same number of points and also some a priori knowledge about the general shape of the correlation. A statistical correlation is defined by how well the X,Y pairs lie exactly on a straight line. A correlation with a value of 1 is a perfect positive correlation, and a value of -1 is a perfect negative correlation. But what if there is curvature? We would have to individually test whether the relationship is logarithmic, exponential, quadratic, cubic etc ...
The program as its stands does not make provision for searching for growth patterns as mentioned in Section 3 and could definitely benefit from optimization. It would be helpful, especially in the case of trend searching, to median filter the data over a small number of points, just enough to guard against spikes. There is no reason why this method could not work on data types other than real numbers. It only requires that the data be capable of sorting in a meaningful order. Ordinal values and even categorical values could be checked for correlations, but location names (or a hash of them) or zip codes, might only be capable of searching for growth patterns, since their sorted order has no real meaning.
I plan to continue developing the idea in a more formal way and would like, as per our discussion, to try and publish a paper, pending a more thorough literature search for related work.
Thank you,
Gregory Flesch
August 15, 2000