Recently I saw an article on the Register that caused me to think about the basics of statistics again. Although I have learned a lot about statistics in the class room, that is not enough. It is through the frustrations of real life applications that you learn the true nature of statistics. As Benjamin Disraeli said,
There are three kinds of lies: lies, damned lies, and statistics.
Many years ago I supported a quality control laboratory at a chemical plant. One day a senior engineer came to me with a request for laboratory data. He was working on a chronic problem in one of the units and was hoping that the laboratory data would help him. He had an idea of what was wrong so he was hoping that the laboratory data would not only confirm what he thought was the problem but also justify the capital investment to implement the solution he was already working on. Unfortunately there were several problems with the data.
The first problem was that the procedure used to collect the sample varied from shift to shift. This was an unintended effect of the sample being low priority data point. Important data points came from analyzers directly connected to the process control system. Most of the samples processed in the laboratory were used to verify that the online analyzers were working properly. This was one of those data points that did not have an online analyzer. Since the plant would normally run for three years without shutting down, the plant was manned by three different shifts. The training on how to collect the sample varied from shift to shift. One shift was very good at collecting this sample while another shift was very bad. To further complicate this problem, some laboratory technicians were not good at running the test for this sample and the technicians running the plant did not care if the test was run properly. When the senior engineer looked at the data he found good data, obviously bad data, and questionable data. He tried dropping the bad data but it still did not give him the result he desired. He tried to selectively drop or adjust some of the questionable data but the whole process quickly got to subjective. The people he needed to convince on the viability of his project were engineers. Engineers are a pretty straight forward group about quantitative data. Quantitative data should be quantitative and there are three answers. The data either confirms your hypothesis, rejects your hypothesis, or tells you nothing. After a considerable amount of angst the senior engineer accepted that the data told him nothing about his problem and he went about searching for another way to prove his point.
The article at the Register tries to convince the reader that the process of averaging can overcome measurement problems. The author sees a “haystack” of data that overwhelms the “needle” of measurement errors. I see it differently. If you look closer at the “haystack” of data you see that a large percentage of the data is derived from a much smaller source of measurement data which includes the errors and a variety of adjustments for missing and erroneous measurements. In this case the problem is that there is only a hundred years of temperature measurements of primarily urban sites and it is being used to derive two thousand years of global temperatures. Climate scientists are facing a similar problem to the one the senior engineer in my story faced. They wish they could go back in time and do a better job of measuring temperature in more places and more consistently. Things would be so much easier if scientists placed a higher priority on temperature measurements and the threat of global warming a hundred years ago. Unfortunately temperature measurements have always been a low priority task and the data that has been collected shows that. Adjustments and temperature proxies are the norm. The resulting data is uncomfortably subjective. It may be saying something important or nothing at all. To carry the haystack analogy of the Register author to its logical conclusion, is this a stack of hay or of something else? Once again I am reminded that the most important lesson I have learned about statistics is the quality of the raw data. If you have good data, statistics is your friend and will help you solve many problems. If you have bad data, statistics is your enemy and will probably create more problems than it solves.