By Daniel Hubbard | July 7, 2010
Independence is an important topic this time of year. Though inspired by that 4th of July kind of independence that is not what I’m thinking about. Instead, I’m thinking about data independence. As we gather our evidence, one question that always needs to be asked is, “is this data independent of the data already collected?”
A scientist needs to understand his apparatus, otherwise he may introduce all sorts of errors into his study. Trying to use a device faster than it was intended to be used might change the results if a measurement is influenced by the measurement made immediately before. That thought occurred to me as I worked on a new presentation about the census. In 1900 the census asked for a person’s age as of June 1, just as had been done for decades. Then it asked for birth month and birth year. Age and birth year are usually redundant as soon as birth month enters the picture. How did the enumerator ask the questions? If given an answer for age and birth month the enumerator simply wrote down the birth year without asking, then the age and birth year are not independent. If the informant rounded the age, the birth year will automatically agree unless the question is asked and the informant gets a chance to be more specific. If one is wrong so will be the other. One doesn’t add weight to the other. Even if the enumerator asked for both age and birth year, the informant may state both based on the same faulty memory or even the same lie. The information is not independent. Given an age and a birth year that clearly don’t agree, would the enumerator simply write them down or would there be some attempt to correct one or the other?
As soon as one bit of information is derived from another they are not independent. If they also convey the same thing, they are no different from one piece of data. One is totally dependent on the other. Mistakes often gain the power to convince by being repeated. It looks like massive amounts of evidence all pointing in the same direction but it is only the same single morsel of data copied over and over. The data are not independent.
Secondary sources often contain the same information but it would be wrong to conclude that by agreeing, they all give weight to the correctness of the information. Instead they may all be relying on the same primary source or they may even be drawing on each other. That doesn’t necessarily make them all wrong but it does mean that they are not independent. In a sense, they are all the same source repeated over and over. Observe the sky once at sunset and one can then repeat the statement 1000 times that the sky is red but those thousand bits of data are not independent. Make a few observations throughout a cloudless day or two and you can conclude the sky is blue because now your data is independent.
At a simplistic level, I like to think of data as getting a vote. Information does not get a second vote by putting on Groucho glasses and a third trip to the poling place wearing a Marge Simpson wig. Different evidence might get different weight when you are pondering what conclusion to draw but each piece of evidence should still only get to vote once no matter how well its repeat appearances are disguised.
So this July, whether you celebrate Canada Day, the signing of the American Declaration of Independence, the storming of the Bastille or none of the above, think about the independence of your data.Twitter It!