By Daniel Hubbard | November 27, 2010
People who study statistics often talk about distributions. They are simply the shapes that appear when you turn data into a picture. Each bit of information might be marked down where it belongs on a number line. With enough data a pattern may emerge. That pattern may have many things to say about underlying causes and laws that govern the data. In genealogy we don’t often plot our data in a statistical way even if we could look at things like age at first marriage if we wanted. Even if we never sit down an plot anything from our data, looking at what distributions are and what they mean can have something to tell us about our research.
The most famous distribution is probably the one known as the “Bell Curve.” Mathematicians call it a “Gaussian Distribution” or a “Normal Distribution.” Carl Friedrich Gauss has his name attached to it because he was the mathematician who did much of the early work on the distribution. That it is also called “normal” gives some idea of just how often this distribution comes up in the world around us. It is called the bell curve because it is somewhat bell shaped—tall at its center (the peak), low farther away (the tails). If you gather a large number of men or women and measure their heights, most of them will be very close to the average value. If you want people who are 6 inches above or below average, you won’t find so many. Every inch farther from the average will mean that you have fewer people who have the height that you are after. Heights tend to follow a bell curve. At school, test scores are often found to be distributed according to a bell curve; the source of the phrase “grading on the curve” that you may remember from your school days.
Book Sales, Genealogical Sources and the Long Tail
There are different kinds of distributions and they fit different sorts of data. If you draw a picture of book sales according to each book’s place in a best seller list, you won’t see a bell curve. The distribution starts high (with the top seller) and falls off to ever lower values on one side. This is the general shape of the kinds of distributions that may have what is known as “the long tail.”
The “long tail” is a description of part of a type of distribution. There is a lot going on at the beginning in this kind of distribution, so people tend to concentrate their efforts on those few things at the very beginning where a few things do a lot. The classic example is a few items with very high sales volume in a store. After the initial peak, there is the tail. No one thing in the tail amounts to much—a little here a little there. Nothing is a big deal on it’s own. Nevertheless there is another word involved here—long. A tail can stretch on and on. Even if there is almost nothing in any one part of the tail, if you add up everything in the tail as it stretches on and on, it can be a lot. It can rival the contents of the peak. New ways of selling and handling inventory have shifted the importance of the tail. The now classic example is Amazon.com. Because people order over the internet, there is no need to have physical stores or their high overhead. If holding inventory is inexpensive then the inventory can include lots of items from the tail—things that sell very few copies. Add up all the sales of niche items and the niche sales rival the sales of best sellers.
In genealogical research we have long tails as well. A simple example is the number of people with different surnames. People with common surnames make up a very large number of people but there is also an enormous number of people with unusual surnames—names that are found out in the tail.
There is, I think, a far more interesting example of the long tail in our research. So much of what we do often relies on a few very useful sources. As long as those sources are reliable* there is nothing wrong with that but there is something incomplete with it. Out in the long tail of genealogical sources there is a lot of information. It just doesn’t all come nicely concentrated and there are a lot of possible sources for you that won’t turn out to have anything you can use in them. You can find a lot way out in the long tail of sources.
What source types are in the long tail varies from place to place, changes with time and can even depend on who your ancestors were. There are even two reasons why sources end up in the long tail category. They may be rich but underutilized. That is, they may have a low chance of being used despite having a good deal of information. Another set of items in the long tail are the ones that do not contain that much or that require other knowledge before they pay off. That is, they have a low chance of being useful. Some source types probably fall into the long tail for a bit of both reasons.
I might categorize state censuses in the first kind of long tail. In some states and at some times they are as information-rich as the Federal Census but they aren’t so commonly found in people’s source lists. If funeral home records fall into your long tail it is probably because they may or may not exist and when they do, they are not always so heavily utilized—a bit of both reasons to be in the long tail. Sometimes along the frontier, church records might fall into the long tail. In some places and at some times they may be the dominant source of information—in the peak not the tail. When settlement races ahead of organization, religious or otherwise, church records may be a rare treasure that reside in the tail. I just finished looking at the records of a congregation in Upstate New York in the 1790′s. It was one of the few sets of church records that I located that might have had some bearing on my problem. The people I hoped to locate were not to be found. They could have been there but they simply were not—for that place and time, I was in the long tail and the odds of finding them were small. There was little to examine. I also recently looked at death records extracted from the proceedings of four fraternal orders in Kansas. There I found a bit to help me even if, once again, I was in the long tail. I have a great-grandmother who can be found in the records of a special boarding school for wards of the state—another long tail source.
Sometimes it is more useful to think of a source or source type as being part of the general genealogical long tail, not likely to contain useful information but just maybe providing a good clue. Other sources and source types might fall into a personal long tail, a source we rarely consult even if we should.
Remember not to be discouraged if you look at a long tail source and find nothing. Whenever you come up empty, that “nothing found” note you write down counts as a piece of information. Does it tell you something that great-grandma wasn’t in those funeral home records even though the owner of the funeral home was a family friend? It might not solve a mystery but it might spark your curiosity. What if the company that great-grandpa boasted about working for in a county history turns out to have records that ought to include him but they don’t?
Even if the few source types in the peak are the most likely place to find information, there is possibly as much if not more information out there in that long tail. For any given problem, sources in the long tail may turn out to be vital.
*Being heavily used doesn’t make a source more reliable, but that is another matter.Twitter It!