Beware the SemiDB

So what, you might be asking, is a “SemiDB” supposed to be? Semi, of course means half, or partial. DB is database, that friend of researchers of all shapes, sizes and persuasions. SemiDB is the name I use for databases that, for one reason or another, don’t contain as much as you might think.

Some database publishers give access to their databases as they are being filled with data. It is wonderful not to need to wait years for the whole state to be finalized when the only county you need is also the only one that is finished. Other databases may not have all the information available for historical reasons. One town destroyed one type of record and so that information will never be in the database. There will always be that hole in the data. Some databases are built up in a nonsystematic way according to the interests of individuals. One very active person may enter huge amounts of data about a particular set of surnames or a specific area, while other surnames and areas go unentered. Finally, some database titles are simply misleading. Have you ever looked at a book or database with a title something like Early Maryland Marriages only to discover that it only contains marriages from three of the state’s counties where either the bride or the groom had a surname that interested the author and that none of the data goes back earlier than 1800? If you haven’t it is only a matter of time.

When using a database it is often more difficult to notice this kind of problem compared to doing research in books or original records. When you pick up a book, you instantly get a feel for how much data it contains. You don’t even really need to think about it. It is a certain thickness, it has a certain page size and you quickly notice how much is on each page. You might also notice that there are only chapters for Washington, Jefferson and Madison counties. That is a pretty clear indication that the book does not contain the whole state. Often a book will contain a section, right up front, describing the information it contains.

As anyone who has ever used a book that is not indexed will tell you, searchable databases border on the miraculous. Nevertheless, you can’t tell at first glance whether or not the database is too small to contain everything that you expect. If few people read the introductions to books, I guess that even fewer read the descriptions of databases contents. Sometimes one cannot even find a description.

So, what is it that can go wrong if the contents of a database are misunderstood? After all, if I find great-uncle Thadeus, I’ve found him, right?

Biased Samples

Whether you realize it or not, many of the conclusions that you draw during your research will have some relationship to statistics. You don’t need to be a mathematician to get it right, but you do need to keep your eyes open for problems. The kinds of problems you can have by not bewaring the semiDB.

Here is one example. The results of early telephone surveys are now notorious for their inaccuracy because telephone ownership was far from universal. Some areas lacked phone service. Some people could not afford a telephone. It seems that the infamous 1948 headline “Dewey Defeats Truman” was a mistake due, in part, to the bias of the phone surveys of the time.

If you think of those people who answered one of these long ago surveys as a bit of data in a semiDB you see the same problem occurring. The information you want might exist but if your area lacked “database service” you won’t be able to find it in the database. If the database you are looking at isn’t clear about being compiled from tax lists, you might wonder why your dirt poor ancestors aren’t there. You might come to the conclusion that they were living somewhere else.

Today, who would expect the same nutrition survey, posted on both www.teenage–midnight-hackers.org and gourmetDiets.com, to give even vaguely similar results? If either result was extended to the whole population, without regard for its bias, we’d be treated to headlines like “Weekly consumption of Cheese-Doodles hits 15 pounds per person” and “Americans fooling their appetites with light and airy souffles.” We want to keep the equivalent of those statements out of our family histories.

False Positives

Mistakes often occur when we fail to realize that the data set we are dealing with doesn’t cover as much territory as we might expect. Awhile back I was looking for a few relatively unusual surnames in a database. These were families that seemed to go together somehow and I wanted to see if I could pin them down to some specific locale. Sure enough I found a nice cluster of people with just those surnames. There was a problem though. I recognized too few of the given names involved, and the area I found them in just didn’t feel right. If you can’t find a sufficiently precise statement of the databases contents it is time to put on the white lab coat and perform some experiments. I found that this same area contained the state’s only concentrations of the names Smith, Brown, Jones, Evans and Schmidt. So, either I had just pinpointed the only population center in the entire state or the statewide title obscured the fact that only a tiny area was covered so far.

Now back to great-uncle Thadeus. The less complete the data that you examine happens to be, the more likely you are to not have a chance to examine all the possibilities. If my supposed find of great-uncle Thadeus is based on there being no other Thadeus Jones in a database, I’m probably in big trouble if I was using a semiDB. I’ve certainly jumped to conclusions anyway but when the database is only partially complete, the problem is greatly amplified. So, what happens next week when I’m no longer looking at the database and more data is released with another Thadeus Jones appearing two towns to the east? I hopefully would discover my error eventually when facts failed to match but if I had been aware of how incomplete the database was, I would have been less sure, more open to alternatives and wasted less time. If I realized it was a work in progress I might have checked back.

False Negatives

When a database only covers part of what you think that it covers, it is also very easy to get what can be thought of as a false negative. If you hypothesize that your ancestors moved to an area about 1822 and you might test that by looking at the 1820 census and the 1830 census. You find them in 1830 but not in 1820. Does that mean they didn’t migrate in 1818? No, not really. Especially if the county they lived in turns out to be missing from the surviving census rolls for 1820. It is easy to be fooled by holes in the data.

Taming the Wild SemiDB

Here are a few steps you can take to avoid falling victim to the semiDB-

Read any description of the database content that you can find. A database of marriage records might be claiming more in its title than the Presbyterians-only nature of the data put into it would support.
If the database is from a book and the introduction to the book is part of the database, read it.
Experiment with searching the database. Its exciting to see that the only matches for the name Zinglemann in the state come from The Dandelion Forks Gazette. Its far less exciting once you discover that every mention of Smith in the state is also in The Dandelion Forks Gazette.
Research the nature of the records used. The database may sound general but the records that go into it may be subject to bias.
If you don’t feel like your getting sensible results from a search of scanned printed material, check if the index was prepared by humans or OCR’d (“Optical Character Recognition,” that is, prepared by a computer from the scanned text). Humans may not be totally accurate but rarely are they as wildly wrong as OCR can be when it doesn’t work. Sometimes in a database of scanned images of printed pages you can find a link to “view text.” That might lead you to what the computer thought it saw. You might see “Somerset” in the image and the computer might have “recognized” something like “5urnonco[.” If that’s the case, stop trusting the search function and grab your reading glasses.
Remember that databases, no matter how complete, have their limitations. Far, far from everything has been scanned, indexed and put online. Ancient paper and not so ancient microfilms are where the bulk of the data is.

Daniel Hubbard

January 31, 2010 at 10:35 pm

Thanks for the kind words. Glad to inspire you to investigate those infants! I hope you can figure them out and that it is some help.

Nancy

January 31, 2010 at 9:52 pm

Hi. I just wanted to let you know that I really enjoy and appreciate your blog. Thanks for sharing your thoughts, insights, and knowledge. After reading your post about graveyards I’m going to look into the 3 infants buried on my g-grandfather’s grave lot — the ggf who emigrated from Germany and for whom I can find no siblings or parents. I mostly assumed that he was just generous or possibly that one of the infants might have been an illegitimate child of his oldest daughter. Thanks for the encouragement to look further and deeper.
Nancy from My Ancestors and Me

Biased Samples

False Positives

False Negatives

Taming the Wild SemiDB

2 thoughts on “Beware the SemiDB”

Leave a Comment Cancel Reply