Why a DNA database is a very bad idea

Imagine this scenario: A murder case that went cold 20 years ago is reopened thanks to newly available DNA-based forensics. The state, lets say Arizona, has a large database of DNA. This isn’t the DNA deposited from newborns that David discussed a few weeks ago, but DNA from convicted criminals and really anyone who’s been brought up on charges in the last few years, whether or not they were eventually found innocent. DNA from a piece of evidence in that 20-tear-old case matches DNA in the database, the suspect is brought in a questioned. He has no alibi for what he was doing on day X 20 years ago. Charges are filed, a jury is called, and the suspect is convicted on the strength of a DNA match. Justice is served, right? Maybe, but maybe not.

So here’s the funny bit. The problem with this scenario has nothing to do with genetics, forensic science, or data storage and access. It’s really not even about crime and justice. It’s about your birthday.

Introduction to Statistics teachers love to begin the first day of class with a demonstration of the birthday paradox. Take any given class of 23 or more students and have them begin calling out their birthday (month an day, although the day month format is also appropriate). Odds are better than 50% that two students will share the same birthday, and everyone will be amazed. What an incredible coincidence! Not really.


Here’s what’s happening: For any given student in a class of 23, the odds that one person in the class will share their birthday with a specific student is 1/365, but the odds of sharing a birthday with any student in the class is 22/365. Now call on the next student. Their odds of sharing a birthday with someone else are 21/365 (we already established that student number 1 shares no birthday). For the student after that, the odds go down to 20/365, and so on down the line. So the odds that any student shares a birthday with any other student in the class is roughly (number of students -1)!/365. The actual equation is a little more complicated than that, but conceptually the same. At 23 students the odds are better than 50%, around 50 students the odds are closer to 99%.

This is a problem inherent in randomly sampling from a large database. The odds of any two students sharing the same birthday is low, but by drawing from a large enough pool of students, you guarantee a match.

Let’s say you have twins in the class. You know you have twins, but you don’t know who they are. Assume their fraternal and look nothing alike, but share the same birthday. Let’s also assume there are 100 students total, enough to ensure that there will be a random birthday match. If you were to decide to find these twins by calling out birthdays, there’s be no guarantee that the match would be the twins you were looking for. Assume you know one of the twins, even then, the odds of a random match in a class of 100 would be a little more than 25%. A test that’s only right 3/4 of the time is not very good.

But assume you can do a little investigation before hand. You look at your students last names, the states they’re from, how they interact which each other. You take all that evidence and use it to decide who your likely twins will be. Then you test their birthdays. Because you’re not drawing from a random pool, the odds that non-twins who happen to have the same last name, state of origin, and treat each other like siblings will also share the same birthday is astronomically low. So low that you can confidently claim that these two students are definitely twins.

What if instead of secret twins in a class, we’re talking about DNA at a crime scene. And instead of a few hundred students, we’re talking about millions of people’s archived DNA. The chance of mismatch is much lower than 1/365, but the sample size is much higher, and the consequences of such a mismatch are dire. Querying DNA from a crime scene against a database ensures that some small fraction of the time there will be a false positive, and an innocent person will be arrested, possibly tried, and possibly convicted, especially with cold cases where DNA is the only remaining evidence.

So how should the justice system use DNA evidence? The same way they use regular evidence. If you have DNA at the crime scene, establish a likely suspect first, then test their DNA for a match. This will reduce the possibility of false positives to near zero. But a suspect generated by querying a DNA database first may just be unlucky.

Several sources have tackled this issue in the past few years. The state of Texas was recently caught storing DNA data from newborns illegally. Who has the right to your genetic information is going to be a huge debate over the next decade. Ready access to genetic information is going to be a cornerstone of both our social and legal systems and the dialogue needs to remain engaged. But it’s important to remember that this problem is not the result of emerging technologies, but one of simple statistics.

~Southern Fried Scientist

February 24, 2010 • 2:48 pm