Tuesday, July 11, 2006

Bad statistics

Recent days have brought two important examples of how misunderstanding statistics can lead you astray.

First, the always bottom-scraping arguments against abortion leads us to Good Math, Bad Math slapping around bad stats on abortion rates. Comparing raw numbers of abortions without considering the ratio of births to pregnancies ignores population growth, and gives you a meaningless argument. My only objection is that GMBM doesn't point out the non-relationship between correlation and causation. The original article he's referring to doesn't just bungle the stats, it claims that "The ever increasing amount of sex education, the ever easier provision of contraception is clearly driving down the number of unwanted pregnancies." Even if the alleged correlation existed (which it doesn't), the correlation does not mean that one thing drives the other.

A more interesting instance of statistics in the news comes from Floyd Rudmin, who applies Bayes' Theorem to show that the NSA's illegal domestic surveillance is a really bad idea. We discussed the principle of Bayes' Theorem and conditional probability with regard to the Monty Hall problem, as well as to creationism, and to the process of science in general.

What Rudmin is really getting into is a special case of Bayes' theorem known as Simpson's paradox.

Let's assume that the NSA program is perfect at catching terrorists. That is, if you are a terrorist, let's assume that the NSA program will always catch you. This is a more generous assumption than Rudmin or I would actually accept, but let's work with it. Let's also assume that the odds of a non-terrorist being caught in general are miniscule, only 1%. It seems like everything is exactly as it should be. But you'll find that your pool of suspects will have 31,000 people (using various of Rudmin's numbers for convenience). Of that, only 3.2% are actually terrorists! How did this happen?

The answer, as anyone who took a basic stats class knows, is that the pool of non-terrorists is so huge, there's every reason to think that anyone who gets screened out by even the most focused of terrorist screening is probably a false positive.

The same problem arises with many diagnostic tests for rare diseases. If you pick a random person off the street and test for HIV, a positive result is most likely not an undiagnosed case of HIV, but a false positive. For instance, the standard ELISA test used for HIV gives a false positive only 1 in 67 times. There are 300 million people in the US, and about a million are HIV+. If we tested everyone, we'd get 4.5 million false positives, meaning that the chances of someone with a positive test result actually being sick would be less than 1 in 5.

There are two things that make the testing that actually happens more accurate. First, every positive ELISA is verified with a second test, one that is more expensive and complex. ELISA catches 99.7% of true HIV+ cases, and sweeps up some extra. The second test weeds out that extra part. Average cost is held down by only testing likely cases.

The second thing that makes ELISA accurate is that it's not mandatory. We don't apply it at random, it's applied to people who think they are at risk. The fraction of HIV+ people among those who engage in risky activity is greater than the proportion of HIV+ people in the population at large, so the number of false positives decreases.

Applying NSA surveillance techniques to the population at large is silliness for the same reason. Requiring some sort of probable cause is part of what minimizes false positives in the justice system while ensuring that bad guys get caught. I'm not prepared to let dozens of innocents be sent away to Guantanamo Bay or Leavenworth in hopes that one or two really are bad people. And I'm not prepared to sacrifice my liberties for programs that don't seem to have detected any terrorists at all.