Volume 7: Bayesian Analysis
Why You Can Never Be Sure of Anything
“Bet you $50 it hasn't.” – xkcd #1132
People don’t like thinking probabilistically. They crave certainty. That rock on the side of the road is definitely a rock. That apple is definitely red. It rained today, so the chances of rain were definitely 100%. The discussion of “what we know” is a philosophical topic, not a mathematical one. Math is useless if we want to know if our entire universe is a single atom in a larger dimension.
Bayesian Inference helps us to describe “what we learned.” In other words, we begin by knowing some things and then we gain information; now we know more. The difference from what we knew before to what we know now is our learning. Bayesian thinking can provide insight as to how much a piece of information should change your knowledge.
But Bayes Theorem is also helpful for traders – or gamblers! Maybe you think the Lions have a 10% chance to win the next Super Bowl; I’ve incorporated more information and think it’s closer to 1%. If my information is better, I can expect to profit from a wager.
Who was Thomas Bayes? What is Bayes Theorem?
How is Frequentist probability different from Bayesian probability? Why do Frequentists and Bayesians fight so much?
How can I use Bayesian Statistics in my everyday life?
Who was Thomas Bayes? What is Bayes Theorem?
The story of the 18th century is the story of rising British world domination. After nearly 100 years of religious conflict and battles between King and Parliament, the Glorious Revolution of 1688 installed a new dynasty, settling the major points of dispute. After this, Britain would face no more major revolutions. The government would change over the next three centuries, but it would happen slowly and peacefully. Without the internal strife that plagued France, Germany, Spain and other European nations, Britain could focus on building the world’s first industrial economy and its finest navy. This would lead to untold wealth due partially to the benefits of colonization.
A small, side benefit of this golden age was the ability of the British Empire(1) to afford a system of country pastors. Coming especially from the later sons of the lesser nobility, these parish priests were supposed to spread across the country, advancing the Church of England. But, for many of them, the life of a country preacher was not so arduous. Fortunately for us, many of their side pursuits were of great cultural and scientific value. Jonathan Swift and Joseph Priestly are two examples of great contributors to art and science whose day job was working for the Church of England(2).
Why do I go on at length about English country parish life in the 18th century(3)? Well, we just don’t know that much about Thomas Bayes so describing his world is the best we can do. We think he might have been born in 1701 but we don’t really know(4). His father was a well-known preacher in the world described above. Thomas spent his early years working with his father before moving to his own parish in Kent (southwest England) around 1735. Today, we only know of two of his publications, and only one of those is about mathematics. I’ll quote from Wikipedia(5) – not for the information, but just to demonstrate how little we know:
“It is speculated that Bayes was elected as a Fellow of the Royal Society in 1742 on the strength of the Introduction to the Doctrine of Fluxions, as he is not known to have published any other mathematical works during his lifetime.
In his later years he took a deep interest in probability. Professor Stephen Stigler, historian of statistical science, thinks that Bayes became interested in the subject while reviewing a work written in 1755 by Thomas Simpson, but George Alfred Barnard thinks he learned mathematics and probability from a book by Abraham de Moivre. Others speculate he was motivated to rebut David Hume's anti-Christian An Enquiry Concerning Human Understanding.”
I count two times that we speculate, two where we “think that,” and once where we don’t know at all. What we do know is that his most lasting work, the eponymous Theorem, only came to light after his death(6).
After that little digression, we can get to the point; here is Bayes Theorem(7) in all its glory:
In this equation, means the probability of X and is the conditional probability of X, assuming Y(8). The letters “A” and “B” can represent anything, from a statement like “that man is six feet tall” to an event like “this coin flip will come up heads.” Remember that probability is the likelihood of something happening or being true; it is expressed as a number between 0 and 1 (or 0% and 100%). Let’s go through a simple example.
Let “B” be the event of me going to the grocery store tomorrow and “A” be the event of me buying eggs. Maybe the probability of going to the store is 50%. We can also say that, without knowing whether I went to the store, the probability of buying eggs is 40%. Let’s also say that if I bought eggs, there is a 90% probability I went to the grocery store(9). Then:
What does this mean? In English, “if I go to the grocery store, the probability that I buy eggs is 72%(10).”
Bayes Theorem itself is a trivial piece of mathematics; it can literally be proven on one side of an index card(11). But as mentioned above, we can use this simple formula to gain a deeper idea of the concept of learning. To demonstrate this, let’s use the example of a drug test, a common application of Bayesian Inference(12).
Let’s say that 1% of the population uses cocaine. You see a person on the street, but know nothing about them; the probability that he or she uses cocaine is therefore 1%. The person walks up and says to you “I took a drug test yesterday and it showed positive for cocaine use.” What is the updated probability that the person uses cocaine?
You are tempted to say “100%” or very close – but this is wrong. Let’s guess that the “false positive” rate on cocaine tests is around 5% and the “false negative” around 2%(13). Let’s go back to Bayes – we’ll abbreviate the event “Tested positive” as “+” and Uses/Doesn’t Use as “U” and “DU” respectively.
We also know that the probability of a positive test, , is equal to the probability of a non-user testing positive plus the probability of a user testing positive, so:
We’ve learned that, given a person had a positive test for cocaine, the probability they are a user is only around 16%. This is only as likely as getting a “6” on a single roll of a die. If the person failed a second drug test, the probability would go to 79.5%, and a third would take it to 98.7%(14). But no matter how many tests were failed, the probability would never go to exactly 100%. This is an important point about Bayesian theory: unless you are already certain of something, no amount of additional information can make you completely certain(15).
How is Frequentist probability different? Why do Frequentists and Bayesians fight so much?
So far, this has been fairly simple – we found a formula and “proved” that everybody you meet is possibly using cocaine. But now it is going to get tricky.
Remember we said that probability is not “wired” correctly into human minds? Well, to make the case in a point, after 2000 years of study going back to Aristotle, top mathematicians still don’t agree on exactly what probability means or how it should be interpreted. There are three major interpretations of probability:
In Classical Probability, there are a certain number of outcomes, each of which is equally likely. Then the probability of an event is the number of favorable outcomes divided by the number of possible outcomes. A fair die has six total sides of which three are even. So, the probability of rolling an even number is 3/6 = 50%. While most of mathematics was developed with the Classical view holding sway, it has some significant limitations. First, in our example we specified a “fair die.” But this is a circular definition; how do you know what a fair die is if you haven’t defined probability? Second, classical probability has no way to handle an infinite number of possible outcomes, which is a big limitation.
Frequentist theory states that the probability of an event is its relative frequency over a large number of trials. If I roll a die many times and half of the rolls are even, then the probability of rolling an even number is 50%. It solves the two major problems of Classical theory that we discussed: we don’t have to assume a “fair die” and if we could imagine a theoretical infinite-sided die, Frequentist theory still works(16). However, because Frequentists interpret probability is the frequency of events, we can create some extreme misuse of the method, as in this comic(17):
Now, this example is without question a bastardization of how the Frequentism works. But it does portray a situation that is not well handled.
Bayes Theorem, as we discussed, is just a formula. But there is an interpretation of probability that is consistent with the Theorem. Bayesian Theory says that the probability of an event is the degree of certainty that it happened(18). To go back to our 6-sided die, let’s consider the statement “the die will roll any number 1 time out of 6”. What is the probability this is true, given we just rolled a 4? Well, the probability of rolling a 4 on a fair die is 1/6. Let’s say that the probability the die is fair is 99%. What is the probability of rolling a 4 on a 6-sided die, without making the assumption that the die is fair? I would say that is 1/6 also. Plugging this into the Theorem means that the probability of the die being fair is now 99%, the same as before our roll. In other words, we didn’t learn anything from a single roll(19).
But you can see here a major drawback of Bayesian Probability. Where did we come up with 99% probability that the die was fair? We also made a guess at the probability of rolling a 4 on a die that isn’t necessarily fair. Where did we get that number from? These assumptions are called “Bayesian Priors,” and this is the main criticism of Bayesian Probability: you need to make assumptions as to probability prior to receiving any information.
The cartoon gives us a much better usage for Bayesian logic: what is the probability the sun has exploded, given that the machine says it has? Well, the probability the machine says “yes” if nova has happened is 35/36 = 97.2%; the probability that it says yes in general is 1/36 = 2.8%(20). Now we need to choose our Prior: what is the probability the sun exploded, before using the machine? I’m going to go with 0.00001%.
This makes sense; it is far more likely that the machine lied than that the sun exploded(21).
For two groups of people who disagree only about the details of a subtle mathematical philosophy, Frequentists and Bayesians have an ongoing, contentious debate over “who is right.” I think that some of this is really a proxy battle between mathematicians and “hard scientists.” Hard scientists tend to prefer the Frequentist approach. They design an experiment and run it a bunch of times. If it comes out as expected, they write a paper, otherwise back to the drawing board. Hard scientists hate the Priors; how can you decide if you are right before you’ve done any experiments?
Mathematicians, surprisingly, tend to dislike the certainty that comes from the Frequentist approach. Remember our drug test example? It was based on pure mathematics, not Bayesian Theory, and left us with an uncertain result. There are many examples of science seeing a false positive and treating it as a Great Breakthrough. Mathematicians agree that Priors are annoying, but also find them useful. Before starting, you need to think about how the system works and consider what answer makes sense(22).
It won’t surprise you when I hedge my bets and say that both Frequentist and Bayesian Theories have uses and misuses. But I find that Bayes is far deeper and more interesting. It is really attempting to answer the question of what we know – more specifically, what we learn. And you can use it to gamble.
How can I use Bayesian Statistics in my everyday life?
As I discussed in Volume 3, I try to use Bayesian principles(23) when I incorporate new information into my worldview. Which is to say – almost always. In our Commentary, we try to follow a methodology: obtain pieces of information and use them to form a view of the world. There are several ways that Bayesian logic is useful in this process.