Volume 7: Bayesian Analysis

Why You Can Never Be Sure of Anything

“Bet you $50 it hasn't.” – xkcd #1132

People don’t like thinking probabilistically. They crave certainty. That rock on the side of the road is definitely a rock. That apple is definitely red. It rained today, so the chances of rain were definitely 100%. The discussion of “what we know” is a philosophical topic, not a mathematical one. Math is useless if we want to know if our entire universe is a single atom in a larger dimension.

Bayesian Inference helps us to describe “what we learned.” In other words, we begin by knowing some things and then we gain information; now we know more. The difference from what we knew before to what we know now is our learning. Bayesian thinking can provide insight as to how much a piece of information should change your knowledge.

But Bayes Theorem is also helpful for traders – or gamblers! Maybe you think the Lions have a 10% chance to win the next Super Bowl; I’ve incorporated more information and think it’s closer to 1%. If my information is better, I can expect to profit from a wager.

Who was Thomas Bayes? What is Bayes Theorem?
How is Frequentist probability different from Bayesian probability? Why do Frequentists and Bayesians fight so much?
How can I use Bayesian Statistics in my everyday life?

Who was Thomas Bayes? What is Bayes Theorem?

The story of the 18th century is the story of rising British world domination. After nearly 100 years of religious conflict and battles between King and Parliament, the Glorious Revolution of 1688 installed a new dynasty, settling the major points of dispute. After this, Britain would face no more major revolutions. The government would change over the next three centuries, but it would happen slowly and peacefully. Without the internal strife that plagued France, Germany, Spain and other European nations, Britain could focus on building the world’s first industrial economy and its finest navy. This would lead to untold wealth due partially to the benefits of colonization.

A small, side benefit of this golden age was the ability of the British Empire(1) to afford a system of country pastors. Coming especially from the later sons of the lesser nobility, these parish priests were supposed to spread across the country, advancing the Church of England. But, for many of them, the life of a country preacher was not so arduous. Fortunately for us, many of their side pursuits were of great cultural and scientific value. Jonathan Swift and Joseph Priestly are two examples of great contributors to art and science whose day job was working for the Church of England(2).

Why do I go on at length about English country parish life in the 18th century(3)? Well, we just don’t know that much about Thomas Bayes so describing his world is the best we can do. We think he might have been born in 1701 but we don’t really know(4). His father was a well-known preacher in the world described above. Thomas spent his early years working with his father before moving to his own parish in Kent (southwest England) around 1735. Today, we only know of two of his publications, and only one of those is about mathematics. I’ll quote from Wikipedia(5) – not for the information, but just to demonstrate how little we know:

“It is speculated that Bayes was elected as a Fellow of the Royal Society in 1742 on the strength of the Introduction to the Doctrine of Fluxions, as he is not known to have published any other mathematical works during his lifetime.

In his later years he took a deep interest in probability. Professor Stephen Stigler, historian of statistical science, thinks that Bayes became interested in the subject while reviewing a work written in 1755 by Thomas Simpson, but George Alfred Barnard thinks he learned mathematics and probability from a book by Abraham de Moivre. Others speculate he was motivated to rebut David Hume's anti-Christian An Enquiry Concerning Human Understanding.”

I count two times that we speculate, two where we “think that,” and once where we don’t know at all. What we do know is that his most lasting work, the eponymous Theorem, only came to light after his death(6).

After that little digression, we can get to the point; here is Bayes Theorem(7) in all its glory:

In this equation, means the probability of X and is the conditional probability of X, assuming Y(8). The letters “A” and “B” can represent anything, from a statement like “that man is six feet tall” to an event like “this coin flip will come up heads.” Remember that probability is the likelihood of something happening or being true; it is expressed as a number between 0 and 1 (or 0% and 100%). Let’s go through a simple example.

Let “B” be the event of me going to the grocery store tomorrow and “A” be the event of me buying eggs. Maybe the probability of going to the store is 50%. We can also say that, without knowing whether I went to the store, the probability of buying eggs is 40%. Let’s also say that if I bought eggs, there is a 90% probability I went to the grocery store(9). Then:

What does this mean? In English, “if I go to the grocery store, the probability that I buy eggs is 72%(10).”

Bayes Theorem itself is a trivial piece of mathematics; it can literally be proven on one side of an index card(11). But as mentioned above, we can use this simple formula to gain a deeper idea of the concept of learning. To demonstrate this, let’s use the example of a drug test, a common application of Bayesian Inference(12).

Let’s say that 1% of the population uses cocaine. You see a person on the street, but know nothing about them; the probability that he or she uses cocaine is therefore 1%. The person walks up and says to you “I took a drug test yesterday and it showed positive for cocaine use.” What is the updated probability that the person uses cocaine?

You are tempted to say “100%” or very close – but this is wrong. Let’s guess that the “false positive” rate on cocaine tests is around 5% and the “false negative” around 2%(13). Let’s go back to Bayes – we’ll abbreviate the event “Tested positive” as “+” and Uses/Doesn’t Use as “U” and “DU” respectively.

We also know that the probability of a positive test, , is equal to the probability of a non-user testing positive plus the probability of a user testing positive, so:

So,

We’ve learned that, given a person had a positive test for cocaine, the probability they are a user is only around 16%. This is only as likely as getting a “6” on a single roll of a die. If the person failed a second drug test, the probability would go to 79.5%, and a third would take it to 98.7%(14). But no matter how many tests were failed, the probability would never go to exactly 100%. This is an important point about Bayesian theory: unless you are already certain of something, no amount of additional information can make you completely certain(15).

How is Frequentist probability different? Why do Frequentists and Bayesians fight so much?

So far, this has been fairly simple – we found a formula and “proved” that everybody you meet is possibly using cocaine. But now it is going to get tricky.

Remember we said that probability is not “wired” correctly into human minds? Well, to make the case in a point, after 2000 years of study going back to Aristotle, top mathematicians still don’t agree on exactly what probability means or how it should be interpreted. There are three major interpretations of probability:

Classical Probability

In Classical Probability, there are a certain number of outcomes, each of which is equally likely. Then the probability of an event is the number of favorable outcomes divided by the number of possible outcomes. A fair die has six total sides of which three are even. So, the probability of rolling an even number is 3/6 = 50%. While most of mathematics was developed with the Classical view holding sway, it has some significant limitations. First, in our example we specified a “fair die.” But this is a circular definition; how do you know what a fair die is if you haven’t defined probability? Second, classical probability has no way to handle an infinite number of possible outcomes, which is a big limitation.

Frequentist Probability

Frequentist theory states that the probability of an event is its relative frequency over a large number of trials. If I roll a die many times and half of the rolls are even, then the probability of rolling an even number is 50%. It solves the two major problems of Classical theory that we discussed: we don’t have to assume a “fair die” and if we could imagine a theoretical infinite-sided die, Frequentist theory still works(16). However, because Frequentists interpret probability is the frequency of events, we can create some extreme misuse of the method, as in this comic(17):

Now, this example is without question a bastardization of how the Frequentism works. But it does portray a situation that is not well handled.

Bayesian Probability

Bayes Theorem, as we discussed, is just a formula. But there is an interpretation of probability that is consistent with the Theorem. Bayesian Theory says that the probability of an event is the degree of certainty that it happened(18). To go back to our 6-sided die, let’s consider the statement “the die will roll any number 1 time out of 6”. What is the probability this is true, given we just rolled a 4? Well, the probability of rolling a 4 on a fair die is 1/6. Let’s say that the probability the die is fair is 99%. What is the probability of rolling a 4 on a 6-sided die, without making the assumption that the die is fair? I would say that is 1/6 also. Plugging this into the Theorem means that the probability of the die being fair is now 99%, the same as before our roll. In other words, we didn’t learn anything from a single roll(19).

But you can see here a major drawback of Bayesian Probability. Where did we come up with 99% probability that the die was fair? We also made a guess at the probability of rolling a 4 on a die that isn’t necessarily fair. Where did we get that number from? These assumptions are called “Bayesian Priors,” and this is the main criticism of Bayesian Probability: you need to make assumptions as to probability prior to receiving any information.

The cartoon gives us a much better usage for Bayesian logic: what is the probability the sun has exploded, given that the machine says it has? Well, the probability the machine says “yes” if nova has happened is 35/36 = 97.2%; the probability that it says yes in general is 1/36 = 2.8%(20). Now we need to choose our Prior: what is the probability the sun exploded, before using the machine? I’m going to go with 0.00001%.

This makes sense; it is far more likely that the machine lied than that the sun exploded(21).

For two groups of people who disagree only about the details of a subtle mathematical philosophy, Frequentists and Bayesians have an ongoing, contentious debate over “who is right.” I think that some of this is really a proxy battle between mathematicians and “hard scientists.” Hard scientists tend to prefer the Frequentist approach. They design an experiment and run it a bunch of times. If it comes out as expected, they write a paper, otherwise back to the drawing board. Hard scientists hate the Priors; how can you decide if you are right before you’ve done any experiments?

Mathematicians, surprisingly, tend to dislike the certainty that comes from the Frequentist approach. Remember our drug test example? It was based on pure mathematics, not Bayesian Theory, and left us with an uncertain result. There are many examples of science seeing a false positive and treating it as a Great Breakthrough. Mathematicians agree that Priors are annoying, but also find them useful. Before starting, you need to think about how the system works and consider what answer makes sense(22).

It won’t surprise you when I hedge my bets and say that both Frequentist and Bayesian Theories have uses and misuses. But I find that Bayes is far deeper and more interesting. It is really attempting to answer the question of what we know – more specifically, what we learn. And you can use it to gamble.

How can I use Bayesian Statistics in my everyday life?

As I discussed in Volume 3, I try to use Bayesian principles(23) when I incorporate new information into my worldview. Which is to say – almost always. In our Commentary, we try to follow a methodology: obtain pieces of information and use them to form a view of the world. There are several ways that Bayesian logic is useful in this process.

Understanding degrees of certainty

A basic precept of Bayesian thinking is that it is not possible to achieve certainty. New information should always affect your thinking about what is and is not true. With Bayes Theorem, a probability can’t reach 100% unless one of the inputs itself is 100%. This is another reason “belief” can be dangerous. Certainty on a topic, in the absence of evidence, can incorrectly lead to certainty on other topics.

Considering new pieces of information

How “important” is a piece of information? According to Bayes, information is important to the extent that we learn from it. If you are fairly certain (say 90% sure) that it is raining, and then see that everybody outside is carrying an umbrella, your probability will not change significantly. You didn’t learn much, so the information wasn’t very important. On the other hand, if you thought it was sunny, seeing umbrellas will drastically change your thinking. With a different Prior, the same information is now far more important.

Acting without complete information

Because we can never be truly certain, a corollary to Bayesian Theory is that we must act without certainty. It is consistent with Bayes that juries are asked to decide “beyond a reasonable doubt,” rather than beyond any doubt. Many measurements are showing the Earth getting warmer, and we are pretty sure that this is happening because of the actions of our species. So we should do something, even though we are not “sure.” Similarly, there is a 0.0001% chance that it is an optical illusion, but I’m still going to stop my car at a red light. We can’t state with 100% probability that the Affordable Care Act increased the total insured population. But the evidence is strong, so when considering policy it is appropriate to behave as if it is very likely the case(24).

Perhaps the best known application of Bayesian inference is to gambling(25). Let’s consider a single basketball game and a single wager. Say that the odds on the game imply that Team A has a 75% chance to win, Team B a 25% chance. To bet on Team A, you must risk losing $75 to win $25 and on Team B, $25 to win $75(26). If your information agrees with these odds, you have no expectation to win with a bet:

P(A Win) * Win(A) + P(A Lose) * Lose(A) =

75% * $25 + 25% * (-$75) = $0

However, if you have better information which tells you that Team A has an 85% chance to win, you can expect to win:

P(A Win) * Win(A) + P(A Lose) * Lose(A) =

85% * $25 + 15% * (-$75) = $10

Perhaps this is obvious, but when gambling (or trading), you can expect to profit only if you have better information than other people. Only if your probabilities are different from “the market” should you place a bet. Which brings us back to Bayes: to win at sports gambling, or in the markets, we need to know the difference between important and unimportant information.

I thought this was supposed to be about Government, maybe with a little history sprinkled in? Why did we go on a long digression about an obscure 18th-century Englishman?

Some of the reason is “meta” – to properly read what we write about Obamacare, it’s important to understand the mindset in which it was written. That mindset is many things, but it is certainly Bayesian.

But I encourage you to apply the Theory to any topic. Thinking in terms of probabilities rather than certainties. Incorporating new information even when it doesn’t comport to your Prior view. Realizing when the information has become clear enough to lead to a conclusion or action. Then challenging that conclusion when new information points the other way!

Perhaps the most important thing I learn from Bayes is the importance of information. We learn from the application to the gambler - in order to have the best performance, you need to use the most important information. And you need to seek out important information, because it isn’t always readily available. Which is my purpose to write here in the first place; whether or not you think these pieces are useful – whether or not they help you learn – hopefully they cause you to continue to search out more information in order to keep updating your Priors.

I’m playing loose with what I call the entity which is today’s United Kingdom. The proper name changed a few times in this period, but I’m always talking about that country where they drink warm beer.
For more on this, see “At Home: A Short History of Private Life” by Bill Bryson.
Yeah – I mean isn’t this supposed to be technical government stuff?
The image shown here is always used as Bayes’ portrait. But there is no good reason to think that it is really him.
Wikipedia - Thomas Bayes
We know that he died in 1761. A famous, awful math joke is that we know more about Bayes’ Posterior than his Prior.
Most people write “Bayes’ Theorem,” but I prefer it without the possessive as in “The Bayes Theorem.”
Conditional probability can be a tricky concept. It is often best expressed with the phrase “given that.” For example, what is the probability that it will be cloudy tomorrow? Maybe 50%. What is the probability that it will be cloudy outside, given that it snows? Closer to 100%.
I usually buy eggs from the grocery, but sometimes from the drug store or farmers’ market.
The most important question to ask when doing math is “Does this make sense?” Yes. I usually buy eggs at the store, but not always. If it had come back as 35% or 99%, it would have been a sign something is likely wrong.
For example, here.
Many places on the internet, including here.
“False positive” = returns a positive test on a non-user; “false negative” = returns a negative test on a user. I’m not able to find definite values for cocaine testing stats, but these are in the ballpark.
I’ll leave these calculations as an exercise for the reader.
This statement is conceptually accurate, but should not be treated as rigorous mathematics.
Maybe I wouldn’t throw as many 7’s at the craps table with this one.
https://xkcd.com/1132/
Analogously, the probability of a statement is the degree of certainty that it is true.
Which makes sense – rolling a single 4 tells you nothing about whether it is a fair die.
By the way - did Randall forget how to round a number? 1/36 = .0277777…
Citation: you are reading this.
This is a vast oversimplification and overgeneralization of the debate.
It may be important to note: Bayes came up with the Theorem in the sense of a mathematical proof. But he didn’t develop modern Bayesian Probability or Inference, which were developed later. However, his writings do show that he was considering these concepts, at least 100 years ahead of their time, so we use his name for the full theory.
For fun: perhaps somebody once said that there was a crowd of 1,500,000 people on the National Mall. However, various types of evidence (photographic, personal accounts, Metro ridership) indicate that it was a fraction of this size. Incorporating this information via Bayes Theorem, we can see that the probability that the crowd was large is indeed very low. We don’t need said probability to literally reach zero to be able to state factually “not many people were there.”
Or trading, which isn’t so different.
We ignore any house edge.