Analytics without intelligence: Simpson’s paradox

You should never use analytics without intelligence for so many reasons. One of them involves Simpsons Paradox. This, quite simply, is the odd effect that combining data sets in just the right way can actually reverse the individual results. That means that you can often decide whether you want to prove or disprove something simply by deciding how to analyze your data.

The paradox is named after Edward Simpson who first described the effect in a paper in 1951, although the effect had been noted and mentioned by earlier mathematicians. Quite simply, it says that if you have two possible outcomes, A and B, and you run two tests to see which is more likely. In the first and second test, A can come out the clear winner. However, when you combine the data from the two tests together, then B can be the winner. How can this be true?

Let’s take a very simple case. I want to find out who is the better marathon runner: Bob or Sue. I tell each of them to run 5 races over the next two weeks. The first week, Sue is feeling a little run down so  she only runs 1 race and doesn’t win. Bob runs four races and wins one of them. Clearly, for week one Bob is better than Sue. Sue wins 0% of her races, and Bob wins 25% of his races. In week two, Sue runs 4 races and wins three of them. Wahoo! Sue won 75%. Bob runs his remaining race, which he also wins. Also wahoo! Bob won 100% of his races. In both weeks, Bob had a higher winning percentage than Sue. Bob can honestly say that each week, he wins a higher % of races run than Sue. Poor Sue.

But wait… what happens if we look at the total over the whole trial period, and not just by each week? Sue won nothing the first week and 3 the second week for a total winning of 3 out of 5, or 60%. Bob won one the first week and one the second week for a total of 2 out of 5, or 40%. At 60% success, Sue is clearly the better runner of the two and can claim that in the trial, she had a higher percentage of wining than Bob. Poor Bob.

We can claim victory for either Sue or Bob, using the same test results, depending on whether we want to focus on total results or who won each week. If we have all the data to examine, we see what’s going on fairly easily. But if we are simply reading a promotional claim and all we know is that either Bob wins each week or that Sue won the test… we have no way of knowing that something is fishy.

What happened? The effect seen here is caused by a “lurking variable” also called a “confounding variable”. In this case, the confounding variable is the number of races run each week. They are not the same, and it’s not a fair comparison. We should be looking at total races run, not who had the higher rate of success each week. And yet, asking who had the higher rate of success each week is a perfectly reasonable thing to ask, even if in this case it’s the wrong thing to ask.

When you are analyzing data, it is so important to understand what your data represents, and what you are trying to decide. The good news for Bob and Sue’s promoters is that both can lay claim to success in this trial, depending on how you analyze the results. However, if you are trying to decide who to send to represent you in the Olympics, you need to understand the data and what the results mean. If you ask the wrong question (“who had the highest winning percentage each week”) you will get an answer that, while technically correct, is misleading because you were not aware of the confounding variable (races run per week for each runner). Analytics cannot be done in a vacuum – you need to understand the data and understand what you are trying to prove. Only then will you ask the right questions, and only then will you correctly interpret the results as they apply to your specific business challenge.

This entry was posted in Data and Analytics. Bookmark the permalink.