How do I know if my results are right?
Often times in data we run reports or analysis and just take for granted that what the data show us is right. Usually this is an assumption that whoever created reporting for us did their due diligence to ensure results where right. The hard truth of the matter is, they almost never do. Running analysis-on-analysis is not only abstract, it is cumbersome and requires a higher knowledge of what data can show us, or rather, how well data can form a picture.
My favorite example of shoddy analysis with little to no verification is sports reporting. Sports reporters and bloggers rarely have any sort of mathematical or statistical background, so it makes sense that they would just trust the numbers. All too often a twitter talking head quotes a basketball reference advanced analytic in favor of their favorite team or player. Regardless of what the stat may actually be showing, used for, or its reliability. There are two things at play here, one people love data and two people love when it shows they are right. But does it really show them they are right?
Statistics is really the practice of testing data and making inferences from it. Or knowing how right your data is. However, the testing piece is often forgotten or to those doing analysis, it feel daunting if not impossible. In this post, I hope to help this process feel less daunting by providing a few methods of testing your results. These will range from the quick-and-dirty to a more formal technique.
One of the easiest things you can do to test a data set or a result from a data set is to run fundamental statistics: Mean (average), Median, Mode and Range. None of these pieces alone are particularly insightful. In fact, the most common mistake I see in analysis is taking an average and making the assumption that it is representative of the data set. However, with all of these pieces together, you can begin to form an opinion of your data. consistency is key here, the more consistent your data is, the more confident you can be about what it shows you.
- Are the average and median pretty close? Good, that indicates your data is not heavily skewed.
- Is your range small? Okay, these data points are grouped together.
- Are your median or average pretty close to the middle or your range? Great, another indicator your data is not skewed.
- Modes are generally more helpful with integers, but your mode being close to your average and median is another good indicator of data consistency.
The next level
If you are working with larger data sets or good sized samples, it is really helpful to understand variance and standard deviation. These are really two sides to the same coin. In fact, standard deviation is the square root of the variance. Plenty of online resources do a great job showing how to calculate these two measures, so I won’t show that here. I will warn that there are two calcs for each of these, one for a sample set and one for a population. TL;DR, they show how much a data set varies or deviates. Again, consistency is key.
In a normal distribution, one standard deviation away from the mean contains about two thirds of all data points. Two standard deviations contains about 95%. For our quick-and-dirty tests above, a low standard deviation indicates consistent data. For next level analysis, you can start to use standard deviation to recognize trends. For instance, say we have data that comes in every month. We can plot that data, the mean/average for the set and the standard deviation, like so:
Using this plot, we can see that starting in September there was an upward trend. We can confidently say that, because there were two consecutive months outside the mean + standard deviation and three outside in four months. Either this is an anomaly cluster, or more likely, this is a trend. There are some formal rules to recognizing trends designed by some of the statistics gurus. I personally like the Nelson rules. These were designed for statistical analysis of manufacturing processes but they have a lot of applications.
Normal and more formal
Now that you have your mean and standard deviation, you can plot a probability/normal distribution. Again this is something that works best with larger data sets and entire populations. The more data you have, the easier it is to see how the distribution fills in. Normalizing a distribution is basically taking data points and imposing it on a curve whose area = 1. By doing this we can see how our data set falls. I generally use excel for this process. Take the mean and STD.(P or S), order the data, then use the `norm.dist` function and plot.
Once we have a plotted curve, we can start to visually form opinions about our data. Because a normal distribution has mean=median=mode, the plot should be centered and symmetrical.
- If it is not centered and symmetrical, that indicates skew. Curves should have a smooth bell shape.
- If it is flat, the data has a large relative standard deviation.
- If it is tall, the data has a small standard deviation.
There are tons of more formal pieces of analysis and statistics that can be done from this point. I won’t touch on many here, but I recommend investopedia as a resource for learning more.
Now we are close to the finish line. Using all the pieces above, we can now compare samples to our population results using hypothesis testing (this can be done with multiple sample sets too but that may be another post). Using this method, we will end up with a probability of how likely our hypothesis is correct. Here is how the method works:
- Make a quantifiable hypothesis about your data set. ie — My sample average is greater than the population average (µ>X).
- Make a null hypothesis, that is essentially the opposite of your hypothesis. ie — My sample average is less than or equal to the population average(µ≤X).
- Calculate your Z-score. Example here for single sample Z-score. For comparing a sample set, just add sqrt(number of samples) to the denominator.
- Look up Z-score on a table to get your P-score. This is the probability that your hypothesis is supported. Depending on your hypothesis (greater than or less than) you will need to use the left or the right table.
With a P-score, you have a quantifiable level of certainty that your analysis is correct.
Whether using a back of the napkin technique or a formal statistical test, analyzing results is going to help set your work apart from other analyst’s. Data can be tricky and deceiving. Unlike the saying goes, numbers can and do lie. By testing your results and analysis you can put numbers through a figurative lie detector making you and your stakeholders more confident.