Introduction to Hypothesis tests
An explantion of the principles of Hypothesis testing - a key idea in statistics
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by researchers to test predictions, called hypotheses.
Null and Alternative Hypotheses
The first step in the hypothesis testing process is to frame your research question in terms of the data that you will collect. You want to think about what statement you are trying to test.
You then want to think about how the data will look different if that statement is true/false. To do this, we state null and alternative hypotheses. These are two competing statements.
The null hypothesis (called H0) usually follows the format that “there is no difference between these numbers” or “there is no relationship between variables”.
The alternative hypothesis (called H1) is a statement that is the opposite of the null hypothesis and usually follows the format that “there is a difference between these two numbers” or “there is a relationship between variables”.
A hypothesis test is where we examine the data and decide which of the two alternative hypotheses is more believable given the evidence we have.
We begin by assuming the null hypothesis is true.
Example
Does attending MASH statistics workshops have an impact on attainment in quantitative modules?
- Null Hypothesis: The mean module mark for students who attend MASH workshops is the same as for students who do not attend MASH workshops.
- Alternative hypothesis: The mean module mark for students who attend MASH workshops is different to the marks for students who do not attend MASH workshop.
If the results for some students who attended MASH statistics workshops and some students who did not attend MASH workshops are as follows:
MASH |
No MASH |
Difference |
|
Mean module mark (%) |
68 |
64 |
4 |
From this information alone we can see that the mean mark for students who attended MASH workshops is higher than for students who did not. However, is the difference large enough to be significant? A hypothesis test is a way of approaching questions like this question formally and consistently, rather than just looking at the difference between the numbers and deciding whether we think it counts as a “big” difference
Carrying out a hypothesis test
The main steps for carrying out significance/hypothesis test are:
- Calculate the test statistic (calculate a single value which represents the important feature of the data we’re testing, for example, a mean) using the data collected
- Use the test statistic to compare what we have observed to what we would expect under the null hypothesis
- Use test statistic to obtain a p-value & use this to decide whether to reject the null hypothesis or not
What is a P-value?
Definition: The probability of observing a result (test statistic) at least as extreme as the one calculated, if the null hypothesis is true.
To understand the p-value it’s helpful to think about it as the probability of seeing a difference/relationship/results as ‘big’ as the one calculated if the null hypothesis is true. If we have a p-value that is small, this means that the probability of seeing that difference when the null hypothesis is true is also small. Therefore the null hypothesis is unlikely to be true. The smaller the p-value the less likely the null hypothesis is to be true, so we have more evidence to reject the null hypothesis.
In order to have enough evidence to reject the null hypothesis, we want our p-value to be as small as possible.
The p-value is almost always calculated using a computer. It is possible to learn to use the formulas which calculate p-values by hand but we won’t discuss the mathematics here. The p-value depends on the sample size, the test statistic and the spread (usually standard deviation) of the samples.
Let’s return to the example above.
Our null hypothesis is:
“The mean module mark for students who attend MASH workshops is the same as for students who do not attend MASH workshops.”
If our results were:
MASH |
No MASH |
Difference |
p-value |
|
Mean module mark (%) |
68 |
68 |
0 |
large |
We can see that we don’t have evidence that the null hypothesis is false. In this case the 0 is our test statistic and the associated p-value would be large.
If our results were:
MASH |
No MASH |
Difference |
p-value |
|
Mean module mark (%) |
68 |
64 |
4 |
smaller |
The test statistic is 4 - the two samples were not exactly the same. The p-value would tell us how likely it is to get this result (or a more extreme result) if the reality is that MASH workshops have no effect on marks. The p-value would be smaller than above.
If our results were:
MASH |
No MASH |
Difference |
p-value |
|
Mean module mark (%) |
68 |
58 |
10 |
Even smaller |
We can see a bigger difference between the means. If MASH workshops don’t make a difference, this result is more unlikely than either of the two above examples. Therefore, the p-value would now be smaller again. This would mean we’re more likely to doubt our null hypothesis.
The Significance Level
From the discussion above, we begin to see that the smaller the p-value is, the less likely we are to believe the null hypothesis. We need a threshold for how small our p-value should be before we decide not to believe the null hypothesis.
We call this threshold the significance level and give it the greek letter α (alpha).
We say that our result is statistically significant if the p-value is less than the significance level (α). Strictly speaking, you can decide for yourself how big you would like the significance level to be. Very often, the significance level is set at 0.05.
If the p-value is less than the significance level, we say we have statistical significance and we say we have “evidence to reject the null hypothesis”.
By convention, we never say that we “accept” a hypothesis. This is because it’s always technically possible for either hypothesis to be true, even if the p-value is very small. However, if we have very small p-values, we can talk about “strong evidence to reject the null” or “very strong evidence to reject the null”.
The definition of a significance level is often given as:
The significance level (α) is the probability of rejecting the null hypothesis when it is actually true.
This ties up with the idea of a significance level as a threshold for a “small” p-value because the p-value is the probability of getting a result like the one we got if the null hypothesis is true and we reject the null hypothesis if the p-value is smaller than the significance level.
Type 1 and Type 2 Errors
When we carry out a hypothesis test there are four different outcomes:
- The null hypothesis is true and we correctly decide to not reject the null hypothesis
- The null hypothesis is false and we correctly decide to reject the null hypothesis
- The null hypothesis is true and we incorrectly decide to reject the null hypothesis. This is known as a Type 1 error.
- The null hypothesis is false and we incorrectly decide to not reject the null hypothesis. This is known as a Type 2 error.
If we have carried out our data collection and other elements of the research properly, we will usually find ourselves in scenario 1 or 2 - ie. the data we have collected will reflect reality and we will get a correct result. However, scenarios 3 and 4 can still occur and we have to understand why.
The Type 1 error is a type of error whereby we are declaring that there is a difference between groups, when actually there is no true difference. The type 1 error rate is equal to the significance level (α) - ie. if the significance level is 0.05 (5%) then we will get a type 1 error 5% of the time. A Type 1 error is sometimes called a “false positive” result
The probability of (correctly) rejecting the null hypothesis when it is actually false is called the Power of the study. It is the probability of concluding that there is a difference, when a difference truly exists.
A Type 2 error is where the data look as though there is no difference between groups (ie. it seems as though the null hypothesis is true) but in reality, there is a difference. The probability of this happening is labelled (beta). If we know that there is a difference and we know how big the difference is, we can calculate . However, the whole point of the hypothesis test is to find out if a difference actually exists so we don’t usually get to calculate outside of examples made up for textbook exercises.
Let’s look at this in the context of our example from above. The possibilities are described in the table below.:
What the test shows |
|||
The test shows no difference (large p-value) |
The test shows a significant difference (small p-value) |
||
What is true in reality |
MASH workshops do make a difference to marks in reality |
We wrongly conclude there is no significant difference - this is a type 2 error |
We have detected the difference in the data - we correctly reject the null hypothesis |
MASH workshops don’t make a difference to marks in reality |
We correctly conclude that there is no significant difference. |
We wrongly conclude there is a significant difference - this is a type 1 error |
Remember that even though there are four different possibilities, they are not all equally likely.
As discussed in the previous section, if the difference between the means is larger, we would expect the p-value to be smaller, but the p-value also depends on factors such as spread and sample size.
One and Two Tailed Tests
There are two ways of conducting any hypothesis test, namely one tailed and two tailed. In a one tailed test, we only look for a difference in one direction. In a two tailed test, we would look for a difference in both directions.
For example, if we want to know if people on a diet have lost weight we might do a one-tailed test to see if (weight before)-(weight after) is positive.
If our diet actually made people gain weight, this would not show up as a significant result on a one tailed test. However, if we conducted a two tailed test, a difference in either direction would show up as significant.
Practically: software will usually perform two-tailed tests by default. Unless you have a good reason, just stick with the default two-tailed test.
Statistical Significance and Meaningful difference
Statistical significance is concerned with whether the null hypothesis is true or not.
In the example we have used, our null hypothesis is the statement “the means of two different groups of people are the same”. A statistically significant result is evidence to say that there is a difference between the two means but if a difference exists, we haven’t said anything about how big the difference is.
Large sample sizes and small standard deviations can lead to significant results even when the difference is small. Small sample sizes may not detect actual differences in the population. We could call a difference that is big enough to be worthwhile a “meaningful” difference. In medical contexts this is sometimes referred to as clinical significance. Meaningful significance is related to the idea of effect size. Statistical tools other than p-values (such as effect size and confidence intervals) will help you decide whether the difference is meaningful.
Useful Resources
There are lots of ways of describing hypothesis testing and many people find it useful to read or listen to a few explanations to begin with. When you find that you can see why all the different descriptions are actually all saying the same thing, you’ve probably got the idea!
Here’s a webpage with an explanation we liked to get you started.
Book a 1:1 appointment or workshop
Would you like to explore a maths or stats topic in greater depth? Why not book a 1:1 with an advisor or a workshop (current students only).