Sunday, 3 January 2016

On Null Hypothesis Significance Testing, P values and the Scientific Method

Hypothesis Testing 

Hypothesis testing is essential in science to determine the presence of an effect. A technique commonly used is NHST, which tests if the data points in your alternative distribution are representative of the normal distribution i.e if your data distribution is different from what would be considered 'normal', and assigning a p value to your data. If the Mean in your data sample is different from the one in the normal distribution, this might tell you that your data is not simply a random sample but that an effect (your variable) is present.

We conduct Hypothesis Testing by comparing our alternative hypothesis against a null hypothesis. You either reject or fail to reject the null hypothesis (double negatives can be used in statistics - not implausible, failed to reject, etc.).

Failing to reject the null hypothesis - This does not mean that the null hypothesis is true, only that this sample does not show that the alternative is true. Not rejecting a position like the null hypothesis does not mean that we're saying it is correct.

Rejecting the null hypothesis - This does not mean that the null hypothesis is false/not true. Neither does it mean that the alternative is true. It just means that this sample shows that the data is different from the null. Another sample might not. 

These two points above are important to understand because when we look for effects in data, particularly noisy data which might be influenced by a lot of factors, you cannot simply reduce the act of spotting an effect to rejecting or failing to reject a null hypothesis. This is because the null hypothesis is almost always false. 

When you sample from a population, it will be a coincidence indeed if you get the exact same means in both your experimental and control samples. I think the more important question to ask is how much of an effect is present and under what conditions will it vary. Statistics is no substitute for thinking. You need to decide what an important effect is. 

Other points to remember - 

- If you have a research question, circle around the problem, address it in different ways. Don't frame it in one specific manner and pin your conclusions on a null hypothesis to be tested.

- Hypothesis testing does not have to be applied to all questions. You can have one-off events worth studying that do not need falsification.

- It's OK to conceive your hypothesis after you have conducted research but it should be before you have analysed data statistically (more on this later).

- Hypothesis tests are always about population parameters, never about sample statistics. We always use the sample data to hypothesise about the population mean, not the sample mean.

- Hypothesis testing and significance testing are different things. Hypothesis testing or Null Hypothesis testing is about  rejecting or failing to reject a null hypothesis, Significance testing is about assigning a p value. We commonly use these two together in a hybrid called NHST, which is controversial.

Null Hypothesis Significance Testing (NHST) and P values

In order to conduct a hypothesis test, we usually assign a significance value, a threshold on which we decide whether to reject or fail to reject the null hypothesis. This is how the NHST methodology works, but it has drawbacks, like a dependance on the p value. A p value is supposed to quantify the strength of evidence against the null value. It tells you how unusual the occurrence would be if it was due to chance.

The p value is the probability of observing a sample statistic like the mean being at least as extreme/favourable as it is in this sample, given our assumptions of the population mean.

p value = P(sample mean being as extreme | assumption about population mean)

It is simply the probability distribution on a normal normalised distribution like a Z score table (you can find it using the pnorm function in R). For example if you test two groups of people and group A gets 5 and group B gets 7 and you want to see if their scores are significantly different from each other, you subtract the differences and get 2 and then decide if this is significantly different from your null value, whatever it is (probably 0), given a certain standard error (Remember that all statistics is essentially a test statistic divided by the error in that statistic). 

One way to do this is to be so immersed in your subject matter, be a complete expert at it and have full subjective contextual knowledge that you know subjectively if a difference of 2 really matters, if it really translates to real world significance. Remember that real world and statistical significance are two different things. 

In statistical significance, you would run your test statistic against a normalised distribution, assuming it follows one, and your data might just be deemed significant if you get a low p value. The low p value is supposed to tell you that the probability of getting this difference of 2 is low i.e on the lower end of one end of the normal distribution, given a null default. 

There are a few drawbacks to using p values as indications of significance. This paper shows us the harmful effects of using NHST and confusing statistical significance with real life significance but I've included my own notes below.

Significance testing tells you more about the quality of your study (variation and sample size) than about your effect size which is more important. Andy Field has written a very easy-to-follow chapter on this topic.

- As I said before, p values are the probability of observing what you observed given a null default, but the default is never null. The null hypothesis might always be false since two groups rarely have the same mean. How then do you make sense of how probable your data is?

- The p value is conditional on the null hypothesis. It is not a statement about underlying reality. Even if it is accurate, the p value is a statement about data when the null is true, it cannot be a statement about data when the null is false.

- A p value is not the probability of the null hypothesis being true or false. The p value is the probability of extreme data conditional on a null hypothesis. 

- It is not the probability of a hypothesis conditional on the data. P values tell us about our data based on assumptions of no effect, but we want a statement of hypotheses based on our data. To infer latter from a p value is to commit the logical fallacy of inverting conditionals. 

- P values do not tell you if the result you obtained was due to chance, they tell you if the result was consistent with being due to chance.

- p values do not tell you the probability of false positives. The sig level (not the p value) is the probability of the type I error rate i.e P(Type 1 error) or P(reject | H0 is true).

- This paper does a good job of expanding on my points above, listing a lot of the common misconceptions about p values and NHST. Highly recommended.

If you're studying a non-stable process that spits out random values, p values are not meaningful b/c they are path dependent. In these cases, the p value isn't meaningful b/c it is a summary of data that has not happened, under assumptions that further data will follow a certain distribution. 

- People use 0.05 as a significance level, but need to remember that hypothesis tests are designed to call a set of data sig. 5% of the time, even when the null is true.

- Many studies show that you have a a very good chance of getting a significant result that isn't really significant with a significance level of 0.05 (about 30% of the time). This paper in particular does a good job of explaining the high false discovery rate using a significance level of 0.05 and compares it to the screening problem, and this article summarises the points well. You can use a lower level like 0.001, but it really is up to you to decide what is statistically significant. 

The Scientific Method

All of this tells me that it is best, when tackling a solution to go back to the philosophical foundations of why we do things. 

Note that you only create a theory or hypothesis after you have evidence. Theories have to be based on evidence, preferably good data-driven evidence. You can't first make up a theory and then look for evidence to confirm or falsify your theory. This is how superstitions and pseudoscience are created. A deliberately vague theory will never be confirmed or falsified, only made to look unlikely. While quantifying how likely or unlikely the existence of an effect is, is the point of science, doing so is a waste of everyone's time if the effect was made up to begin with, so don't do this.

If you see something weird you can't explain, you don't automatically give it a name. That's merely classifying a phenomena, putting it in a box that represents what you already know of the universe, which is incomplete. And your classification system or model or framework could be wrong. You need to do more. It is best to sit on the fence, admit your ignorance, and keep exploring, digging and asking questions of your phenomena, all the while building better and better models to explain it and make predictions. This is preferable to classifying your phenomena in terms of some-pre existing narrative that fits your own socio-cultural context, which would be a failure of critical reasoning.

I see this all the time. Once people identify with a narrative, everything they see will serve to strengthen that narrative. Supporters of a political party do not support that party because the evidence led them to support that party, they do so because of other reasons, like values that they identify with. But once the decision is made, evidence doesn't matter. We are slaves to narratives. Everything that follows is confirmation bias.

We use models because of their usefulness, not because they are correct. It seems to me that the best way to tackle a scientific question or puzzle is to first do exploratory research, just lots of multiple comparisons, or A-B testing, and obviously we wouldn't use p values here. We look at our exploratory data, at possible trends we see and that might or might not be true, that might reflect some underlying connections, and then create hypotheses based on what we've found in the data. 

Here is where we switch from exploratory to confirmatory research. To confirm or falsify our hypotheses, we need to run experiments, which can involve hypothesis testing. And we have to gather new data for this. We cannot use the same data set for both exploratory and confirmatory research as that would be cheating ourselves and would not be scientific. 

We pre-register our experiments so we can't change our minds later and claim we were always looking for what we ended up finding. This is called the garden of forking paths or researcher degrees of freedom or p hacking - You can only test 1 hypothesis, not 20 and then report only 1. Or drop one condition so you get a sig. p value of < .05. 

There are really millions of variables that can correlate significantly with each other. Which is why we get significant correlations when we generate hundreds of 10 number strings of random numbers and then compare two strings. When you compare enough variables, you will find significant results. This is noise. This is just how large data works, or data without theory, or data with a theory that is ad hoc or made up and not evidence based. This is how superstition works. You need to look beyond this, to see if any of these correlations or effects are consistent and not merely noise.

So we conduct our confirmatory research, get our results, and then replicate to see if the results hold. Replication ensures that we confirm that the effect is real and wasn't just a coincidence. Also, keep in mind that if your hypothesis was based on a solid non-noisy phenomena or theory that that you had good reason to believe existed or was true, then replication should merely help ascertain this one way or another and not be a threat to you. It should all be part of the process of good science. If your effect was made up to begin with, or was noisy, then no amount of replication is going to help falsify something that never should have been investigated in the first place. in this sense, the original experiment  bears no special status over and above the replication. They both need to be treated the same.


This then is 3 different experiments that we have conducted to find one effect. And where do p values come in? I think you can use them for confirmatory research, but only to tell you about your sample data distribution, about the probability that the data is consistent with chance, under repeated attempts. But you cannot use p values to tell you about your hypotheses. From what we've seen, p values cannot do that. They were not set up for that purpose and they don't work that way. You should be able to tell what a truly significant result is in your study without p values, or by looking at other statistics. Or maybe using Bayesian statistics.


No comments:

Post a Comment