Hypothesis, Significance level and other basics

If you are delving into data analytics and statistics, it is essential to get a strong hold on hypothesis and related terms.

In this post, I am listing down the key concepts with simple explanations.

What is a hypothesis?

Hypothesis is a prediction about what your research will find.

It proposes a relationship between 2 or more variables – independent and dependent variables.

Independent variable is the one whose value changes in an experiment. Dependent variable is the measured one

Null hypothesis H0 – assumes no relation between variables in the experiment.

Eating an apple everyday doesn’t lead to lower doctor visits

More screen time doesn’t lead to higher chances of myopia in children

Alternative hypothesis Ha – assumes a relationship does exist between variables in the experiment.

Eating an apple everyday leads to lower doctor visits

More screen time leads to higher chances of myopia in children

What is a significance level in hypothesis?

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true.

For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

Higher the significance level, the experiment is more lenient.

What is statistical significance / p-value?

Statistical significance means the results are significant in terms of supporting the theory being investigated (i.e. not due to chance).

In other words, p is the probability of rejecting H0 when it is actually true. It is often expressed as a p-value or probability value between 0 and 1.

Lower the p-value, stronger is the evidence to reject the null hypothesis.

For a significance level of 5%, p <= 0.05 means that there is enough evidence that null hypothesis can be rejected. p > 0.05 suggests otherwise and you fail to reject the null hypothesis.

Confidence interval and confidence limits

Confidence interval CI represents the interval that you are certain contains the true population.

For example, a confidence interval of 95% whose boundaries are set by confidence limits, means we are 95% confident that this interval contains true population.

Bigger is the interval, higher the chance it contains the value.

The confidence level sets the boundaries of a confidence interval, this is conventionally set at 95% to coincide with the 5% convention of statistical significance in hypothesis testing.

In some studies wider (e.g. 90%) or narrower (e.g. 99%) confidence intervals will be required. This rather depends upon the nature of your study. You should consult a statistician before using CI’s other than 95%.

Be sure to remember these basics when you delve deeper into data analytics.