## Glossary

**ANOVA**

Short for ANalysis Of VAriance. This is a general term for the tests that compare the mean values of groups of observations by splitting the total variance into its parts.

**BInomial Distribution**

A discrete probability distribution of a random binary varibable. It is useful for infering information about the proportions of the variable in the population.

**Box plot**

Also known as a box and whisker plot. This is a form of plotting data which shows the median, maximum, minimum and upper and lower quartiles for populations.

**Categorical data**

Comes in two different types, nominal data or ordinal data. The defining characteristic of either sort of categorical data is that the scales can not be manipulated mathematically.

**Chi Squared Test**

This is a statistical test used on frequency data to compare observed data with expected data from a hypothesis. Assuming that there are deviations between the observed data and expected data, the test tells you the probability that it is down to chance.

**Confidence Interval**

This is the range of values around a calculated statistic (usually the mean) within which the true value of that statistic for the population is lies. A confidence interval is always quoted with a confidence level - usually 95%.

**Continuous data**

This is data that can have a continuum of values, and is not required to only have discrete values. In general, there is no absolute zero for this data, zero can be placed anywhere along the continuum.

**Correlation**

This is way of quantifying the relationship between 2 paired set of data. So if 2 paired sets of data are strongly related or linked then they are strongly positively correlated, and conversely if there is little or no relationship between the 2 sets of data then they have no correlation and if 2 sets of paired data are related but as one increases the other decreases then they are strongly negatively correlated.

Examples of these would be - there is a strong correlation between a car's engine size and its top speed, there is no correlation between a car's engine size and it's colour and there is a strong negative correlation between a car's engine size and its fuel efficiency.

More technically the correlation is the Pearson's correlation coefficient.

It is worth remembering that you need to take care that you have sufficient paired data points to be statistically significant. If you have a strong correlation on a very small sample size it may not be significant, and conversely a medium correlation on a very large sample size would be far more significant.

**Covariance**

This statistic is a measure the average relationship between 2 variables.

**Deciles**

The values of a variable that divide the observations values into 10ths. For example, the largest decile is the figure that divides the top 10% of observations from the rest of the observations.

**Discrete Probability Distribution**

A probability distribution where the variable an only take discrete, not continuous, values. For example, how many people are in a car.

**Hypothesis and Hypothesis Testing**

A hypothesis is a clear statement of what a piece of research or statistical test is trying to test. The usual way that this is set up is that the hypothesis that you want to test (For example, that car engine size is negatively correlated to a car's top speed ie as engines get bigger top speed gets smaller) is the alternative hypothesis, and is the opposite of what is known as the null hypothesis. The null hypothesis is the status quo and so what would be widely understood to be the case - in this case that car engine size is positively correlated to a car's top speed.

Hypothesis testing is the process of deciding between the alternative hypothesis and the null hypothesis. This is tested by find the P-value which is the probability that the results that you have could have been obtained if the null hypothesis is true. The P-value threshold is usually pre-agreed and so if this number is above the pre-agreed P-value for significance then the alternative hypothesis is accepted. For example, a p-value threshold of 0.001 is agreed, so if the probability of obtaining the results with the null hypothesis holding true is 1 in 1,000 or lower then the alternative hypothesis is accepted.

It is worth noting that it is quite usual to test multiple hypothesises against a single set of data.

**Kurtosis**

Kurtosis is a way of quantifying the peakiness or sharpness of a distribution. A kurtosis test is often run in conjunction with a skew test to get a good understanding of and description of the dataset.

Negative kurtosis indicates a flat distribution and so one that has more cases in the tails of the distribution and relatively fewer in the peak, whereas positive kurtosis indicates a peaky distribution with relatively few cases in the tails of the distribution.

It is usual to have a criterion that says that kurtosis of -2 to +2 is acceptable.

Obviously datasets can come in all shapes and sizes and so where you have a peaky dataset which also has 'fat tails' this is described as leptokurtososi and Playkurtosis described a data set which has both a low peak and thin tails.

Images showing this can be found at the bottom.

**MANOVA**

MANOVA is short for Multiple ANalysis Of Variance. This is an expanded use of an ANOVA test but where you are working with multiple dependent variables. This is used where each variable is normally distributed with respect to each other variable under consideration in the dataset.

**Nominal data**

Nominal data is data where the numbers serve as data labels rather than values. An example of this would be a list where 'yes' is represented by 1 and 'no' is represented by 0.

**Non-parametric Tests**

These are tests that do not assume a certain underlying distribution for the dataset.

Examples of non-parametric tests are Chi-square, Mann-Whitney U and Spearman's rank correlation.

**One Tailed Test**

This is a test when the hypothesis being tested does imply a direction for the relationship. So an increase in variable X is expected to lead to an increase in variable Y.

**Ordinal data**

Ordinal data is that can not be measured but can be ranked or put in order. An example of this would be survey's asking people to express preferences

**P-value**

This is the probability of getting the sample of results being tested from a population, or a more extreme result, if the null hypothesis is true.

**Parametric Test**

Parametric tests assume that the underlying data is distributed in a certain way - often that something is normally distributed - and that the data is spaced at certain intervals and that the variances of the 2 data sets are consistent when multiple datasets are being compared.

Examples of parametric tests includes - Z test, F test and t tests

**Quartiles**

The values of a variable that divides the values of the observations into equal quarters. For example, the largest quartile of a distribution is the figure that divides the top 25% of a distribution from the smaller 75% of observations.

**Regression Analysis**

Regression is used to explore and define the relationship between 2 variables. At it's simplest it can be seen as trying to find a 'line of best fit' for some data points potted on a graph although it can be used to find more complex relationships between data points as well.

Linear regression is the simplest version of regression analysis and assumes that there is a linear relationship between the two variables. Where more complex relationships are being looked at a version of regression analysis called multi-variate regression analysis is used.

**Skew**

This is a way of describing a distribution which is similar to a normal distribution but has a 'skew' and so it more weighted to one end of the other end of the distribution. The most usual skew is a 'right skew' and this is a distribution where the distribution isn't symmetrical but pushed, or 'skewed' to the right. and so it is more to the right that you would usually expect. A picture of a left skewed distribution can be found at the bottom.

**T-test**

There are a variety of T-tests depending on the data sets being compared. Two samples T-tests that compare the means of two sets of normally distributed data are equal.

**Two Tailed Test**

This is a test when the hypothesis being tested doesn't imply a direction for the relationship is there is one. So an increase in variable X could lead to either an increase or a decrease in valuable Y.