Last time we talked about combinations and permutations with groups of different items, or more formally put sets of n items each of which was unique. This time I'm going to explore what happens if the groups we're working with aren't all different items. A classic example of this being phone numbers where a number can reoccur a number of times in some cases.

If you want to look at combinations where not all of the set are different then the formula that you use is

]]>

Although this may sound a little off topic for an SPSS blog these come into sampling in survey research, significance testing and other statistical methods and so I thought it would be useful to take a slight detour and write a post or two about combinations and permutations and other basic statistical concepts.

Combinations and permutations are related concepts and related to how many difference ways things can be sequenced.

Permutations is the number of ways that a given number of items from a given set of items can be put into difference sequences.So for example if you have A, B, C and D (so a set of 4 items) but only take 3 of them at any one time how many different ways are there that they could come up.

Working through the list of possibilities gives us ABC, ACB, ABD, ADB, ACD, ADC, BAC, BCA, BDA, BAD, BCD, BDC, CAB, CBA, CBD, CDB, CAD, CDA, DAC, DCA, DAB, DBA, DBC, DCB. So that is 24 possible permutations in total.

When you are working with a set of n items and taking r items from that set the standard notation and also way of calculating the number of permutations is given by:

nPr = n! / (n-r)!

This can obviously be simplified to nPn = n! where you are taking the same number of items from the set as there are items in the set. In this notation n! is n factorial which is calculated as (1 * 2 * 3 *4 ..... * n-1 * n) so when taking 6 distinct items for a set of 6 the number of permutations is 1 * 2 * 3 * 4 * 5 * 6 = 720.

If you are taking 4 items from a set of 6 items then the formula gives you 360 = 720 /(1*2). And using our earlier example taking 3 items from a set of 4 items gives (1*2*3*4) / (1*2) = 24 items.

Combinations is the number of subsets that can be derived from any given set of items. Combinations is independent of sequence which, as you will have seen above, is not the case with permutations where ABC and ACB, for example, are separate items and counted as such. In our above example there are 6 versions of the combination of the 3 letters ABD, so there are 6 permutations of ABD but these are all only one combination.

The relevant notation for combinations is shown below. As you can see it's very similar to the notation for permutations.

nCr = n! / ((n-r)!*r!) where again n is the number of items in the set and r is the number of items being taken from the set. So for our previous example with A, B, C and D the notation gives 4C3 = (1*2*3*4) / (1*(1*2*3)) = 4

Over the next few blog posts I will explain a few more of the basics of statistics as users of SPSS will need to understand this. Although SPSS does a lot of the work for you the more of a grounding you have in basic statistics the simpler and quicker it will be to use it.

]]>

Put simply regression analysis is looking for a line of best fit through a series of data points. This will then show the relationships between the 2 datasets very clearly and also allow you to predict new values.

Given that this method uses all data points it is worth remembering that it is not robust to outliers and so some care needs to be taken with the dataset being used. It is also worth remember that it assumes normality for the data.

If you have a data set then linear regression seeks to find the equation in the form of Y = A + BX that most closely fits the data. Where Y is the value on the vertical axis, A is the point at which the line crosses the Y axis and B defines the slope of the line. So if B is 2 then the line will be steep (>45 degrees) as for every one unit that Y increases by X will need to increase by 2.

Firstly you need to calculate the 5 key pieces of data for your dataset so:

1. The mean of your x values -

2. The mean of your y values -

3. The standard deviation of the x values-

4. The standard deviation of the y values -

5. The correlation between your x values and your y values.-

Once you have this data then getting your line of best fit is very simple.

The slope of your line (

You now need to figure out where the line crosses the Y axis. This you can calculate by using the formula

So once you have completed these steps you now have your equation in the form of Y

Obviously SPSS will allow you to do all of these calculations very quickly. You simply choose ANALYSE from the toolbar and then click REGRESSION > LINEAR. This brings up a dialogue box which will ask you to define which variable is dependent and which independent and it will then calculate the variables in your formula. These can be found in the 'Unstandardised Coefficients' table, under the first column which is labelled 'B'.

The constant figure in this column is A, so where the line crosses the Y axis. Beneath that you will find the name of the independent variable which gives you B, the slope of the line.

So there you have it, an introduction to using linear regression and also how you can calculate one in SPSS.

]]>

Normal distributions are enormously useful, and used very frequently. It is a continuous probability distribution and used a lot in scientific research. The key idea that makes the normal distribution so useful is something called the Central Limit Theorem. This states that the mean of a sample of variables drawn at random from the same distribution will be distributed normally. This is regardless of the distribution of the underlying variable. The sample mean will be the same as that of the underlying population which the sample variance will be equal to the population variance divided by the sample size. This approximation improves as the sample size gets larger.

So the next question is - why is this so useful? The most useful thing about this is that is allows us to test hypotheses about data without knowing the underlying distribution of that data.

The second reason it's so useful is that normal distributions are everywhere, and this is in the main because so many variables in nature are not impacted by just one variable but are themselves the sum of many independent variables. Anything in nature (for example individual heights) is the sum of multiple different, and sometimes opposing factors like genes, diet etc.

Which normal distributions are very useful the key 'gotcha' to look out for is how likely it is to get an outlier. Under a normal distribution results far from the mean (many multiples of the standard deviation) are exceedingly unlikely and so if you expect distant outliers from the mean with any regularity at all it is probably the wrong distribution to use.

Hopefully that adds a little more detail to your understanding of Normal or Gaussian distributions and how

Standard Deviation is a very useful concept if used with care and so I'm going to write a couple more blogs to help you understand more about how it is used in practice. The first one is how to interpret a standard deviation figure.

Probably a very useful way of interpreting a standard deviation is that, assuming that your underlying data is sampled from a Gaussian distribution (for these purposes a normal distribution), you expect approximately 68% of the values of your distribution to lie within one standard deviation of the mean of your distribution, and you expect approximately 95% of your distribution to lie within 2 standard deviations of the mean. Also by extension you can assume that 27% of your population will lie between one and 2 standard deviations from the mean, and 5% will lie more than 2 standard deviations from the mean.

So for example, imagine if your a distribution where the mean is 20 and the standard deviation is 5.

That means that the range within one standard deviation of the mean is 20 + or - 5 so between 25 and 15, and the range within 2 standard deviations of the mean is 30 to 10.

Therefore if you then take another reading the odds are 19 in 20 (95%) that the reading will be between 30 and 10.

Similarly this also helps you understand why a lower standard deviation also implies a 'tighter' or more defined distribution. If the standard deviation of the above distribution was actually 2.5 not 5 then 95% of the values of the distribution would like between 20 + or - (2 * 2.5) = 25 to 15. This is the same range within which only 68% of the values of our original distribution would lie.

This can also be shown pictorially. The illustrations at the bottom of this post illustrate this point for you.

Don't worry I will write a follow up blog post to give you some more detail on Gaussian distributions but for these purposes we are assuming that your data is sampled from a normal distribution.

That's it for this week's SPSS blog.

]]>

Today is another beginner's tip for people new to using SPSS. What is the standard deviation of a dataset and how do I use SPSS to calculate it.

Standard Deviation is a measure of how widely dispersed our dataset is. It is a fairer and more comprehensive way of describing a dataset than just using a simple mean, median or mode. It actually describes how widely a dataset is dispersed from its mean. This of course means that in order to be really useful, you also need to know the units that your standard deviation is in and the mean of the dataset that it refers to as well. On it's own a standard deviation figure is unlikely to be very useful. A low standard deviation figure implies a tight or little dispersed dataset and conversely a large standard deviation implies a widely dispersed dataset.

It is useful to know how standard deviation is calculated as well so here goes.

It is the square root of the mean of the square of the differences of each variable in the dataset from the datasets mean. So in order to calculate it the sum of all of the squares of each piece of data's difference from the mean of the data set is taken. To get the mean it is then divided by the number of pieces of data and the square root of that is taken.

It is probably most easily illustrated by example. Image a dataset of 3 items - 9, 8, 7, 6 , 5

The mean of this data is 7 and so the square of difference from the mean for the data is 4 (9-7)^1 , 1 (8-7)^1 , 0 , 1 , 4

So the sum of the square of the differences is 10. There are 5 items in the data set and so the mean of this figure is 2 (10/5), and the square root of it is 1.414.

So for our very simple distribution the mean is 7 and the square root is 1.414. Obviously it can be far more complicated to calculate for larger and more

In SPSS to calculate the standard deviation for a dataset it is a very simple process. Select your variables, click STATISTICS and select Standard Deviation as well as Mean and click CONTINUE. SPSS will now very quickly and simply calculate the mean and standard deviation of your data.

I will post more in my next post about standard deviation as it is an important concept in statistics and so for anyone using SPSS.

You simple choose ANALYSE > DESCRIPTIVE STATISTICS > EXPLORE. You will then need to ensure that in the dependent list is the quantity that we are measuring and describing and in the factor listing we put the factor or quantity that we are exploring.

So for example the quantity might be people's ages and in the factor listing we would put where they come from.

SPSS will then calculate the mean and median of your data set for you. ]]>

Both Mean and Median are different types of averages. The mean value is based on all of the values within the data and so will include outliers. In small datasets significant outliers can have a significant impact. It is calculated as the sum of all of the variable in the dataset divided by the number of items in the dataset.

So for example imagine our dataset is 1,11,12,13,14. The sum of these is 51, divided by 5 which gives a mean value of 10.2. As this is less than 80% of our dataset (chosen to illustrate the point!) in this case the outlier has made the mean a less useful statistic.

The Median however is the middle number in an ordered dataset. So imagine our dataset was actually 11,13,1,14,13, we would first order the data into 1,11,12,13,14 and then choose the middle number in the dataset, so 12 is our median. As you can see where you have an outlier in the data the median is a far more useful and representative figure as it removes the influence of the outlier.

The median is easy to calculate for datasets with odd numbers of pieces of data in them. For datasets with even numbers of pieces of data in them we calculate the mean of the middle 2 pieces of data. So assuming that our dataset had an additional piece of data in it, 15, our middle 2 pieces of data would be 12 and 13. Taking the mean of them would give a median of 12.5. ]]>

When importing or entering data the rule to remember is as follows: All the information about one thing does in one row, or to expand it a bit information about different things goes in different rows (and the same column) whereas information, or data, about the same thing goes in different columns (and so the same row).

So if you had a list of peoples' names and their heights, weights and ages then in the first column you would put the first person's name, in the second their height, in the third their weight and in the fourth their age. So all information about the same person goes in the same row but in different columns.

Obviously there are exceptions to this rule when you get to using SPSS and you have a variable that defines a group of things but if you're not absolutely sure you should always stick with the rule above.

]]>

In this post I will go through each of the column titles to explain little more about what that column is for. Remember that each column in the Variable View window refers to a piece of data in the Data View.

If you click on the Type box at the top of the column this will open a dialogue box that will allow you to quickly define the data type.

Hopefully you'll find that helpful next time you're working through a Variable View in SPSS.

]]>