Correlation: Introduction

Correlation addresses the relationship between two different factors (variables). The statistic is called a correlation coefficient. A correlation coefficient can be calculated when there are two (or more) sets of scores for the same individuals or matched groups.

A correlation coefficient describes direction (positive or negative) and degree (strength) of relationship between two variables. The higher the correlation coefficient, the stronger the relationship. The coefficient also is used to obtain a p value indicated whether the degree of relationship is greater than expected by chance. For correlation, the null hypothesis is that the correlation coefficient = 0.

Examples: Is there a relationship between family income and scores on the SAT? Does amount of time spent studying predict exam grade? How does alcohol intake affect reaction time?

Raw data sheet

The notation X is used for scores on the independent (predictor) variable. Y is used for the scores on the outcome (dependent) variable.

 Subject Variable 1 Variable 2 1 X1 Y1 X = score on the 1st variable (predictor) Y = score on the 2nd variable (outcome) 2 X2 Y2 3 X3 Y3 4 X4 Y4 5 X5 Y5 6 X6 Y6 7 X7 Y7 8 X8 Y8

Contrast/comparison versus Correlation

The modules on descriptive and inferential statistics describe contrasting groups -- Do samples differ on some outcome? ANOVA analyzes central tendency and variability along an outcome variable. Chi-square compares observed with expected outcomes. ANOVA and Chi-square compare different subjects (or the same subjects over time) on the same outcome variable.
Correlation looks at the relative position of the same subjects on different variables. More....

 Interpreting correlation coefficients Correlation can be positive or negative, depending upon the direction of the relationship. If both factors increase and decrease together, the relationship is positive. If one factor increases as the other decreases, then the relationship is negative. It is still a predictable relationship, but inverse, changing in opposite rather than same direction. Plotting a relationship on a graph (called a scatterplot) provides a picture of the relationship between two factors (variables). More....

A correlation coefficient can vary from -1.00 to +1.00. The closer the coefficient is to zero (from either + or -), the less strong the relationship. The sign indicates the direction of the relationship: plus (+) = positive, minus (-) = negative. Take a look at the correlation coefficients (on the graph itself) for the 3 examples from the scatterplot tutorial.

Correlations as low as .14 are statistically significant in large samples (e.g., 200 cases or more).

The important point to remember in correlation is that we cannot make any assumption about cause. The fact that 2 variables co-vary in either a positive or negative direction does not mean that one is causing the other. Remember the 3 criteria for cause-and-effect and the third variable problem.

Two formulas

There are two different formulas to use in calculating correlation. For normal distributions, use the Pearson Product-moment Coefficient (r). When the data are ranks (1st, 2nd, 3rd, etc.), use the Spearman Rank-order Coefficient (rs). More details are provided in the next two sections.

Bivariate and multiple regression

The correlation procedure discussed thus far is called bivariate correlation. That is because there are two factors (variables) involved -- bi = 2. The term regression refers to a diagonal line drawn on the data scatterplot. You saw that in the tutorial. The formula for correlation calculates how closely the data points are to the regression line.

Multiple regression is correlation for more than 2 factors. The concept is fairly simple, but the calculation is not, and requires use of a computer program (or many hours of hand calculation).