CADDIS Volume 4: Data Analysis
Exploratory Data Analysis
- What is EDA?
- Mapping Data
Authors: G.W. Suter II, P. Shaw-Allen, S.M. Cormier
Correlation analysis is a method for measuring the covariance of two random variables in a matched data set. Covariance is usually expressed as the correlation coefficient of two variables X and Y. The correlation coefficient is a unitless number that varies from -1 to +1. The magnitude of the correlation coefficient is the standardized degree of association between X and Y. The sign is the direction of the association, which can be positive or negative.
Pearson's product-moment correlation coefficient, r, measures the degree of linear association between two variables. Spearman's rank-order correlation coefficient (ρ) uses the ranks of the data, and can provide a more robust estimate of the degree to which two variables are associated. Kendall's tau (τ) has the same underlying assumptions as Spearman's (ρ), but represents the probability that the two variables are ordered nonrandomly.
A value of r, ρ, or τ is interpreted as follows:
- A coefficient of 0 indicates that the variables are not related (Figure 1, left).
- A negative coefficient indicates that as one variable increases, the other decreases (Figure 1, center).
- A positive coefficient indicates that as one variable increases the other also increases (Figure 1, right).
- Larger absolute values of coefficients indicate stronger associations (e.g., Figure 1, right and center). However, small Pearson coefficients may be due to a nonlinear relationship (Figure 2).
Examples of different behaviors of Pearson's and Spearman's correlations are shown in Figure 2. Pearson's r does not accurately represent the strength of the non-linear association in Figure 2 (left plot). Pearson's r and Spearmans ρ provide different estimates of correlation depending upon the distribution of the data (Figure 2, right plot).
How do I calculate correlations?
A tool for calculating correlations is available in CADStat, and most any spreadsheet or statistical software will compute them as well.
Correlation analysis is used primarily as a data exploration technique to reveal the degree of association in a set of matched data. This information can inform subsequent analyses of relationships between variables. In particular, correlation can indicate possible factors that confound a relationship of interest. In most data, pairwise correlations may not provide enough insights, and multivariate exploratory analyses are recommended.