CADDIS Volume 4: Data Analysis
Basic Principles & Issues
Topics in Autocorrelation: Details
- Dependence and correlation concepts
- Methods of correlated data analysis
- Spatial regression
- Pseudoreplication and analysis of clustered data
- More information
- Authors: M. McManus, D. Farrar,
Autocorrelation and Data Independence: Details
Analysis of environmental data supporting causal assessments can be improved by taking dependencies among measurements into account. Indeed, failure to do so may exaggerate the information content in the data.
We are concerned with a problem of dependence, as the term is defined in probability theory. In essence, random variables X and Y are independent if knowing the value of one would not help us to improve our guessing of the other. More formally, the conditional distribution of Y (say), obtained by fixing the value of X, is equal to the marginal distribution of Y (the distribution ignoring X).
Here the distinction between correlation and dependence is relaxed and the term "autocorrelation" is used to refer to lack of independence among the values of a variable at different points. Independence implies uncorrelation but uncorrelation does not imply independence. However, for variables that have jointly a multivariate normal distribution, independence is equivalent to (Pearson) non-correlation.
Many spatial methods for correlated data analysis assume multivariate normality, so that dependence is accounted for by Pearson correlation. An assumption often made, in order to incorporate autocorrelation into a statistical model for a continuous variable, is that correlation between results for two measurement locations diminishes as some smooth function of distance separating the locations. An exponentially decreasing function is a common default assumption. For clustered data, observations may be assumed correlated within clusters but uncorrelated between clusters.
Correlated data methods effectively replace the independence assumption with an assumption that data are dependent in specific ways. The topic of autocorrelation is usually considered in basic texts on environmental statistics (e.g., Gilbert 1987, Ramsey and Schafer 2002). For example, for clustered data we may assume that there are distinct correlations for units within- versus among- clusters. A common assumption in modeling of spatial and temporal autocorrelation is that correlation decreases smoothly, as a function of distance separating measurement points in space or time. For the most part, correlated data methods are parametric.
A typology that accommodates a substantial part (but not all) of available resources is displayed below. An advanced application to environmental data analysis may have elements of multiple methodologies.
|Example Tools||Example References|
|Clustering||linear & generalized linear mixed models, generalized estimating equations (GEE)||Schabenberger and Pierce 2002, Qian 2010|
|Temporal Autocorrelation||Classical time-series analysis||autocorrelation function (acf), partial acf (pacf), ARIMA models.||Chatfield 1975|
|Longitudinal Data Analysis||(as for clustering data)|
|Spatial autocorrelation||Classical geostatistics||variogram, Kriging, conditional simulation||Schabenberger and Pierce 2002, Waller and Gotway 2004|
|Likelihood and Bayesian methods||(extensions of classical methods)||(Generally advanced)|
|Spatial & Temporal Correlation||(Generally advanced)|
In the causal assessment context, interest is in relating a response variable (e.g., measure of ecological quality) to some stressor variables. Autocorrelation of the response variable may be due, in some measure, to effects of stressor variables that are themselves autocorrelated. Indeed, some reflection may suggest that autocorrelation is generally due to mis-specification of the model: A response variable may be affected by environmental variables that are not included in the model and that vary over space or time.
A possible implication is that, for purposes of developing an empirical model, the autocorrelation of interest may be autocorrelation not accounted for by stressor variables. In a regression context, we may be more interested in applying methods such as variograms to residuals from a regression model, than in applying such a method to the raw response variable. The application to regression residuals (from a non-spatial regression approach) may help us to decide whether a model is needed with specific autocorrelation features, in addition to stressor effects.
A second possible implication is that stressor variables that have not been measured (but which are manifested via residual autocorrelation) might be accounted for by using a model that incorporates autocorrelation. This role would be analogous to the role of blocking in a randomized blocks design. (The blocks are not generally designed with specific confounding variables in mind. They can serve to control variables that vary between blocks.)
If it is decided that autocorrelation needs to be incorporated into the data analysis, special methods of “spatial regression" may be applied. In place of the independence assumption of basic regression methods, we approach autocorrelation via specific assumptions about the nature of the correlation. Additional Special parameters, to quantify autocorrelation, may be estimated in addition to familiar regression parameters (regression coefficients and residual variance). According to one approach, the estimates of additional parameters are calculated by fitting a curve to the variogram. These parameter estimates are substituted into weighted regression computations. Alternatively, the variogram may be viewed as an exploratory graphical approach. Estimates of all parameters may be optimized in a single step, e.g., optimizing a restricted likelihood function (as in SAS Proc Mixed).
The term pseudoreplication is generally familiar to ecologists, and seems useful for conveying important issues. The definition given by Hurlbert (1984) is “use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent." While this alludes specifically to experiments, we think the term is used more widely, particularly in the general context of spatially clustered data. In a randomized comparative experiment, one may distinguish between “observational units" and “experimental units."
For an example, in a developmental toxicity study with multiple pups per litter, the pups cannot be randomized to different dose levels, when the exposure is prenatal, via exposure of the mother. The litter is taken to be the experimental unit, and treatment effects are evaluated relative to variation among litters. Individual pups, which cannot be randomized to different dose levels, are considered to be observational units. Analogous distinctions may apply with probability surveys and observational studies, for example in a study of lakes with each lake sampled at multiple locations.
A simple approach for evaluating clustered data is to ignore cluster sizes, reduce each cluster to a single summary statistic, and analyze the set of cluster summaries as independent data. Alternatively, a random-effects model (multilevel or mixed-effects) with distinct variance terms for within- and among-litter variation can be used, effectively weighting the cluster means using the cluster sizes. For normal variables, this is equivalent to assuming distinct correlations for within- and among-cluster differences.
For repeated-measurement designs (where an experimental unit was measured at multiple times) a practice that was more common in the past was to treat the measurements as a cluster. More recently, a tendency has been to allow for autocorrelation that may diminish as a function of time separation between occasions of measurement.
Useful treatments of time series analysis and spatial data analysis, with an emphasis on R implementation, are found in specific chapters of Crowley (2007) and Venables and Ripley (2004). Bivand et al. (2008) also emphasize use of R.
Autocorrelation is one important aspect of spatial statistical data analysis. General introductions to spatial data methods include Waller and Gotway (2004).
The statistical and numerical technology for parametric modeling with autocorrelation is similar to that for multilevel models, a topic large enough to justify a separate web page. See Zuur (2009) and Qian (2010) for sources emphasizing multilevel modeling of ecological data in R. Clustered and spatially or temporally autocorrelated data for a continuous response are often analyzed using SAS Proc Mixed, which supports a variety of correlation structures (Littell et al. 2006), or analogous programs in R (Pinhiero and Bates 2000).
Qian (2010) treats the example of lakes in detail, with R examples. The example is used here for relative simplicity. Autocorrelation in data from streams may take a more complicated form. Literature on autocorrelation in stream studies is a topic of ongoing statistical research.
Spatial and temporal autocorrelations are subjects of study in specialized statistical disciplines. Large quantities of literature, much relatively advanced, can be obtained by searching literature with keywords including spatial statistics, geostatistics, variogram, kriging, time-series analysis, multilevel models, hierarchical models, random effects models, mixed models, clustered data analysis, and longitudinal data analysis. For practical purposes, good starting points for the analyst who wants an up-to-date treatment with an ecological or environmental focus may include special journal issues such as the Ecological Applications Volume 19, Issue 3 (2009) dealing with hierarchical models, or recent review articles such as Dormann et al. (2007) dealing with spatial correlation. Crawley (2007), a text generally recommended for data analysis in R, contains chapters on times series analysis and spatial data analysis.