CADDIS Volume 4: Data Analysis
Basic Principles & Issues
- Authors: M. McManus, D. Farrar,
Autocorrelation and Data Independence
Analysts should consider the possibility that some samples in a dataset may not actually provide independent evidence because samples taken relatively close together in space or time are to some degree redundant. Measurements also may occur as clusters, with a tendency for less variation within than among clusters. For example, a study involving chlorophyll concentration in lakes may involve multiple lakes, and samples at multiple locations in each lake. In this case, samples from each lake are "clusters".
If individual samples depend strongly on one another, use of standard statistical methods that assume independence (e.g., confidence intervals) may result in greater confidence in the strength of conclusions than is actually justified by the evidence in the data.
These concerns are all associated with the technical topic of autocorrelation, or the correlation of repeated measurements on the same variable. In ecology, similar concerns may be discussed under the topic of pseudoreplication. When analyzing biomonitoring data we are most often concerned with positive spatial autocorrelation, in which measurements tend to be similar when taken from locations close together. A simple example of autocorrelation in a stream is the likelihood that conditions downstream are not independent of conditions upstream.
The first line of defense against autocorrelation problems is familiarity with the study design, along with an understanding of variation occurring on different spatial and temporal scales at the sample sites. These insights may be sufficient to identify data that are adequately independent. For example, if morphology of a stream can be described as an alternation of riffles and pools, the analyst might somehow obtain a single value for each riffle or pool, perhaps by averaging sample measurements from the same riffle or pool.
Since autocorrelation is generally present in environmental data on some scale, we emphasize graphical evaluation. Any mapping approach that can reveal a tendency for clustering of relatively high or relative low values may be effective in revealing spatial autocorrelation. For example, a map of total phosphorus concentrations in Florida suggests a cluster of high phosphorus measurements in the central western portion of the peninsula (Figure 3). Specialized graphical techniques like the variogram, discussed in the supplemental information below, can be helpful for identifying a minimum spatial or temporal separation, such that measurements will be practically uncorrelated. Statistical tests for spatial and temporal autocorrelation also are available. The usual issues in correct use of statistical tests should be kept in mind. That is, for any given level of autocorrelation, the chance of a statistically significant test will depend on sample size. Graphical techniques parallel to those for use in diagnosing spatial autocorrelation also are available for temporal autocorrelation (e.g., for use when evaluating the time series for an individual monitoring station).
If autocorrelation is judged to be important for evaluation of a given data set, based on statistical tests or graphical evaluation, relatively simple approaches may be considered appropriate for taking autocorrelation into account, in the analysis of data. Temporal correlation can be eliminated by separate analysis of "time-slices", such as individual years of biomonitoring data collection. Results for individual years may be compared using confidence intervals for effect indices. For some purposes clustered data may be reduced to a single summary statistic, for example (in the lake example) a mean value for each lake. (Note though, that we may then lose the ability to evaluate stressors that vary within lakes!)
A significant amount of statistical literature relates to how autocorrelation can be incorporated into regression models. For modeling a times series for a single site, some relatively simple approaches involve use of lagged X variables. The analyst may, however, opt for relatively advanced multilevel models, which may allow for various patterns of autocorrelation, variation on multiple spatial scales, simultaneous modeling of data for multiple years, and stressor that vary within as well as between sites. Participation of a statistician for advanced analyses may be helpful.
More information regarding the analysis of autocorrelated data is available here.