CADDIS Volume 4: Data Analysis
Topics in Assembling Data
- Authors: S.B. Norton, L. Alexander, S.M. Cormier, G.W. Suter II
A causal analysis may include data collected at the impaired site(s) and at local reference sites ("data from the case") as well as data from other locations ("data from elsewhere"). Data from the case are essential, but the sample size may be too small for robust statistical analysis and relationships found at the site should be confirmed by data from elsewhere, if possible. Data from elsewhere are used to develop background information on causal relationships and observations from similar situations that can be related to the case. Generally, both types of data are needed to complete a causal analysis.
The primary data from the case are the data used to determine if the site(s) can be classified as impaired. Data from the case includes samples of biota, water chemistry, sediments, stream habitat condition, and other attributes of the impaired and local reference sites. They also may include land use/land cover data, geologic maps, or historical records needed to classify the sites.
After assembling all available data from the case, you may find that additional data are needed to support or refute relationships between observed stressors and biological impairments. Advice on designing a sampling effort to meet your specific objective may be found in U.S. EPA's Guidance on Choosing a Sampling Design for Collecting Environmental Data (PDF) (178 pp, 1 MB, About PDF). Suggestions for storing and managing data can be found at Resources for Planning New Data Collections.
Data from elsewhere usually form the bulk of the data used for causal analysis. These data may be obtained from other sites within the region, similar sites in other regions, laboratory studies, journal articles, industry and government publications, and other sources. After the data are assembled, they must be interpreted and related to observations from the case.
Monitoring programs run by city, county, state, and federal agencies provide information on local and regional field conditions which can be helpful to a causal analysis. Field data also can be used to evaluate stressors and responses under realistic environmental conditions. When using biomonitoring data, the measured variables, taxonomic resolution, and sampling approach used in selecting sites determine how data may be analyzed and how results may be used. For example, in stratified sampling designs, subsets may deliberately be represented in samples, with relative frequencies not equal to frequencies in the population of interest. Also, some probabilistic designs are associated with special methods of statistical analysis, such as special standard errors.
Links to U.S. EPA monitoring data available online:
Environmental Monitoring and Assessment Program (EMAP)
The Environmental Monitoring and Assessment Program (EMAP) is a research program to develop the tools necessary to monitor and assess the status and trends of national ecological resources. Data sets generated in the course of EMAP's research are available to be searched and downloaded.
Regional Environmental Monitoring and Assessment Program (REMAP)
The objectives of REMAP are to evaluate and improve EMAP concepts for state and local use, assess the applicability of EMAP indicators at differing spatial scales, and demonstrate the utility of EMAP for resolving issues of importance to EPA Regions and states. REMAP data are available online like EMAP data, but have smaller spatial and temporal scales.
EPA STOrage and RETrieval database (STORET)
The STORET Data Warehouse is U.S. EPA's repository of the water quality monitoring data collected by water resource management groups across the country. These organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups and universities, submit datasets to the STORET Warehouse in order to make them publically accessible.
Wadeable Streams Assessment (WSA)
The Wadeable Streams Assessment (WSA) is a survey of the biological condition of small streams throughout the U.S. conducted by the U.S. EPA in collaboration with states and tribes. The first WSA in 2004-2005 sampled 1,392 sites selected at random to represent the condition of all streams in regions that share similar ecological characteristics. Participants used the same standard methods at all sites, to ensure results that are comparable across the nation.
Peer reviewed journal articles and other publications may be obtained from libraries or online citation search engines. Although most journals and many scientific citation services charge fees for downloads or searches, many college and university libraries offer free access to the search engines and journals to which they subscribe, and also may provide assistance with literature searches. Other sources includes open-access data base compilations such as ECOTOX and HERO. For example, ECOTOX provides two open-access databases of citation information for selected stressor-response relationships in freshwater ecosystems. CADLit uses a keyword search to find stressor-response information for multiple stressor exposures reported in peer reviewed scientific literature. The Interactive Conceptual Diagram application uses a graphical user interface to find citations that support specific causal links in the conceptual model diagrams.
Note that databases are useful for identifying which data you will need and conducting initial exploratory comparisons, but the original source must be consulted if the information becomes critical to the causal assessment; in addition, many journals publish data on-line as supplements to the papers.
Observations from similar cases can yield insights into the current investigation, particularly if other investigators have implemented management actions or identified diagnostic symptoms. Unfortunately, these studies often are not published, and at present no central repository for such information exists. However, a small number of examples and case studies are available in Volume 3 of CADDIS.
Modeling data can be used as surrogates when insufficient field or laboratory data are available for a case. For example, a watershed model such as Better Assessment Science Integrating point & Non-point Sources (BASINS) can generate estimates of water quality parameters at specified locations.
- Make the initial search broad. It is easy to overlook potentially relevant data for the stressors and biological impairments in your case by restricting the search to familiar sources.
- Use the metadata for data obtained from other sources. Metadata (data about the data) have the details needed to analyze a dataset and evaluate its quality. If metadata are not provided, locate website contacts or other users who may be able to supply them.
- Document the sources of original data and any alterations made (e.g. file merges, new fields, deleted observations). A file folder system and naming convention that allow users to easily track and identify data sources and versions will facilitate the assembly process.
- Set up read-only folders for original versions of data files. Always work on copies, not originals!
- Budget time to clean up and review final datasets prior to starting the analysis.