Analysis and Universal Process Modeling for the Interpretation of Environmental Data

In recent years. numerous applications of computer-based methods to environmental chemistry have been. developed. These include the use of principal component analysis (PC A), soft independent modeling of class analogy (SIMCA), geogTaphical information systems (GIS), neural networks and expert systems (Natusch et al, 1983; Breen and Robinson, 1985; James 1993). The use ofthese techniques has been driven by the need to convert complex environmental analyiical data into useful information. Regulatory efforts, clean-up strategies, monitoring programs and other environmental efforts aU on the successful conversion of analytical data into a form that contains relevant information necessary to make decisions. Among others. analytical measurements are used to e·valuate loadings of toxic chemicals into ecosystems, the effectiveness of remediation efforts and in assessing drinking water treatment standards. Uniortunately, differing analytical methodologies, varying degrees of control in the anal:ytical process, and the complexity of environmental data have aH challenged the environmental scientist's ability to adequately translate data into environmentally useful information.. This is illustrated by the tact that there can be greater than. 65% relative standard deviation in the amount of specific contaminants reported by laboratories when the contaminants are at the parts per billion (ng/1) level (Garfield, 1991 ). These types of problems have led to situations where entire data sets, covering years of analysis, have been declared useless (Ben.noit, 1994 ). The first step in interpreting environmental


154
Use ofNN, peA and UPMfor Interpreting Environmental Data data, therefore, is to ensure that the analytical variability is much less than the environmental variability being measured.This can only be done if laboratories adhere to strict quality control principles.Computational tools can then successfully be used to detect trends associated with changing environmental conditions.
The Niagara River Toxics Management Plan is a program established by Environment Canada, the U.S. Environmental Protection Agency Region II, the Ontario Ministry of the Environment and the New York State Department of Environmental Conservation.The plan has, as one of its stated goals, to achieve a significant reduction oftoxic contaminants in the Niagara River and to reduce the inputs of specific toxic chemicals from point and non-point sources by 50% by 1996 (Williams et aI., 1994).Associated with this plan is an upstream!downstream monitoring program designed to specifically measure target organic compounds.The analytical procedures used to support this monitoring program are prescribed by the Niagara River Analytical Protocol and contain specific guidelines controlling the analytical methodologies and associated quality control procedures used to generate analytical data (Analytical Protocol Group of River Monitoring Committee, 1992).Because this program has been in place since 1987 and because of its associated monitoring program which has a rigorous analytical component, the data generated from this program is of suitable quality for analysis by specific chemometric methods.This chapter describes the use of neural networks (NN), PCA and universal process modeling (UPM) for the evaluation of analytical data generated from three locations along the Niagara River (Figure 9.1).
This project had four specific goals: 1.To use NN, PCA and UPM techniques to detect variations in the levels of target organic compounds over time between specific locations along the Niagara River.2. To use NN, PCA and UPM techniques to identifY the source of water samples collected from locations along the Niagara River.3. To use UPM techniques to detect variations in the levels of target organic compounds over time within specific locations along the Niagara River.4. To evaluate the use ofNN, PCA and UPM techniques as tools for identifying non-target contaminants using a broad spectrum analytical approach.

Experimental Design
The Niagara River is a major interconnecting waterway between Lake Erie and Lake Ontario.Flowing northerly from the former to the latter, the Niagara River drops some 100 meters in elevation over a distance of 58 kilometers.The

156
Use ofNN, peA and UPMfor Interpreting Environmental Data river drains an urban region which is heavily industrialized and contains numerous chemical dump sites.The river includes Niagara Falls, which physically divides the river into an upper and a lower section.Two permanent sampling stations were established in 1987 and are located to collect representative samples of water entering the Niagara River from Lake Erie and exiting the Niagara River into Lake Ontario.These stations are Fort Erie (FE) and Niagaraon-the-Lake (NOTL).A third station, the Buffalo Water Intake (BWI) was established in the early 90s and is tocated above the head of the Niagara River (Figure 9.1).
Weekly water samples are collected, extracted and analyzed by gas chromatography-mass spectrometry (GC-MS) following the procedures described by the Niagara River Analytical Protocol (Anaiytical Protocol Group, 1992).In general, 24 hour composite samples are collected and extracted using Goulden Large Volume Extractors.The extracts are analyzed for specific target organic compounds such as Chlorinated Pe~iicides (OCs) and Polynuclear Aromatic Hydrocarbons (PAHs) using GC-MS.The GC-MS data from these samples can be described in terms of a multivariate problem (Lavine, 1992;Lavine et at.,1993).That is, a large number of data points or variables (chromatographic and spectral data representing different compounds) are used to describe an object (water or environmental quality of a site).The analytical data, usually reported in ngll concentrations, is transferred into Microsoft Excel for analysis by NN, PCA and UPM methodologies (Figure 9.2).
The entire data set consisted of samples collected from FE, NOTL and B WI between 1987 and 1994 (Table 9.1).The data set contained 359 samples each measuring 23 target compounds from Fort Erie, 338 samples each measuring 21 target organic compounds from NOTL, and 42 samples each containing 32 variables for the Buffalo Water Intake.A subsetofthis data, consisting of samples collected from BWI, FE, andNOTL between 1993 and1994 was used to determine between-site variability using PCA, NN and UPM techniques.These samples consisted of 149 samples each measuring 32 target compounds.The entire data set (samples collected from 1987 through 1994) was used to determine within site variability at the FE and NOTL locations.~l Dlagnostio Information l---J Figun~ 9.2 Illustration of data analysis.

Neural Networks 157
A neurai network is a computer program designed to link a variety of inputs through a series ofinterconnected associations into a specific output The output produced from t~e associations can be used for problems related to predictions, classification, transformation and modeling (Zupan et at, 1993;Lawrence, 1993).Neural networks have been used in a variety of chemical applications, including chromatographic and spectral pattern recognition (Long et at, 1991;WIenke et aI., 1994;Zupan et at, 1993;Lawrence, 1993).For neural networks to be successfully applied to specific problems, the problems must be appropriately defined, a data set must be established and the network must be trained.
Brainmaker Professional, a commercially available neural network was used in this project (California Scientific Software, 1993).

Principal Component Analysis (PCA)
PCA is a display method for mapping multivariate data into a twodimensional plane.This method first calculates the correlation matrix, then diagonalizes it to obtain the eigenvalues and eigenvectors.Finally, it transforms the original data into new ones by using the matrix of eigenvectors as a transformation matrix.The map is obtained by plotting the transformed data against whatever two of the new components bring the largest portion of the information into the correlation matrix.Similar samples lie close together in pattern space, forming clusters (Marssart, 1988).The variables modeled in this project include concentration and compound, generating a two dimensional space.Inspect 0.73 was used in this project (Lohninger, 1994).

Universal Process Modeling (UPM)
UPM is an M-in, M-out algorithm.The algorithm receives its input data vector of size M and responds with an output vector ofthe same dimensionality.To construct the response vector, UPM makes use of a reference library which is a database of exemplar patterns.Each time it is presented with a new test signal, UPM creates a localized model based on a subset of patterns selected from among the patterns stored in the reference library.The selection of exemplars for the localized model is based on the similarity of the test vector to each pattern in the reference library and the relative position of the exemplars.The similarity, which is calculated by the advanced metric, is used in two places: in selecting nearest neighbour images from the reference library, and in constructing the coefficients used to linearly combine those images into the predicted image.Once the exemplars are selected, the model is evaluated to determine the response vector and output diagnostic or classification information.
Universal Process Modeling is a proprietary empirical modeling technique which requires an historical data set that adequately describes the system.UPM provides a predicted output based upon the comparison of the historical data set with those obtained as contemporary data sets.Modelware Professional was used for this project (Teranet IA Inc, 1992).

Between-Site Variability
The initial question to be addressed was whether the analytical data from three locations along the Niagara River could be used to observe between-site variability.By determining the variability between-sites, a second question to 9.2 Between-Site Variability 159 be addressed was whether the location of the sample could be determined based solely upon the analytical data.
Figures 9.3 and 9.4 show the principal component (PC) score plots that characterize the samples.Figure 9.3 is the score plot of the first PC (X-axis) against the second PC (Y-axis).Labeled points on the graph indicate individual samples which are outliers from the expected cluster and show the date of collection.For example, N940414 indicates a sample collected at Niagara-onthe-Lake on April 14, 1994.Two samples from FE and three from NOTL were observed to contain data outside the normal range.In each case, higher concentrations of specific target compounds were observed.These higher values could be associated with either true increases of these compounds as a result of a spill or release, or could be analytical outliers.Figure 9.4 is a magnified plot around the clusters in Figure 9.3.In general, two distinct clusters were observed, and were associated with the FE and NOTL locations.Data from the BWI was distributed between the FE and the NOTL clusters with more BWI data associated with the FE cluster.As the BWI and FE locations are in relatively close proximity, it is not surprising to observe this effect.NN and UPM analysis was undertaken on the same data set of 149 samples.Twenty-five percent of the data, with data coming from each of the three locations, were randomly selected for training input.The prediction rate for each of the methods was detennined based upon the number of times each of the systems could correctly identify the source of me analytical data when presented with the remaining 75% ofthe samples.Table 9. 2 shows the results ofthis study.In general, the NN could correctly identify the source of the data 94.4% of the time while UPM analysis had a prediction rate of 91.7%.In order to observe the effect of the size of the training set on prediction rate, the training set size was varied from 2% (3 samples) to 75% (112 samples) of the total number of samples.This is an important consideration as many data sets will have a limited number of samples which can be used for training.This is also of importance due to the cost associated with collecting and analyzing large numbers of samples.Obviously, the ideal situation for determining variability would be to use the smallest training set possible.Figure 9.5 shows the result of this study.In general, a 65% prediction rate was achieved with only 2% of the samples for both NN and UPM methods.The prediction rate increased steadily up to the 90 % level using only 20% of the samples.An unusual observation was the drop in prediction rate using UPM from 90% to 85% when between 20% and 50% of the samples were used for training.This may be a result of using samples which were collected at different times rather than samples which were collected sequentially.It is unclear why this did not affect the NN analysis.A smaller dip 100  in prediction rate (73-70%) was observed using the NN system when between 15 and 20% of the samples were used as training sets.Both the UPM and NN analyses were undertaken using default classification features.It is believed that higher prediction rates can be achieved when parameters within both the UPM and NN systems are optimized.These results demonstrate that it is possible to use peA, NN and UPM methods to identity data from separate iocations.This is an important first step in being able to monitor real changes in chemical contamination over time.That is, if a normal set of conditions crumot be defined, then it will be impossible to determine changes within a system.

Within-Site Variability
UPM was used to evaluate within-site variability from both FE and NOTL.Baseline data from 1987 was used as the learning set and consisted of 4 7 samples for each of the locations.The prediction data for FE consisted of 312 samples targeting 11 specific compounds.Samples were collected between January 1988 through June 1994.The prediction data for NOTL consisted of 288 samples targeting 18 specific compounds.Samples were collected from January 1988 through June 1994.
The UPM output consists of two graphs, a Trend plot and a Deviation plot.Trend plots, as shown in the graph in Figure 9.6A, contain a variety of information about the behavior of specific target compounds relative to the other target compounds within a chromatographic run.Each division on the X -axis of the plot represents a sampling event (one chromatographic run), while the Y-axis represents the concentration.The appearance of negative values on the Y-axis are due to software limitations which autoscales the axis.Two lines are observed: the dark line is the actual measured concentration and the light line is the predicted concentration based on data from the training set.The bottom edge of the graph represents the behavior of the specific target compound, with the light color indicating that the compound is within normal limits.The dark color indicates that the behavior of the target compound is different from that predicted.The upper edge of the graph shows the behavior of all of the target compounds relative to predicted values developed from the training set.Dark sections indicate that the data set is outside the expected range, while the light areas indicate that the observed data set is within the range predicted by the modeL Note that a specific compound may be out of range (lower edge) while the entire data set can be within normal limits (upper edge), conversely, the data set may be out of range while the particular target compound is within its normal limits.
Deviation plots, as shown in the graph in Figure 9.6B, were used to observe the behavior of aU compounds within a chromatographic run relative to each other based on data from the training set.The Deviation plot provides a variety The level at which the system is determined to be unhealthy, meaning significantly different from expected, can be set by the user and for the purpose of this study was set at 0.85.The X-axis of the Deviation plot shows each individual compound, with each bar representing one ofthe compounds within a chromatographic run (l,4-dichlorobenzene, 1,2,4-trichlorobenzene, 1,2,3-trichlorobenzene, 1,2,3,4-tetrachlorobenzene, pentachlorobenzene, hexachlorobenzene, BHD, Lindane, heptachlor epoxide, dieldrin, hexchlorbutadiene, polychlorinated biphenyls, fluoranthene, pyrene, benzo( a )anthracene, chyresene, bis(2-ethyl-hexyl)phthalate, dioctylphthalate, from left to right).The Y-axis represents deviation units.The Deviation plot also provides a view of the deviation of each compound making up the data set.Individual compounds were defined as within normal limits if they fell within two deviation units, the warning level for each compound was set at three deviation units, and the compound was defined as out of control at four deviation units.The deviation of all compounds relative to the training set ultimately determines the system health.While one compound may show an outof-control condition, for example, the system health could still be normal (above 0.85) if the combination of the deviations of the other compounds were still within normal or warning levels.Figure 9. 7 shows a Trend plot generated for FE for the compound BHD.The numbers on the X -axis indicate sampling event (date) covering the specified time period.In general, the observed levels of BHD can be seen to be significantly lower compared to the prediction line.This would indicate that BHD levels at the FE iocation are indeed dropping from those initially found in 1987.This trend was also observed at the NOTL location (Figure 9.6).In contrast, values obtained for pyrene (Figures 9.8,9.9)for both FE and NOTL locations show a different trend.In general, there is little change between the predicted and observed levels over time.Specific increased concentration events, appearing as spikes relative to the prediction line were observed at both locations.These events appear to occur between October and March of each year with the absolu'te magnitude of each event appearing to drop off with time.

Conclusions
PCA, NN and UPM methods were shown to be useful tools for the prediction and detection of variation in the concentration of target compounds both within-and between-sampling sites.NN and UPM methods correctly identified the source of analytical data based upon minor differences observed in

Figure 9 . 5
Figure 9.5 The effect of training set size on the prediction rate.

Figure 9 . 7 Figure 9
Figure 9.7 Seven year trend plot ofBHD from Fort Erie.(Dark bottom line: measured concentrations; Light upper line: predicted concentrations)

Table 9 .
1 Data set used in the study,

Table 9 .
2 The classification rate (%) of samples from three locations.