Directing Sampling based on Uncertainty Analysis

Determining where and whatto sample for environmental modeling of receiving waters is becoming increasingly important because the need for improved accuracy in model results conflicts with limited site sampling budgets. A quantitative approach to sampling, entitled Quantitatively Directed Exploration (QDE), provides a mathematical framework for determining the best location to sample, and what parameter should be sampled. QDE employs a first-order Taylor series expansion to estimate the uncertainty or variance in the model results. Uncertainty in input parameters is determined through data extrapolation teclmiques, specifically multivariate conditional probability, while model sensitivity is calculated by directly coding sensitivity derivatives into a model using ADIFOR 2.0. Combining these two matrices produces the variance in model results, which in tum is employed to direct sampling. The next sampling location is defined as the point where the variance in model results is the largest. \Vhich input parameter to sample is detennined by evaluating the conttibution to the total variance produced by each input parameter. The QDE approach is demonstrated on a water quality model where non-point source loading, stream characteristics, and contaminant behavior are uncertain input parameters and concentration is the uncertain model result.


Introduction
Data collection is an important aspect of modeling because all results produced by a model are based on input parameters.It is, however, difficult to determine the optimal location(s) to collect samples.A quantitative approach, entitled Quantitatively Directed Exploration (QDE), has been developed as an aid to guide sampling by using the relative uncertainty associated with the model result.Uncertainty in model result, or the variance in output, is estimated with a first-order Taylor series expansion.The Taylor series approach combines input data uncertainty and model sensitivity.The next sampling location is defmed as the point where the variance in model results is the largest.Which input parameter to sample is detennined by evaluating the contribution to the total variance produced by each input parameter.
Methods of directing sampling or improving sampling strategies by using uncertainty analysis have been addressed by several researchers.Uncertainty in input data is one of the matrices needed in the QDE approach.Ben-Jemaa et al. (1995) applied two optimization approaches: 1. minimize the variance of a sampled parameter, and 2. minimize sampling cost, to monitor the contaminant distribution in lake sediment.In addition, the robustness of the sampling network design to changes in data extrapolation methods was also analyzed.Different data extrapolation methods were employed to evaluate their effects on results.However, the sampling network design presented by Ben-Jemaa et al. ( 1995) was based solely on data extrapolation, while the QDE approach presented in this chapter is based on both data extrapolation and model sensitivity.Warwick (1997) developed a first-order uncertainty technique for computing the probability of successfully calibrating and verifying a modeL Several sampling scenarios were evaluated to determine the best combination of samples that minimized cost while achieving the desired level of success and reliability.His approach is somewhat subjective because it depends on setting an acceptable level of risk, reliability, and accuracy tolerance.
Several approaches can be used to compute uncertainty in modelcomputed results.These approaches are Monte Carlo (MC) simulation, parameter perturbation, and first-order Taylor-series expansion.However, MC and parameter perturbation require large computational effort to produce an estimate of uncertainty.Unlike MC and parameter perturbation, the approach presented in this chapter, a first-order Taylor series method together with direct derivative coding (DDC), generates uncertainty in model-computed results with a single model nm.DDC evaluates sensitivity by programming derivatives of model output with respect to model input into the original model.This approach has been shown to produce results similar to MC and parameter perturbation in much less time when the model is linear (Graettinger et al. 2001 ).
The QDE approach is a general approach that can be applied to many models.The QDE approach has been applied to several engineering applications, such as geotechnical engineering, grom1dwater modeling, transportation engineering, and surface water-quality modeling (Graettinger and Dowding, 1999;Graettinger and Dowding, 2001;Reeves et al., 2000;Graettinger et al., 2001;Supriyasilp et al., 2001;and Supriyasilp et al., 2002).In this chapter, the approach is demonstrated on a water-quality model where non-point source loading, stream characteristics, and contaminant behavior are uncertain input parameters and contaminant concentration is an unce1iain model result.

Data Extrapolation and Uncertainty
An overview of the QDE approach is presented in Figure 7.1, which describes the organization of computer programs and data matrices (after Graettinger and Dowding, 1999).TI1e first step for a QDE analysis is data extrapolation, where available input data are distributed to each computational cell (column 1, Figure 7.1 ).Since input parameters are not sampled at all locations, extrapolation techniques arc applied to calculate a best estimate and an uncertainty in that estimate for each parameter in each cell.The uncertainty, therefore, arises only from extrapolating input parameters.Before extrapolating uncertain input data, spatial continuity of the data must be determined and described by a variogram.A variogram is a mathematical function that describes the correlation versus distance relationship of two values, for example, non-point somce loading at two locations.The variogram is used to define the distance over which values are interdependent (Davis, 1973;Knudsen et al., 1977).
The variogram is applied to both cxh•apolate input and quantifY uncertainty in that input.The variogram model itself is not perfectly known, but as additional data are collected the variogram model can be updated.For this work, the variogram model was kept constant and input data were assumed to be perfectly known at the sample points.Uncertainty, therefore, is equal to zero at the sample locations.111e uncertainty associated with the extrapolated data increases as the distance from a san1ple location increases.However, it should be noted that assmning zero uncertainty at sampling locations is not a model limitation.Isaaks and Srivastava (1989) present several variogram functions.Equation 7.1 is a Gaussian model, Equation 7.2 is an exponential model, and Equation 7.3 is a spherical modeL For this work, a Gaussian variogram is used.
where: the maximum uncertainty in the data (the sill).
These variogram models can be applied when the distance, h, is less than the range, a.It should be noted that neither a change in the variogram model nor a change in variogram parameters, i.e. range and sill, has a large effect on the new sampling location (Supriyasilp et al. 2002).However, for this chapter an experimental variogram was constructed based on the available data to approximate the range and sill.
A multivariate conditional probability function is used to provide estimates of the input parameters at computational cells on the basis of limited sample data.The conditioned estimates and associated covariances describing a vector of input parameters are calculated as (Gehnan, 1995): where: Cov(U} a vector of input data for every cell along the stream reach, a similar vector containing data only for sampled cells, is generated as the prior estimate of input data for each cell, and the covariance matrix calculated from the variogram. From these initial matrices, the unconditional estimates, E(U) and Cov(U), are updated to conditioned estimates, E(U] V) and Cov(UI V) using sampling information, V.The conditioned means, E(UI V), represent the extrapolated input data that are used in the model (column 1, Figure 7.1), while the conditioned covarianccs, Cov(UI V), represent the uncertainty matrix (column 2, Figure 7.1).

Model
The QDE approach can be applied to any modeL However, to demonstrate the QDE approach, a simple water-quality model consisting of a cascade of Continuous Stirred Tank Reactors (CSTR) was applied to an example stream reach.The authors recognize that there are more robust models, such as the Enhanced Stream Water-Quality Models (QUAL2E) (Brown and Barnwell 1987) and the One-dimensional Transport with Inflow and Storage (OTIS) model (Runkel 1998).However, to illustrate the ideas and concepts of QDE, the simplified model was chosen.The model accepts the spatially extrapolated input data and calculates the contaminant concentrations in the stream (colunm 1, Figure 7.1).
Assuming that local loading, W, adds only pollutant mass; then the mass balance equation for each cell i (i.e. for each CSTR in the cascade) is given by (Chapra 1997 Cfu~,~~,,~, Figure 7.2 shows a cascade of CSTRs, in which longitudinal mixing is negligible.The mass balance equation for the first, second, and third cells in Figure 7.2 can be \\-Titten as Equations 7.7, 7.8, and 7.9, respectively.A similar set of equations can be written for any cascade of cells.Under steady-state conditions, the time derivatives are equal to zero.(7.9) It is seen from Equations 7.8 and 7.9 that the concentrations in upstream cells also appear in the equations for downstream cells.This means that the input parameters for upstream cells can affect the computed concentrations in dnvvnstream cells.Although the cascade of CSTRs is a simple water-quality model chosen to demonstrate the QDE approach, it should be noted that a real stream can be modeled as a CSTR system when the Pee let number, which is a ratio of the rate of advective transport to the rate of diffusive or dispersive transport, is less than 0.1 (Chapra 1997).

.2.3 Model Sensitivity
Model sensitivity is the derivative of model results with respect to input pammeters, d(out)ld(in).In other words, changes in the system output are related to changes in the system inputs.If the input parameters are not perfectly known, then it is expected that as sampling occurs, input parameters will change and so will model results.The derivatives describing the changes are stored in the sensitivity matrix (column 2, Figure 7.1 ).The sensitivity matrix can be generated by using parameter perturbation, by applying an inverse matrix approach, or by directly programming sensitivities into the model.In the parameter perturbation method, each input parameter is changed slightly, the model is rerun, and the change in output is determined.This allows for calculation ofthe sensitivity derivative.The inverse matrix approach, which can be applied with the CSTR model but is not a general approach, uses the inverse of the assimilation factor ( Chapra 1997) as the sensitivity matrix.For the direct programming approach, the derivative is written as program code.For the QDE approach presented here, model sensitivity is calculated by directly programming sensitivity derivatives into the CSTR model using a program called Automatic Differentiation In FORtran (ADIFOR).
ADIFOR version 2.0 (Bischof et al. 1994), is applied to create the model sensitivity derivatives.To apply ADIFOR to a FORTRAN code, the user must specifY the "independent" and "dependent" variables with respect to differentiation.ADIFOR then determines any associated variables, formulates the derivative expressions, and generates new FORTRAN code for the computation of both the original function and the associated derivatives (Hwang et aL 1998).It should be noted that the new code is not site or data specific.Therefore, the reprogramming effort is a one time commitment for a particular model.This is unlike parameter perturbation or MC simulation where multiple model runs are required every time site data are changed.ADIFOR has been applied to several engineering applications such as transportation engineering, air quality modeling, and computational fluid dynamics (Graettinger et al. 1999;Hwang et al. 1998;Hong 1999).

Taylor Series Calculation of Variance
After generating the uncertainty in input parameters and calculating the sensitivity of the results due to a change in each of the input parameters, the next step is to combine these two components using the first-order Taylor series expansion (column 3, Figure 7.1).The result of this matrix multiplication is the variance or uncertainty in the output parameter.The first-order Taylor series expansion for correlated input data is given by (Harr 1996) as: where: Var[Ck] Cov [;'i:mi ,xmj/ (7.10) the variance of the model result in cell k, the covariance (or uncertainty) between input data for cells i and j for parameter m, and the sensitivity of the predicted concentration at location k to changes in input data at location i for parameter m.
The sensitivity derivative is evaluated at the mean of the input parameter.The first-order Taylor series expansion gives a good estimation of output variance when the model is linear.For a nonlinear model, a higher-order Taylor series expansion should be applied to compute the variance.
The Taylor series expansion with DDC can generate the variance of the model-computed results much faster than MC simulation and Taylor series with parameter perturbation (Graettinger et al., 2001).For this case study, in order to produce similar results, 10,000 MC simulations, 600 perturbation runs, and one DDC run were required.Taylor series with DDC is 480 times faster than MC and 30 times faster than Taylor series with parameter perturbation.

Case Study
To demonstrate an application ofthe QDE approach, a one-dimensional surface water-quality model is applied to a hypothetical stream reach.Each computational cell is 50 m long and is represented as a CSTR.Each cell requires six input parameters, these being: discharge, surface area, reaction rate, apparent settling velocity, volume, and local contaminant loading.All parameters are uncertain and for this example are assumed to be statistically independent of each other.
The first step in the QDE approach is to spatially extrapolate input data to all cells in the model.The reaction rate, apparent settling velocity, and discharge are assumed to be constant along the reach, but are still uncertain.Therefore, there are spatially constant variances associated with these parameters.For this example, surface area, volume, and contaminant loading were initially sampled at locations of250 and 1500 m.In addition, another sample of surface area was taken at 2450 m because this parameter is not difficult to obtain relative to the other parameters.Surface area is simply the length of the computational cell, which is constant along the reach, multiplied by the width of the stream, which can be determined from aerial photography.It should be noted that the water surface area of a cell is assumed to be equal to the settling surface area.The data from these initial observations are used to extrapolate information to each of the 50 cells, which are then used as inputs to the model.
Uncertainty in input data is one of the two terms needed to calculate the variance in model results.Figure 7.3 shows the extrapolated loading data along with its associated standard deviation.The standard deviation is represented by vertical error bars in each of the 50 cells.The standard deviation in extrapolated data is at a minimum at the sample locations, and reaches a maximum value at a distance away from the sample locations (Figme 7.3).
The second tenn needed for the Taylor series calculation of variance in concentration is sensitivity, which is the change in concentration due to a change in input parameters.Figure 7.4 shows an example of the sensitivity matrix of concentration due to loading.The sensitivity matrix is presented as a surface, which shows the sensitivity of concentration in every cell to a change in the loading in every cell.One half ofthe sensitivity matrix has zero sensitivity, which means a change in the loading at a downstream cell has no effect on the concentration in upstream cells.'fl1is is because ofthe feed-forward structure of the model.The main diagonal of the matrix has the largest magnitude sensitivities, as the concentration in a cell is most sensitive to the loading in that cell.Up>mml.
Distan:.:e (m) D~ml.After uncertainty in input parameter and sensitivity are calculated, they are combined using the Taylor series expansion (column 3, Figure 7.1 ).QDE then determines the next sampling location, where the variance in model result is largest.Figure 7.5 shows the contaminant concentration and its associated standard deviation as calculated by the Taylor series expansion with DDC.The standard deviation reaches a maximum value of 0.565 g/m 3 at 1050 m and then decreases as it approaches the downstream sampling location.Therefore, the next sampling location should be 1 050 m dmvnstream, which is the location of the largest standard deviation.In practice, it is sometimes impossible to sample all input parameters at the next sampling location due to budget limitations.Since each input parameter in each cell is uncertain and has its own sensitivity, every input parameter contributes to the overall standard deviation in the output.Therefore, the input parameter that has the greatest contribution to the total variance should be sampled at the next sampling location.Figure 7.6 shows the standard deviation in contaminant concentration caused by each input parameter calculated from the first-order Taylor series expansion.It is clearly seen from Figure 7.6 that the standard deviation of concentration due to the contaminant loading, W, is larger than that for other input parameters for most locations along the stream.Therefore, loading is the most important parameter in the CSTR model.
Apart from the contribution to the total standard deviation in model result, the cost of sampling a parameter should also be considered.Inexpensive parameters to sample, such as smface area and volume, which can be obtained by measuring the width and the depth of the stream, may prove to be an economical method to reduce the tmcertainty in calculated concentrations.For the example presented, loading, surface area, and volume all are sampled at 1050 m.One would expect that the standard deviation in concentration would be lowest at the sample locations, but the lowest standard deviations are shifted slightly downstream (Figure 7.6).This is because of the effect of concentration and flow from the upstream cells.Moreover, it is seen in Figure 7.6 that the lowest standard deviations of concentration due to reaction rate, apparent settling velocity, and discharge do not occur atthe sampled location.Since these parameters are not sampled at any specific location, and their uncertainties are assumed to be constant along the stream, changes in their standard deviation arc from model sensitivity.
After new samples are taken, all the procedures in Figure 7.1 are repeated.Figure 7.7 shows the total uncertainty in the model result, before and after taking the samples of surface area, volume, and loading, at 1050 m.The standard deviation of concentration after taking the new samples clearly drops not only at the sampling location, but also between the distance 750 to 1750 m because of the spatial continuity described with the variogram function.The QDE approach then identifies the next sampling location seen in Figure 7.7, which is at 2250 m downstream with a standard deviation of 0.541 g/m 3 .
The QDE approach can be very useful for any sampling program when there is a limited budget for data collection.In response to this resource limitation, the tendency is to use only a single indicator of ambient conditions and at a limited number of observation sites.The QDE approach can help select the most important parameter to sample and identify the most appropriate sampling location to reduce the cost of monitoring.In the QDE approach, uncertainty associated with sampled input parameters can be reduced by additional sampling.However, some input paramters are not obtained from field sampling (i.e.calibrated parameters) and their uncertainties at this point are assumed to be constant along the stream.The authors recognize this problem and are doing research to develop a method of reducing the uncertainty associated with unsampled input parameters.

Conclusion
Through this example, it was shown that the QDE approach quantitatively calculates the uncertainty in model result, which is the contaminant concentration in this case study.By applying QDE to a CSTR model, the variance in concentration was computed from input data uncertainty and model sensitivity.Model sensitivity was calculated by the automatic differentiation approach using ADIFOR 2.0, and uncertainties in input data are calculated by applying the multivariate conditional probability approach.By combining these two matrices in a Taylor series calculation of variance, uncertainties in concentration due to each parameter were quantitatively detennined.As expected, the parameter that was found to provide the largest contribution to the standard deviation of concentration was contaminant loading.Therefore, the contaminant loading along with easily collected parameters were sampled atthe location oflargest standard deviation.After sampling, the QDE approach was rerun and a new location for sampling was determined.

Notation
The following symbols are used in this chapter.
As area where settling can take place (m 2 ); a 0 " maximum distance for which spatial continuity exists (range); b maximum uncertainty in the data (sill); Cov [.:rmi ,xmj] covariance between input data for cells i and} for uncertainty associated with the distance h Figure 7.1 QDE flow chart of programs (boxed and capitalized) and data matrices (unboxed and lower case) required to produce the variance in result from a model.
3) h the distance which separates sampled and estimated data y(h) the uncertainty associated with a distance h a the maximum distance for which spatial continuity exists (the range), and b

Figure 7 . 3
Figure 7.3 Extrapolated loading data along with its associated standard deviation (cr).

Figure 7 .
Figure 7.4 3D surface representing sensitivity due to loading.
Figure 7.5 Concentration and its associated standard deviation (a) with anows indicating initial sample locations.

AgFigure 7 . 6
Figure 7.6 Standard deviation of concentration due to each parameter in CSTRmodel.

Figure 7 . 7
Figure 7. 7 Standard deviation of concentration before and after sampling at 1050 m.
order reaction rate constant (d-1 ); expectation operator; base of natural logarithms = 2. 718281828; distance which separates sampled and estimated data; discharge (m 3 /d); vector that stores data for every cell along the creek; vector that stores data only from sampled cells; volume of a cell (m 3 ); apparent settling velocity (mid); variance of the model result in cell k; local contaminant loading (g/d); sensitivity of model results at locationk to changes in input data at location i for parameter m;