The URL can be used to link to this page

Home My WebLink AboutDRAFT for Discussion - GW Statistical Methodology - October 20 2016 Technical Memorandum STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES October 2016 – DRAFT FOR DISCUSSION PURPOSES ONLY DR A F T This page intentionally left blank. DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Contents Background ............................................................................................................................ 1 PART I – PRELIMINARY DATA ANALYSIS ........................................................................... 4 1. Descriptive Statistics ...................................................................................................... 4 2. Graphical Analysis ......................................................................................................... 6 3. Identify Outliers .............................................................................................................. 7 4. Identify Distributions ....................................................................................................... 8 5. Test for Serial Correlation .............................................................................................. 8 6. Test for Seasonality ....................................................................................................... 9 7. Test for Trend ...............................................................................................................10 8. Confirm Trend Using Piecewise Polynomial Regression ...............................................12 9. Determination of Baseline Period for Background Wells ...............................................13 PART II – TEST FOR SUB-GROUPS....................................................................................15 1. Graphical Analysis ........................................................................................................17 2. Tests for Sub-Groups ....................................................................................................17 3. Tests to Identify which Sub-Groups are Different ..........................................................20 PART III – DEVELOP BACKGROUND THRESHOLD VALUES (UPPER PREDICTION LIMITS) .................................................................................................................................21 FIGURES Figure 1. Decision Flow Chart for Part I – Preliminary Data Analysis ......................................... 5 Figure 2. Plot of Iron (Total) as a Function of Time .................................................................... 6 Figure 3. Scatter Plot of Boron (Dissolved) as Function of Time ................................................ 7 Figure 4. Box Plot of Monthly Observations of Dissolved Oxygen over Years 1998 through 2008 ..................................................................................................................................10 Figure 5. Maximum Likelihood Estimation Regression using De-seasonalized Beryllium (Dissolved) Concentrations (ug/L) .............................................................................11 Figure 6. Polynomial Piecewise Regression using Beryllium (Dissolved) Concentrations (ug/L) ..................................................................................................................................13 Figure 7. Decision Flow Chart for Part II – Determining Differences across Sub-Groups ..........16 Figure 8. Empirical Distribution Function Plot of De-seasonalized Observations (Specific Conductance (µS/cm)) for Season=0 and Season=1 ................................................17 Figure 9. Decision Flow Chart for Part III - Developing Reference Background Concentration Values .......................................................................................................................24 DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES TABLES Table 1. Monitored Constituents For Groundwater Assessments Required by the Coal Ash Management Act of 2014 ........................................................................................................... 1 Table 2. Test for Normality Assumptions Required for ANOVA Tests .......................................18 Table 3. Kaplan-Meier Class of Test Results for Differences across Monthly Means ................20 DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Background Regulations providing North Carolina groundwater quality standards are provided in T15A NCAC 02L .0202. Section (b)(3) of the regulation provides that: Where naturally occurring substances exceed the established standard, the standard shall be the naturally occurring concentration as determined by the Director. This document provides the approach developed to establish the naturally occurring concentrations of constituents included in the sampling required for the Duke Energy ash basins. The naturally occurring concentration for a constituent is that value which represents the upper threshold values from the upper tail of the data distribution. HDR will refer to this upper threshold value as the reference background concentration value, that is, the concentration of the constituent that represents ambient (i.e. naturally occurring) background conditions 1. The monitored constituents for the groundwater assessments required by the Coal Ash Management Act of 2014 are listed in Table 1 below. Note that HDR will only use the non-filtered results. HDR will review and evaluate the corresponding filtered results; however, they will not be used for compliance purposes. In addition, samples will not be used in the development of the reference background concentrations whenever turbidity conditions during the sampling events are greater or equal to 10 nephelometric turbidity units (NTUs)2. Table 1. Monitored Constituents For Groundwater Assessments Required by the Coal Ash Management Act of 2014 Aluminum Cobalt pH Uranium, Natural Antimony Copper Selenium Uranium, 233 Arsenic Iron Strontium Uranium, 234 Barium Lead Sulfate Uranium, 236 Beryllium Manganese Total Dissolved Solids Radium 226 Boron Mercury Total Suspended Solids Radium 228 Cadmium Molybdenum Thallium Chloride Nickel Vanadium Chromium (Total) Nitrite Zinc Chromium (VI) Nitrate The steps described in this document serve as a guideline to develop upper prediction limits (UPLs) for the constituents for the groundwater detection and assessment and monitoring programs required by the Coal Ash Management Act at the Duke Energy ash basins in North Carolina. They closely follow HDR’s proposed method to establish reference background concentrations for constituents according to the Environmental Protection Agency’s Hazardous and Solid Waste Management System; Disposal of Coal Combustion Residuals from Electric 1 Evaluating Metals in Groundwater at DWQ Permitted Facilities (NC DENR DWQ 2012). 2 The units of turbidity from a calibrated nephelometer are called Nephelometric Turbidity Units (NTU). 1 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Utilities; Final Rule (EPA CCR)3. The reference background concentrations determined by the process presented in this document will be submitted to the North Carolina Department of Environmental Quality (DEQ) Division of Water Resources for determination as to the naturally occurring site reference background concentration for the specific constituent. A site-specific report documenting the procedures, evaluations, and calculation will be prepared and submitted to DEQ. The steps in this document are based on the US Environmental Protection Agency (USEPA) “Unified Guidance” (USEPA 2009), and the ProUCL (USEPA 2013). In addition, the North Carolina Division of Water Quality (NCDWQ) technical assistance document for Evaluating Metals in Groundwater at DWQ Permitted Facilities (NCDWQ 2012) was also referenced. As recommended by the EPA Unified Guidance, HDR has selected the 95 percent upper prediction limits (UPL95) to establish the reference background concentration value for each of the constituents at each site 4. The UPL95 for each constituent represents a concentration value with a low probability (i.e., 5 percent chance) of being exceeded when a future sample(s) for the constituent is taken. If the concentration in the future sample is higher than the estimated UPL95, then steps are taken to validate the observation by means of quality assurance checks and/or by re-sampling and/or to investigate for possible natural or anthropogenic reasons for observing a higher concentration than expected. HDR will produce these limits for each of the constituents using their respective concentrations observed in the samples taken from the set of background wells located at each Duke Energy ash basin site. The data across these background wells per constituent and per site will be pooled prior to estimating the reference background concentration using the UPL95. Concentrations of the constituents measured in samples taken downgradient to the background wells can then be compared to the background concentrations as part of any planned inter-well testing regimen. 3 HDR modified its earlier methods to establish reference background concentration so that both state and federal regulations are comparable. Having similar processes to address the two sets of regulations will minimize confusion. 4 There are different methods which can be used to estimate the reference background concentrations such as the UPL and the upper tolerance limit (UTL). HDR chose the UPL as it is the statistic recommended by the Unified Guidance (page 2-15). The Unified Guidance recommends the UPL over the UTL for the following reasons: 1. the ability to estimate an UTL which can control for Type I error rates when simultaneously testing an exact number of multiple future or independent observations is not precise as it is when estimating the appropriate UPL; 2. the mathematical underpinnings of the UPLs under re-testing strategies are well established while those for re-testing with tolerance limits are not. Re-testing strategies are now encouraged and sometimes required under assessment monitoring situations; and finally, 3. statistically, the two limits are similar, especially under normal assumptions; and to avoid confusion as to which statistic to use, the UPL is chosen. If it is not known how many future samples will be tested for compliance at the next sampling event, one could use an upper simultaneous limit (USL); however, the quality of this statistic rests heavily on a well-specified distribution and it is not robust to outliers. ProUCL regularly cautions about the use of the USL in the presence of outliers and if there is any question as to possible contamination in the assumed pristine background test sites. ProUCL has indicated that the USL is preferred over the UTL (page 92) since the UTL does not address the increased chance of an exceedance when multiple samples are being simultaneously tested. Then in a separate section, ProUCL states that the UTL is preferred over the UPL since it is not dependent on the number of future observations being simultaneously tested (page 100). The research team feels these recommendations are not based on sufficient evidence to discount the use of the UPL to set reference background concentrations. 2 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Each site has between one and ten background wells; the wells have been drilled at different points in time beginning as early as the late 1990s or early 2000s. In addition to the difference in time when each well started being sampled, the wells were also sampled at different intervals, and for different purposes, and were drilled into bedrock or in shallow or deep unconsolidated materials. HDR reviewed and vetted these well locations from the perspective of the location relative to possible sources and in adherence to the Comprehensive Site Assessment (CSA) and/or Corrective Action Plan (CAP) specifications. While there are differences in the locations and the sampling periods of the wells, the fundamental assumption is that the constituent concentrations sampled at these background wells, when pooled, serve as an estimate of overall well field conditions for a given constituent. HDR will test this assumption using statistical tests. If distinct sub-groups exist as which may be the case across different hydrostratigraphic units, HDR will develop separate background concentrations for each distinct sub-group of wells. The process described by this technical memo consists of three parts. • Part I – Preliminary Data Analysis Part I includes the analyses used to assess and transform the data where necessary such that the data can be used to produce appropriate UPLs. HDR refers to this stage as the preliminary data analysis (PDA). Consideration is given to issues related to outliers, serial correlation, seasonality, spatial variability, trends, and appropriateness of the period of record (sampling period). • Part II – Test for Sub-Groups Part II of this report describes the approach to test for sub-group differences. During the statistical analyses in Part 1, questions may arise as to the appropriateness of assuming all data from all wells at a given site for a given constituent truly represents the overall well field condition at any point in time. There may be statistically significant differences in mean concentrations among sub-groups of observations for a constituent. Types of sub-groups to test include seasonal sub-groups (winter, spring, summer, and fall) and well class sub-groups (bedrock, shallow, or deep). If after testing the groups are statistically different, the same steps described in Part I can be applied to the partitioned data to better understand the distribution of the samples within a sub-group for each constituent. The reference background concentration values using the UPL95 for a constituent will be produced for each sub-group of samples, provided the sub-groups represent distinct populations. • Part III – Develop Background Threshold Values (Upper Prediction Limits) Part III describes the statistical analyses used to produce UPLs for each constituent. 3 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES PART I – PRELIMINARY DATA ANALYSIS The PDA phase contains nine distinct steps and is summarized in the decision flow chart in Figure 1. The numbers alongside the shapes in Figure 1 represent where each of the PDA steps occurs in the flow. 1. Descriptive Statistics Descriptive statistics will be developed per background well, per constituent, and repeated for data pooled across all background wells within a site. Purpose of descriptive analysis is to characterize data and assess quality of information. The following statistics will be produced per constituent, per well, and then pooled over all background wells. • Sample size • Number of detects • Number of non-detects 5 • Percentage of non-detects • Number of distinct observations • Number of distinct method detection limits6 (MDLs) • Range of collection period in months: Difference between last sampling date and first sampling date • Mean • Median • Minimum • Maximum • Standard deviation • Coefficient of variation • Median Absolute Deviation/0.6757 • Skewness • Kurtosis 5 Trace concentrations of constituents may be present in groundwater samples but cannot be detected and are reported as less than the detection limit of the analytical instrument or laboratory method used. Such an observation is deemed to be a ‘non-detect’. Values which can be measured are called detects. 6 A method detection limit (MDL) is the single value deemed to be the minimum value that a specific test can detect at the laboratory conducting the tests. Over time, a laboratory may have different MDLs for a constituent as the laboratory’s equipment and protocols change. 7 The Median Absolute Deviation /0.675 is a robust estimate of variability of the population standard deviation. A robust estimate of variability, which is less likely impacted by outliers in the sample, is derived by dividing the median absolute deviation (MAD) with the constant of 0.675. The MAD is the sum of the differences between each of the observations and the median. Any differences that are less then zero are converted to positive values before a total sum is computed. 4 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Run Descriptive Statistics Graphical Analysis Identify Distribution (detects) Deseasonalize Seasonality Yes No Identify Outliers Period of Record represents background? NoYes Update period of record (repeat steps 1 – 7) Run piecewise regressions Trend Analysis 1 2 3 4 6 7 8 9 Test for Serial Correlation 5 Develop background (Go to Part III) Sub-groups are the same Repeat above sequence for each sub-group Yes No Test for sub- groups (Go to Part II) 6 7 Figure 1. Decision Flow Chart for Part I – Preliminary Data Analysis 5 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES 2. Graphical Analysis Scatter plots of observations as a function of time will be developed. Different colors can be used to differentiate detects from non-detects. Figure 2. Plot of Iron (Total) as a Function of Time The graphs visually provide clues as to whether the period of record is reflective of a steady state baseline period. Multiple method detection limits over time can affect quality of the data. One needs so decide at this point if all data can be incorporated into analysis or if older historical data may need to be dropped. For example, Figure 3 below demonstrates a time series of boron (dissolved). Although samples are observed starting in 1976 for boron (dissolved), only the data from 2006 to 2012 were used to study trends and set UPLs, because there was a change in the data reporting protocols for samples commencing in 2006 for boron. Iron Total (ug/l) Date Iron Total (ug/l) - Raw Data 31/01/1993 05/02/1996 09/02/1999 13/02/2002 17/02/2005 22/02/2008 0 20000 40000 60000 80000 100000 120000 140000 Data Type Detected Not Detected 6 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Figure 3. Scatter Plot of Boron (Dissolved) as Function of Time Outliers and seasonality can also be visually detected. Further statistical tests described in sections 3 to 8 below will need to be conducted to confirm assumptions from visual inspections. 3. Identify Outliers A statistical outlier is defined as a value originating from a different statistical population than the rest of the sample. Outliers or observations not derived from the same population as the rest of the sample violate the basic statistical assumption of identically distributed measurements. If an outlier is suspected, an initial helpful step is to construct a probability plot of the ordered sample data versus the standardized normal distribution. Two tests are available to test for possible outliers. Dixon’s Outlier Test is appropriate for data series with sample sizes less than 25, and Rosner’s Outlier Test is applicable to those with a sample size larger than 25. These outlier tests assume that the rest of the data except for the suspect observation(s) are normally distributed. In conjunction with Dixon’s and Rosner’s Tests, observations which are greater than three standard deviations from the mean will be flagged and evaluated. Values greater than 3 standard deviations from the mean may be outliers. Boron Dissolved vs Date Slave River Date Boron Dissolved (ug/L) - Raw Data Aug-1976 Nov-1980 Mar-1985 Jun-1989 Sep-1993 Dec-1997 Mar-2002 Jun-2006 Sep-2010 Dec-20140 6 13 19 25 32 38 44 51 57 63 69 76 82 88 95 101 107 114 120 Data Type Detected Not Detected 7 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES If outliers are found from the tests, HDR recommends that the anomalous numbers be investigated. If they are correct values and collected under standard, consistent protocols, they will remain in the data series. Otherwise, they can be dropped before proceeding. Some distributions naturally have anomalously low or high values. The subsequent tests for distribution types should find the best fitting distribution that can explain the anomalous values. Other literature suggests repeating the statistical procedures with and without the outliers. After a comparison of the estimates is completed, a decision needs to be made as to which data set is representative. HDR believes that a decision needs to be made, but at the data collection and assessment stage. The risk of this method is that the estimated distributions and statistics tend to be chosen to suit a goal. The decision is not necessarily objective. An example would be where a sample was qualified as “J+” (biased high), due to equipment blank contamination. If a sample such as this was seen as an outlier, it may be possible to eliminate it from further analysis for this reason. If there is a doubt as to the authenticity and reliability of the measured value, it should not be used. Otherwise, it is a value that was generated by the system regulating the water quality conditions of the tested groundwater well. 4. Identify Distributions Since many tests make an explicit assumption concerning the distribution represented by the sample data, the form and exact type of distribution often has to be checked using a GOF test. A GOF test assesses how closely the observed sample data resemble a proposed distributional model. The best GOF tests attempt to assess whether the sample data closely resemble the tails of the candidate distributional model. The models under consideration for water quality samples are normal, lognormal, or gamma distributions. The Shapiro-Wilk (sample sizes <= 50) and Lilliefors (sample sizes >50) tests can be used to test for normal distribution. Note that these two tests can be used to test for lognormal distributions after the data is transformed using the natural log function. The two types of empirical distribution function (EDF) based methods, the Kolmogorov-Smirnov (K-S) and Anderson-Darling (A-D) test, are used to test for a gamma distribution. The software package from the EPA, ProUCL8 has incorporated these methods to automatically test for either normal, lognormal, or gamma distribution types. If all GOF tests fail, non-parametric estimation methods will be used. 5. Test for Serial Correlation Sources for serial correlation in water samples can be due to seasonal effects or temporal effects related to the timing of the sample collections. Trend analysis using regression techniques of a water quality constituent sampled over time is confounded if the time series exhibits serial correlation. The regression errors from adjacent observations may be 8 Singh, Anita and Ashok K. Singh. 2013. ProUCL Version 5.0.00 User Guide: Statistical Software for Environmental Applications for Data Sets with and without Non-detect Observations. U.S. Environmental Protection Agency, EPA/600/R-07/041 8 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES correlated. For example, if the residual from a given month’s observation is high, then it is likely that the residual from the next month’s observation will also be high. The same logic follows for low residuals giving rise to other low residuals. This type of correlation is referred to as serial correlation or autocorrelation. Run the autocorrelation function test for successive lags (1-n). (HDR uses the statistical software application NCSS 9 to run this test.) 6. Test for Seasonality As explained in step 5, there are different reasons why a series of water quality constituent samples exhibit serial correlation. A common reason arises from changes in season as evidenced from varying temperatures and precipitation. These changes impact water quality constituents in a predictable and cyclical manner over the months. The study of water quality changes over time is focused on the ability to discern true trend through regression analysis amidst the cyclical nature of the data or its “seasonality”. The correct use of these regression analyses rests on the crucial assumption that regression errors or residuals arising from the model fitting are independent of each other. This is often not the case with data that is seasonal in nature. If seasonality exists, then the autocorrelation function test described in step 5 will pick up the behavior. To better understand the type of seasonality (monthly, quarterly, bi-annually) which factors into the observed variability of a time series, a visual inspection of the data as a function of time is recommended. Box plots of observations on a monthly or quarterly basis (provided one has at least 8 to 10 observations per month or per quarter) will be plotted. Note that the side-by-side box plots can also be used to visually inspect the level of variability per sub-group. See Figure 4 for an example of box plots. 9 Hintze, J. (2013). NCSS 9. NCSS, LLC. Kaysville, Utah, USA. www.ncss.com. 9 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Figure 4. Box Plot of Monthly Observations of Dissolved Oxygen over Years 1998 through 2008 7. Test for Trend The samples from background wells represent water quality conditions exhibiting natural variability and unaffected by anthropogenic activities. As such, one expects that the measurements taken at regular intervals over time (three or more years) demonstrate a steady or stationary time series. The data from the background wells will be tested to determine whether a trend exists (values steadily increasing or steadily decreasing). PAR_NAME=Oxygen Dissolved (mg/l) Before Deseasonalization Month Raw Data 1 2 3 4 5 6 7 8 9 10 11 12 6 8 10 12 14 16 18 10 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Depending on the presence of non-detects and seasonality, one of the following regression tests will be selected: • Kendall Seasonal Regression if using full data series with seasonality present • Maximum Likelihood Estimation (MLE) Regression 10 with best fitting distribution on de-seasonalized data • Mann-Kendall (non-parametric, no seasonality, only 1 MDL) Figure 5 below demonstrates the estimated trend line for the beryllium (dissolved) over time. In this particular regression, the slope is positive and statistically significant at the 0.05 level of significance. The question arises is if this entire series due to the upwards trend should be used to estimate the UPLs for the background wells. Figure 5. Maximum Likelihood Estimation Regression using De-seasonalized Beryllium (Dissolved) Concentrations (ug/L) 10 The MLE regression approach solves the “likelihood equation” to find values for mean and standard deviation that are most likely to have produced both the non-detect and detect data. Annual Trend Analysis - Deseasonalized Data vs. Date Lognormal Distribution Date Deseasonalized Data Mar-2006 Dec-2006 Sep-2007 Jun-2008 Feb-2009 Nov-2009 Aug-2010 May-2011 Jan-2012 Oct-20121.0000 1.0011 1.0021 1.0032 1.0042 1.0053 1.0063 1.0074 1.0084 1.0095 1.0105 1.0116 1.0126 1.0137 1.0147 1.0158 1.0168 1.0179 1.0189 1.0200 Data Type Detected Not Detected 11 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES 8. Confirm Trend Using Piecewise Polynomial Regression In the circumstances when changes in trend may occur within a parameter’s time series, the piece-wise polynomial model has proven useful. This approach attempts to find an appropriate mathematical model that expresses the relationship between the parameter values and the sampling dates by using piece-wise regressions. Two types of piece-wise models are used to study the trends: linear-linear model and linear-linear-linear model. The linear-linear regression model assumes and identifies one structural break in a parameter series, in which the two portions of the data separated by the break point follow two different trends as modeled by two different linear equations. Similarly, the linear-linear- linear model attempts to identify two structural breaks to separate three different linear trends. This approach is informative, but it has the disadvantage of not being able to account for non-detects in a sample. Hence, HDR recommends implementing the piece-wise polynomial models and MLE non-detects regression approach complementary to one another. HDR has applied the piece-wise models mainly as a visual guide when selecting the baseline periods for the parameters. For example, while the MLE Regression approach suggested that the overall trend for beryllium (dissolved) over time is steadily increasing (Figure 5), the polynomial piece-wise regression with two structural breaks in Figure 6 suggests that recently the concentrations are moving downwards to a level exhibited in the early part of the sampling period. HDR concluded that the entire sample from the period of record can be used to produce UPLs for this constituent. 12 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Figure 6. Polynomial Piecewise Regression using Beryllium (Dissolved) Concentrations (ug/L) 9. Determination of Baseline Period for Background Wells At this point, a decision is made if the entire period of record from which the background samples were collected demonstrates a steady state and represents a baseline against which future values can be tested. If both the linear regression models and the polynomial piece-wise regression models suggest that over time the observations are steadily increasing or decreasing, then a review of the data will be done to determine if a sub- segment of the time series better represents the background period. This of course means the final sample numbers are reduced. Annual Trend Analysis: Deseasonalized Data vs Date Piece-Wise (Linear-Linear-Linear) Date Deseasonalized Data Mar-2006 Dec-2006 Sep-2007 Jun-2008 Feb-2009 Nov-2009 Aug-2010 May-2011 Jan-2012 Oct-2-0.0100 -0.0082 -0.0063 -0.0045 -0.0026 -0.0008 0.0011 0.0029 0.0047 0.0066 0.0084 0.0103 0.0121 0.0139 0.0158 0.0176 0.0195 0.0213 0.0232 0.0250 13 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES A consideration in the decision process is that if the number of samples in the period of record is low (<10 samples) and over a short duration (< 2 years), then an observed trend in the time series may not necessarily indicate changes to the natural variability in the water quality of the background wells. In fact, the trend could be part of the natural variation; and if the time series were longer, a steady state particular to the concentrations of the constituents of interest could be observed. As more sampling events occur, the background well data can be combined and tested either for significant trends or for differences in means (see Part II below) between two time periods. HDR recommends that at least two additional sampling events be completed before amalgamating the new background data with the previous background data and testing whether the new data from the background wells is from the same population as the older data and hence can be combined to produce revised reference background concentration values. As more background data is collected to produce the reference background concentration values, the test statistics which are designed to detect statistically significant exceedances become more powerful. 14 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES PART II – TEST FOR SUB-GROUPS After completion of the PDA and graphical analysis, the data findings may suggest that the range of concentrations is not consistent over time. As discussed earlier, seasonal differences often explain changes in concentrations over time. Other differences could be related to the class of wells to which a background well belong. Examples of well classes observed across the seven Duke Energy monitoring sites are bedrock wells and deep and shallow wells. This part describes the steps required to validate if differences in mean concentrations across the sub-groups are statistically significant or not. In other words, it is used to test if two or more sample means are drawn from the same source population, i.e., same distribution function. HDR uses the significance level of 0.05 to decide whether to accept or reject the null hypothesis that there are no differences across the sub-population means. Before proceeding to test for differences across the sub-group means, one needs a sufficient sample size of at least 8 to 10 samples per sub-group according to the EPA’s 2009 Unified Guidance Report 11 and ProUCL’s Technical Guide 12. (The 8 to 10 minimum is with reference to producing background threshold values. If a sub-group is distinct enough to form its own time series and background threshold values need to be estimated, then such a minimum must be obtained. Tests of sub-group differences can be done on groups as low as 10; however, their outcomes may not be reliable. HDR recommends that sample sizes per sub-group should be at least 30.) Testing for sub-groups can be done in three steps: 1. Graphical analysis, 2. Tests for sub-group differences, and 3. Tests to identify which sub-groups are different. The testing steps are outlined in Figure 7. 11 2009. Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities: Unified Guidance. Office of Resource Conservation and Recovery, Program Implementation and Information Division, U.S. Environmental Protection Agency. EPA 530/R-09-007, page 5-3. 12 Singh, Anita, et al. 2010. ProUCL Version 5.00.0 User Guide. U.S. Environmental Protection Agency, EPA/600/R- 07/038, page 17 15 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Censored Observations No Yes Distribution Test for ROS Imputation Test ANOVA assumptions Kruskal-Wallis Test ANOVA/Log ANOVA Test Test ANOVA assumptions KM Test of Sub- Group Differences ANOVA/Log ANOVA BDL >=50% KM Test of Sub- Group Differences Dunn’s Test to Identify Which Sub-Groups are Different Sub-Groups are Different Dunn’s Test to Identify Which Sub-Groups are Different Sub-Groups are Different Tukey-Kramer Test to Identify Which Sub-Groups are Different Sub-Groups are Different Dunn’s Test to Identify Which Sub-Groups are Different Sub-Groups are Different No Yes Yes YesNo Yes Yes Tukey-Kramer’s Test to Identify Which Sub-Groups are Different Sub-Groups are Different Graphical Analysis by Sub-Groups Test for Sub- Groups 1 2 3 YesNo Yes Yes Figure 7. Decision Flow Chart for Part II – Determining Differences across Sub-Groups 16 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES 1. Graphical Analysis A useful visual test for sub-group differences is the EDF (see NCSS - Non-detects-Data Group Comparison). This procedure computes summary statistics, generates EDF plots, and computes hypothesis tests appropriate for two or more groups for data with non-detects (left-censored) values (provided fewer than 50% of the values are non-detects (left- censored)). Figure 8 demonstrates that the two sub-groups representing samples taken during two different seasons show similar distributions, i.e., no differences in concentrations between the two seasons. Figure 8. Empirical Distribution Function Plot of De-seasonalized Observations (Specific Conductance (µS/cm)) for Season=0 and Season=1 2. Tests for Sub-Groups The following methods can be used to detect for population differences across the sub- groups: • Analysis of variance (ANOVA) (under normal distribution assumptions) • Log-ANOVA (under log-normal distribution assumption) 17 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES • Kruskal-Wallis One-Way Analysis on Ranks (distribution free assumptions/non- parametric, corrected for ties) • Kaplan-Meir (Non-parametric, useful with heavy censoring. See NCSS for this test.) HDR runs all four types of tests and compares the test outcomes across all distribution assumptions as further lines of evidence; however, the test results for the appropriate distribution assumption and level of censoring takes precedence. The decision as to which test is to be used is predicated on the presence of censorship and whether the distribution follows a parametric distribution of either normal, log-normal, or gamma or does not have a discernable distribution and hence is non-parametric. Note that the Log-ANOVA is simply the ANOVA approach applied to the natural-logarithm of the time series. The ANOVA tests do require that normality assumptions are valid for each sub-group. In addition, the variances across the groups should be approximately equal. NCSS automatically outputs tests which confirm if the normality assumptions hold. Below is an example of such output. In this example, normality assumptions hold and an ANOVA can be used to test for differences across sub-groups. Table 2. Test for Normality Assumptions Required for ANOVA Tests Assumption Test Value Prob Level Decision (5%) Skewness Normality of Residuals -0.0424 0.966172 Accept Kurtosis Normality of Residuals 1.1625 0.245037 Accept Omnibus Normality of Residuals 1.3532 0.508348 Accept Modified-Levene Equal-Variance Test 1.7228 0.085088 Accept If the level of censoring is less than 50 percent, then the Regression on Order Statistics 13 (ROS) imputation technique can be used to impute values for the non-detects. Imputation methods offer the advantage of a substitution method for non-detects that does not give all samples the same value, hence reduces the impact of bias. ROS is a simple imputation method that fills in non-detect data on the basis of a probability plot of detects. ProUCL can test which distribution assumption can be used to impute using the ROS approach. Once the non-detects have been imputed, the ANOVA assumptions can be tested once more. If they are valid, then either ANOVA can be used to test for differences across sub-groups. If not, HDR recommends using Kruskal-Wallis and Kaplan-Meier Tests to test for sub-group differences. Table 2 demonstrates the results of testing if monthly means are statistically different from each other using the Kaplan-Meier approach (NCSS has modules which implement this 13 Regression on Order Statistics (ROS): A regression line is fit to the normal scores of the order statistics for the uncensored observations and then to fill in values imputed from the straight line for the observations below the detection limit. 18 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES test). In this demonstration, the means across the months are statistically different at the significance level of 0.05 or 5 percent. 19 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Table 3. Kaplan-Meier Class of Test Results for Differences across Monthly Means Hypotheses H0: Distribution Functions are Equal Among Groups HA: At Least One Group Distribution Functions Differs Test Name Chi-Square DF Prob Level Reject H0 (Alpha = 0.05) Logrank 42.777 11 0 Yes Gehan-Wilcoxon 42.772 11 0 Yes Tarone-Ware 42.814 11 0 Yes Peto-Peto 43.202 11 0 Yes Mod. Peto-Peto 43.194 11 0 Yes 3. Tests to Identify which Sub-Groups are Different Provided any of the four tests described above (ANOVA, Log-ANOVA, Kruskil-Wallis, and Kaplan-Meier) show sub-group differences, HDR recommends the following tests to identify which sub-group(s) is/are different from the others. • Post-Hoc Test for Multiple Comparisons, o Tukey-Kramer Test (parametric), and o Dunn’s Test (non-parametric). 20 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES PART III – DEVELOP BACKGROUND THRESHOLD VALUES (UPPER PREDICTION LIMITS) The primary goal of water quality sampling is to determine whether or not a given sample or collection of samples have exceeded some baseline set of samples or a particular previously defined standard. In general, this is done by computing the UPL. The upper limit is computed because the concern is generally for exceedances greater than the baseline or standard. A few parameters, such as pH or dissolved oxygen may require both upper and lower prediction limits. The formulation of the prediction limit may vary slightly with the particulars of the test to be made and the characteristics of the data involved (see chapters 3 and 5 of ProUCL’s Version 5.0.00 Technical Guide 14), but in general, the formula for the prediction limit for k future or independent observations is: 𝑈𝑈𝑈𝑈𝑈𝑈=𝑥𝑥̅+𝑡𝑡1−∝/𝑘𝑘,𝑛𝑛−1𝑆𝑆�1 + 1𝑛𝑛 Where 𝑥𝑥̅ = baseline (historical data) sample mean S = baseline (historical data) standard deviation t = Student’s t with 1−∝ degrees of freedom ∝ = Type I (false positive) error rate n = number of observations in baseline dataset k = number of future or independent obtained samples to be predicted 15. HDR uses ProUCL to estimate the appropriate UPL given the type of parametric or non- parametric distribution a constituent’s concentration follows, the level of censorship and the number of future samples expected to be tested at a given point in time. Figure 9 describes the steps and decision points required to select the appropriate UPL for each constituent at a given site. The six key steps described below are flagged alongside their appropriate stage within the decision flow chart in Figure 9. 1. The first step in applying ProUCL to produce the UPLs is to separate constituents into those with at least one non-detect and those with no non-detects. ProUCL calculates UPLs differently depending on whether there are non-detects in the data set. The algorithms in ProUCL use imputation and modelling techniques to address the issue of non-detects. It does not use the common practice of substituting a non-detect for a given constituent with one-half of its MDL as this method introduces bias into the estimates for the UPL95, among other estimates. Some constituents may have 50 percent or more of their observations as 14 Singh, Anita and Ashok K. Singh. 2013. ProUCL Version 5.0.00 Technical Guide: Statistical Software for Environmental Applications for Data Sets with and without Non-detect Observations. U.S. Environmental Protection Agency, EPA/600/R-07/041 15 Definition of Upper Prediction Limit (UPL): The upper boundary of a prediction interval for an independently obtained observation (or an independent future observation), ProUCL Version 5.0.00, Technical Guide, page xx. 21 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES non-detects. With so many observations below detection limits 16 (BDLs), it is difficult to discern what distribution best fits the data. Provided the non-detects have been validated through the data review process, if such a high percentage of non-detect measurements are observed, HDR recommends applying non-parametric methods to estimate the background concentration levels. If a sub-set of non-detects are suspect, they will be removed from the analysis prior to estimating the percentage of non-detects (see graphical analysis section of Part I). Once this step is completed, follow the steps below. 2. Produce 95 percent UPLs (UPL95) for one future (or independently obtained) observation per constituent. Produce the UPL95 for m future observations where m represents the number of different down gradient wells at each site where the constituent is sampled at each sampling event. (Select ‘all’ option in ProUCL under Upper Limits/BTVs17 module. ProUCL outputs UPLs, UTLs, and USLs). Save all output. 3. Record all UPL95s under all parametric distribution models and non-parametric distribution models. 4. The following logic provides guidance in the situations where the parametric GOF tests indicate that multiple distributions fit the observed distribution from the sample. Ideally, only one or no distribution tests passes. If a distribution is strongly indicated by the empirical distribution generated from the sample (i.e., follows the textbook shape for the distribution), it is more likely that only that distribution will pass the GOF tests. However, when one is presented with smaller datasets (n<30) or datasets with gaps between observations, or shapes that almost look like the distribution being tested, more than one distribution GOF test passes. In fact all three or any two of the GOF tests for the Gamma, Lognormal, or Normal distribution can pass. In these circumstances, preference is given to the Gamma distribution provided it has passed its GOF test over the remaining two distributions (see point a below). If the Gamma GOF test fails, then preference is given to the Lognormal over the Normal provided the Lognormal has passed and is not heavily skewed (i.e., its standard deviation is <= 1) (see point b). An interesting point is that in the situation where the Lognormal has passed and the Gamma has failed, if the Lognormal is highly skewed (i.e., its standard deviation is > 1), then the team uses the Gamma distribution to model the data and does not automatically default to the Normal distribution even if the data sample also passed the Normal GOF test. The Gamma and Lognormal distributions are closely related in shape; however, the Gamma has the added advantage of not having extremely long tails to the right. This mitigates the chance that the UPL95 will be extremely high relative to the median of the distribution as could be the case if one used either the Lognormal or Normal distribution assumptions to estimate the UPL95. At this point, the team is encouraged to study the empirical 16 Note that MDL and BDL mean different things. MDL is the “Method Detection Limit” and is the single value specified as the minimum value that a specific test can detect. BDL is “Below Detection Limit” and indicates any result less than or equal to the MDL. 17 ProUCL uses the term Background Threshold Value (BTV) to represent the reference background concentration value. 22 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES distributions produced during the PDA in light of the GOF test results and use professional judgment as to the appropriateness of the GOF test results. If at least one parametric GOF passes, then do the following for k=1 and k=m; a. If Gamma passes, use the Gamma UPL95(k) as the background concentration level. b. If Lognormal passes, then do the following; i. If the standard deviation of the fitted lognormal distribution is less than or equal to 1, then use lognormal UPL95(k) for the background concentration level; ii. If the standard deviation is > 1, use Gamma UPL95(k)18 for the background concentration level. c. Otherwise, use Normal UPL95(k) for the background concentration level. If no parametric test passes, use the non-parametric UPL95 for the background concentration level. Please note that the non-parametric UPL95 is constant for any number of future or independent observations. 5. Given that measurements follow no discernable parametric distribution, if the sample size <30 and distribution is skewed (i.e., skewness ne 0), use Chebyshev’s non-parametric estimate (at a confidence coefficient=85 percent or 90 percent to be conservative, i.e., have some chance an exceedance will occur); otherwise, use the non-parametric UPL95. 6. To develop UPLs for samples with insufficient data (i.e. < 8 observations). (Note that ProUCL will provide a warning if you have insufficient samples for parametric estimates). a. Rank order the observations from smallest to largest including a 120 percent of the maximum detect observation to better understand the distribution. b. Until sufficient samples are collected, assign the maximum detected value as the background prediction limit. The maximum observation is used by some practitioners to estimate an upper limit, especially for small samples (see U.S. Environmental Protection Agency (EPA). 1992a. Supplemental Guidance to RAGS: Calculating the Concentration Term. Publication EPA 9285.7-081, May 1992.). 18 The authors of ProUCL (2013) recommend for skewed data sets, parametric (e.g., gamma distribution on KM estimates) or nonparametric methods (e.g., Chebyshev inequality) which account for data skewness should be used to compute reference background concentration estimates. (See page 136). “The use of a parametric lognormal distribution on a lognormally distributed data set tends to yield unstable impractically large upper confidence limit values, especially when the standard deviation (sd) of the log-transformed data is greater than 1.0 and the data set is of small size such as less than 30 inches to 50 inches (see page 257). 23 | Page DR A F T Duke Energy Carolinas, LLC | TECHNICAL MEMORANDUM DRAFT FOR DISCUSSION PURPOSES ONLY –October 2016 - STATISTICAL METHODS FOR DEVELOPING REFERENCE BACKGROUND CONCENTRATIONS FOR GROUNDWATER AT COAL ASH FACILITIES Censored Observations At least one parametric GOF passes No Yes Warning: Large Loss of Power Use Non-Parametric Tests Run UPLs(k) in ProUCL for constituents with no non-detects Record UPLs(k) under different distribution assumptions and distribution free assumptions No Yes Run UPLs(k) in ProUCL for constituents with non- detects Record UPLs(k) under different distribution assumptions and distribution free assumptions At least one parametric GOF passes Warning: Large Loss of Power Use Non-Parametric Tests No Yes If sample size < 30, provide warning that more samples are required If sample size < 8, estimate is unreliable and recommend the maximum value as the upper background limit until more samples can be collected sample size <= 30 sample size <= 30Use Non-Parametric Use Chebyshev No Yes No Yes Use Non-Parametric (ignores NDs) Use Chebyshev (imputes using KM) Develop BTVs If sample size < 30, provide warning that more samples are required If sample size < 8 estimate is unreliable and recommend the maximum value as the upper background limit until more samples can be collected 1 2 3 5 6 Gamma passes Use Gamma UPL95(k) Lognormal passes Use Normal UPL95(k) Lognormal sigma <=1 Use Lognormal UPL95(k)Use Gamma UPL95(k) Gamma passes Use Gamma UPL95(k) Lognormal passes Lognormal sigma <=1 Use Lognormal UPL95(k)Use Gamma UPL95(k) Use Normal UPL95(k) 4 >=50% BDLs Use Non- parametric methods No Yes No Yes No Yes No Yes No Yes No Yes No Yes No. Detects <4 YesNo If 100% BDLs, team may need to decide other site specific values or sources Estimate is unreliable regardless of the level of censorship and recommend the maximum detect value as the upper background limit until more samples can be collected Figure 9. Decision Flow Chart for Part III - Developing Reference Background Concentration Values 24 | Page DR A F T