Forecasts will be quantitatively evaluated for each target using two metrics. Point forecasts will be evaluated using relative mean absolute error to assess performance relative to a seasonal autoregressive model and to other forecasts. The probability distributions will be evaluated using the logarithmic score. For each target, relative MAE and the logarithmic score will be calculated across all seasons and forecast times (week of the season) as well as for specific seasons and forecast time to identify potential differences in model strengths. The primary comparisons will be made for the testing period (2010-2013), however forecasts will also be compared between the training and testing periods to assess how forecast accuracy changes when predicting on data that was excluded from the model development process.

IMPORTANT NOTES: A different model may be employed for each target and location. No metrics will be compared across targets.**Relative Mean Absolute Error**

Mean absolute error (MAE) is the mean absolute difference between predictions \(\hat{\mathbf{y}}\) and observations \(\mathbf{y}\) over \(n\) data points:

\(MAE(\hat{\mathbf{y}}, \mathbf{y})=\frac{1}{n}\sum\limits_{i=1}^{n} \left|\hat{y}_{i}-y_{i}\right|\).

Relative MAE for models A and B is:

\(relMAE_{A,B}=\frac{MAE_{A}}{MAE_{B}}\)

An important feature of this metric is that it can be interpreted directly in terms of accuracy of predictions. For example, \(relMAE_{A,B}\) = 0.8 indicates that, on average, predictions from model A were 20% closer to the observed values than those from model B. Additionally, comparing multiple candidate models to a common baseline model with relative MAE allows for the assessment of the relative accuracy of the candidate models. For example, the relative MAE for model A versus the baseline model can be divided by the relative MAE for model B versus the baseline, resulting in the relative MAE for model A versus model B.

**References**

Hyndman RJ and AB Koehler. (2006) Another look at measures of forecast accuracy. International Journal of Forecasting. 22(4):679-688. Available at: http://www.buseco.monash.edu.au/ebs/pubs/wpapers/2005/wp13-05.pdf.

- Reich NG, J Lessler, K Sakrejda, SA Lauer, S Iamsirithaworn, and DAT Cummings. (2015) Case studies in evaluating time series prediction models using the relative mean absolute error. Available at: “http://works.bepress.com/nicholas_reich/11.

**Logarithmic Score** The logarithmic scoring rule is a proper scoring rule based on a binned probability distribution of the prediction, \(\mathbf{y}\). The score is the log of the probability assigned to the observed outcome, \(i\):

\(S(\mathbf{p},i) = ln(p_{i})\)

For example, a single prediction for the peak week includes probabilities for each week (1-52) in the season. The probability assigned to a week when the peak is unlikely would be low, near to zero, while the probability assigned to the forecast peak week is the highest. The total of these probabilities across all weeks must equal 1, i.e.:

\(\sum\limits_{i=1}^{52} p_{i}=1\).

If the observed peak week is week 30 and \(p_{i}\) = 0.15, the score for this prediction is \(log\)(0.15), or approximately -1.9.

Note that the logarithmic score is based on a probabilistic estimate - a complete probability distribution over all possible outcomes. This is desirable because it requires the forecaster to consider the entire range of possible outcomes, and to estimate the likelihood if each one of them. Two forecasters may agree on which outcome is the most likely, and therefore submit the same point estimate, which would be scored identically by MAE. However, their predictions may differ substantially on how likely this outcome is, and how likely other outcomes are. By considering this probability, the logarithmic score enables scoring of the confidence in the prediction, not just the value of the point prediction.

Another advantage of logarithmic scores is that they can be summed across different time periods, targets, and locations to provide both specific and generalized measures of model accuracy. The bins that will be used for forecasts of the peak week, maximum weekly incidence, and total cases in the season will be specified in the forecast templates.

**References**

Gneiting T and AE Raftery. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 102(477):359-378. Available at: https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf.

- Rosenfeld R, J Grefenstette, and D Burke. (2012) A Proposal for Standardized Evaluation of Epidemiological Models. Available at: http://delphi.midas.cs.cmu.edu/files/StandardizedEvaluation_Revised_12-11-09.pdf.