continuous forecasts


I need to provide a link to the best web page that I know of that contains forecast verification methodology information:

I have used some of the equations from that page (since the Google docs equation editor doesn't translate to Google sites)

# basically, the stuff in this courier font is code 
# you should be able to copy and paste it into the python notebook
# things that begin with # are comment lines and are not executed

continuous variables


specific values, such as temperature, wind speed, probability of precipitation

following the Murphy & Winkler general framework - joint distribution approach

set up python


import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


we’re going to generate some fake observations and forecasts and do some quick calculations

mu=50
sigma=20
obs=np.random.normal(mu,sigma,1000)

np.max(obs)
np.min(obs)
np.mean(obs)
np.median(obs)

randerr=np.random.normal(-1,10,1000)
syserr=10.0-0.2*obs
fcst=obs+randerr+syserr
np.max(fcst)
np.min(fcst)
np.mean(fcst)
np.median(fcst)

mean error (bias)

error=fcst-obs
np.mean(fcst-obs)
np.mean(error)

positive: forecast is on average above the obs (too warm)
negative: forecast is on average below the obs (too cold)

mean absolute error


np.mean(np.abs(error))

typical magnitude of the error

mean square error

np.mean(error*error)
np.mean(error**2)

more sensitive to large errors (outliers)
a single bad forecast (or bad ob) will make this large

root mean squared error



same units as forecast, obs
another typical magnitude of error (sensitive to large errors)

mse=np.mean(error**2)
rmse=np.sqrt(mse)
print rmse

let’s use joint distribution approach and make a scatter plot
forecast values vs. observed values (f,x)
to be consistent with contingency table, obs=x coord, fcst=y coord


plt.plot(obs,fcst,'.')
plt.plot([-20, 120],[-20,120],'r')
plt.xlabel('obs')
plt.ylabel('fcst')


we’re looking at p(f,x) from the sample
could do some kernel density estimation to estimate p(f,x) in between points
or turn this into a categorical forecast verification and use contingency tables

can also plot:
errors vs forecast values

plt.plot(fcst,error,'.')


errors vs observed values

plt.plot(obs,error,'.')
plt.xlabel('obs')
plt.ylabel('error')
plt.axhline(0)


this can reveal relationships between the errors and particular predicted or observed values
sort of a conditional distribution
do we underforecast observed winds for high wind speed events?
do we overforecast cold observed temperatures?
do errors increase in magnitude for rare events?


also useful to examine the distribution of forecast and observed values

let's look at histograms, which are empirical estimates of the PDF

plt.hist(obs)

plt.hist(fcst)

plt.hist([obs,fcst])


another useful way to look at these distributions is by plotting the empirical cumulative distribution function (CDF)
CDF: p(x)=Pr{X<=x} = integral of PDF(x) from -inf to x

to do this:

1. sort the data (x) from low to high

fcstsort=np.sort(fcst)
obssort=np.sort(obs)


2. estimate CDF=p(xi)
p(xi)=(i-a)/(n+1-2a)

Tukey a=1/3
Gumbel a=1

a=1./3.
n=1000.
i=np.arange(1.0,1001.0,1.0)
px=(i-a)/(n+1.-2.*a)


3. plot sorted data (x) vs p(x)


plt.plot(fcstsort,px,'r',obssort,px,'b')
plt.ylabel('p(Tmax<=x)')
plt.xlabel('Tmax')
plt.title('CDFs fcst = red obs = blue')


for a Normal distribution, this will look a bit like the side of a hill
starts at zero, ramps up slowly at first as we move along the left tail of the distribution, then ramps up quickly as we move across the main part of the distribution, then ramps up slowly again towards 1.0 as we move along the right tail of the distribution

for a uniform distribution, the CDF will be a straight line

Linear Error in Probability Space (LEPS)
Equation for linear errorin probability space

basically look up the CDF of the obs value and compare that to the CDF of the forecast value (using the same observed distribution to define those CDFs)...take the absolute value of the difference, and average all of those

when the underlying observed distribution is uniform, the CDF is linear, this score is basically the same as MAE

HOWEVER, when the underlying distribution has tails or skewness, like precipitation for example, large differences in actual values will result in small differences in LEPS when you are comparing values out in the "tail" of a particular distribution.
On the other hand, when you compare values near the "bulge" of the distribution, small differences in actual values will result in large differences in LEPS.

this is similar to saying that a forecast for 50mm when 25mm is observed is still relatively "good", while a forecast for 1mm when 10mm is observed is relatively bad, even though the actual difference in the first case is larger than in the second...a forecast of a rare event should be "graded easier" than a forecast for a common event


cdfobs=px
cdffcst=np.interp(fcstsort,obssort,px)

here we are using numpy's interpolation function to linearly interpolate the CDF values in the obs distribution to the forecast values.  this interpolation function does not extrapolate by default, so any forecast values greater than the observed maximum will be assigned to CDF=0.999 (in this example with sample size=1000) and any forecast values less than the observed minimum will be assigned CDF=0.

now we can calculate the LEPS

leps=np.mean(np.abs(cdffcst-cdfobs))
print leps

doing a scatter plot of these two distributions is what is known as a Q-Q plot (quantile-quantile plot), which is a exploratory data analysis procedure for determining if two samples are from the same distribution, if they are, this will fall along the main diagonal.  this is similar to what we would expect the forecast performance to be if we were to do verification following recalibration of the forecasts to remove bias

plt.plot(cdfobs,cdffcst,'.')
plt.plot([0,1],[0,1],'r')


skill score


for most of these error measures, the score for a perfect forecast = 0, for example, for MSE

SS_mse = 1 - MSE_forecast/MSE_reference

for a reference forecast = climatology


where O_bar is the mean observed value, often called "climatology"

questions:

should this be the mean observation from the verification sample?
should this be a long-term climatology?
should this be the overall/global mean observation, or should this vary spatially/temporally?

error=fcst-obs
mse=np.mean(error**2)
anomaly=np.mean(obs)-obs
mseclimo=np.mean(anomaly**2)
ssmse=1-mse/mseclimo
print ssmse
print mse
print mseclimo

correlation coefficient can also be used as a measure of "association" (aspect of forecast quality as defined by Murphy 1993)

correlation is not sensitive to bias, a biased forecast could have perfect correlation, so correlation by itself should not be used as a single measure of quality

A=np.corrcoef(fcst,obs)
r=A[0,1]
print A
print r

this returns a correlation matrix, the values in the diagonal should be = 1, these indicate that the forecasts are perfectly correlated with themselves, and the observations are perfected correlated with themselves.  A is a 2x2 matrix, we can refer to the i,j element of A either with

print A[1,0]
print A[1][0]

in general A=np.corrcoef(x0,x1) returns:  A1,0 element = A0,1 element = the correlation between x0 (the forecasts) and x1 (the observed values)

as a reminder, the definition of the correlation coefficient is the covariance between two variables (X and Y), divided by the product of their standard deviations:
Murphy (1988) showed how the MSE can be rewritten in terms of: means, standard deviations, and correlation between forecasts and observations:


where sf, sx are the sample standard deviations of the forecast and observations, respectively
and rfx is the correlation between the forecasts and observations

  1. first term is the bias (squared), therefore the larger the bias, the larger MSE will be
  2. second term is the variance of the forecasts, one might assume that a lower MSE could be obtained by simply issuing constant forecasts (all equal to the mean, giving zero variance sf2=0), however we also have the standard deviation in the fourth term, which is negatively oriented, therefore low variance (standard deviation) would also result in a low fourth term.  ideally we want our forecast system to have a similar variance as the observations.
  3. third term is the variance of the observations, obviously outside the control of the forecaster
  4. fourth term consists of the product of the standard deviations of both forecasts and observations with the correlation between the two...forecasts that are more highly correlated with observations will result in a lower MSE
note that for a perfect forecast, sf = sx and rfx = 1, therefore the fourth term would cancel out the second and third terms, first term would be zero (unbiased)

this is telling us: to reduce your MSE...reduce your bias and increase your correlation with the observations
this may seem quite obvious, but here we have the math to explain how this works

in a similar vein, we can decompose the skill score based on MSE


here, the first term is the square of the linear association between the forecasts and observations, a good forecast will have a higher number for this term
the second term is related to reliability/conditional bias...it is also negatively oriented, so a good forecast will have a small magnitude for this term
  • ideally, sf = sx, so the ratio would = 1...also recall that for simple linear regression, the slope of the regression line is related to the correlation between the two variables, so this compares the slope of a regression line to the ideal slope
  • if there is conditional bias, the forecast system may overforecast small values and underforecast large values, the slope of the linear regression between forecasts and observations would be less than 1 even if the overall bias = zero
  • this term will increase in magnitude if the forecast system is unreliable
the third term is related to the overall bias of the forecast system, a good forecast will have a small number for this term

sx = np.std(obs)
sf = np.std(fcst)
fbar=np.mean(fcst)
xbar=np.mean(obs)
A=np.corrcoef(fcst,obs)
r=A[0,1]
print 'correlation term',r**2
print 'reliability term',(r-sf/sx)**2
print 'bias term',((fbar-xbar)/sx)**2
ssmse2=r**2-(r-sf/sx)**2-((fbar-xbar)/sx)**2
print 'skill score based on MSE',ssmse2

let's try to see how these numbers will change if we examine a biased set of forecasts and a set of forecasts that have larger conditional bias

Probability forecasts

Probability forecasts can be considered a subset of continuous forecasts since the forecast can have any value between 0 and 1 (or 0-100%).

Typically the probability of a specific event, such as precipitation greater than or equal to 0.01"

(POP)

therefore, the observations can only have two values: 0 or 1

obspop = np.random.binomial(1,0.3,1000)

this will return 1000 numbers, either 0 or 1, drawn from a binomial distribution with a 0.3 probability of returning 1 for each

let's generate totally random forecasts, from a uniform distribution
say that our forecasts are not very sharp, they only vary between 0.1 and 0.5

fcstpop=np.random.uniform(0.1,0.5,1000)

check the mean values to see if they are both close to 0.3

np.mean(obspop)
np.mean(fcstpop)

also, as we have already noted, the forecasts are often categorized by placing the continuous numbers into discrete bins, we have used categorical methods to verify these kinds of forecasts already

MSE of probability forecast is known as the Brier Score (BS)


for n pairs of forecast/observed events
where xk =1 when the observed event occurs, xk = 0 when the observed event does not occur

pretty easy to compute in python

error=fcstpop-obspop
BS=np.mean(error**2)
print BS

this is an error, so for a perfect forecast: BS = 0 (no "B.S.")


Brier Skill Score (BSS)



BSref = usually climatology


BS can be decomposed into terms related to: 
  • reliability
  • resolution
  • uncertainty

calibration-refinement factorization from Murphy and Winkler (1987)

let's look at the first 5 numbers in each forecast and observation array
fcstpop[:5]
obspop[:5]

let's say the forecast values are indicated by i=1,M categories

for example, let's say the i=1 category is for  0 < f <= 0.2
total number of events = n
n=len(obspop)

let Ni be the number of times each categorical fi was issued

let's look at some logical tests on our first 5 forecast values
fcstpop[:5]<0.2
this will return a "boolean" array of True and False for each value in the fcstpop array after testing if they are <0.2

we can take advantage of this array of True and False to do conditional calculations, a simple sum treats True =1 and False = 0
np.sum(fcstpop[:5]<0.2)


since most categories are greater than or equal to a lower limit and less than an upper limit, we use the numpy logical_and function to do this:
np.logical_and(fcstpop[:5]>=0, fcstpop[:5]<0.2)
np.sum(np.logical_and(fcstpop[:5]>=0, fcstpop[:5]<0.2))

(should give the same answer as before since fcstpop are all non-negative)

take away the array index that was limiting things to the first 5 numbers to see the number of forecasts that were issued in this category:
Ni=np.sum(np.logical_and(fcstpop>=0, fcstpop<0.2))
print Ni

the marginal distribution of forecast values is given by:

relative frequency of each forecast category
Ni=np.sum(np.logical_and(fcstpop>=0, fcstpop<0.2))
pfi = Ni*1.0/n
print pfi

here I multiplied Ni by 1.0 to get this into a real number, otherwise it will do integer division

we can also list the forecast values only where the forecasts fall into our category:

fcstpop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)]

this nice thing about this is that we can also list the observed values that correspond to those specific forecast values, only where the forecasts fall into our category, simply by indexing the observations by the logical test on the forecasts:

obspop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)]

now let's look at the conditional probability that x=1 given f=fi , which is:

sum/average the observed values for only those occasions where fi is issued
conditional average observation, expected value of x given fi
using numpy, this is simply:

np.mean(obspop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)])

#now do the entire array
np.mean(obspop[np.logical_and(fcstpop>=0., fcstpop<0.2)])

recall that this is getting at reliability: we want the frequency of observation = yes to be equal to the forecast category for a reliable forecast system
 
unconditional sample climatology:


decomposed Brier score:


BS   =        "reliability"       +           "resolution"     +    "uncertainty"


let's set up some code to calculate the components of the Brier score using our POPs and forecast categories every 20%


recall that python can do looping with the "for <index> in <list>" command
examples:


for k in [7, -3.3, np.pi, 'hello']:
    print k

a = [7, -3.3, np.pi, 'hello']
for i in np.arange(4):
    print i, a[i]

print np.arange(4)




also recall the python syntax, when you use a "for" loop (or an "if" test), the end of that statement will need a colon ":", the block of code that follows must be indented to indicate that this block of code will be what is executed in the loop.  the indents must be consistent within that block of code. that block of code is terminated by the next line of python that is NOT indented

in the first loop, you see how this for loop works, it simply steps through the provided list sequentially, executing the intended block of code that follows.  the list can contain any type of data in the list, here we've provided an integer, float, numpy object, and a string

the second loop uses the same list, but uses a numpy array (np.arange) that sets up a simple array of sequential numbers, the for loop steps through those numbers, in this case 0 to 3 (four numbers in total).  the block of code prints the loop index, then refers to array "a" by that i

a numpy "array" is a special kind of list that only contains one type of variable, say "short" integers or a double-precision floating point numbers.  using arrays is faster than regular lists since all of the members of that list are of the same type, there is less overhead in python when operating with arrays than with regular lists

you can define a list to be a numpy array using:

mylist = np.array([2, 3, 5], dtype = 'i')

the first argument is the list, the second argument is the data type, use:
  • 'd' for double-precision floating point
  • 'f' for single-precision floating point
  • 'i' for short integer
  • 'l' for long integer
I believe double-precision floats are 64-bit, and long integers are 32-bit, this probably depends on the machine you're using

it is usually a good idea to "initialize" an array that you will be working with to set the values to all zeros, using the numpy.zeros function

a = np.zeros(1000, dtype='d')

where the first argument is the shape of the array

try:

b = np.zeros((3,2), dtype='d')
np.shape(b)
np.rank(b)
np.size(b)
np.dtype.char

let's try some other common array manipulation

a = np.arange(6)
print a
b= np.reshape(a,(3,2))
print b
c=np.transpose(b)
print c
a = np.append(a,[7,8])
print a
aplus=np.concatenate((a,[10,11]))
print aplus
aaa=np.repeat(a,3)
print aaa

let's try to use the looping and arrays to calculate the components of the BS decomposition, try the following python code, make sure that you understand how each line works


import numpy as np

#set up obs as random draw from binomial distribution, 0.3 frequency of yes
obspop = np.random.binomial(1,0.3,1000)
#set up fcst as random draw from uniform distribution, half between 0.1 and 0.5
#   the rest between 0 and 1
fcstpop=np.random.uniform(0.1,0.5,500)
fcstpop = np.append(fcstpop,np.random.uniform(0.,1.,500))

#set up arrays for lower and upper bounds on categories
lower = np.arange(0.,1.,.1)
upper = np.arange(.1, 1.1, .1)

#set the upper bound high enough to include forecasts equal to 1.0
upper[9]=1.01
#set the upper bound high enough to include forecasts equal to 1.0

n = len(obspop)*1.0
print 'lower',lower
print 'upper',upper

#initialize variables with zeroes
# N is the total number of forecast in each category
N = np.zeros(len(lower))
# pf is the marginal distribution of forecasts
pf = np.zeros(len(N),dtype='d')
# xbari is the average observation given the corresponding forecast was in a category
xbari = np.zeros(len(N),dtype='d')
# fi is the average forecast given that forecast was in a category
fi = np.zeros(len(N),dtype='d')

# loop over all of the catgories, find N, pf, xbari, fi
for i in np.arange(len(lower)):
    N[i]=np.sum(np.logical_and(fcstpop>=lower[i], fcstpop<upper[i]))
    pf[i]=N[i]/n
    xbari[i]=np.mean(obspop[np.logical_and(fcstpop>=lower[i], fcstpop<upper[i])])
    fi[i]=np.mean(fcstpop[np.logical_and(fcstpop>=lower[i], fcstpop<upper[i])])

#overall unconditional mean obs
xbar = np.mean(obspop)
#components of BS
reliability = np.sum(N*(fi-xbari)**2)/n
resolution = np.sum(N*(xbari-xbar)**2)/n
uncertainty = xbar*(1-xbar)
# print stuff
print 'N',N
print 'pf',pf
print 'xbari',xbari
print 'fi',fi
print 'reli',reliability
print 'reso',resolution
print 'uncer',uncertainty
print 'BSdecomp =',reliability - resolution + uncertainty
print 'BS =',np.mean((fcstpop-obspop)**2)


ROC curves

graphical way of looking at the conditional distributions using the likelihood-base rate factorization, POD vs POFD for a fixed observed event, varying the forecast threshold across all possible values


let's plot ROC curves for the Mason (1989) idea of the two Gaussian distributions that we can control

one distribution of forecast values for "yes" observations
another distribution of forecast values for "no" observations

more separation between these two distributions means better discrimination
if the two distributions are identical, there is no discrimination


from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline



dprime=0.2
alpha=0.25
f0=norm(loc=0.5-dprime/2.,scale=0.15)
f1=norm(loc=0.5+dprime/2.,scale=0.15)
fcst_yes=f1.rvs(size=int(10000*alpha))
fcst_no=f0.rvs(size=int(10000*(1-alpha)))
fcst=np.append(fcst_no,fcst_yes)
# fcst is probability, make sure it is between 0 and 1
fcst[fcst<0.]=0.
fcst[fcst>1.]=1.
obs_yes=np.ones(int(10000*alpha))
obs_no=np.zeros(int(10000*(1-alpha)))
obs=np.append(obs_no,obs_yes)
fcstsort=fcst[np.argsort(fcst)]
fcst2d=fcstsort.reshape((100,100))
obssort=obs[np.argsort(fcst)]
obs2d=obssort.reshape((100,100)) 

# might be interesting to plt.contourf obs2d and fcst2d

# next, set up an array of forecast thresholds that vary from the min to max, with 100 pts in between
# Tf=thresholds for converting continuous fcst to yes/no
Tf=np.linspace(0.,1.,100)
# set up arrays of POD and POFD with zeroes
POD=np.zeros(len(Tf))
POFD=np.zeros(len(Tf))

# loop across the forecast thresholds, find regions in Venn diagram to determine 
#  2x2 contingency table components for each forecast threshold, then find POD, POFD
for i in np.arange(len(Tf)):
    a=np.logical_and(fcst>Tf[i],obs>To)
    b=np.logical_and(fcst>Tf[i],obs<=To)
    c=np.logical_and(fcst<=Tf[i],obs>To)
    d=np.logical_and(fcst<=Tf[i],obs<=To)
    aa=np.sum(a)/n
    bb=np.sum(b)/n
    cc=np.sum(c)/n
    dd=np.sum(d)/n
    POD[i]=aa/(aa+cc)
    POFD[i]=bb/(bb+dd)

# plot ROC curve
plt.plot(POFD,POD,'+-')




dd
Comments