I need to provide a link to the best web page that I know of that contains forecast verification methodology information: I have used some of the equations from that page (since the Google docs equation editor doesn't translate to Google sites) # basically, the stuff in this courier font is code # you should be able to copy and paste it into the python notebook specific values, such as temperature, wind speed, probability of precipitation following the Murphy & Winkler general framework - joint distribution approach set up python import numpy as np import matplotlib.pyplot as plt %matplotlib inline we’re going to generate some fake observations and forecasts and do some quick calculations mu=50 sigma=20 obs=np.random.normal(mu,sigma,1000) np.max(obs) np.min(obs) np.mean(obs) np.median(obs) randerr=np.random.normal(-1,10,1000) syserr=10.0-0.2*obs fcst=obs+randerr+syserr np.max(fcst) np.min(fcst) np.mean(fcst) np.median(fcst) mean error (bias) error=fcst-obs np.mean(fcst-obs) np.mean(error) positive: forecast is on average above the obs (too warm) negative: forecast is on average below the obs (too cold) mean absolute error np.mean(np.abs(error)) typical magnitude of the error mean square error np.mean(error*error) np.mean(error**2) more sensitive to large errors (outliers) a single bad forecast (or bad ob) will make this large root mean squared error same units as forecast, obs another typical magnitude of error (sensitive to large errors) mse=np.mean(error**2) rmse=np.sqrt(mse) print rmse let’s use joint distribution approach and make a scatter plot forecast values vs. observed values (f,x) to be consistent with contingency table, obs=x coord, fcst=y coord plt.plot(obs,fcst,'.') plt.plot([-20, 120],[-20,120],'r') plt.xlabel('obs') plt.ylabel('fcst') we’re looking at p(f,x) from the sample could do some kernel density estimation to estimate p(f,x) in between points or turn this into a categorical forecast verification and use contingency tables can also plot: errors vs forecast values plt.plot(fcst,error,'.') errors vs observed values plt.plot(obs,error,'.') plt.xlabel('obs') plt.ylabel('error') plt.axhline(0) this can reveal relationships between the errors and particular predicted or observed values sort of a conditional distribution do we underforecast observed winds for high wind speed events? do we overforecast cold observed temperatures? do errors increase in magnitude for rare events? also useful to examine the distribution of forecast and observed values let's look at histograms, which are empirical estimates of the PDF plt.hist(obs) plt.hist(fcst) plt.hist([obs,fcst]) another useful way to look at these distributions is by plotting the empirical cumulative distribution function (CDF) CDF: p(x)=Pr{X<=x} = integral of PDF(x) from -inf to x to do this: 1. sort the data (x) from low to high fcstsort=np.sort(fcst) obssort=np.sort(obs) 2. estimate CDF=p(xi) p(xi)=(i-a)/(n+1-2a) Tukey a=1/3 Gumbel a=1 a=1./3. n=1000. i=np.arange(1.0,1001.0,1.0) px=(i-a)/(n+1.-2.*a) 3. plot sorted data (x) vs p(x) plt.plot(fcstsort,px,'r',obssort,px,'b') plt.ylabel('p(Tmax<=x)') plt.xlabel('Tmax') plt.title('CDFs fcst = red obs = blue') for a Normal distribution, this will look a bit like the side of a hill starts at zero, ramps up slowly at first as we move along the left tail of the distribution, then ramps up quickly as we move across the main part of the distribution, then ramps up slowly again towards 1.0 as we move along the right tail of the distribution for a uniform distribution, the CDF will be a straight line Linear Error in Probability Space (LEPS) ![]() basically look up the CDF of the obs value and compare that to the CDF of the forecast value (using the same observed distribution to define those CDFs)...take the absolute value of the difference, and average all of those when the underlying observed distribution is uniform, the CDF is linear, this score is basically the same as MAE HOWEVER, when the underlying distribution has tails or skewness, like precipitation for example, large differences in actual values will result in small differences in LEPS when you are comparing values out in the "tail" of a particular distribution. On the other hand, when you compare values near the "bulge" of the distribution, small differences in actual values will result in large differences in LEPS. this is similar to saying that a forecast for 50mm when 25mm is observed is still relatively "good", while a forecast for 1mm when 10mm is observed is relatively bad, even though the actual difference in the first case is larger than in the second...a forecast of a rare event should be "graded easier" than a forecast for a common event cdfobs=px cdffcst=np.interp(fcstsort,obssort,px) here we are using numpy's interpolation function to linearly interpolate the CDF values in the obs distribution to the forecast values. this interpolation function does not extrapolate by default, so any forecast values greater than the observed maximum will be assigned to CDF=0.999 (in this example with sample size=1000) and any forecast values less than the observed minimum will be assigned CDF=0. now we can calculate the LEPS leps=np.mean(np.abs(cdffcst-cdfobs)) print leps doing a scatter plot of these two distributions is what is known as a Q-Q plot (quantile-quantile plot), which is a exploratory data analysis procedure for determining if two samples are from the same distribution, if they are, this will fall along the main diagonal. this is similar to what we would expect the forecast performance to be if we were to do verification following recalibration of the forecasts to remove bias plt.plot(cdfobs,cdffcst,'.') plt.plot([0,1],[0,1],'r')
skill score SS_mse = 1 - MSE_forecast/MSE_reference for a reference forecast = climatology where O_bar is the mean observed value, often called "climatology" questions: should this be the mean observation from the verification sample? should this be a long-term climatology? should this be the overall/global mean observation, or should this vary spatially/temporally? error=fcst-obs mse=np.mean(error**2) anomaly=np.mean(obs)-obs mseclimo=np.mean(anomaly**2) ssmse=1-mse/mseclimo print ssmse print mse print mseclimo
correlation coefficient can also be used as a measure of "association" (aspect of forecast quality as defined by Murphy 1993) correlation is not sensitive to bias, a biased forecast could have perfect correlation, so correlation by itself should not be used as a single measure of quality A=np.corrcoef(fcst,obs) r=A[0,1] print A print r this returns a correlation matrix, the values in the diagonal should be = 1, these indicate that the forecasts are perfectly correlated with themselves, and the observations are perfected correlated with themselves. A is a 2x2 matrix, we can refer to the i,j element of A either with print A[1,0] print A[1][0] in general A=np.corrcoef(x0,x1) returns: A1,0 element = A0,1 element = the correlation between x0 (the forecasts) and x1 (the observed values) as a reminder, the definition of the correlation coefficient is the covariance between two variables (X and Y), divided by the product of their standard deviations: Murphy (1988) showed how the MSE can be rewritten in terms of: means, standard deviations, and correlation between forecasts and observations: where sf, sx are the sample standard deviations of the forecast and observations, respectively and rfx is the correlation between the forecasts and observations
note that for a perfect forecast, sf = sx and rfx = 1, therefore the fourth term would cancel out the second and third terms, first term would be zero (unbiased) this is telling us: to reduce your MSE...reduce your bias and increase your correlation with the observations this may seem quite obvious, but here we have the math to explain how this works in a similar vein, we can decompose the skill score based on MSE here, the first term is the square of the linear association between the forecasts and observations, a good forecast will have a higher number for this term the second term is related to reliability/conditional bias...it is also negatively oriented, so a good forecast will have a small magnitude for this term
the third term is related to the overall bias of the forecast system, a good forecast will have a small number for this term sx = np.std(obs) sf = np.std(fcst) fbar=np.mean(fcst) xbar=np.mean(obs)
print 'correlation term',r**2 print 'reliability term',(r-sf/sx)**2 print 'bias term',((fbar-xbar)/sx)**2 ssmse2=r**2- (r-sf/sx)**2- ((fbar-xbar)/sx)**2 print 'skill score based on MSE',ssmse2 let's try to see how these numbers will change if we examine a biased set of forecasts and a set of forecasts that have larger conditional bias Probability forecastsProbability forecasts can be considered a subset of continuous forecasts since the forecast can have any value between 0 and 1 (or 0-100%). Typically the probability of a specific event, such as precipitation greater than or equal to 0.01" (POP) therefore, the observations can only have two values: 0 or 1 obspop = np.random.binomial(1,0.3,1000) this will return 1000 numbers, either 0 or 1, drawn from a binomial distribution with a 0.3 probability of returning 1 for each let's generate totally random forecasts, from a uniform distribution say that our forecasts are not very sharp, they only vary between 0.1 and 0.5 fcstpop=np.random.uniform(0.1,0.5,1000) check the mean values to see if they are both close to 0.3 np.mean(obspop) np.mean(fcstpop) also, as we have already noted, the forecasts are often categorized by placing the continuous numbers into discrete bins, we have used categorical methods to verify these kinds of forecasts already MSE of probability forecast is known as the Brier Score (BS) for n pairs of forecast/observed events where xk =1 when the observed event occurs, xk = 0 when the observed event does not occur pretty easy to compute in python error=fcstpop-obspop BS=np.mean(error**2) print BS
this is an error, so for a perfect forecast: BS = 0 (no "B.S.") Brier Skill Score (BSS) BSref = usually climatology BS can be decomposed into terms related to:
calibration-refinement factorization from Murphy and Winkler (1987) let's look at the first 5 numbers in each forecast and observation array fcstpop[:5] obspop[:5] let's say the forecast values are indicated by i=1,M categories for example, let's say the i=1 category is for 0 < f <= 0.2 total number of events = n n=len(obspop) let Ni be the number of times each categorical fi was issued let's look at some logical tests on our first 5 forecast values fcstpop[:5]<0.2 this will return a "boolean" array of True and False for each value in the fcstpop array after testing if they are <0.2 we can take advantage of this array of True and False to do conditional calculations, a simple sum treats True =1 and False = 0 np.sum(fcstpop[:5]<0.2) since most categories are greater than or equal to a lower limit and less than an upper limit, we use the numpy logical_and function to do this: np.logical_and(fcstpop[:5]>=0, fcstpop[:5]<0.2) np.sum(np.logical_and(fcstpop[:5]>=0, fcstpop[:5]<0.2)) (should give the same answer as before since fcstpop are all non-negative) take away the array index that was limiting things to the first 5 numbers to see the number of forecasts that were issued in this category: Ni=np.sum(np.logical_and(fcstpop>=0, fcstpop<0.2)) print Ni the marginal distribution of forecast values is given by: relative frequency of each forecast category Ni=np.sum(np.logical_and(fcstpop>=0, fcstpop<0.2)) pfi = Ni*1.0/n print pfi here I multiplied Ni by 1.0 to get this into a real number, otherwise it will do integer division we can also list the forecast values only where the forecasts fall into our category: fcstpop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)] this nice thing about this is that we can also list the observed values that correspond to those specific forecast values, only where the forecasts fall into our category, simply by indexing the observations by the logical test on the forecasts: obspop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)] now let's look at the conditional probability that x=1 given f=fi , which is: sum/average the observed values for only those occasions where fi is issued conditional average observation, expected value of x given fi using numpy, this is simply: np.mean(obspop[np.logical_and(fcstpop[:5]>=0., fcstpop[:5]<0.2)]) #now do the entire array np.mean(obspop[np.logical_and(fcstpop>=0., fcstpop<0.2)]) recall that this is getting at reliability: we want the frequency of observation = yes to be equal to the forecast category for a reliable forecast system unconditional sample climatology: decomposed Brier score: BS = "reliability" + "resolution" + "uncertainty" let's set up some code to calculate the components of the Brier score using our POPs and forecast categories every 20% recall that python can do looping with the "for <index> in <list>" command examples: for k in [7, -3.3, np.pi, 'hello']: print k a = [7, -3.3, np.pi, 'hello'] for i in np.arange(4): print i, a[i] print np.arange(4) also recall the python syntax, when you use a "for" loop (or an "if" test), the end of that statement will need a colon ":", the block of code that follows must be indented to indicate that this block of code will be what is executed in the loop. the indents must be consistent within that block of code. that block of code is terminated by the next line of python that is NOT indented in the first loop, you see how this for loop works, it simply steps through the provided list sequentially, executing the intended block of code that follows. the list can contain any type of data in the list, here we've provided an integer, float, numpy object, and a string the second loop uses the same list, but uses a numpy array (np.arange) that sets up a simple array of sequential numbers, the for loop steps through those numbers, in this case 0 to 3 (four numbers in total). the block of code prints the loop index, then refers to array "a" by that i a numpy "array" is a special kind of list that only contains one type of variable, say "short" integers or a double-precision floating point numbers. using arrays is faster than regular lists since all of the members of that list are of the same type, there is less overhead in python when operating with arrays than with regular lists you can define a list to be a numpy array using: mylist = np.array([2, 3, 5], dtype = 'i') the first argument is the list, the second argument is the data type, use:
I believe double-precision floats are 64-bit, and long integers are 32-bit, this probably depends on the machine you're using it is usually a good idea to "initialize" an array that you will be working with to set the values to all zeros, using the numpy.zeros function a = np.zeros(1000, dtype='d') where the first argument is the shape of the array try: b = np.zeros((3,2), dtype='d') np.shape(b) np.rank(b) np.size(b) np.dtype.char let's try some other common array manipulation a = np.arange(6) print a b= np.reshape(a,(3,2)) print b c=np.transpose(b) print c a = np.append(a,[7,8]) print a aplus=np.concatenate((a,[10,11])) print aplus aaa=np.repeat(a,3) print aaa let's try to use the looping and arrays to calculate the components of the BS decomposition, try the following python code, make sure that you understand how each line works import numpy as np
#set up obs as random draw from binomial distribution, 0.3 frequency of yes obspop = np.random.binomial(1,0.3,1000) #set up fcst as random draw from uniform distribution, half between 0.1 and 0.5 # the rest between 0 and 1 fcstpop=np.random.uniform(0.1,0.5,500) fcstpop = np.append(fcstpop,np.random.uniform(0.,1.,500))
#set up arrays for lower and upper bounds on categories lower = np.arange(0.,1.,.1) upper = np.arange(.1, 1.1, .1)
#set the upper bound high enough to include forecasts equal to 1.0 upper[9]=1.01 #set the upper bound high enough to include forecasts equal to 1.0 n = len(obspop)*1.0 print 'lower',lower print 'upper',upper
#initialize variables with zeroes # N is the total number of forecast in each category N = np.zeros(len(lower)) # pf is the marginal distribution of forecasts pf = np.zeros(len(N),dtype='d') # xbari is the average observation given the corresponding forecast was in a category xbari = np.zeros(len(N),dtype='d') # fi is the average forecast given that forecast was in a category fi = np.zeros(len(N),dtype='d') # loop over all of the catgories, find N, pf, xbari, fi for i in np.arange(len(lower)): N[i]=np.sum(np.logical_and(fcstpop>=lower[i], fcstpop<upper[i])) pf[i]=N[i]/n xbari[i]=np.mean(obspop[np.logical_and(fcstpop>=lower[i], fcstpop<upper[i])]) fi[i]=np.mean(fcstpop[np.logical_and(fcstpop>=lower[i], fcstpop<upper[i])]) #overall unconditional mean obs xbar = np.mean(obspop) #components of BS reliability = np.sum(N*(fi-xbari)**2)/n resolution = np.sum(N*(xbari-xbar)**2)/n uncertainty = xbar*(1-xbar) # print stuff print 'N',N print 'pf',pf print 'xbari',xbari print 'fi',fi print 'reli',reliability print 'reso',resolution print 'uncer',uncertainty print 'BSdecomp =',reliability - resolution + uncertainty print 'BS =',np.mean((fcstpop-obspop)**2) ROC curvesgraphical way of looking at the conditional distributions using the likelihood-base rate factorization, POD vs POFD for a fixed observed event, varying the forecast threshold across all possible values let's plot ROC curves for the Mason (1989) idea of the two Gaussian distributions that we can control one distribution of forecast values for "yes" observations another distribution of forecast values for "no" observations more separation between these two distributions means better discrimination if the two distributions are identical, there is no discrimination from scipy.stats import norm import matplotlib.pyplot as plt import numpy as np %matplotlib inline dprime=0.2 alpha=0.25 f0=norm(loc=0.5-dprime/2.,scale=0.15) f1=norm(loc=0.5+dprime/2.,scale=0.15) fcst_yes=f1.rvs(size=int(10000*alpha)) fcst_no=f0.rvs(size=int(10000*(1-alpha))) fcst=np.append(fcst_no,fcst_yes) # fcst is probability, make sure it is between 0 and 1 fcst[fcst<0.]=0. fcst[fcst>1.]=1. obs_yes=np.ones(int(10000*alpha)) obs_no=np.zeros(int(10000*(1-alpha))) obs=np.append(obs_no,obs_yes) fcstsort=fcst[np.argsort(fcst)] fcst2d=fcstsort.reshape((100,100)) obssort=obs[np.argsort(fcst)] obs2d=obssort.reshape((100,100))
# might be interesting to plt.contourf obs2d and fcst2d # next, set up an array of forecast thresholds that vary from the min to max, with 100 pts in between # Tf=thresholds for converting continuous fcst to yes/no Tf=np.linspace(0.,1.,100) # set up arrays of POD and POFD with zeroes POD=np.zeros(len(Tf)) POFD=np.zeros(len(Tf)) # loop across the forecast thresholds, find regions in Venn diagram to determine # 2x2 contingency table components for each forecast threshold, then find POD, POFD for i in np.arange(len(Tf)): a=np.logical_and(fcst>Tf[i],obs>To) b=np.logical_and(fcst>Tf[i],obs<=To) c=np.logical_and(fcst<=Tf[i],obs>To) d=np.logical_and(fcst<=Tf[i],obs<=To) aa=np.sum(a)/n bb=np.sum(b)/n cc=np.sum(c)/n dd=np.sum(d)/n POD[i]=aa/(aa+cc) POFD[i]=bb/(bb+dd) # plot ROC curve plt.plot(POFD,POD,'+-') dd |