forecast error distributions

By Kim Hoogewind

I used Python to compute the joint, conditional, and marginal distributions of the given forecast and observations. I will now attempt a short crash course in Python and how it may be used to solve the problems presented in the assignment.


Python Makes Programming... FUN!

I never thought I would have enjoyed programming, but Python has made this possible. It is easy to learn and  gaining popularity in our field of atmospheric science. I hope that I will succeed in enticing each  one of you into becoming a user (Python may be interfaced with many of the programs/programming languages  you already know, e.g. R (rpy or r2py), GrADS (PyGrADS), ArcGIS, NCL (PyNGL), Fortran interface (F2Py), C interface (Cython), Java (Jython) to name a few). Anyhow, if you would like whet your appetite and start learning a little more about Python, there is a pretty good tutorial here that goes through some of the basics. Also, I highly recommend taking a look at Johnny Lin's new intro book to Python in the Atmospheric and Oceanic Sciences (you can get a free e-copy here). 

I currently use Python 2.7.2.

Python Script

I always begin my script with the "shebang" line to ensure that the correct interpreter will be used. To be fair, this really is only necessary if  the script is made executable (e.g.  you chmod +x it in order to run it directly: ./mypythonscript.py) rather than running the script by "python mypythonscript.py".

With Python there are many different libraries or modules that you may import, many of which come standard. The major package that I use in this example (and in nearly all of my work) is NumPy.  NumPy is an excellent package for scientific computing, and in particular for working with arrays. To import a module, you simply type "import" and then the module name. Here, you have the option to import the module as a different name, which is quite helpful if the module name is very long, such that you do not have to retype it later on when you call a function or an instance from it later on.

Example:
This is what I have at the very top of my script. While it is not necessary to import all of your required modules at the top of your code (you can do it anywhere really), I find it to be good practice.

#!/usr/bin/env python
import numpy as np

Reading in the data
  • Reading in data in Python is very simple. To open a file and return a file object, you simply use  the built-in open() function. To read the file, I used NumPy's loadtxt function thatt will load data from a text file or csv file.

Example:

This is the section of my code where I was reading in the observed and forecasted temperatures from the provided csv file. Here, I am going to read in the csv file (it's comma delimited, so delimeter=','), and since the first row is simply the column headers, I would like to skip that (skiprows=1). Since the observed temperature are in column 3 and the forecasted temperatures in column 6, those are two columns I would like to read in. With the usecols argument, I can do this. Unlike the skiprows specification though, you must remember that 0 (zero) is an index, and here would represent the first column. Thus, specifying 2 and 5 would refer to columns 3 and 6, respectively. Lastly, I set unpack to True such that I can assign column 3 to the obs array and column 5 to the fcst array. As a final consideration, it is always good practice to close the file that you opened once you are done with it. 

f1 = open('/path/to/data/HW_2.csv','r')
obs,fcst = np.loadtxt(f1,delimiter=',',skiprows=1,usecols=(2,5),unpack=True)
f1.close()

Calculate Joint Probability Distribution
  • What is really great about Python, NumPy in particular, is that array operations do not require you to loop through all of the array indices when performing calculations. Operations will automatically be applied to each location within the array.
  • To calculate the Joint Probability Distribution, I made use of NumPy's 2d histogram function
    • This function requires you to specify the 2 datasets you wish to use as well as the number of bins in each dimension
    • For this example, I set up the bins to be a 1d array (or list) of numbers from 10 to 100 in intervals of 10
      • This was accomplished using the numpy arange function, which returns evenly spaced values for the given interval (see reference #2)
  • Alas, the joint probability distribution is simply the resulting 2d histogram array divided by sum of the elements of that array
    • To accomplish this, I simply made use of the sum function in NumPy which will sum up all of the elements of the array
    • This function also has the ability to sum rows and/or columns (axis specification), which will come in handy when calculating marginal probabilities

Example:

Below is the 2 lines (!) of code that created the joint probability distribution for the temperature forecasts using 10ºF bins. The output array of the 2d histogram is stored as "hist10" while the other two variables simply represent an array(list) of the bin edges in each dimension. Finally, I simply divided hist10 by the sum of the elements of hist10 to produce the final joint probability distribution, which here is named p10. 

Note: the order of the two data sets depends upon which one you would like to be represented by columns or rows. To be consistent with Murphy and Winkler's (1987) example (e.g. their Table 1), the forecast will be the columns and the observations will be the rows. Because Python is row-major, the observations are then specified first within the histogram function.

hist10, o10_bin_edges, f10_bin_edges = np.histogram2d(obs,fcst,bins=np.arange(10,101,10))
p10 = hist10/(np.sum(hist10))


As an aside, the sum of all of the elements in hist10 may also be written as:
hist10.sum()

For the case of POPs, though, you must specify different bin sizes for each dataset. This can be accomplished by specifying the number of bins in each direction within the bin argument (here, between the square brackets).

hist10, o10_bin_edges, f10_bin_edges = np.histogram2d(obs,fcst,bins=[2,np.arange(10,101,10)])

Calculate Marginal probabilities
  • To calculate the marginal probabilities, we simply want to sum the rows/columns of the joint probability distribution
  • As aforementioned, this may be accomplished using the NumPy sum function, where we can specify which dimension, or axis, of the array we would like to sum
    • To sum each row, axis =1
    • To sum each column , axis =0
(click picture to visit source)

Example:
Here, I also make use of the function newaxis which simply adds a dimension/axis. For example, to calculate p(x), I sum over rows fore each of the columns and store it in array p_x. When I include [:,np.newaxis], it essentially structures p_x such that it has the same number of rows as the joint probability distribution array (p10) but is one column. Essentially, this step is equivalent to reshaping the array to the desired dimensionality, which may also be accomplished using NumPy reshape

p_x = np.sum(p10,axis=1)[:,np.newaxis]
p_f = np.sum(p10,axis=0)[np.newaxis,:]

which would be equivalent to:

p_x = np.sum(p10,axis=1)
p_x = np.reshape(p_x,(len(p_x),1))

p_f = np.sum(p10,axis=0)
p_f = np.reshape(p_f,(1,len(p_f)))

This alternate way makes use of the built-in len function which will return the length, or number of items, in a list/array/tuple.

Calculate Conditional Probabilities
  • The likelihood base rate and calibration refinement may be calculated simply by dividing the joint probability distribution by the appropriate marginal probability. 
  • This process will divide each row/column of p10 element-wise by the respective marginal probability array--this uses some behind the scenes broadcasting such that the computation may be made between arrays of different shape.

LBR = p10/p_x
CR =  p10/p_f

Writing out the arrays to a text file
Though there area  few work clunky workarounds in Matplotlib, there is no real good way to create a table in Python (one of only a few negative things I can say about Python). As such, I decided to simply write out the resulting arrays into a delimited text file. Before writing out the Joint Probability distribution, I wanted to append the marginal probabilities to it, so I could write out one big array for easier table creation.
  • Use NumPy vstack to stack the joint probability distribution array and the marginal probability p(f) vertically (row-wise) to create one array
p10 = np.vstack((p10,p_f)) 
  • We would like to do the same thing but horizontally (column-wise) this time to join the new p10 array and the marginal probability p(x)
  • However, since the joint probability distribution array now has an extra row, the dimensions will not line up
  • So, I appended a 1.0 to the end of p(x) (using NumPy append
    • Append will flatten the array, so it needs to be reshaped it to its original dimensionality before it can be stacked with p10
  • From there, NumPy hstack can be used add the column
p_x = np.append(p_x,1.0)
p_x = np.reshape(p_x,(len(p_x),1))

p10 = np.hstack((p10,p_x))
  • To write out data, I simply used the savetxt function from NumPy.
  • Instead of a writing out the data separated by commas (you would set delimiter=','), I decided to write the data out to a tab delimited file (delimiter='\t') for easier readability when I open it using a text editor. I also formatted the output such that it was rounded to four decimal places to enhance readability.
np.savetxt("contingency_10bin.csv",p10,fmt="%.4f",delimiter='\t')
np.savetxt("LBR_10bin.csv",LBR,fmt="%.4f",delimiter='\t')

np.savetxt("CR_10bin.csv",CR,fmt="%.4f",delimiter='\t')

Calculating Forecast Bias
Calculating the forecast makes use of the NumPy mean function and basic python subtraction.
fcst_mean = np.mean(fcst)
obs_mean = np.mean(obs)

bias = fcst_mean - obs_mean
Comments