For
homework #2, I used Python to compute the joint,
conditional, and marginal distributions of the given forecast and
observations. I will now attempt a short crash course in Python and how
it may be used to solve the problems presented in the assignment.
Python Makes Programming... FUN!
Python Script
./mypythonscript.py) rather than running the script by "python mypythonscript.py".
With Python there are many different libraries or modules that you may import, many of which come standard. The major package that I use in this example (and in nearly all of my work) is NumPy. NumPy is an excellent package for scientific computing, and in particular for working with arrays. To import a module, you simply type "import" and then the module name. Here, you have the option to import the module as a different name, which is quite helpful if the module name is very long, such that you do not have to retype it later on when you call a function or an instance from it later on.
Example:
This is what I have at the very top of my script. While it is not necessary to import all of your required modules at the top of your code (you can do it anywhere really), I find it to be good practice.
#!/usr/bin/env python
import numpy as np
Example:
This is the section of my code where I was reading in the observed and forecasted temperatures from the provided csv file. Here, I am going to read in the csv file (it's comma delimited, so delimeter=','), and since the first row is simply the column headers, I would like to skip that (skiprows=1). Since the observed temperature are in column 3 and the forecasted temperatures in column 6, those are two columns I would like to read in. With the usecols argument, I can do this. Unlike the skiprows specification though, you must remember that 0 (zero) is an index, and here would represent the first column. Thus, specifying 2 and 5 would refer to columns 3 and 6, respectively. Lastly, I set unpack to True such that I can assign column 3 to the obs array and column 5 to the fcst array. As a final consideration, it is always good practice to close the file that you opened once you are done with it.
f1 = open('/path/to/data/HW_2.csv','r')
obs,fcst = np.loadtxt(f1,delimiter=',',skiprows=1,usecols=(2,5),unpack=True)
f1.close()
Calculate Joint Probability Distribution
- What is really great about Python, NumPy in particular, is that array operations do not require you to loop through all of the array indices when performing calculations. Operations will automatically be applied to each location within the array.
- To calculate the Joint Probability Distribution, I made use of NumPy's 2d histogram function
- This function requires you to specify the 2 datasets you wish to use as well as the number of bins in each dimension
- For this example, I set up the bins to be a 1d array (or list) of numbers from 10 to 100 in intervals of 10
- This was accomplished using the numpy arange function, which returns evenly spaced values for the given interval (see reference #2)
- Alas, the joint probability distribution is simply the resulting 2d histogram array divided by sum of the elements of that array
- To accomplish this, I simply made use of the sum function in NumPy which will sum up all of the elements of the array
- This function also has the ability to sum rows and/or columns (axis specification), which will come in handy when calculating marginal probabilities
Example:
Below is the 2 lines (!) of code that created the joint probability distribution for the temperature forecasts using 10ºF bins. The output array of the 2d histogram is stored as "hist10" while the other two variables simply represent an array(list) of the bin edges in each dimension. Finally, I simply divided hist10 by the sum of the elements of hist10 to produce the final joint probability distribution, which here is named p10.
Note: the order of the two data sets depends upon which one you would like to be represented by columns or rows. To be consistent with Murphy and Winkler's (1987) example (e.g. their Table 1), the forecast will be the columns and the observations will be the rows. Because Python is row-major, the observations are then specified first within the histogram function.
hist10, o10_bin_edges, f10_bin_edges = np.histogram2d(obs,fcst,bins=np.arange(10,101,10))
p10 = hist10/(np.sum(hist10))
hist10.sum()
hist10, o10_bin_edges, f10_bin_edges = np.histogram2d(obs,fcst,bins=[2,
np.arange(10,101,10)
])
- To calculate the marginal probabilities, we simply want to sum the rows/columns of the joint probability distribution
- As aforementioned, this may be accomplished using the NumPy sum function, where we can specify which dimension, or axis, of the array we would like to sum
- To sum each row, axis =1
- To sum each column , axis =0
Example:
p_x = np.sum(p10,axis=1)[:,np.newaxis]
p_f = np.sum(p10,axis=0)[np.newaxis,:]
p_x = np.sum(p10,axis=1)
p_x = np.reshape(p_x,(len(p_x),1))
p_f = np.sum(p10,axis=0)
p_f = np.reshape(p_f,(1,len(p_f)))
- The likelihood base rate and calibration refinement may be calculated simply by dividing the joint probability distribution by the appropriate marginal probability.
- This process will divide each row/column of p10 element-wise by the respective marginal probability array--this uses some behind the scenes broadcasting such that the computation may be made between arrays of different shape.
LBR = p10/p_x
CR = p10/p_f
Writing out the arrays to a text file
Though there area few work clunky workarounds in Matplotlib, there is no real good way to create a table in Python (one of only a few negative things I can say about Python). As such, I decided to simply write out the resulting arrays into a delimited text file. Before writing out the Joint Probability distribution, I wanted to append the marginal probabilities to it, so I could write out one big array for easier table creation.
- Use NumPy vstack to stack the joint probability distribution array and the marginal probability p(f) vertically (row-wise) to create one array
p10 = np.vstack((p10,p_f))
- We would like to do the same thing but horizontally (column-wise) this time to join the new p10 array and the marginal probability p(x)
- However, since the joint probability distribution array now has an extra row, the dimensions will not line up
- So, I appended a 1.0 to the end of p(x) (using NumPy append)
- Append will flatten the array, so it needs to be reshaped it to its original dimensionality before it can be stacked with p10
- From there, NumPy hstack can be used add the column
p_
x = np.append(p_x,1.0)
p_x = np.reshape(p_x,(len(p_x),1))
p10 = np.hstack((p10,p_x))
- To write out data, I simply used the savetxt function from NumPy.
- Instead of a writing out the data separated by commas (you would set delimiter=','), I decided to write the data out to a tab delimited file (delimiter='\t') for easier readability when I open it using a text editor. I also formatted the output such that it was rounded to four decimal places to enhance readability.
np.savetxt("contingency_10bin.csv",p10,fmt="%.4f",delimiter='\t')
np.savetxt("LBR_10bin.csv",LBR,fmt="%.4f",delimiter='\t')
np.savetxt("CR_10bin.csv",CR,fmt="%.4f",delimiter='\t')
fcst_mean = np.mean(fcst)
obs_mean = np.mean(obs)
bias = fcst_mean - obs_mean