using python for data analysis

"python" is a programming language that is widely used, interactive, object-oriented, and easy to learn. One of the best features of python is the ability to include "modules" or "packages" that handle a lot of powerful tasks, such as statistics, plotting, and reading/writing a wide variety of data formats.

# basically, the stuff in this green Courier font is code 
# you should be able to copy and paste it into the python shell
# things that begin with # are comment lines and are not executed in python


python


We will be using a special version of python called "Enthought Python Distribution" (or EPD).  This contains a large number of modules that will be very useful for us in our data analysis.

open a terminal window and change the directory to the class disk where we have saved our data in "csv" (comma separated values) files

cd /class/hypocenter/a/EAS434/EAPS_226_data

start python interactively, type:

/opt/EPD/bin/python


let's try the classic first programming task that everyone does when they learn how to use a new language

print 'hello world'


it is always good to know how to get out of a program...try:

quit()



ok, now we need to start python up again, use the up arrow on your keyboard to bring back the previous linux command

try these to see how python handles mathematical expressions and integer data types

print 5*4
print 6/2
print 13+5+7
print 6**2

a=5/2
print a

a=5./2.
print a


note the difference between these two, this is the difference between integer and floating point math



lists, here is how python assigns a series of values to a variable, putting them into a list within a pair of brackets:
a=[2, 3.2, 'hello', 'eeee']
len(a)
a[0]
a[1:3]
a[1]='goodbye'
print a


tuples are lists that can’t be changed and are assigned using parentheses instead of brackets
b=(2, 3.2, 'goodbye', 'aaaa')
b[0]


numpy

we are going to use a very useful module called "numpy" which is short for Numerical Python

python includes a module using the "import" command, in general:
import module_name as shorthand

import numpy as np


everything in python is an "object".  you can find out more information about an object by using the dir() command

dir(np)


lists all of the sub-processes that numpy contains, these are called "definitions" to access those sub-processes, use a "." after the name of the main object, followed by the name of the sub-process

p=np.pi
print p
np.sin(p/2)



let's use a handy numpy process that can read in data from a text file, such as our "csv" files, change filename to one of the files for your location:

filename = 'atyahr.csv'
adata = np.genfromtxt(filename, delimiter=',')

let's find out some information about the data that are in variable "adata"

print adata.shape
print adata.size

"shape" tells us the number of (rows, columns) that we have read from the csv file
"size" tells us the total number of data values, which should be rows*columns

we can refer to an individual value in the big adata array by referencing its row and column value. these are numbered starting with "0" increasing from left to right, top to bottom in our spreadsheet
for example, the value of the top row, left column is

print adata[0,0]

note that "nan" means "not a number".  this is because the first column contains our date/time information in a non-numerical format, let's try to print the entire first row

print adata[0,:]

you'll see that the numbers match the top row in your spreadsheet (without the header text)

the ":" symbol is a 'wildcard" and allows us to refer to the entire length of the vector

let's just look at the 5 columns that contain numbers, starting with column #1 (second column from the left)

print adata[0,1:6]
this might be a bit confusing, since column #1 is the second column, column #2 is the third column, etc., the right-most column is actually column #5. when you reference values this way, the indexes go from start:end without including the number at the end!  for example, try:

print adata[0,1:5]

notice that this does not print the value in column #5, only columns 1 through 4.  I'm not sure why it works this way, just something we have to get used to.

let's do some calculations on column #1, which contains the surface temperature values

print np.max(adata[:,1])
print np.min(adata[:,1])
print np.mean(adata[:,1])

here we are taking advantage of numpy sub processes to find the max, min, and mean of the entire column of numbers

ok, so far we haven't really done anything that we couldn't easily do in Excel or the Google spreadsheet. let's take advantage of some of the more powerful aspects of python to mask out values that are greater than 120 minutes from the nearest actual observation.  recall that this variable is in the right-most column, column #5.

first let's see how many of the values in column 5 exceed 120 minutes

print adata[:,5].size
print adata[adata[:,5]>120,5].size

note the difference between these two commands.  instead of referencing all of the rows in column 5 (using ":"), we replace ":" in the second command with a logical command "adata[:,5]>120"  this logical command returns a list of "true" and "false" values depending upon whether each value in column 5 is greater than 120.  only the "true" values will be used in the subsequent process, here just checking the size of that vector.  this is called indirect referencing.

calculating errors


let's find the mean of the surface temperatures (column #1) for only those rows which are less than or equal to 120 minutes away from the nearest actual observation (determined by column #5)
 
print np.mean(adata[adata[:,5]<=120,1])

ok, now let's read in the WRF data for your location

filename = 'atywrf.csv'
wrf = np.genfromtxt(filename, delimiter=',')


let's set up a new variables that contain the forecast values (fcst) and observed values (obs) for surface temperature, only for "valid" observations that are less than or equal to 120 minutes from the nearest observation. let's set up a variable called "numv" to hold the value that we want to use in order to determine valid observations (here 120 minutes), subtract fcst - obs to determine the hourly error values

numv=120
fcst = wrf[adata[:,5]<=numv,1]
obs = adata[adata[:,5]<=numv,1]
error = fcst - obs

with this variable, it is now really easy to compute the bias (mean error), mean absolute error, and mean squared error for the surface temps

bias = np.mean(error)
mae = np.mean(np.abs(error))
mse = np.mean(error**2)
rmse = np.sqrt(mse)

print 'bias = ',bias
print 'mae = ',mae
print 'mse = ',mse
print 'rmse = ',rmse


mean error (bias)


positive: forecast is on average above the obs (too warm)
negative: forecast is on average below the obs (too cold)

mean absolute error


typical magnitude of the error

mean square error


more sensitive to large errors (outliers)
a single bad forecast (or bad ob) will make this large

root mean squared error



same units as forecast, obs
another typical magnitude of error (sensitive to large errors)

scatter plots

a nice way to compare the forecast and observed values is with a "scatter plot"
where you plot symbols in x,y space of forecast values vs. observed values
obs=x coord, fcst=y coord

we need to import another package that has nice plotting modules

import matplotlib.pyplot as plt


matplotlib is a Matlab-style plotting package, if your're familiar with Matlab, it’s easy to learn

plt.figure()
plt.plot(obs,fcst,'.')
plt.hold()
plt.xlabel('obs')
plt.ylabel('fcst')
plt.show()


it is helpful to have a "y=x" line for comparison purposes, since a perfect forecast will fall along that line, we can add this to the plot command by plotting the obs vs. obs values as a red line

plt.figure()
plt.plot(obs,fcst,'.',obs,obs,'r')
plt.hold()
plt.xlabel('RWIS obs')
plt.ylabel('WRF fcst')
plt.title('surface temps deg C for site A')
plt.show()

here, the bias can easily be seen, if the dots tend to fall below the y=x line, the forecasts are too cold, if the dots fall above the y=x line, the forecasts are too warm

from here, it should be fairly easy to compute the error statistics for other forecast variables, such as air temperature (column #2) and dewpoint (column #3)

airfcst = wrf[adata[:,5]<=numv,2]
airobs = adata[adata[:,5]<=numv,2]
error = airfcst - airobs

bias = np.mean(error) mae = np.mean(np.abs(error)) mse = np.mean(error**2) rmse = np.sqrt(mse) print 'air temp bias = ',bias print 'air temp mae = ',mae print 'air temp mse = ',mse print 'air temp rmse = ',rmse

plt.figure() plt.plot(airobs,airfcst,'.',airobs,airobs,'r') plt.hold() plt.xlabel('RWIS obs') plt.ylabel('WRF fcst') plt.title('air temps deg C for Amity') plt.show()

dewfcst = wrf[adata[:,5]<=numv,3]
dewobs = adata[adata[:,5]<=numv,3]
error = dewfcst - dewobs

bias = np.mean(error) mae = np.mean(np.abs(error)) mse = np.mean(error**2) rmse = np.sqrt(mse) print 'dewpoint bias = ',bias print 'dewpoint mae = ',mae print 'dewpoint mse = ',mse print 'dewpoint rmse = ',rmse

plt.figure() plt.plot(dewobs,dewfcst,'.',dewobs,dewobs,'r') plt.hold() plt.xlabel('RWIS obs') plt.ylabel('WRF fcst') plt.title('dewpoint deg C for Amity') plt.show()


hourly errors


at this point the error statistics that we've computed are for the overall forecast. it would be interesting to see how these vary during the day, similar to what was done in the BAMS article on the NWS/Metro forecasts.

our data are hourly, so if we skip through the column and only do a calculation on every 24th row, we would be focusing that calculation on the same time of day (and same forecast hour) throughout. in python, you can skip entries in a row using the "start:end:skip" referencing syntax. let's start with the first hour in our file, which is probably 07 UTC (7h fcst). the length of the column can be found using the "len" function. we will assign a subset of the data to separate variables, then compute our stats as before:

length = len(adata)
asubset=adata[0:length:24,:]
wrfsubset=wrf[0:length:24,:]
fcst = wrfsubset[asubset[:,5]<=numv,1]
obs = asubset[asubset[:,5]<=numv,1]
error = fcst - obs

bias = np.mean(error) mae = np.mean(np.abs(error)) mse = np.mean(error**2) rmse = np.sqrt(mse) print 'hour 7 sfc temp bias = ',bias print '
hour 7 sfc
temp mae = ',mae
print 'hour 7 sfc temp mse = ',mse
print 'hour 7 sfc temp rmse = ',rmse
print 'hour 7 number of valid obs =',len(obs)

this is great for getting the 07 UTC values, we would like to do this for all 24 hours in the day. this would be a pain to have to type in 24 different sets of commands and print statements, you can imagine how tedious this would be in a Google Spreadsheet. here is where we'll take advantage of "looping" in python, using a "for" loop:

let's play around with looping, python can do looping with the "for <index> in <list>" command
try this:


for k in [7, -3.3, np.pi, 'hello']:
    print k


a = [7, -3.3, np.pi, 'hello']
for i in np.arange(4):
    print i, a[i]


print np.arange(4)


here is where some python syntax comes into play, when you use a "for" loop (or an "if" test), the end of that statement will need a colon ":", the block of code that follows must be indented to indicate that this block of code will be what is executed in the loop.  the indents must be consistent within that block of code. that block of code is terminated by the next line of python that is NOT indented

in the first loop, you see how this for loop works, it simply steps through the provided list sequentially, executing the intended block of code that follows.  the list can contain any type of data in the list, here we've provided an integer, float, numpy object, and a string

the second loop uses the same list, but uses a numpy array (np.arange) that sets up a simple array of sequential numbers, the for loop steps through those numbers, in this case 0 to 3 (four numbers in total).  the block of code prints the loop index, then refers to array "a" by that i

let's set up a loop that goes through all 24 hours in a day, pull the subset of the data for each hour, compute our stats, and store the answers in a new set of variables

hr=np.zeros(24)
msehr=np.zeros(24)
maehr=np.zeros(24)
biashr=np.zeros(24)
sizehr=np.zeros(24)
for i in np.arange(24):
    hr[i]=i+7
    asubset=adata[i:length:24,:]
    wrfsubset=wrf[i:length:24,:]
    fcst = wrfsubset[asubset[:,5]<=numv,1]
    obs = asubset[asubset[:,5]<=numv,1]
    error = fcst - obs
    sizehr[i]=obs.size
    msehr[i]=np.mean(error**2)
    biashr[i]=np.mean(error)
    maehr[i]=np.mean(np.abs(error))
print ' mse = ', msehr
print ' bias = ',biashr
print ' mae = ',maehr


python scripting


at this point, we should put together a script or program that will do all of the calculations and plots that we would like to see in one place.  running python interactively is nice for exploring the data and testing new calculations, but once we get to a point where we have a lot of typing to do, putting the commands into a file (what I call a "script") and executing those is much easier.

open a new terminal, change directory into our class data directory, and open gedit:
cd /class/hypocenter/a/EAS434/EAPS_226_data
gedit


now copy and paste the following python code into your gedit window

save it as "rwisNAME.py" (change NAME to your name or something else unique to you)

exit gedit

[if this doesn't work we might need to do: cp rwiscalc.py rwisNAME.py ]

now do

chmod 777 rwisNAME.py

./rwisNAME.py


python script for RWIS error calculation

#!/opt/EPD/bin/python

import numpy as np
import matplotlib.pyplot as plt
from optparse import OptionParser

# we are using the "option parser" to bring in optional command line arguments into our script
# for example, we can now easily change the location of the analysis to Frankfort by doing
# ./rwis.py -p fft
#
# or change the number of minutes to the nearest observation cutoff for determining valid observations to 60 min by doing
# ./rwis.py -c 60
# or both
# ./rwis.py -p gas -c 30

# define options and set up default values

parser = OptionParser()

parser.add_option("-p", "--prefix", dest="prefix", default="aty",
                  help="start of csv filenames", metavar="file name prefix")

parser.add_option("-c", "--cutoff", dest="cutoff", type="int",default=120,
                  help="limit of minutes for interp", metavar="minutes")
parser.add_option("-r", "--range", dest="mrange", type="int", default=30,
                  help="number of days", metavar="DAYS")
parser.add_option("-y", "--year", dest="myear", type="int", default=2012,
                  help="year", metavar="YEAR")
parser.add_option("-m", "--month", dest="mmonth", type="int", default=11,
                  help="month", metavar="MONTH")

# obtain information from the command line options

(options, args) = parser.parse_args()

year=options.myear
month=options.mmonth
maxday=options.mrange
pref=options.prefix
numv=options.cutoff

# set up csv file names, read data
afile = pref+'ahr.csv'
bfile = pref+'bhr.csv'
wrffile = pref+'wrf.csv'

print 'reading ',afile
print 'reading ',bfile
print 'reading ',wrffile

adata=np.genfromtxt(afile, delimiter=',')
bdata=np.genfromtxt(bfile, delimiter=',')
wrf=np.genfromtxt(wrffile, delimiter=',')

print ' total number of valid hours A site'
print adata[adata[:,5]<=numv,5].size
print ' total number of valid hours B site'
print bdata[bdata[:,5]<=numv,5].size

# compute summary error statistics

fcst=wrf[adata[:,5]<=numv,1]
obs=adata[adata[:,5]<=numv,1]
error=fcst-obs
mse=np.mean(error**2)
bias=np.mean(error)
mae=np.mean(np.abs(error))
print ' sftemp A site '
print ' mse = ',mse
print ' mae = ',mae
print ' bias = ', bias

fcst=wrf[bdata[:,5]<=numv,1]
obs=bdata[bdata[:,5]<=numv,1]
error=fcst-obs
mse=np.mean(error**2)
bias=np.mean(error)
mae=np.mean(np.abs(error))
print ' sftemp B site '
print ' mse = ',mse
print ' mae = ',mae
print ' bias = ', bias

fcst=wrf[adata[:,5]<=numv,2]
obs=adata[adata[:,5]<=numv,2]
error=fcst-obs
mse=np.mean(error**2)
bias=np.mean(error)
mae=np.mean(np.abs(error))
print ' airtemp  '
print ' mse = ',mse
print ' mae = ',mae
print ' bias = ', bias

fcst=wrf[adata[:,5]<=numv,3]
obs=adata[adata[:,5]<=numv,3]
error=fcst-obs
mse=np.mean(error**2)
bias=np.mean(error)
mae=np.mean(np.abs(error))
print ' dewpoint  '
print ' mse = ',mse
print ' mae = ',mae
print ' bias = ', bias

# determine the hourly errors

length = len(adata)
hr=np.zeros(24)
msehr=np.zeros(24)
maehr=np.zeros(24)
biashr=np.zeros(24)
sizehr=np.zeros(24)
mseair=np.zeros(24)
maeair=np.zeros(24)
biasair=np.zeros(24)
msedew=np.zeros(24)
maedew=np.zeros(24)
biasdew=np.zeros(24)
mseb=np.zeros(24)
maeb=np.zeros(24)
biasb=np.zeros(24)
for i in np.arange(24):
    hr[i]=i+7
    asubset=adata[i:length:24,:]
    bsubset=bdata[i:length:24,:]
    wrfsubset=wrf[i:length:24,:]
    fcst=wrfsubset[asubset[:,5]<=numv,1]
    obs=asubset[asubset[:,5]<=numv,1]
    bobs=bsubset[asubset[:,5]<=numv,1]
    airfcst=wrfsubset[asubset[:,5]<=numv,2]
    airobs=asubset[asubset[:,5]<=numv,2]
    dewfcst=wrfsubset[asubset[:,5]<=numv,3]
    dewobs=asubset[asubset[:,5]<=numv,3]
    error=fcst-obs
    berror=fcst-bobs
    airerror=airfcst-airobs
    dewerror=dewfcst-dewobs
    sizehr[i]=obs.size
    msehr[i]=np.mean(error**2)
    biashr[i]=np.mean(error)
    maehr[i]=np.mean(np.abs(error))
    mseair[i]=np.mean(airerror**2)
    biasair[i]=np.mean(airerror)
    maeair[i]=np.mean(np.abs(airerror))
    msedew[i]=np.mean(dewerror**2)
    biasdew[i]=np.mean(dewerror)
    maedew[i]=np.mean(np.abs(dewerror))
    mseb[i]=np.mean(berror**2)
    biasb[i]=np.mean(berror)
    maeb[i]=np.mean(np.abs(berror))

# plot the hourly errors
plt.figure()
plt.plot(hr,np.sqrt(msehr),'r',hr,biashr,'b',hr,maehr,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('A site surface temp Bias (blue) RMSE (red) MAE (green)')
plt.show()

plt.figure()
plt.plot(hr,np.sqrt(mseair),'r',hr,biasair,'b',hr,maeair,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('air temp Bias (blue) RMSE (red) MAE (green)')
plt.show()

plt.figure()
plt.plot(hr,np.sqrt(msedew),'r',hr,biasdew,'b',hr,maedew,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('dewpoint Bias (blue) RMSE (red) MAE (green)')
plt.show()

plt.figure()
plt.plot(hr,np.sqrt(mseb),'r',hr,biasb,'b',hr,maeb,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('B site surface temp Bias (blue) RMSE (red) MAE (green)')
plt.show()



# apply the bias found previously to the hourly forecasts, see how that improves those forecasts

mseadj=np.zeros(24)
maeadj=np.zeros(24)
biasadj=np.zeros(24)
mseairadj=np.zeros(24)
maeairadj=np.zeros(24)
biasairadj=np.zeros(24)
msedewadj=np.zeros(24)
maedewadj=np.zeros(24)
biasdewadj=np.zeros(24)
msebadj=np.zeros(24)
maebadj=np.zeros(24)
biasbadj=np.zeros(24)
for i in np.arange(24):
    asubset=adata[i:length:24,:]
    bsubset=bdata[i:length:24,:]
    wrfsubset=wrf[i:length:24,:]
    fcst=wrfsubset[asubset[:,5]<=numv,1]-biashr[i]
    obs=asubset[asubset[:,5]<=numv,1]
    bobs=bsubset[asubset[:,5]<=numv,1]
    airfcst=wrfsubset[asubset[:,5]<=numv,2]-biasair[i]
    airobs=asubset[asubset[:,5]<=numv,2]
    dewfcst=wrfsubset[asubset[:,5]<=numv,3]-biasdew[i]
    dewobs=asubset[asubset[:,5]<=numv,3]
    error=fcst-obs
    berror=fcst-bobs
    airerror=airfcst-airobs
    dewerror=dewfcst-dewobs
    mseadj[i]=np.mean(error**2)
    biasadj[i]=np.mean(error)
    maeadj[i]=np.mean(np.abs(error))
    mseairadj[i]=np.mean(airerror**2)
    biasairadj[i]=np.mean(airerror)
    maeairadj[i]=np.mean(np.abs(airerror))
    msedewadj[i]=np.mean(dewerror**2)
    biasdewadj[i]=np.mean(dewerror)
    maedewadj[i]=np.mean(np.abs(dewerror))
    msebadj[i]=np.mean(berror**2)
    biasbadj[i]=np.mean(berror)
    maebadj[i]=np.mean(np.abs(berror))

# plot MAE of un-adjusted (raw) forecast against MAE of bias-adjusted forecast

plt.figure()
plt.plot(hr,maehr,'b',hr,maeadj,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('A site raw MAE (blue) MAE of bias-adjusted fcst (green)')
plt.show()

plt.figure()
plt.plot(hr,maeair,'b',hr,maeairadj,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('air temp raw MAE (blue) MAE of bias-adjusted fcst (green)')
plt.show()

plt.figure()
plt.plot(hr,maedew,'b',hr,maedewadj,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('dewpoint raw MAE (blue) MAE of bias-adjusted fcst (green)')
plt.show()

plt.figure()
plt.plot(hr,maeb,'b',hr,maebadj,'g')
plt.hold()
plt.xlabel('forecast hour')
plt.ylabel('deg C')
plt.title('B site raw MAE (blue) MAE of bias-adjusted fcst (green)')
plt.show()

Comments