python demo


link to jupyterhub on scholar.rcac.purdue.edu



"python" is a programming language that is widely used, interactive, object-oriented, and easy to learn. One of the best features of python is the ability to include "modules" or "packages" that handle a lot of powerful tasks, such as statistics, plotting, and reading/writing a wide variety of data formats.

# basically, the stuff in this green Courier font is code 
# you should be able to copy and paste it into the jupyter command lines
# hit shift-enter to execute the python code that you have entered
# things that begin with # are comment lines and are not executed in python


python


We will be using a special version of python called "Anaconda" (or conda).  This contains a large number of modules that will be very useful for us in our data analysis.


let's try the classic first programming task that everyone does when they learn how to use a new language

print('hello world')

hit shift-enter to execute this code

try these to see how python handles mathematical expressions and integer data types

print(5*4)
print(6/2)
print(13+5+7)
print(6**2)

a=5./2.
print(a)




lists, here is how python assigns a series of values to a variable, putting them into a list within a pair of brackets:
a=[2, 3.2, 'hello', 'eeee']
print(a)
print(len(a))
print(a[0])
print(a[1:3])
a[1]='goodbye'
print(a)


numpy

we are going to use a very useful module called "numpy" which is short for Numerical Python

python includes a module using the "import" command, in general:
import module_name as shorthand

import numpy as np



p=np.pi
print(p)
print(np.sin(p/2))

let's generate some fake observations and forecasts and do some quick calculations
mu=50
sigma=20
obs=np.random.normal(mu,sigma,100)
help(np.random.normal)


print(np.max(obs))
print(np.min(obs))
print(np.mean(obs))
print(np.median(obs))

mu=-1
sigma=10
randerr=np.random.normal(mu,sigma,100)
syserr=2.5
fcst=obs+randerr+syserr
print(np.max(fcst))
print(np.min(fcst))
print(np.mean(fcst))
print(np.median(fcst))

plots


we need to import another package that has nice plotting modules

import matplotlib.pyplot as plt
%matplotlib inline

matplotlib is a Matlab-style plotting package, if your're familiar with Matlab, it’s easy to learn

plt.plot(obs,'b',fcst,'r')
plt.title('obs=blue and fcst=red')
plt.ylabel('values')

sounding data


let's use a handy numpy process that can read in data from a text file, such as our Springfest sounding file:

filename = '/scratch/scholar/m/mebaldwi/EAPS431/20160416_1642Z_Purdue_Oval.txt'
data = np.genfromtxt(filename, delimiter=',',skip_header=3)

the data in the text file is "comma delimited" meaning separated by commas
skip_header lets the program know how many lines of header information to skip before trying to decode the numbers

let's find out some information about the data that are in variable "data"

print(data.shape)
print(data.size)

"shape" tells us the number of (rows, columns) that we have read from the csv file
"size" tells us the total number of data values, which should be rows*columns

we can refer to an individual value in the big data array by referencing its row and column value. these are numbered starting with "0" increasing from left to right, top to bottom in our spreadsheet
for example, the value of the top row, left column is

print(data[0,0])

note that "nan" means "not a number".  

print(data[0,:])

you'll see that the numbers match the top row in your spreadsheet (without the header text)

the ":" symbol is a 'wildcard" and allows us to refer to the entire length of the vector

let's just look at the 5 columns that contain numbers, starting with column #1 (second column from the left)

print(data[0,1:6])
this might be a bit confusing, since column #1 is the second column, column #2 is the third column, etc., the right-most column is actually column #5. when you reference values this way, the indexes go from start:end without including the number at the end!  for example, try:

print(data[0,1:5])

notice that this does not print the value in column #5, only columns 1 through 4.  I'm not sure why it works this way, just something we have to get used to.

let's do some calculations on column #1, which contains the longitude values

print(np.max(data[:,1]))
print(np.min(data[:,1]))
print(np.mean(data[:,1]))

here we are taking advantage of numpy sub processes to find the max, min, and mean of the entire column of numbers

ok, so far we haven't really done anything that we couldn't easily do in Excel or the Google spreadsheet. let's take advantage of some of the more powerful aspects of python to do a calculation with values that between a range of numbers. 

first let's see how many of the values in column 4 exceed 500mb

print(data[:,4].size)
print(data[data[:,4]>500,4].size)

note the difference between these two commands.  instead of referencing all of the rows in column 4 (using ":"), we replace ":" in the second command with a logical command "data[:,4]>500"  this logical command returns a list of "true" and "false" values depending upon whether each value in column 4 is greater than 500.  only the "true" values will be used in the subsequent process, here just checking the size of that vector.  this is called indirect referencing.

it might be easier to deal with each column of numbers separately, we can assign them to variables with names that have some meaning to us

lat=data[:,0]
lon=data[:,1]
time=data[:,2]
height_m=data[:,3]
press_mb=data[:,4]
temp_C=data[:,5]
mix_ratio_gkg=data[:,6]
wind_spd_ms=data[:,7]
wind_dir_deg=data[:,8]

let's use a logical_and to find the average temperature between two pressure values (500 and 1000mb)

t_mean=np.mean(temp_C[np.logical_and(press_mb>500.,press_mb<=1000.)])
print(t_mean)

we might want to change the units of variables to standard MKS values, such as Kelvin for temperature and Pascal for pressure

press_Pa=press_mb*100.
temp_K=temp_C+273.15
mix_ratio_kgkg=mix_ratio_gkg/1000.

let's plot the path of the sounding

plt.plot(lon,lat)
plt.xlabel('longitude deg')
plt.ylabel('latitude deg')
plt.title('sonde location')

try saving the image and inserting it into your doc

let's look at pressure vs. height, with pressure on the x-axis and height on the y-axis

plt.plot(press_mb,height_m)
plt.xlabel('pressure mb')
plt.ylabel('height m')

now set up another variable that contains the exponential function for our model of pressure with height

press_model_mb=1000.*np.exp(-height_m/7000.)

let's plot both the modeled pressure and the actual pressure on the same plot, using different colors for each line

plt.plot(press_mb,height_m,'b',press_model_mb,height_m,'r')
plt.xlabel('pressure mb')
plt.ylabel('height m')
plt.title('modeled=red and actual=blue pressure')


what is the largest difference between these?

print(np.max(press_mb-press_model_mb))




Comments