link to jupyterhub on scholar.rcac.purdue.edu # basically, the stuff in this green Courier font is code # you should be able to copy and paste it into the jupyter command lines # hit shift-enter to execute the python code that you have entered # things that begin with # are comment lines and are not executed in python pythonWe will be using a special version of python called "Anaconda" (or conda). This contains a large number of modules that will be very useful for us in our data analysis. let's try the classic first programming task that everyone does when they learn how to use a new language print('hello world') hit shift-enter to execute this code try these to see how python handles mathematical expressions and integer data types print(5*4) print(6/2) print(13+5+7) print(6**2) a=5./2. print(a) lists, here is how python assigns a series of values to a variable, putting them into a list within a pair of brackets: a=[2, 3.2, 'hello', 'eeee'] print(a) print(len(a)) print(a[0]) print(a[1:3]) a[1]='goodbye' print(a) numpywe are going to use a very useful module called "numpy" which is short for Numerical Pythonpython includes a module using the "import" command, in general: import module_name as shorthand import numpy as np p=np.pi print(p) print(np.sin(p/2)) let's generate some fake observations and forecasts and do some quick calculations mu=50 sigma=20 obs=np.random.normal(mu,sigma,100) help(np.random.normal) print(np.max(obs)) print(np.min(obs)) print(np.mean(obs)) print(np.median(obs))
mu=-1 sigma=10 randerr=np.random.normal(mu,sigma,100) syserr=2.5 fcst=obs+randerr+syserr print(np.max(fcst)) print(np.min(fcst)) print(np.mean(fcst)) print(np.median(fcst)) plotswe need to import another package that has nice plotting modules import matplotlib.pyplot as plt %matplotlib inline matplotlib is a Matlab-style plotting package, if your're familiar with Matlab, it’s easy to learn plt.plot(obs,'b',fcst,'r') plt.title('obs=blue and fcst=red') plt.ylabel('values') sounding datalet's use a handy numpy process that can read in data from a text file, such as our Springfest sounding file: filename = '/scratch/scholar/m/mebaldwi/EAPS431/20160416_1642Z_Purdue_Oval.txt' data = np.genfromtxt(filename, delimiter=',',skip_header=3) the data in the text file is "comma delimited" meaning separated by commas skip_header lets the program know how many lines of header information to skip before trying to decode the numbers let's find out some information about the data that are in variable "data" print(data.shape) print(data.size) "shape" tells us the number of (rows, columns) that we have read from the csv file "size" tells us the total number of data values, which should be rows*columns we can refer to an individual value in the big data array by referencing its row and column value. these are numbered starting with "0" increasing from left to right, top to bottom in our spreadsheet for example, the value of the top row, left column is print(data[0,0]) note that "nan" means "not a number". print(data[0,:]) you'll see that the numbers match the top row in your spreadsheet (without the header text) the ":" symbol is a 'wildcard" and allows us to refer to the entire length of the vector let's just look at the 5 columns that contain numbers, starting with column #1 (second column from the left) print(data[0,1:6]) this might be a bit confusing, since column #1 is the second column, column #2 is the third column, etc., the right-most column is actually column #5. when you reference values this way, the indexes go from start:end without including the number at the end! for example, try: print(data[0,1:5]) notice that this does not print the value in column #5, only columns 1 through 4. I'm not sure why it works this way, just something we have to get used to. let's do some calculations on column #1, which contains the longitude values print(np.max(data[:,1])) print(np.min(data[:,1])) print(np.mean(data[:,1])) here we are taking advantage of numpy sub processes to find the max, min, and mean of the entire column of numbers ok, so far we haven't really done anything that we couldn't easily do in Excel or the Google spreadsheet. let's take advantage of some of the more powerful aspects of python to do a calculation with values that between a range of numbers. first let's see how many of the values in column 4 exceed 500mb print(data[:,4].size) print(data[data[:,4]>500,4].size) note the difference between these two commands. instead of referencing all of the rows in column 4 (using ":"), we replace ":" in the second command with a logical command "data[:,4]>500" this logical command returns a list of "true" and "false" values depending upon whether each value in column 4 is greater than 500. only the "true" values will be used in the subsequent process, here just checking the size of that vector. this is called indirect referencing. it might be easier to deal with each column of numbers separately, we can assign them to variables with names that have some meaning to us lat=data[:,0] lon=data[:,1] time=data[:,2] height_m=data[:,3] press_mb=data[:,4] temp_C=data[:,5] mix_ratio_gkg=data[:,6] wind_spd_ms=data[:,7] wind_dir_deg=data[:,8] let's use a logical_and to find the average temperature between two pressure values (500 and 1000mb) t_mean=np.mean(temp_C[np.logical_and(press_mb>500.,press_mb<=1000.)]) print(t_mean) we might want to change the units of variables to standard MKS values, such as Kelvin for temperature and Pascal for pressure press_Pa=press_mb*100. temp_K=temp_C+273.15 mix_ratio_kgkg=mix_ratio_gkg/1000. let's plot the path of the sounding plt.plot(lon,lat) plt.xlabel('longitude deg') plt.ylabel('latitude deg') plt.title('sonde location') try saving the image and inserting it into your doc let's look at pressure vs. height, with pressure on the x-axis and height on the y-axis plt.plot(press_mb,height_m) plt.xlabel('pressure mb') plt.ylabel('height m') now set up another variable that contains the exponential function for our model of pressure with height press_model_mb=1000.*np.exp(-height_m/7000.) let's plot both the modeled pressure and the actual pressure on the same plot, using different colors for each line plt.plot(press_mb,height_m,'b',press_model_mb,height_m,'r') plt.xlabel('pressure mb') plt.ylabel('height m') plt.title('modeled=red and actual=blue pressure') what is the largest difference between these? print(np.max(press_mb-press_model_mb)) |