Stepanek - Homework #2 Using R and Excel

Given the open nature of our second homework assignment for EAS 59100 – Forecast Verification, I utilized multiple methods in order to perform the following activities:

-          Creating scatterplots of forecast and observed values of temperature and precipitation

-          Creating  categorical contingency tables – estimating joint probability distributions

-          Computing marginal probabilities from the contingency tables

-          Factoring the joint distributions

-          Computing conditional probabilities

-          Calculating forecast bias

Aside from checking a lot of my calculations via a standard scientific calculator, and sketching thoughts down with pencil/paper, I utilized the following software programs, and will detail the steps I performed using:

-          Microsoft Excel

-          R (Project for Statistical Computing)

We were provided a dataset with a variety of information within a MS Excel Spreadsheet.  After reading through the assignment, I felt this would make for a good opportunity to start learning more about the R software package (note that I am NOT an R expert, and this was my first 'real' experience using the software). 

Please note that in general, this wiki works through the data in homework problem #1, but the procedures to produce results for homework problem #2 are very similar.


Major Step #1 – Telling R how to read in the data

1)       After carefully reading the assignment and developing a feel for what data would be needed to perform the necessary calculations (and subsequently what data was extraneous and could be eliminated), I deleted all columns except ones that were necessary.  For this case, the following data columns were retained

o   Actual Temperatures

o   Actual Precipitation

o   MOS Forecast Temperatures

o   MOS Forecast Precipitation

a.       Important:  File must be saved in a format that can be understood by R – I saved the file as a .csv (comma delimited) file

b.      Important:  Column headers must be a string of characters in order to have things match up properly (example: KIND_temp_(F) is acceptable, while KIND temp (F) would likely cause errors to develop)


2)      Read the data into R as a table

a.       Command:  > a <- read.table (“J:/HW_2.csv”)

                                                               i.      Note, you could assign a variety of names to the table – I simply chose ‘a’ since I was starting off and familiarizing myself with the                                                                 software interface


3)      Clean up the table

a.       The table will look very unorganized upon displaying it in R - I cleaned it up into organized columns by using the following command:

                                                               i.      > a <- read.table (“J:/HW_2.csv”, header=T, sep=”,”)

                            1.       This eliminates the commas in between the data values and tells R that the first row contains the names of the columns in the table


Additional comments:

-          Simply type the name you assigned to the table (in my case ‘a’) at the R prompt to display the data contained in the table.

o   Example:  > a


-          If you want to see the names of the column data that are in the table, use the following command:

o   > names (a)          (this should display something like the following)

 [1] “KIND_Temp_Obs”     “KIND_POP_Obs”     “KIND_temp_MOSFcst”     “KIND_POP_MOSFcst”


Major Step #2 – Creating a Scatterplot in R

1)       Plotting two variables against one another is a fairly easy task in R – it just takes a bit of extra work to clean up the plot and make it ‘look good’

a.       Here is the general command to plot two variables as a scatterplot in R:

                                        plot (x,y)

            However, you can then add in a ‘main’ title, as well as ‘xlab’ x-axis labels and ‘ylab’ y-axis labels. 

b.      For example, here is the command I used to plot forecast (MOS) temperatures against the observed temperatures (some insignificant details eliminated for aesthetics):

                                                               i.      > plot(a$KIND_Temp_MOSFcst, a$KIND_Temp_Obs,

                                    + main=”Relationship Between Observed and Forecasted Temperatures”,

                                    + xlab=”MOS Forecast Precipitation”,

                                    + ylab=”Observed Precipitation”)

                                                             ii.      Noted how I had to utilized the ‘a$’ before my column headers to let R know that the columns/data I wished to plot were a subset of the table ‘a’.

                                                            iii.      Pay close attention to commas and quotation marks or you will likely receive an error message!

c.       The scatterplot should open in a separate window (note, I have utilized the Windows version of R and do not know how closely the Linux version would parallel). 

                                                               i.      There are a multitude of ways to copy or save the data to paste into another document (I copied mine into an Excel sheet for simplicity).  Below is an                                     example of the output produced by the R software


Major Step #3 -  Creating a Categorical Contingency Table within R

1)      This was by far the most difficult step (at least for me) to achieve, but I learned a lot about programming in the R environment while working on this portion of the homework.  My goal was to create an 8x8 contingency table with categories consisting of increments of 10 (°F).

a.       Please note that I could have easily created a 9x9 contingency table, but chose to eliminate the 90-99 category due to one observation of 92°F (and no forecasts)  In retrospect, I should have retained this information, though it would not cause any significant alterations to my values due to the sample size.


2)      The general way to create a table based upon the data is using the following command:

a.       > table (x, y)     (or, based upon my columns…)

b.      > table (a$KIND_Temp_Obs, a$KIND_Temp_MOSFcst)

The problem with this method?  It parses out the data into single degree (°F) increments, while we were instructed to use categories from 10-19, 20-29, etc. (and this makes the results significantly more useful!)

3)      Here is a general outline of what was done to create the initial ‘categories’ for both observed and forecasted temperatures

 a.       Determine the number of values for a specific column

                                                               i.      length (column header) … > length(a$KIND_Temp_Obs)

1.       This gave me an output of [1] 196

2.       This informed me that there were 196 values in the column for observed temperatures (and therefore 196 data points for each column header)


b.      Next, I created a null object with the mindset that I would create an additional column to categorize the observed temperature values (and eventually categories for the other values as well)

                                                               i.      > cat <- NULL

1.       This created a new category for which one can assign values (in this case, categories for temperature values

c.       The following is the critical step (which I must note took extensive amounts of trial and error – emphasis on the error) – an extended ‘for / if’ statement assigning new values to each observed temperature value under the header ‘cat’


                                                               i.      Ultimately, the iterations on this step took some advanced knowledge that there were no temperature values less than 10°F and subsequently started                                      by defining ALL observed temperatures as being between 10°F and 19°F (10-19) and then ‘asking’ if a temperature was greater than another value,                                         assigning a new category if it met the argument, while leaving other (cooler) values alone.  Note that you could also take the opposite approach and                                     start with the highest value and create a series of 'if less than' arguments to parse the data out in a similar fashion.


                                                             ii.      Started with the following statement, which basically created (while leaving the original data intact) a new column of 196 data values all with a value                                     of “10-19”

                        >  for (i in 1:196)

                        + cat [i] <- “10-19”


Note:  I could have assigned anything for the ‘i’ in the above lines.  This new column of data (length = 196) is created out of ‘nowhere’.


                                                            iii.      At this point, you need to ‘teach’ the data to reference what you are interested in categorizing, and what intervals to condense the original data into.                                       The extended example below is for the observed temperatures at Indianapolis, IN


> for (i in 1:196) {

+ if (a$KIND_Temp_Obs>=20)

+ cat[i] <- “20-29”

+ if (a$KIND_Temp_Obs>=30)

+ cat[i] <- “30-39”

+ if (a$KIND_Temp_Obs>=40)

+ cat[i] <- “40-49”

+ if (a$KIND_Temp_Obs>=50)

+ cat[i] <- “50-59”

+ if (a$KIND_Temp_Obs>=60)

+ cat[i] <- “60-69”

+ if (a$KIND_Temp_Obs>=70)

+ cat[i] <- “70-79”

+ if (a$KIND_Temp_Obs>=80)

+ cat[i] <- “80-89”

+ }


How this works is through a series of iterations, so there are times where a value of, say, 88 would be categorized as “50-59”, but through the iterative process, it eventually falls into the proper category of 80-89, assuming you do have some knowledge of the data 'boundaries'.


                                                           iv.      Next, I ensured the data looked good by simply displaying ‘cat’ and then looked to integrate this into my original data table.  This can be done by                                         ‘binding’ the category to the table with the following command:


                        > cbind  (a, cat)       (specific to my naming convention, but this binds ‘cat’ to table ‘a’)


            Furthering this idea, to permanently attach this new category to the original table, I just assigned the ‘bound’ table to the original table name with the following:


                                    > cbind (a, cat) -> a


            To confirm that it was bound to the original table, you can look at the specific category and get a printout by typing the following:   > a$cat


                                                             v.       Next, since your table is looking to determine the joint probability distribution (and ultimately marginal probabilities and further down the line                                             conditional probabilities), a similar technique should be utilized to create like categories from the forecast data.  Here is the writeup, noting the                                             different categories, different naming conventions, but overall similar structure


> for (j in 1:196) {

+ if (a$KIND_Temp_MOSFcst>=20)

+ cat2[j] <- “20-29”

+ if (a$KIND_Temp_ MOSFcst >=30)

+ cat2[j] <- “30-39”

+ if (a$KIND_Temp_ MOSFcst >=40)

+ cat2[j] <- “40-49”

+ if (a$KIND_Temp_ MOSFcst >=50)

+ cat2[j] <- “50-59”

+ if (a$KIND_Temp_ MOSFcst >=60)

+ cat2[j] <- “60-69”

+ if (a$KIND_Temp_ MOSFcst >=70)

+ cat2[j] <- “70-79”

+ if (a$KIND_Temp_ MOSFcst >=80)

+ cat2[j] <- “80-89”

+ }


Note that similar methods were used to combine ‘cat2’ to the original table ‘a’ as well, giving us the two columns of data needed to make a joint probability table


It would be recommended to at least make a cursory glance to ensure that the newly created columns (cat, cat2) do correspond to the actual data!


                                                           vi.      At this point, you can construct a new table consisting of the ‘categorized’ values of temperature (or precipitation) using the following command:


                        > table (cat, cat2)           (input the names of your categorized columns of data)


Note that at this time, I gave my cat and cat2 columns different names to keep things a bit more organized.  Sometimes it is easier to keep the names simple for the computing, then give them more ‘complex’ names to give them significance.


> cat -> Actual_Temps

> cat2 -> Forecast_Temps


                                                          vii.      Finally, within R, I recalculated the table to give the values as ‘relative’ frequencies, completing the work for my joint probability distribution table.                                      Use the following command, combined with the previous knowledge of how many data points are in the column, to make the frequencies ‘relative’:


                        > (table (Actual_Temps, Forecast_Temps))/196


Note that you can ‘clean things up’ from having too many decimal places by assigning a name to the table (for example, ‘b’) and then using the following command:


                        >  round (b, digits=4) – this gives each cell 4 digits after the decimal place


However, I actually found it much easier to do this work once I exported the data into a file readable by Excel and performing the data manipulations within Excel.


                                                        viii.      Exporting the data to Microsoft Excel (or, exporting what you want in comma delimited format)

1.       Name the table, if not already done so

2.       Use the following command, indicating specific folders, if needed (otherwise, will save to your designated R folder)


> write.csv (file = “b.csv”, b)




Major Step #4 -  Finalizing Categorical Contingency Table within Excel (and computing the Marginal Probabilities)

1)       Open the .csv file created at the end of Major Step #3 within MS Excel (it is assumed the user has knowledge of how to do this).  Note that most, if not all formatting done within R has been lost due to the .csv format

2)      Clean things up a bit – add titles to the axes, standardize decimal places if desired, etc.

3)      To calculate the marginal probabilities p(x) and p(f), simply add up the values within a row or a column.  Your vertical column should be observations (x) and your horizontal row should be forecasts (f).  Therefore, your marginal probabilities p(x) are calculated by adding up each row for a specific category of observation value (i.e. x = 10-19).  Similarly, each marginal probability p(f) is calculated by adding up the columns for a specific forecast value (i.e. f = 50-59).


a.       This is best done using the ‘summation’ tab (Σ) and highlighting the desired values to be added together

b.      Ensure you have calculated the rows/columns correctly by adding up all values found for p(f) & p(x) for each category – if correct, they should each add up to 1, showing the probability of your forecast or observation falling into at least one category is 100%


Major Step #5 -  Factoring the Joint Distribution and computer Conditional Probabilities

1)       All of these steps were performed in MS Excel, as they are simply modifications of the Categorical Contingency Table produced in Major Step #4.


2)       Earlier, in Major Step #3, I took the data table and performed the following command to get the relative probabilities


> (table (Actual_Temps, Forecast_Temps))/196


                I began by simply using the table of Actual_Temps and Forecast_Temps, but in an 'absolute' sense (no division by the total number of data values).  In a manner similar to Major                     Step #4 (part 3), I added up all the rows and columns, which signify how many forecast (observations) fell into a certain category regardless of the relationship to the                                 observation (forecast).


3)      For this step, our goal is to create two conditional probability tables, one for p (f | x)  and another for p (x | f).  It is very helpful to put into words what these two ‘conditional probabilities’ mean.  For p (x | f), we are looking to find the probability that an observation (x) will fall into a specific category given the forecast has ALREADY been determined and assigned.  For example:


What is the probability that the actual observation was between 20-29 GIVEN a forecast that was between 30-39?


Therefore, taking the absolute values of each joint distribution and dividing them by the number of forecasts for a specific category, we are able to calculate the conditional probabilities.  This was done in Excel by simply adding up rows and columns using the aforementioned summation tool, and then dividing specific rows or columns by the representative number of forecasts or observations.   Another example of p (x | f):


There were 17 temperature forecasts that called for a value between 20-29.  Of these 17 forecasts, 10 saw an actual observation of 30-39.  Therefore, the probability of an observation (x) falling in the 30-39 category GIVEN a forecast of 20-29 is 10/17, or 0.5882 (58.82%).


Nothing specific was used within Excel other than the summation tool and simple division.  You can check your work by ensuring that the columns add up to a probability of 1 for p (x | f) and that the rows add up to 1 for p (f | x).


Major Step #6 -  Computing the Forecast Bias

1)       This is simple to do, and could be done easily both in Excel or in R.  I will paint the picture in R since the reader may be less familiar with that method.


2)       The measure of forecast bias is found by simply subtracting the average observed value from the average forecast value.  Therefore, if you know the average values, it is easy arithmetic.


3)      If you have a table within R with the necessary variables, you can simply do a summary of the table with the following command:


> summary (a)


This should display data including mean, median, quartiles, etc. for each column of data in the table.  By doing this, I found out the various means for observed and forecast temps and precip, and by doing the simple math, you can determine the forecast bias.