Qing Zhu's Final exam

Spatial verification - Cluster Analysis (CA) method
1. Introduction
    Verification of spatial forecasts over a domain can not be simply done by gird-to-grid comparison of forecast field and observation field. Regarding to space-to-space comparison, three typical techniques are usually employed (1) Clustering analysis (2) Variogram/Correlation comparison (3) Optical flow, all of them are based on overall statistical characteristics of forecast fields and observation fields to some extent, other than values of variable at each specific grid point. Here I am going to focus on the first method, Clustering Analysis(CA)[Marzban and Sandgathe 2006, 2008; Marzban et al., 2008], apply it to an accumulative precipitation data set of forecast and observations over US. 
    The CA method aims to identify featured objects/events within a given field. Each identified object corresponds to a cluster candidate, and the CA method is designed in such a way that the distance within each cluster candidate are minimized and also the distance between different cluster candidates are maximized. With the help of CA method, objects or events are found out for forecast field and observation field. The comparison between forecast field and observation field is done by comparing the dissimilarity of featured objects from the two fields.

2. Methodology
  • Description of CA algorithm
    The essential part of CA algorithm is categorizing the a space domain into different clusters. It begins by assuming each individual grid point is a cluster and gradually merging similar grid points into the same cluster. This is done by iterations. In each iteration, the CA algorithm will check the distance between different clusters and merge the closest clusters into a single cluster. Finally, CA algorithm will end up with one single cluster, containing all the grid points. As a result, the parents nodes and child nodes are explicit known at each iteration, and they are easily to be retrieved for different analysis purpose. 
  • Calculating distance between objects
    As mentioned in previous section, the distance between different clusters are calculated. Based on the distances, which two clusters are merged is then determined. Since, each cluster may contain several grid points, the calculation of "Euclidean distance" is not straightforward. Usually, there are three ways calculate the distance:

(1) Averaged distance

n1 is the number of grid points within an object; n2 is number of grid points within another object. Di,j is the the Euclidean distance between point i in one object and point j in another object. By averaging all possible distance Di,j, we are able to obtain a reasonable representation of the distance between the two objects. 

(2) Minimal distance

Notation are same as in (1) Averaged distance. The minimal distance is picked up to represent the distance between objects

(3) Maximal distance

Notation are same as in (1) Averaged distance. The maximal distance is selected to represent the distance between objects 
  • Variable normalization
    The distance between clusters considers not only 2D spatial coordinates (x-y) but also the value of variable (let's say p). As a result, the distance is calculated in 3D space (x-y-p). However the range of coordinates x, y and p are largely different. In order to have a precise representation of distance, we have to first normalize the x-y-p coordinate before we calculate the distance. 
    The normalization is done by subtracting the mean value and diving the standard deviation for each coordinates.

x, y and p are vectors. For analyzing a real data set, x could be longitude, y could be latitude and p could precipitation amount. After conduction the normalization, the variables of x, y and p follow the standard normal distribution. 
  • Identifying clusters and Matching clusters from forecast field and observation field
    Clusters are identified for forecast field and observation field separately. As a result, we have two groups of clusters from either forecast filed and observation field. The objects identified on forecast field may be totally missed on observation field, or the objects on forecast field could be displaced on observation field. The idea of matching clusters from the two fields is trying to find out a reasonable threshold. Only when the distance between clusters from different field is within such threshold, the clusters are marked as "matched". 
    The threshold is obtained by (1) calculating all the possible pairwise distance between clusters from the two fields; (2) drawing histogram of the pairwise distance, it probably follow a bell-shape distribution; (3) picking up a threshold value from the left tail of the histogram.

3. Data
    The data set includes NCEP observational cumulative precipitation and WRF forecasts cumulative precipitation over the US. The following two plots are for 2007-05-01. As the CA method needs to calculate the pairwise distance between each grid points, I re-sample the data to 50 by 50 grid points to ease the computational burden. The original data set has 881 by 1121 grid points. But, the featured objects are not changed after re-sampling. The NCEP cumulative precipitation map has large strong precipitation cells in Southern Mid-US, Western coast area and Northeastern US, while the WRF predicted cumulative precipitation map doesn't have precipitation field along the Western US coast area and it has an precipitation cell in Southeastern Florida that is not observed by NCEP data.
  • NCEP cumulative precipitation

  • WRF forecasts of cumulative precipitation

4. Results
  • Scatter plot of forecasts vs observations
    The scatter plot (forecast value versus observation value) gives us an overall idea of how the forecast matches the observation. Generally, the scatter points sit around lower left corner of the plot, while a tail exists in the lower right part of the plot, indicating the forecasts over-estimate the precipitation in some area. 

  • Clusters on observation field
   By using the Cluster Analysis (CA) method, six featured objects are identified on the NCEP observational precipitation map. The distance between clusters are calculated based on "averaged distance".  

  • Clusters on forecast field
   Also six featured objects are identified on the WRF forecast precipitation map.

  • Pairwise distance between clusters from forecast field and observation field
   The pairwise distance between clusters from the two different fields are used to generate the following histogram plot. Most of the distances values are around 2. As a result, I select 1.5 as the threshold to matching the clusters. If the distance is larger than the threshold, the object will be marked as missed on observation field or marked as false alarm on forecast field.

    This following table summarize the distance between all six objects (A~F) from observation field to all the other six objects from forecast field (A~F). The matched objects are highlighted by red color. 

    Object C in observation filed matches object A in forecast field.

    Object B in observation filed matches object C in forecast field.

    Object D in observation filed matches object E in forecast field.

    Object A in observation filed is missed in forecast field.

    Object E in observation filed is missed in forecast field.

    Object F in observation filed is missed in forecast field.

    Object B in forecast field is false alarmed.

    Object D in forecast field is false alarmed.

    Object F in forecast field is false alarmed.

  • False Alarm Ratio(FAR), Missed, Threat Score(TS)
        As a result, in summary, I put the results into a 2 by 2 contingency table, and have the following table.

        We are able to calculate the False Alarm Ratio (FAR), Missed event, and Threat Score (TS) as follows: 

      FAR = b/(a+b)=0.5

      Missed = c/(a+c)=0.5


5. Future works
  • Exploring the relationship between Threat Score (TS) and total number of clusters
  • Conduct cluster analysis of joint field of forecasts and observations.
6. References
  • Marzban, C. (2005) Cluster Analysis for Verification of Precipitation Fields.  Wea. Forecasting , 21, 824-838.
  • Marzban, C.,  and S. Sandgathe (2006) Scalar measures of performance in rare-event situations. Wea. Forecasting, 13, 753-763.
  • Marzban, C.,  and S. Sandgathe (2008) Cluster Analysis for object-oriented verification of fields: A variation. Mon. Wea. Rev., 136, 1013-1025.
  • Marzban, C., S. Sandgathe, H. Lyons (2008) An object-oriented verification of three NWP model formulations via cluster analysis: An objective and a subjective analysis.  Mon. Wea. Rev.136, 3392-3407.
7. Python codes.
  • Uploaded file named: ca.py
Michael Baldwin,
Dec 13, 2012, 10:24 PM