MET Tool: Ensemble-Stat

Ensemble-Stat Tool: General

Ensemble-Stat Functionality

The Ensemble-Stat tool may be used to verify the deterministic ensemble members against gridded and/or point observations. Statistics are then derived using those observations, such as rank histograms, probability integral transform histograms, spread/skill variance, relative position and continuous ranked probability score.

Ensemble-Stat Usage

View the usage statement for Ensemble-Stat by simply typing the following:

ensemble_stat

At a minimum, the input gridded ensemble files and the configuration config_file must be passed in on the command line. You can specify the list of ensemble files to be used either as a count of the number of ensemble members followed by the file name for each (n_ens ens_fil e_1 ... ens_file_n) or as an ASCII file containing the names of the ensemble files to be used (ens_file_list). Choose whichever way is most convenient for you. The optional -grid_obs and -point_obs command line options may be used to specify gridded and/or point observations to be used for computing rank histograms and other ensemble statistics.

As with the other MET statistics tools, all ensemble data and gridded verifying observations must be interpolated to a common grid prior to processing. This may be done using the automated regrid feature in the Ensemble-Stat configuration file or by running copygb and/or wgrib2 first.

cindyhg Tue, 06/25/2019 - 08:31

Configure

Start by making an output directory for Ensemble-Stat and changing directories:

mkdir -p ${METPLUS_TUTORIAL_DIR}/output/met_output/ensemble_stat

cd ${METPLUS_TUTORIAL_DIR}/output/met_output/ensemble_stat

The behavior of Ensemble-Stat is controlled by the contents of the configuration file passed to it on the command line. The default Ensemble-Stat configuration file may be found in the data/config/EnsembleStatConfig_default file. The configurations used by the test script may be found in the scripts/config/EnsembleStatConfig* files.

Prior to modifying the configuration file, users are advised to make a copy of the default:

cp ${MET_BUILD_BASE}/share/met/config/EnsembleStatConfig_default EnsembleStatConfig_tutorial

Open up the EnsembleStatConfig_tutorial file for editing with your preferred text editor.

vi EnsembleStatConfig_tutorial

The configurable items for Ensemble-Stat are similar to those found in Grid-Stat and Point-Stat: forecast and observation dictionaries read in the the ensemble and observation fields, respectively, while the remaining dictionaries and options control the computed statistics and the area over which those statistics are calculated.

You may find a complete description of the configurable items in the ensemble_stat configuration file section of the MET User's Guide. Please take some time to review them.

For this tutorial, we'll configure Ensemble-Stat to verify 24-hour accumulated precipitation. While we'll run Ensemble-Stat on a single field, please note that it may be configured to operate on multiple fields. The ensemble we're verifying consists of 6 members defined over the west coast of the United States.

Edit the EnsembleStatConfig_tutorial file as follows:

In the fcst dictionary, set

   field = [
     {
       name = "APCP";
       level      = [ "A24" ];
     }
   ];

To verify the 24-hour accumulated precipitation fields.
In the observation filtering options, set:

message_type = [ "ADPSFC" ];

To verify against surface observations.
In the field entry section, set:

prob_cat_thresh= [ >=0, >=5.0, >=10.0 ];

To specify thresholds to use for computation of the Ranked Probability Score (RPS).
In the mask dictionary, set

poly = [ "MET_BASE/poly/NWC.poly",
"MET_BASE/poly/SWC.poly" ];

To also verify over the northwest coast (NWC) and southwest coast (SWC) subregions.
Set:

output_flag = {
   ecnt = BOTH;
   rhist = BOTH;
   phist = BOTH;
   orank = BOTH;
   ssvar = BOTH;
   relp = BOTH;
}

To compute continuous ensemble statistics (ECNT), ranked histogram (RHIST), probability integral transform histogram (PHIST), observation ranks (ORANK), spread-skill variance (SSVAR), and relative position (RELP).

Save and close this file.

johnhg Thu, 07/25/2019 - 16:07

Run

Next, run Ensemble-Stat on the command line using the following command, using wildcards to list the 6 input ensemble member files:

ensemble_stat \

6 ${METPLUS_DATA}/met_test/data/sample_fcst/2009123112/*gep*/d01_2009123112_02400.grib \

EnsembleStatConfig_tutorial \

-grid_obs ${METPLUS_DATA}/met_test/data/sample_obs/ST4/ST4.2010010112.24h \

-point_obs ${METPLUS_DATA}/met_test/out/ascii2nc/precip24_2010010112.nc \

-outdir . \

-v 2

Ensemble-Stat is now performing the tasks we requested in the configuration file. Note that we've passed the input ensemble data directly on the command line by specifying the number of ensemble members (6) followed by their names using wildcards. We've also specified one gridded StageIV analysis field (-grid_obs) and one file containing point rain gauge observations (-point_obs) to be used in computing rank histograms. This tool should run pretty quickly.

When Ensemble-Stat is finished, it will have created 8 output files in the current directory: 7 ASCII statistics files (.stat, _ecnt.txt, _rhist.txt, _phist.txt, _orank.txt, _ssvar.txt , and _relp.txt ), and a NetCDF matched pairs file (_orank.nc).

johnhg Thu, 07/25/2019 - 16:09

Output

The output from Ensemble-Stat is one or more ASCII files containing statistics summarizing the verification performed, and a NetCDF file containing the gridded matched pairs.

All of the line types are written to the file ending in .stat. The Ensemble-Stat tool currently writes 12 output line types: ECNT, RPS, RHIST, PHIST, RELP, SSVAR, PCT, PSTD, PJC, PRC, ECLV, and ORANK.

The ECNT line type contains contains continuous ensemble statistics such as spread and skill. Ensemble-Stat uses assumed observation errors to compute both perturbed and unperturbed versions of these statistics. Statistics to which observation error have been applied can be found in columns which include the _OERR (for observation error) suffix.
The RPS line type contains the Ranked Probability Score, as well as the number of ensembles that were used to calculate the score, the Ranked Probability Skill Score, and the Ranked Probability Score decomposed into its terms of reliability, resolution, and uncertainty.
The RHIST line type contains counts for a ranked histogram. This ranks each observation value relative to ensemble member values. Ideally, observation values would fall equally across all available ranks, yielding a flat rank histogram. In practice, ensembles are often under-(U shape) or over-(inverted U shape) dispersive. In the event of ties, ranks are randomly assigned.
The PHIST line type contains counts for a probability integral transform histogram. This scales the observation ranks to a range of values between 0 and 1 and allows ensembles of different size to be compared. Similarly, when ensemble members drop out, RHIST lines cannot be aggregated together but PHIST lines can.
The RELP line is the relative position, which indicates how often each ensemble member's value was closest to the observation's value. In the event of ties, credit is divided equally among the tied members.
The PCT line type contains the contingency table counts for probabilistic forecasts.
The PSTD line type is the probabilistic statistics for dichotomous outcomes for derived ensemble relative frequencies.
The PJC line type contains joint and conditional factorization for derived ensemble relative frequencies
The PRC line type has the receiver operating characteristic for derived ensemble relative frequencies
The ECLV line type is the economic cost/loss relative value for derived ensemble relative frequencies
The ORANK line type is similar to the matched pair (MPR) output of Point-Stat. For each point observation value, one ORANK line is written out containing the observation value, its rank, and the corresponding ensemble values for that point. When verifying against a griddedanalysis, the ranks can be written to the NetCDF output file.
The SSVAR line contains binned spread/skill information. For each observation location, the ensemble variance is computed at that point. Those variance values are binned based on the ens_ssvar_bin_size configuration setting. The skill is determined by comparing the ensemble mean value to the observation value. One SSVAR line is written for each bin summarizing the all the observation/ensemble mean pairs that it contains.

The STAT file contains all the ASCII output while the _ecnt.txt, _rhist.txt, _phist.txt, _orank.txt, _ssvar.txt, and _relp.txt files contain the same data but sorted by line type. Since so much data can be written for the ORANK line type, we recommend disabling the output of the optional text file using the output_flag parameter in the configuration file.

Since the lines of data in these ASCII file are so long, we strongly recommend configuring your text editor to NOT use dynamic word wrapping. The files will be much easier to read that way.

Open up the ensemble_stat_20100101_120000V_rhist.txt RHIST file using the text editor of your choice and note the following:

vi ensemble_stat_20100101_120000V_rhist.txt

There are 6 lines in this output file resulting from using 3 verification regions in the VX_MASK column (FULL, NWC, and SWC) and two observations datasets in the OBTYPE column (ADPSFC point observations and gridded observations).
Each line contains columns for the observation ranks (RANK_#) and a handful of ensemble statistics (CRPS, CRPSS, IGN, and SPREAD).
There is output for 7 ranks - since we verified a 6-member ensemble, there are 7 possible ranks the observation values could attain.

Close this file, and open up the ensemble_stat_20100101_120000V_phist.txt PHIST file, and note the following:

vi ensemble_stat_20100101_120000V_phist.txt

There are 5 lines in this output file resulting from using 3 verification regions (FULL, NWC, and SWC) and two observations datasets (ADPSFC point observations and gridded observations), where the ADPSFC point observations for the SWC region were all zeros for which the probability integral transform is not defined.
Each line contains columns for the BIN_SIZE and counts for each bin. The bin size is set in the configuration file using the ens_phist_bin_size field. In this case, it was set to .05, therefore creating 20 bins (1/ens_phist_bin_size).

Close this file, and open up the ensemble_stat_20100101_120000V_orank.txt ORANK file, and note the following:

vi ensemble_stat_20100101_120000V_orank.txt

This file contains 1866 lines, 1 line for each observation value falling inside each verification region (VX_MASK).
Each line contains 44 columns, including header information, the observation location and value, its rank, and the 6 values for the ensemble members at that point.
When there are ties, Ensemble-Stat randomly assigns a rank from all the possible choices. This can be seen in the SWC masking region where all of the observed values are 0 and the ensemble forecasts are 0 as well. Ensemble-Stat randomly assigns a rank between 1 and 7.

Use the ncview utility to view the NetCDF gridded observation rank file:

ncview ensemble_stat_20100101_120000V_orank.nc &

This file is only created when you've verified using gridded observations and have requested its output using the output_flag parameter in the configuration file. Click through the variables in this file. Note that for each of the three verification areas (FULL, NWC, and SWC) this file contains 4 variables:

The gridded observation value
The observation rank
The probability integral transform
The ensemble valid data count

In ncview, the random assignment of tied ranks is evident in areas of zero precipitation.

Close this file.

Feel free to explore using this dataset. Some options to try are:

Try setting skip_const = TRUE; in the config file to discard points where all ensemble members and the observation are tied (i.e. zero precip). If you want to save it to a different file, make sure you set output_prefix to something meaningful, such as "run2", or "skip-constant".
Try setting obs_thresh = [ >0.01 ]; in the config file to only consider points where the observation meets this threshold. How does this differ from the using skip_const?
Use wgrib to inventory the input files and add additional entries to the ens.field list. Can you process 10-meter U and V wind?

johnhg Thu, 07/25/2019 - 16:11