Stat-Analysis

Stat-Analysis johnhg Mon, 06/24/2019 - 16:55

Stat-Analysis Functionality

The Stat-Analysis tool ties together results from the MET statistics tools by providing a way to filter their STAT output files and aggregate the results through time and/or space. The Stat-Analysis tool processes the STAT output of the Point-Stat, Grid-Stat, Wavelet-Stat, and Ensemble-Stat tools and performs one or more analysis jobs on the data. The Stat-Analysis tool may be run by specifying a single analysis job on the command line or multiple analysis jobs using a configuration file. The analysis job types are summarized below:

  • The filter job simply filters out STAT lines from STAT files that meet the filtering options specified.
  • The summary job operates on one column of data from a single STAT line type. It produces summary information for that column of data: mean, standard deviation, min, max, and the 10th, 25th, 50th, 75th, and 90th percentiles. For example, it can be used to look at the summary of a statistic like RMSE across many cases in time.
  • The aggregate job operates on a single line type and aggregates the STAT data in the lines which meet the filtering criteria. It dumps out a line containing the aggregated data. For example, it can be used to sum contingency table counts (CTC) across many cases and dump out the aggregated counts (CTC). The input line type is the same as the output line type.
  • The aggregate_stat job performs almost the same function as the aggregate job, but the output line type differs from the input line type. For example, it can be used to aggregate contingency table counts (CTC) across many cases and dump out statistics generated from the aggregated contingency table (CTS).
  • The go_index job computes the GO Index, a performance metric used primarily by the United States Air Force. The GO Index is a specific application of the more general Skill Score Index (ss_index) job type which is the weighted mean of skill scores computed for a user-defined set of variables, levels, lead times, and statistics.
  • The ramp job operates on a time-series of forecast and observed values and is analogous to the RIRW (Rapid Intensification and Weakening) job supported by the tc_stat tool. The amount of change from one time to the next is computed for forecast and observed values. Those changes are thresholded to define events which used to populate a 2x2 contingency table.

The Stat-Analysis Tool really performs two main steps:

  1. Filter the input STAT lines using the filtering parameters set in the configuration file and/or on the job command line and write the results to a temporary file.
  2. For each analysis job, read filtered data from the temporary file and perform the job.

When processing a large amount of data with STAT-Analysis, grouping similar jobs into a configuration file is more efficient than running them separately on the command line.

Stat-Analysis Usage

View the usage statement for Stat-Analysis by simply typing the following:

stat_analysis

At a minimum, you must specify at least one directory or file in which to find STAT data (using the -lookin path command line option) and either a configuration file (using the -config config_file command line option) or a job command on the command line.

When -lookin is set to one or more explicit file names, Stat-Analysis reads them regardless of their suffix. When -lookin is set to a directory, Stat-Analysis searches it recursively for files with the .stat suffix.

The more data you pass to Stat-Analysis, the longer it will take to run. When possible, users should limit the input data to what is required to perform the desired analysis.

Configure

Configure johnhg Tue, 06/25/2019 - 09:08

The behavior of Stat-Analysis is controlled by the contents of the configuration file or the job command passed to it on the command line. The default Stat-Analysis configuration may be found in the $MET_BASE/config/STATAnalysisConfig_default file. The configuration used by the test script may be found in the met-8.0/scripts/config/STATAnalysisConfig file. Prior to modifying the configuration file, users are advised to make a copy of the default:

cp $MET_BASE/config/STATAnalysisConfig_default $MET_TUTORIAL_DATA/config/STATAnalysisConfig_tutorial

Open up the $MET_TUTORIAL_DATA/config/STATAnalysisConfig_tutorial file for editing with your preferred text editor.

For this tutorial, we'll set up a configuration file to run a few jobs. Then, we'll show an example of running a single analysis job on the command line.

The Stat-Analysis configuration file has two main sections. The items in the first section are used to filter the STAT data being processed. Only those lines which meet the filtering requirements specified are retained and passed down to the second section. The second section defines the analysis jobs to be performed on the filtered data. When defining analysis jobs, additional filtering parameters may be defined to further refine the STAT data with which to perform that particular job.

As a word of caution, the Stat-Analysis tool is designed to be extremely flexible. However, with that flexibility comes potential for improperly specifying your job requests, leading to unintended results. It is the user's responsibility to ensure that each analysis job is performed over the intended subset of STAT data. The -dump_row job command option is useful for verifying that the analysis was performed over the intended subset of STAT data.

We'll configure the Stat-Analysis tool to analyze the results of the Point-Stat tool output and aggregate scores for the EAST and WEST verification regions. Edit the $MET_TUTORIAL_DATA/config/STATAnalysisConfig_tutorial file as follows:

  • Set fcst_var = [ "TMP" ]; To only use STAT lines for temperature (TMP).
  • Set fcst_lev = [ "P850-750" ]; To only use STAT lines for the forecast level specified.
  • Set obtype = [ "ADPUPA" ]; To only use STAT lines verified with the ADPUPA message type.
  • Set vx_mask = [ "EAST", "WEST" ]; To use STAT lines computed over the EAST or WEST verification polyline regions.
  • Set line_type = [ "CTC" ]; To only use the CTC lines.
  • Set jobs as follows:
jobs = [
"-job filter -dump_row ${MET_TUTORIAL_DATA}/output/stat_analysis/job1_filter.stat",
"-job aggregate -interp_pnts 1 -dump_row ${MET_TUTORIAL_DATA}/output/stat_analysis/job2_aggr_ctc_1.stat",
"-job aggregate -interp_pnts 25 -dump_row ${MET_TUTORIAL_DATA}/output/stat_analysis/job3_aggr_ctc_25.stat",
"-job aggregate_stat -out_line_type CTS -interp_pnts 25 -dump_row ${MET_TUTORIAL_DATA}/output/stat_analysis/job4_aggr_stat_cts.stat"
];

Save and close this file. The four jobs listed above achieve the following:

  1. Filter out those STAT lines which meet the filtering criteria, and write them to an output STAT file.
  2. Aggregate the CTC lines which have the INTERP_PNTS column set to 1 and write the lines which meet the filtering criteria to an output STAT file.
  3. Aggregate the CTC lines which have the INTERP_PNTS column set to 25 and write the lines which meet the filtering criteria to an output STAT file.
  4. Do the same as the third job, but write out the aggregated contingency table stats (CTS) rather than the aggregated contingency table counts (CTC).

Note that all four jobs use the -dump_row job command option which dumps the lines of STAT data used for this job to the specified file name. We'll look at these files to ensure that the jobs ran over the intended subsets of STAT data.

Run

Run johnhg Tue, 06/25/2019 - 09:23

Next, run Stat-Analysis on the command line using the following command:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/point_stat \
-out $MET_TUTORIAL_DATA/output/stat_analysis/stat_analysis.out \
-config $MET_TUTORIAL_DATA/config/STATAnalysisConfig_tutorial \
-v 2

Stat-Analysis is now performing the analysis jobs we requested in the configuration file. It is writing the output of our four jobs to the file we specified using the -out command line argument. It should run in only a couple of seconds since we're analyzing such a small sample of STAT data. In general though, Stat-Analysis can be used to process very large amounts of data, a whole season's worth, in a relatively short amount of time.

By Case and STAT Output

Notice that jobs 2 and 3 do basically the same thing but for different -interp_pnts values. We can actually perform those job in a much simpler way. The -by job command option specifies one or more columns which define case information. Those case columns are concatenated and the job is performed for each unique case found in the data.

Also notice that while the output file generated above contains aggregated statistics, it is missing the 22 header STAT header columns. The -out_stat job command option specfies that name for an output STAT file which does contain those header columns.

Let's rerun jobs 2 and 3 but on the command line using both the -by and -out_stat options:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/point_stat \
-out_stat $MET_TUTORIAL_DATA/output/stat_analysis/ctc_by_interp_pnts.stat \
-job aggregate -line_type CTC -fcst_var TMP -fcst_lev P850-750 -obtype ADPUPA -vx_mask EAST,WEST \
-by INTERP_PNTS \
-v 2

Note that this single job was run on the command line with no configuration file. The multiple values for -vx_mask are specified as a comma-separated list (specifying -vx_mask multiple times works too. Open up the output STAT file ($MET_TUTORIAL_DATA/output/stat_analysis/ctc_by_interp_pnts.stat) and notice the content of the VX_MASK column (EAST,WEST). For string header columns (e.g. VX_MASK, OBTYPE, FCST_VAR) multiple values are simply concatenated while for date/time columns (e.g. FCST_VALID_BEG, FCST_VALID_END, FCST_LEAD) the min/max values are reported.

Next, rerun that same job but use the -set_hdr option to explicitly specify the contents of the output VX_MASK column:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/point_stat \
-out_stat $MET_TUTORIAL_DATA/output/stat_analysis/ctc_by_interp_pnts_set_hdr.stat \
-job aggregate -line_type CTC -fcst_var TMP -fcst_lev P850-750 -obtype ADPUPA -vx_mask EAST,WEST \
-by INTERP_PNTS -set_hdr VX_MASK CONUS \
-v 2

Open up the output file and check the VX_MASK column.

You can use the -by option an arbitrary number of times to define case information. Try rerunning with -by INTERP_PNTS,VX_MASK.

Process Probabilistic Output

Next, we'll run an analysis job on the probabilistic output from the Grid-Stat tool on the command line:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/grid_stat \
-job aggregate_stat \
-dump_row $MET_TUTORIAL_DATA/output/stat_analysis/aggr_stat_pstd.stat \
-vx_mask EAST,WEST \
-line_type PCT \
-out_line_type PSTD \
-v 2

The output of this Stat-Analysis job is printed to the screen since we didn't redirect the output to a file using the -out command line option. This job has aggregated two probability contingency table count (PCT) STAT lines, one for the EAST and one for the WEST, and it has written out the corresponding statistics (PSTD) STAT line. The output for this job includes the following 3 lines:

  • The JOB_LIST line lists the job command options that were used to perform this job.
  • The COL_NAME line consists of the column names for the statistics listed in the next line.
  • The PSTD line consists of the output for the probabilistic statistics (PSTD) STAT line. However, only the columns that appear after the LINE_TYPE column are shown.

As we did above, let's switch to using the -by vx_mask option:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/grid_stat \
-job aggregate_stat \
-dump_row $MET_TUTORIAL_DATA/output/stat_analysis/job5_aggr_stat_pstd.stat \
-by vx_mask \
-line_type PCT \
-out_line_type PSTD \
-v 2

This job has aggregated the probability contingency table count (PCT) STAT lines by vx_mask and has written out the corresponding statistics (PSTD) STAT line. There is one output line per vx_mask (CONUS, EAST, G212, WEST) for a total of four lines in this case.

Next, we'll run a job to look at verification of wind direction in the output of the Point-Stat tool. Just as we did above, we'll use the EAST and WEST verification regions to aggregate the vector partial sums (VL1L2) for winds and look at errors in the wind direction. Run the following job on the command line:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/point_stat \
-job aggregate_stat \
-dump_row $MET_TUTORIAL_DATA/output/stat_analysis/job6_aggr_stat_wdir.stat \
-vx_mask EAST -vx_mask WEST \
-interp_pnts 25 \
-line_type VL1L2 \
-fcst_thresh ge1.0 \
-out_line_type WDIR \
-v 2

The output of this Stat-Analysis job includes the following 4 lines: the JOB_LIST, COL_NAME, ROW_MEAN_WDIR, and AGGR_WDIR. See the MET Users Guide for details on these output lines.

Process Ensemble Output

Lastly, we'll aggregate together two ranked histograms from the Ensemble-Stat output, the ones for the NWC and SWC verification areas. Execute the following command:

stat_analysis \
-lookin $MET_TUTORIAL_DATA/output/ensemble_stat \
-job aggregate \
-line_type RHIST \
-vx_mask SWC -vx_mask NWC \
-obtype ADPSFC \
-v 2

This job aggregates together the 358 ranks from the NWC region with the 382 ranks from the SWC region and writes out the aggregated counts. This aggregation is only possible because the number of ranks in each input line (7) remains constant.

Output

Output johnhg Tue, 06/25/2019 - 09:28

The output of Stat-Analysis is printed to the screen by default, as we saw, unless you redirect it to an output file using the -out command line option. View the output of the first Stat-Analysis command we ran by opening the $MET_TUTORIAL_DATA/output/stat_analysis/stat_analysis.out file using the text editor of your choice. Note the following:

  • The output of the first job to simply filter the STAT data consists of a FILTER line listing the filtering parameters applied.
  • The output of the second and third jobs consists of 3 lines each: the JOB_LIST, COL_NAME, and CTC lines. This is the same type of output that was printed to the screen for the Stat-Analysis job run on the command line.
  • The output of the fourth job consists of 3 lines: the JOB_LIST, COL_NAME, and CTS lines.

Close this file, and open up the following four files to examine the STAT data over which these jobs were run:

  • $MET_TUTORIAL_DATA/output/stat_analysis/job1_filter.stat
  • $MET_TUTORIAL_DATA/output/stat_analysis/job2_aggr_ctc_1.stat
  • $MET_TUTORIAL_DATA/output/stat_analysis/job3_aggr_ctc_25.stat
  • $MET_TUTORIAL_DATA/output/stat_analysis/job4_aggr_stat_cts.stat

In this example, the second, third, and fourth jobs read two CTC STAT lines each, using the masking regions EAST and WEST. So these three jobs aggregate contingency table counts across two verification regions. Users are strongly encouraged to use the -dump_row job command option to verify that the analysis job was run over the intended subset of STAT data. Close these files when you have finished reviewing the STAT data.