4. Run the Workflow

4. Run the Workflow
schramm Thu, 08/12/2021 - 11:48

Estimated time: 25 min

Set the environment variable $EXPTDIR to your experiment directory to make navigation easier:

EXPT_SUBDIR=test_CONUS_25km_GFSv15p2
EXPTDIR=${SR_WX_APP_TOP_DIR}/../expt_dirs/${EXPT_SUBDIR}

The workflow can be run using the./launch_FV3LAM_wflow.sh script which contains the rocotorun and rocotostat commands needed to launch the tasks and monitor the progress. There are two ways to run the ./launch_FV3LAM_wflow.sh script: 1) manually from the command line or 2) by adding a command to the user’s crontab.  Both options are described below.

In either case, it is important to note that the rocotorun process is iterative; the command must be executed many times before the entire workflow is completed, usually every 2-10 minutes. 

To run the workflow manually:

cd $EXPTDIR
./launch_FV3LAM_wflow.sh

Once the workflow is launched with the launch_FV3LAM_wflow.sh script, a log file named log.launch_FV3LAM_wflow will be created in $EXPTDIR.

To see what jobs are running for a given user at any given time, use the following command:

qstat -u $USER

This will show any of the workflow jobs/tasks submitted by rocoto (in addition to any unrelated jobs the user may have running).  Error messages for each task can be found in the task log files located in the $EXPTDIR/log directory. In order to launch more tasks in the workflow, you just need to call the launch script again in the $EXPTDIR directory as follows:

./launch_FV3LAM_wflow.sh

until all tasks have completed successfully.  You can also look at the end of the $EXPT_DIR/log.launch_FV3LAM_wflow file to see the status of the workflow.  When the workflow is complete, you no longer need to issue the ./launch_FV3LAM_wflow.sh command.

To run the workflow via crontab:

For automatic resubmission of the workflow (e.g., every 3 minutes), the following line can be added to the user’s crontab:

crontab -e

and insert the line:

*/3 * * * * cd /glade/scratch/$USER/expt_dirs/test_CONUS_25km_GFSv15p2 && ./launch_FV3LAM_wflow.sh
NOTE:  If you don’t have access to crontab, you can run the workflow manually as described above.

The state of the workflow can be monitored using the rocotostat command, from the $EXPTDIR directory:

rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10

The workflow run is completed when all tasks have “SUCCEEDED”, and the rocotostat command will output the following:

CYCLE                   TASK                    JOBID            STATE           EXIT STATUS     TRIES      DURATION
=================================================================================
201906150000     make_grid             4953154       SUCCEEDED         0                   1              5.0
201906150000     make_orog            4953176       SUCCEEDED         0                   1             26.0
201906150000     make_sfc_climo    4953179       SUCCEEDED         0                   1             33.0
201906150000     get_extrn_ics        4953155       SUCCEEDED         0                   1              2.0
201906150000     get_extrn_lbcs.     4953156       SUCCEEDED         0                   1              2.0
201906150000     make_ics              4953184       SUCCEEDED         0                   1             16.0
201906150000     make_lbcs            4953185       SUCCEEDED         0                   1             71.0
201906150000     run_fcst                4953196       SUCCEEDED         0                   1           1035.0
201906150000     run_post_f000      4953244       SUCCEEDED         0                   1              5.0
201906150000     run_post_f001      4953245       SUCCEEDED         0                   1              4.0
...
201906150000     run_post_f048     4953381       SUCCEEDED         0                    1              5.0

If something goes wrong with a workflow task, it may end up in the DEAD state:

CYCLE                 TASK                  JOBID             STATE          EXIT STATUS      TRIES          DURATION
=================================================================================
201906150000     make_grid         20754069       DEAD                256                    1              11.0

This means that the DEAD task has not completed successfully, so the workflow has stopped.  Go to the $EXPTDIR/log directory and look at the make_grid.log file, in this case, to identify the issue. Once the issue has been fixed, the failed task can be re-run using the rocotowind command:

cd $EXPTDIR
rocotorewind -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -c 201906150000 -t make_grid

where -c specifies the cycle date (first column of rocotostat output) and -t represents the task name (second column of rocotostat output). After using rocotorewind, the next time rocotorun is used to advance the workflow, the job will be resubmitted.