Slow file I/O caused false cold start of HWRF-B

Submitted by lew.gramer on Mon, 10/19/2020 - 13:09

We encountered the following issue while running our Real-time HWRF-B experiment in "fallback mode" on Hera:

06Z cycle warmstarted at 09:50UT here: `/scratch2/AOML/aoml-hafs1/Lew.Gramer/pytmp/HB20_Fallback/2020101906/00L/hwrf_launch.log`

The TCV file `/scratch1/NCEPDEV/hwrf/noscrub/input/SYNDAT-PLUS/syndat_tcvitals.2020` was apparently updated just prior to the start of this cycle with just "94L" records - and then updated again some time afterward with additional "27L" records. This is only a guess. However, the symptom we saw was that the 06Z cycle was initialized with the storm "94L" rather than 27L, even though "27L" was found to be present in the TCV when examined later. The effect was that when the 12Z cycle was started, it looked for a "27L" storm in the 06Z $COM files from which to self-cycle: failing to find one, it instead incorrectly cold started 27L.

The workaround was to start another experiment "HB20_Debug" on Hera, which reran the 06Z cycle (using the 00Z $COM output from the HB20_Fallback 00Z cycle to warm start) with the appropriate storm ID of "27L".

PS: I have made a copy of the original (errant) launch log file from the first run of 06Z here: `~Lew.Gramer/log/HB20_Fallback/2020101912/00L/hwrf_launch.log`

 

Lew

 

PS: Is there a way to "CC" others on our forum posts? I would like Gus to be notified of the creation of this ticket, and of future updates - but do not see a way to do that on the current Web form... ?

 

Lew, thanks for reporting this issue. Occasionally, operational centers will make late-breaking changes to the TCV. EMC controls how frequently the TCV file is updated on Hera and Jet. I am not sure how much latency there is, but it could have been a factor here.

I do not think that the workflow, by default, is set up to handle the situation that you described, but the relevant routine is ./ush/hwrf/revital.py. It is possible that you could set the default search_dt to a larger number (12 hours instead of 6 hours?) in an attempt to account for this problem, although we haven't tested this. However, it may not be worth it, as this situation is likely to be quite rare – I never saw it happen while I was running my realtime experiment last year.

In terms of being able to copy others on forum threads, I don't see a way to do it. I will ask around to see if there is a way.