Ensemble DA tasks ('enshx') Failing on Hera for HWRF-B

Submitted by ghassan.alaka on Thu, 11/05/2020 - 10:06

Dear HWRF Developer's Forum,

I am running experiments with the Basin-scale HWRF (HWRF-B) on Hera. Unfortunately, the ensemble DA is failing. Specifically, the "enshx_DA_0[01-40]" tasks are all failing when running the hwrf_gsi executable. The failure happens within 30 seconds of the start time, and might be associated with a reference to "inconsistent ndat,npe" (see below). The most confusing aspect of this issue is that my colleague, Sarah Ditchek, can run the enshx tasks for the same forecast cycle successfully. I have tried rerunning the enshx tasks after copying her executables (no luck) and copying her ProdGSI source code (no luck). I do not appear to have any notable differences in the configuration (parm) beyond unique paths from Sarah. Also, I was able to run these tasks successfully a few months ago. That leads me to believe there is some issue in my environment. I confirmed that Sarah and I are running with the same modules, but I can't guarantee that the environments are identical. I am stumped... any ideas?

The rocoto log file: /scratch2/AOML/aoml-hafs1/Ghassan.Alaka/pytmp/HB20_dorian_TEST/2019082418/00L/hwrf_init_hx001_storm2.log

The stdout file: /scratch2/AOML/aoml-hafs1/Ghassan.Alaka/pytmp/HB20_dorian_TEST/2019082418/05L/ensda/001/gsihx/stdout

The error:
 ****STOP2****  ABORTING EXECUTION w/code=         330
  observer_set: inconsistent ndat,npe           92         160  /=           92
         200
 ****STOP2****  ABORTING EXECUTION w/code=         330
 

Thank you very much,

Gus Alaka

ghassan.alaka@noaa.gov

Quick update here. Thankfully, this issue was straightforward to resolve once it was better understood. The error was caused by a mismatch between the number of processors used for the meanhx task (200) and the enshx tasks (160). This quote from Henry Winterbottom sums it up nicely:
 

"The number of processors used to compute the forward operator H(x) for the respective ensemble members must match the number of cores used to compute the forward operator for the ensemble mean. This is for several reasons, but these are likely the two most important:
  • The ensemble mean makes the decision regarding which observations are to be assimilated via QC and thinning; the number of observations for each ensemble member must be identical
  • I am not certain whether HWRF is yet using the netCDF4 GSI diag files, but if you recall, when GSI runs it dumps a bunch of pe* files that are concatenated together for each observation type; there is meta data in these files regarding observation counts (among other things); the ensemble members require these to match."

So, make sure the number of processors match otherwise the enshx tasks will fail due to the mismatch. Let me know if there are more questions.

 

Best,

Gus