HWRF_for_V16 Branch -- Running with research mode FV3GFS v16 resolution

Submitted by peter.marinescu on Tue, 03/30/2021 - 10:44
Forum: Users | General

Hello, 

I am working with the HWRF_for_V16 branch of HWRF. I am trying to force HWRF with the data from FV3GFS (v16) that is being run in research mode by someone else in our group. This research mode has half the number of model values in both x,y directions as compared to the operational model (1536x768 versus 3072x1536). I am running this for Hurricane Dorian in 2019.

The question I have: Is there something that needs to be changed in the HWRF_for_v16 branch code to account for this different grid spacing in the GFS/GDAS model output?

Ultimately, in my test runs for Hurricane Dorian in 2019, the init_gfs_0_E99, and init_GDAS_[3,6,9]_E99 tasks are failing when I try to use the initial and boundary conditions files (gfs.*.sf00, gdas.*.sf03, gdas.*.sf06, gdas.*.sf09) from the research-mode FV3GFS-v16. I have ran this same code successfully with operational data on HPSS for a case in 2020. I have tried to debug myself and have narrowed the point of failure to the file: sorc/hwrf-utilities/tools/hwrf_prep_hybrid/COF2GRD.f. 
 
When looking at the prep.log file that is produced in the prep_hybrid task, there are a bunch of zeros that are output when printing the surface pressure values across the domain after the first several rows (see attached file). These zeros are not present when running this version of HWRF with operational resolution FV3GFS-v16 data for a 2020 case.
 
Furthermore, the code seems to fail within the loop that initializes condensate mass (CWMGRID) from the various hydrometeor species. After further investigation, there is a jb variable that is used to calculate the jj index, but this jb value ends up being a negative number (-394) which may be related to the difference in grid spacing in the initialization dataset. I have further traced back this "-394" value for jb to the following if statement in the COF2GRD.f file: 
 
        if(jc+int(jmax/2) .gt. jnemsio) then
          jb=jnemsio-jmax
        endif
 
where I believe jc is 299, jmax is 1152, jnemsio (nyg) is 756
 
Therefore, maybe jmax may need to be changed? It seems like jmax is defined in hwrf_swath/create_sw2.f90, and the reslatlon variable can be changed in this file.
 
This is the point I have reached in my own debugging, but would like to try to get assistance to fix this problem. Would you be able to assist me with this? 
 
Thank you for your time, Peter
prep_log_example.pdf

Hi Peter,

I would be happy to help you troubleshoot this.

My first question is whether you happen to be running this code on Jet, Hera, or Orion. If so, would you mind sharing the paths to your code and output directories with me? It will help me troubleshoot.

Without looking at anything other than what you included in your message (which provided a lot of very helpful information – thank you!), my inclination is to say that the imax and jmax values in the [prep_hybrid] section of ./parm/hwrf.conf should be approximately halved, since your input data are half the resolution of the operational GFS. imax and jmax describe the resolution of an intermediate grid that your GFS data get interpolated to, and if that intermediate grid is too close to (or exceeds) the resolution of the GFS data, I believe it could trigger the behavior that you reported.

So instead of imax=1440 and jmax=721 (those are the defaults), I would try imax=720 and jmax=360. If you want to repeat the entire run from the beginning, you can make this change in ./parm/hwrf.conf, but if you only want to repeat the init jobs, the change should also be made in your output directory in com/YYYYMMDDHH/SID/storm1.conf, since that file is what all of the jobs will read from.

Please let me know if you have any questions, and let me know how this test goes (assuming you haven't already tried this).

Thanks,

evan

Permalink

In reply to by evan.kalina

Hi again Peter,

 

Instead of the change that I described above, please make the following change in ./ush/hwrf/prep.py and repeat the init jobs:

 

            if nemsio or netcdf:
                if self.taskname[0:5] == 'ensda':
                    imax=960
                    jmax=480
                else:
                    imax=1024 (was 2048)
                    jmax=576 (was 1152)

 

The first block for 'ensda' can remain the same. The prep.py script is where prep_hybrid gets the values for imax and jmax, rather than from ./parm/hwrf.conf as my previous message suggested.

Thanks,

evan

Permalink

In reply to by evan.kalina

Hello Evan, 

Thank you for the response. 

1) I am running this code on Hera. The path to the code is saved here: /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/

Do you need me to change the read/write access or do you have some sort of administrator access?

2) I have included the change in ./ush/hwrf/prep.py that you mentioned above, the code successfully ran past the input tasks all the way to the output task. However, the code failed at the output_E99 task. It seems to be related to the following file: /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/scrub/05L_v16testdtc/2019082406/05L/intercom/fgat.t201908240600/wrfanl/wrfanl_d02_2019-08-24_06_00_00.

Output.log file: /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/scrub/05L_v16testdtc/2019082406/05L/hwrf_output.log

Rocotstatrocotostat -w /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/rocoto/hwrf-05L_v16testdtc-05L-2019082406.xml -d /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/rocoto/hwrf-05L_v16testdtc-05L-2019082406.db

3) The code is also failing for the following cycle (2019082412) ensDA_DAXXX tasks, which when looking at the log files is related to the NETCDF / HDF5 error.

hwrf_init_DAXX.X.log file: /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/scrub/05L_v16testdtc/2019082412/05/hwrf_init_DA001.log

The V16 files I am using to initialize HWRF are also saved here for reference: /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/*201908*

4) When looking through the ush/hwrf/prep.py code, I noticed the following block of code:

            if self.taskname[0:5] == 'ensda':

                nemsio=False

                netcdf=False

                if int(self.confstr('GFSVER').strip('PROD')) >= 2017:

                    nemsio=True

                    netcdf=False

                if int(self.confstr('GFSVER').strip('PROD')) >= 2020:

                    nemsio=False

                    netcdf=True

After seeing this, I realized I should tell you that I am running storms from the 2019 season, so I am not sure if this or other area of the code also need to be changed.

Thanks for your help thus far Evan and hope you are doing well,

Peter

 

 

 

Hi Peter,

1) I'm able to see your files. Thanks for providing the paths!

2) Based on the errors in the log, it looks like the output task is running out of memory. In /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/rocoto/sites/hera.ent, please try changing the value of OUTPUT_MEMORY from 4G to 8G and rewinding/rerunning the output task.

3) Hmm, I am not sure about this. Were the files in the enkf.*/ directories in /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/ also produced by someone in your group? I.e., they aren't from the GFSv16 retros made by EMC?

4) You shouldn't need to change this. Those if statements refer to the version of GFS that you're using for the run. V16=PROD2020, with GFS data in netcdf format. So you are OK here.

Hope you're doing well!

Cheers,

evan

Regarding #3,

Maybe this is the same problem that you reported for the deterministic forecast. I didn't realize that you also were going to run the HWRF ensemble.

Your enkf files have dimension im=768, jm=384, whereas the enkf files from the official retros have im=1536, jm=768. So a 50% reduction, just like for the GFS deterministic forecast.

Therefore, I think you should try (in prep.py):

            if nemsio or netcdf:
                if self.taskname[0:5] == 'ensda':
                    imax=480 (was 960)
                    jmax=240 (was 480)
                else:
                    imax=1024 (was 2048)
                    jmax=576 (was 1152)

After making that change, would you please try repeating one of the ensda jobs and seeing if it works (or at least gets further along)?

Permalink

In reply to by evan.kalina

Hello Evan, 

Re #2) increasing the memory from 4G to 8G did seem to work, and output tasks did successful complete, so thank you very much for the suggestion!

More on #3 in the following comment.

Permalink

In reply to by evan.kalina

Hello Evan, 

Re #3) Regarding the enkf.*/ in /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/ the 2019 data files which I am working with for these tests are from a run produced by someone in our research group. The 2020 data files in this directory were pulled from HPSS (I believe the EMC retros you mentioned) -- I had run a test case for 2020 with this code with these data, which seemed to work fine, which is why these data are there. However, our project is for 2019 hurricane cases, which I do not believe there are EMC retros with v16 that are available... are there?

I have also tried this change that you mentioned above (reducing the imax/jmax by half in the prep.py code). However, the code continued to fail. After looking at log files for various members (scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/scrub/05L_v16testdtc/2019082412/05L/hwrf_init_DAXXX.log), it always seems to fail with a NETCDF error (-101 NetCDF: HDF error, 99). When digging into the code more, it seems to fail in the hwrf_calc_analysis executable, which I traced back to sorc/ProdGSI/util/netcdf_io/calc_analysis.fd/inc2anl.f90

In this file, it loops through the different variables, and interestingly, the code fails at different variables within the loop for the different members (i.e., sometimes failing after copying dzdt to the anl.06 file, sometimes failing after delz, etc…), making me assume that it is not likely that a specific variable is causing the issue. Furthermore, some log files (i.e. /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/scrub/05L_v16testdtc/2019082412/05L/hwrf_init_DA014.log) have an additional error message, namely “*** Error in `/scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/exec/hwrf_calc_analysis': corrupted size vs. prev_size: 0x000000008ab63580 ***”. When googling this “corrupted size vs. prev_size” error message, many different forums suggested that this was a memory allocation problem. Also in the DA014 log file, after the error message, there is a backtrace and memory map, which maybe is consistent with the hypothesis that this is a memory allocation problem? 

I was also peeking around at the different files in each ensemble member prep folders (ensda/XXX/prep/2019082412), and I noticed that the geogrid.out file has east/west and north/south dimensions of 414 and 779 respectively, which is different from what I am inputting in. Could this be related to the issue? Or do the dimensions in hwrf_na12_mkbnd.parm need to be changed?

What do you think? 

Thank you so much for your help thus far!

- Peter

 

Permalink

In reply to by peter.marinescu

Hi Peter,

On line 423 of ush/hwrf/prep.py, there is this line:

                cmd = bigexe(self.getexe('calc_analysis'))

Can you replace that line with these two lines:

                prog = self.getexe('calc_analysis')
                cmd = produtil.run.mpirun(produtil.run.mpi(prog), allranks=False)

This hasn't been required in the past on Hera, but it has been needed on other machines. So I am wondering if it will help in this situation. calc_analysis is a MPI program, so the workflow should submit it using the mpirun method, rather than with bigexe.

And at the top of the script, on this line:

from produtil.run import alias, bigexe, checkrun, openmp

Add these two methods: mpirun, mpi

So the new line should look like:

from produtil.run import alias, bigexe, checkrun, openmp, mpirun, mpi

Permalink

In reply to by evan.kalina

Hello Evan, 

Thank you again for the response! I tried making these changes in /ush/hwrf/prep.py, but it did not change the outcome, and the jobs continued to fail. 

So after looking at all the log files for the 40 members, all of them produce a " -101 NetCDF: HDF error 99" error, but 4 of the members produce additional error statements. These additional error messages are either:

1)  *** Error in `/scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/exec/hwrf_calc_analysis': corrupted size vs. prev_size: 0x0000000089e7f130 *** for DA004 and DA014

2) forrtl: severe (174): SIGSEGV, segmentation fault occurred for DA012

3) *** Error in `/scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/exec/hwrf_calc_analysis': double free or corruption (out): 0x000000008bce4d70 ***

for DA029

 

Based on my limited knowledge on the topic, I believe these all suggest a memory issue like I had mentioned before. I wonder if maybe since the file sizes and domains are smaller, maybe fewer processors are needed to complete the task if it is being run in parallel?

Do you have any other suggestions of code that can be changed and tested or files that I can look at to help me diagnose what is causing these errors. For example, would it be useful to add some sort of print statements in sorc/ProdGSI/util/netcdf_io/calc_analysis.fd/inc2anl.f90, where I believe the code is failing to help?

 

Also for the following cycle, 2019082418, the bdy_E99 task also fails; it seems within the metgridfcst process. It seems to abort the last few proceccess only :

"application called MPI_Abort(MPI_COMM_WORLD, 43532272) - process 21

application called MPI_Abort(MPI_COMM_WORLD, 51985792) - process 22

application called MPI_Abort(MPI_COMM_WORLD, 54672768) - process 23

application called MPI_Abort(MPI_COMM_WORLD, 52710784) - process 20"

 

I am not sure this is related to the above errors with the initialization of the DA members in the prior cycle (2019082412), but I thought I should mention it, in case it is useful.

 

Thanks again for your help and time,

Peter

Hi Peter,

I agree that it is a memory problem, but I am beginning to suspect that it could be getting triggered by an environment issue. Is your stack size set to unlimited ("limit stacksize unlimited" in c shell)?

You could try reducing the number of cores allocated to the ensemble DA jobs by changing ENSDA_CORES from 72 to 48 in ./rocoto/sites/defaults.ent, but I am not confident that this is the problem.

I will keep looking at your logs, and we can certainly dig through the code more, but the fact that the ensemble member jobs do not all fail in the same place makes me think there is something more basic going wrong here.

Hi again Peter,

I assume that this problem remains unresolved. I was able to reproduce it on my side, and found a solution that I wonder if you would be willing to try.

Your input enkf files are in HDF5, rather than NETCDF format, and the hwrf_calc_analysis program is not working well with the HDF5 format.

Would you please try converting these two files to netcdf format, and then rerunning member 1? Please do not overwrite the original files, just in case they are needed later.

/scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/enkf.2019082406/sfg_2019082406_fhr06s_mem001

/scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/enkf.2019082412/siganl_2019082412_mem001

You can convert them in the following way:

module load nco/4.9.1

ncks -6 -O some_hdf5_file new_file_in_netcdf_format

If, when you rerun member 1 with the netcdf files, the task completes successfully, I think we have this figured out.

In my test, this did work for me.

Thanks for your patience on this.

Cheers,

evan

Permalink

In reply to by evan.kalina

Hello Evan, 

You are correct in assuming the problem has not been resolved.

I had tried changing the stack size limit from its initial setting 300000 to 10000000, which seemed to be the maximum I was allowed to set on hera. ulimit -s unlimited did was not permitted.  I also tried halving the number of processors assigned to those tasks. Both of these actions resulted in the same errors.

I will try converting to files to netcdf from hdf5, and rerun member001 as you suggest and report back.

Thanks for your continued work on this!

Best,

Peter

Hi Peter,

I hope you don't mind that I checked to see if your run worked ;-)

I see it got further, but failed later on when the next executable (hwrf_prep) ran.

I think this is a more obvious fix, and when I tested it in my sandbox it worked.

The cause of the problem is that your enkf files have a dimension list of (time, pfull, grid_yt, grid_xt, phalf). But the order of the dimension list in the files produced by the operational GDAS is (grid_xt, grid_yt, pfull, phalf, time).

I am sure there is a way to use the NCO tools to orient the dimension list correctly in your enkf files, but alas, I could not figure it out. Instead, I modified ./sorc/hwrf-utilities/tools/hwrf_prep_hybrid/read_gm_v16_nc.f90 to remove the assumption that the dimension list is given in a certain order. You can compare your version of the file to my edited one, which is here on Hera: /scratch1/BMC/dtc-hwrf/Evan.Kalina/hwrf-trunk-20210126/sorc/hwrf-utilities/tools/hwrf_prep_hybrid/read_gm_v16_nc.f90

I see you have some custom changes to this file, so I am sure you will want to port my changes into your version, which should not be too tough. For the benefit of this forum, my changes are:

@@ -18,7 +18,7 @@

     INTEGER, INTENT(OUT) :: nx, ny, nzf, nzh

     CHARACTER(LEN = 50) :: dname

     INTEGER :: ncid,dimlen

-    INTEGER :: ndims, nvars, ngatts, unlimdimid

+    INTEGER :: ndims, nvars, ngatts, unlimdimid, x_dimid, y_dimid, zf_dimid, zh_dimid

     LOGICAL :: file_exists

 

     INQUIRE(FILE=TRIM(filename), EXIST=file_exists)

@@ -33,10 +33,14 @@

       PRINT*, 'ndims < 4', ' ndims = ',ndims

       STOP

     ENDIF

-    CALL check(NF90_INQUIRE_DIMENSION(ncid,1,dname,nx),'getdim:inquire_dimension 1')

-    CALL check(NF90_INQUIRE_DIMENSION(ncid,2,dname,ny),'getdim:inquire_dimension 2')

-    CALL check(NF90_INQUIRE_DIMENSION(ncid,3,dname,nzf),'getdim:inquire_dimension 3')

-    CALL check(NF90_INQUIRE_DIMENSION(ncid,4,dname,nzh),'getdim:inquire_dimension 4')

+    CALL check(NF90_INQ_DIMID(ncid,'grid_xt',x_dimid),'getdimid: x')

+    CALL check(NF90_INQ_DIMID(ncid,'grid_yt',y_dimid),'getdimid: y')

+    CALL check(NF90_INQ_DIMID(ncid,'pfull',zf_dimid),'getdimid: zf')

+    CALL check(NF90_INQ_DIMID(ncid,'phalf',zh_dimid),'getdimid: zh')

+    CALL check(NF90_INQUIRE_DIMENSION(ncid,x_dimid,dname,nx),'getdim:inquire_dimension x')

+    CALL check(NF90_INQUIRE_DIMENSION(ncid,y_dimid,dname,ny),'getdim:inquire_dimension y')

+    CALL check(NF90_INQUIRE_DIMENSION(ncid,zf_dimid,dname,nzf),'getdim:inquire_dimension zf')

+    CALL check(NF90_INQUIRE_DIMENSION(ncid,zh_dimid,dname,nzh),'getdim:inquire_dimension zh')

     CALL check(NF90_CLOSE(ncid),'getdim:close')

   END SUBROUTINE get_dims

Please let me know how this goes. I hope we are close to getting the whole system working for you.

Permalink

In reply to by peter.marinescu

Hello Evan, 

So I recompiled HWRF for the updated fortran change, and ran a new text experiment (subexpt name: v16testdtc_recomp)

rocotostat -w /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/rocoto/hwrf-05L_v16testdtc_recomp-05L-2019082406.xml -d /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/H_V16/rocoto/hwrf-05L_v16testdtc_recomp-05L-2019082406.db

As you had suggested and can see with the above command, ensDA_001 now succeeded with the change in file formats. 

I will begin to convert the other files in this test run, and confirm that it is all going smoothly.

Thanks again for all of your help!

- Peter

 

 

Peter, that is great news. Glad to hear it.

By the way, I noticed that your bdy task for the next cycle (2019082418) failed again, as you reported previously. I searched around, and found this error in ./2019082418/05L/gfsinit/wps/metgrid.log: "ERROR: The mandatory field SKINTEMP was not found in any input data."

The last status message printed to the log is "Processing 2019-08-26_03," which is forecast hour 33. The next forecast hour (36) is where the problem lies.

The files that metgrid is processing are the GFS grib2 files, which are stored in ./stageddata/hwrfdata_PROD2020/gfs.2019082418. If you run wgrib2 on the file from forecast hour 36, it indicates the file is corrupt:

wgrib2 gfs.t18z.pgrb2.0p25.f036

*** FATAL ERROR: rd_grib2_msg_seq_file, missing end section ('7777') ***

The other files do not trigger this error (at least not the ones for forecast hour 33 or 39).

The cause of this file corruption is very likely an I/O error when the file was originally being written to disk. Since this was a GFS run made by someone in your group, I realize it will probably be extra painful to recreate the file. Unfortunately, I am not sure that there are any other options, unless the original version of this file is stored somewhere else and it is OK.

 

Permalink

In reply to by peter.marinescu

Hello Evan,

Apologies for the delay here but I had a few presentation deadlines and then the slow queue times. I tried this on a different storm (without the currupt GFS grb files) and everything seemed to work smoothly! Thank you for all of your help.

Out of curiously, how were you able to determine that my input enkf files where HDF5, rather than the NETCDF format? For example, when I run ncdump -h on one of the old files (ncdump -h /scratch1/BMC/qosap/Peter.Marinescu/HWRF_AeoTest/stageddata/hwrfdata_PROD2020/enkf.2019082406/OLD/siganl_2019082406_mem001), it outputs, which based on the first word in the output "netcdf'" seems to imply that it is a netcdf file (bolded below).

 

So I am just curious what command you used to figure it out, just in case I run into this again at a later date?

 

netcdf siganl_2019082406_mem001 {

  dimensions:

    ilev = 128 ;

    lat = 384 ;

    lev = 127 ;

    lon = 768 ;

  variables:

    float T_inc(lev,lat,lon) ;

    float delp_inc(lev,lat,lon) ;

    float delz_inc(lev,lat,lon) ;

    float hyai(ilev) ;

    float hybi(ilev) ;

    float icmr_inc(lev,lat,lon) ;

    float ilev(ilev) ;

    float lat(lat) ;

    float lev(lev) ;

    float liq_wat_inc(lev,lat,lon) ;

    float lon(lon) ;

    float o3mr_inc(lev,lat,lon) ;

    float pfull(lev) ;

    float sphum_inc(lev,lat,lon) ;

    float u_inc(lev,lat,lon) ;

    float v_inc(lev,lat,lon) ;

  // global attributes:

    :source = "GSI EnKF" ;

    :analysis_time = 2019082406 ;

    :IAU_hour_from_guess = 6 ;

    :comment = "recentered analysis increment using recentersigp" ;

} // group /

 

Thank you again for all of your help! These messages back and forth were extremely helpful, and we can now move our research project further along.

Best,

Peter

 

Hi Peter,

You're welcome! I'm glad to hear that it's working now.

The magic command is: more name_of_your_file

The beginning of the output will either say HDF or CDF. This is the only way I know of to check. Otherwise, the files seem identical, as you pointed out.

Cheers,

evan

Attach Files