Reproducibility of HWRF-B on Orion

Submitted by sarah.d.ditchek on Mon, 06/07/2021 - 11:33
Forum: Developers | HPC

I'm running into reproducibility issues on Orion when using HWRF-B. I've tested this issue by running 1 cycle twice (run "a" and run "b") under multiple configurations. The cycle I'm using is 2018083006.

  1. Only allowing 1 storm in the parent domain produces identical atcf files between runs "a" and "b". I tested this with just 90L (Invest) and then just 15E (Miriam). 
  2. Only allowing 2 storms in the parent domain produces identical fhr=0 files between runs "a" and "b", but different fhr>0. I tested this with just 90L (Invest) and 15E (Miriam).
  3. Allowing all triggered storms in the parent domain produces identical fhr=0 files between runs "a" and "b" for 2/3 storms triggered, and different fhr>0. The three storms that were triggered were 90L (Invest), 15E (Miriam), and 16E (Norman).

Based on feedback from the HAFS/HWRF Developers Meeting on 6/7/21, I am running some additional tests and will report the results on this thread.

  1. Only allowing 1 storm in the parent domain - running with 16E (Norman). 
  2. Only allowing 2 storms in the parent domain - running with 90L (Invest) and 16E (Norman) as well as running with 15E (Miriam) and 16E (Norman)
  3. I'm also rerunning all of the above tests to create a run "c".

In May 2020, I ran into a reproducibility issue using HWRF-B on Hera. By reducing the number of processors for the coupler to 4 processors, bitwise reproducibility was achieved on Hera. Can you point me to where I can check the number of processors used by the coupler on Orion? If it's not 4, I'll change it to 4 and run a test with that change. The path to the model directory is: /work/noaa/aoml-osse/sditchek/HB20_ALL/

Thanks for the help!

Best,

Sarah

Hi Sarah,

In ./parm/hwrf_multistorm.conf, I see you are setting wm3c_ranks=4 in the [runwrf] section. This is the number of cores used for the coupler. In addition, in ./rocoto/ms_forecast_procs.ent, the various CPL_FCST_*40PPN entities that are used for running the multistorm forecast jobs on Orion all begin with 1:ppn=4, thereby reserving 4 cores for the coupler. This all looks correct, so I do not think the solution will be the same on Orion as it was on Hera.

If you have any scrub directories that contain the logs and the runwrf/ directory for a particular experiment, this would help me check a few more things. I do not see any obvious problems with the way you have set up your experiment, at least not yet.

Thanks,

evan

Hi Evan,

Thanks for taking a look at that!

I have a few experiments running currently (the HB20_B2_TWONEST experiments in scrub/), but scrubbing was turned on for those so those directories will eventually vanish. I do save the .tar files and while it doesn't have the runwrf/ directory, I untarred a ONENEST experiment for you in case it's helpful. It can be found here: /work/noaa/aoml-osse/sditchek/scrub/forevan/. I can also rerun a ONENEST experiment and turn scrubbing off. Let me know if you'd like me to do that!

I have a new update. As mentioned in my original ticket, I reran the ONENEST experiments to create a run "c".

  1. Only allowing 90L (Invest) in the parent domain produced identical atcf files between runs "a","b", and "c" 
  2. Only allowing 16E (Norman) in the parent domain produced identical atcf files between runs "a","b", and "c" 
  3. BUT, only allowing 15E (Miriam) in the parent domain produced identical atcf files between runs "a","b", but only identical fhr=0 values for run "c", where fhr>0 were different.

Thanks,

Sarah

Permalink

In reply to by sarah.d.ditchek

Thanks Sarah! Would you be able to open permissions on that directory? I am most interested in the WRF namelist, because I would like to make sure the processor layout is consistent between the rocoto job card and the namelist.

As for why the same experiment (e.g., #3) sometimes reproduces output and sometimes doesn't, I am quite puzzled. You are submitting the job to the same queue in 3a–c, right? It would be really helpful to be able to see the forecast log files and the namelists from these three runs; otherwise, I'm afraid there's not much to go on. Turning scrubbing off would probably be the easiest way to make sure we have as much information as possible.

You're welcome! The directory permissions should now be open for you to take a look.

I'm puzzled too. I am submitting the job to the same queue. The only difference between the experiments is the SUBEXPT name in my wrapper sript (a,b,or c). I'll rerun those experiments with scrubbing off. Once they're done I'll post the directory path here!

As far as I can tell, all of the processor layout related variables appear to be consistent with the settings in rocoto/ms_forecast_procs.ent, so there is nothing obviously wrong. It would be good to verify that the point of origin of the different results is indeed in the forecast step (and not, for example, in the GSI or in the other preprocessing jobs). This should be easy to do with scrubbing turned off. In the meantime, I'll keep thinking.

Huh. Well that's good to know at least?

I finished running the experiment where I only allow one storm in the parent domain with scrubbing off. I chose Miriam since that was the one where I got different fhr>0 values previously I ran 1 cycle 4 times and...all atcf files were identical. So unfortunately that won't help with figuring out the issue! Since they were identical, to save space I only kept 2 of the experiments, and they're located here: /work/noaa/aoml-osse/sditchek/scrub/ under the HB20_B2_TWONEST* folders.

I'm running the experiments where I only allow two storms in the domain again with scrubbing off. Hopefully that reproduces the different fhr>0 values so you can take a look. If it doesn't, I'll rerun with allowing all triggered storms in the parent domain and see if that reproduces the issue.

Is it possible that the compute nodes or processors use a different random number generator seed based on the day or if it switches after maintenance is performed? 

Hi, Sarah,

Evan is on vacation and I'll try to answer the questions.

In my understanding, the random number generator should not produce significant differences in the result.

BTW, how much is the difference between the two runs?

Thanks,

Linlin 

Hi Linlin,

I'm looking at my HB20_B2_ONENEST example where I got different atcf outputs. 

  • vimdiff HB20_B2_ONENESTc_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix HB20_B2_ONENESTb_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix
  • FHR=0 is identical
  • FHR>=0 is not identical
  • The Lat/Lon begin to differ 21 h out. The intensity begins to differ 6 h out. The pressure begins to differ 6 h out.
  • Looking out to the 126 h forecast, run "c" is at 22.3N, 153.6 W while run "b" is at 22.0 N and 154.3 W. So 0.3 degrees latitude and 0.7 degrees longitude different. 

Looking at my HB20_B2_ALLNEST example where I got different atcf outputs...

  • vimdiff HB20_B2_ALLNESTa_Triggered/miriam15e.2018083006.trak.hwrf.atcfunix HB20_B2_ALLNESTb_Triggered/miriam15e.2018083006.trak.hwrf.atcfunix
  • FHR=0 is identical
  • FHR>=0 is different
  • The Lat/Lon begin to differ 24 h out. The intensity begins to differ 9 h out. The pressure begins to differ 9 h out.
  • Looking out to the 126 h forecast, run "a" is at 26.6N, 151.5 W while run "b" is at 23.7 N and 152.9 W. So nearly 3 degrees latitude and 1.4 degrees longitude different. 

So the difference for the ONENEST is minor at 126 h. But the difference when all storms in the basin are triggered is quite different at 126 h.

Unfortunately, running the TWONEST experiments simultaneously with scrub=no didn't reproduce this issue. I'm rerunning a ONENEST experiment to see if I can reproduce it. If not, I'll try rerunning the ALLNEST experiment with scrub=no so you can take a look.

Thanks,

Sarah

Hi again,

Some "good" news! I reran ONENEST for Miriam over the weekend with scrub=no (run "b") and it produced different results from the previous ONENEST for Miriam with scrub=no (runs "a" and "d") that were identical!! Note that originally I had run "a-d" simultaneously and they produced identical results. I only kept runs "a" and "d" to save space. The run "b" described here was run 6 days after the original "a-d" simultaneous runs.

ATCF Files: /work/noaa/aoml-osse/sditchek/noscrub/atcf/HB20_B2_ONENESTx_Miriam, where x=a,b, or d

  • Runs "a" and "d" were identical (diff HB20_B2_ONENESTa_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix  HB20_B2_ONENESTd_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix)
  • Runs "a" and "b" were different at multiple fhr>0. The differences aren't large, but it's not bitwise reproducible (diff HB20_B2_ONENESTa_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix  HB20_B2_ONENESTb_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix)

GRB Files: /work/noaa/aoml-osse/sditchek/scrub/HB20_B2_ONENESTx_Miriam, where x=a,b,or d

  • I ran diff -rq on the com folders of runs "a" and "d". All grb files were identical. Some wrf* files and other files differ. which might be due to timestamps in the files. (/work/noaa/aoml-osse/sditchek/scrub/differences_avd.txt)
  • I ran diff -rq on the com folders of runs "a" and "b". Several grb files were different. Some wrf* files and other files also differ, which might be due to timestamps and other factors in the files. (/work/noaa/aoml-osse/sditchek/scrub/differences_avb.txt)

Hopefully this helps to get to the bottom of the issue!

Thanks,

Sarah

Hi, Sarah,

I did some tests on Orion with your case, the results can be found at:

/work/noaa/noaa-det/lpan/b4b/06252021

With lower compiling optimization for WRF part, the differences in wrfout* disappear; however, the track files can still have some differences.

Orion-login-1[143] lpan$ diff scrub/HB20_B2_ONENESTa_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00 scrub/HB20_B2_ONENESTb_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00

Orion-login-1[144] lpan$ diff scrub/HB20_B2_ONENESTa_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00 scrub/HB20_B2_ONENESTc_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00

Orion-login-1[145] lpan$ diff scrub/HB20_B2_ONENESTa_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00 scrub/HB20_B2_ONENESTd_Miriam/2018083006/00L/runwrf/wrfout_d03_2018-09-04_12_00_00

Orion-login-1[146] lpan$ diff noscrub/HB20_B2_ONENESTa_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix noscrub/HB20_B2_ONENESTb_Miriam/miriam15e.2018083006.trak.hwrf.atcfunix

56,67c56,67

< EP, 15, 2018083006, 03, HWRF, 126, 220N, 1543W,  20, 1012, XX,  34, NEQ, 0000, 0000, 0000, 0000,  -99,  -99, 134,   0,   0,    ,   0,    ,   0,   0,           ,  ,   ,    ,   0,   0,   0,   0,       THERMO PARAMS,    -112,    1107,    -392, N, 10, DT, -999

---

> EP, 15, 2018083006, 03, HWRF, 126, 221N, 1537W,  20, 1012, XX,  34, NEQ, 0000, 0000, 0000, 0000,  -99,  -99, 161,   0,   0,    ,   0,    ,   0,   0,           ,  ,   ,    ,   0,   0,   0,   0,       THERMO PARAMS,    -114,    1099,    -373, N, 10, DT, -999

I'll change the compiling option of hwrf utilities and do more tests.

Thanks,

Linlin

 

Hi Linlin,

Thanks for looking into this more! I was on annual leave for the past two weeks and returned back late last week.

While away, my colleague Peter Marinescu, who is having reproducibility issues with H220_V16 on Hera, had an email thread going with Zhan Zhang, Jason Sippel, and myself. Zhan is also aware of the reproducibility issues I'm having on Orion with HB20. He provided a potential fix for both Peter and myself, which I'm testing now.

The fix entails copying over the newest version of ./sorc/WPSV3/arch/configure.defaults and ./sorc/WPSV3/ungrib/Makefile to my directory and rebuilding the ungrib executable. This change adds in the -fp-model consistent flag. Zhan explained that "the forecast job reproducibility issue may be related to the ungrib ifort compilation option that DTC has fixed, but not included in H220_forV16." 

I'll keep you updated!

Thanks,

Sarah

 

Hi, Sarah,

Thanks for the information! I tried the option with the flag -fp-model consistent for WRF, WPS, and GSI/ENKF part for Peter's case on Hera.

The differences still exist for the third cycle.

I'll do more tests.

Thanks,

Linlin