Environment Issue on Orion: HWRF Jobs Fail To Recognize Correct Shell

Submitted by ghassan.alaka on Mon, 11/16/2020 - 12:39
Forum: Developers | HPC

The stack size on Orion must be set to unlimited in order for HWRF (and HWRF-B) components to run with the memory they need. The default stack size is too small to accommodate large files and complex jobs. I mentioned the recurrence of the "stack size" problem on Orion in the HWRF Developer's call today (11/16). Even though the stack size is being explicitly set to "ulimit -s unlimited" in the ~/.bashrc file, it seems as though the HWRF system might not be recognizing that the default shell should be Bash. This had worked until recently, likely due to some change on the Orion side. As I was investigating the log files, I noticed this potential error in the header of the log file, during the environment/module loads:

++ __ms_ksh_test=
/work/noaa/aoml-hafs1/galaka/HB20_orion/ush/hwrf_pre_job.sh.inc: line 54: __ms_function_name: unbound variable
+++ cat
++ __ms_bash_test=
++ [[ ! -z '' ]]
++ [[ ! -z '' ]]
++ __ms_shell=sh

This unbound variable prevents the job from realizing the shell should be Bash, so it defaults to sh. The simple fix was to also set the stack size to be unlimited in ~/.profile, so that both "sh" and "bash" shells will have unlimited stack sizes. However, it seems as though hwrf_pre_job.sh.inc should be updated so the correct shell can be used on Orion. I checked the HWRF trunk version of hwrf_pre_job.sh.inc in case I am missing an update, but my version is identical.

I am open to ideas on how to proceed. The fix I mentioned allows the job to succeed, but I am guessing many more users will run into this issue on Orion.

 

Best,

Gus

Hi Gus,

I use .bashrc on Orion and adding 

ulimit -s 19000000

Before the following line

if [ -z "$PS1" ]; then return; fi

works for the operational HWRF system. However, you have a fix now and we will look into it. 

Thanks

 

Permalink

In reply to by biswas

Hi Biswas,

Thank you for that information. This issue occurred after defining "ulimit -s unlimited" in .bashrc.

Do you see the "unbound variable" error at the top of your job logs? What is the definition of $__ms_shell? Can you retest one of these jobs to see if it still works for you?