output task failed on JET

Submitted by alim on Thu, 11/18/2021 - 10:56
Forum: Developers | HPC

Hi,

HWRF output task has been failing for a storm that i started two days ago.

Looks like its memory issue

Error messgae:

ExitStatusException: batchexe('/apps/nco/4.9.1/intel/18.0.5.274/bin/ncks')['-4','-L','6','-O',u'/lfs4/HFIP/hwrfv3/Agnes.Lim/pytmp/H220_ctrl_intel18/2020070612/05E/intercom/fgat.t202007061200/wrfanl/wrfanl_d02_2020-07-06_12_00_00',u'/lfs4/HFIP/hwrfv3/Agnes.Lim/pytmp/H220_ctrl_intel18/com/2020070612/05E/tmp.wrfanl_d02_2020-07-06_12_00_00.part.B0OkrO'].in('/dev/null',string=False): non-zero exit status (returncode=-9)
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=62374467.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

tjet is being used for this task. How can i switch to other jets for this task.

 

Thanks

Agnes

 

Hi Agnes,

 

The easiest way to fix this would be to increase OUTPUT_MEMORY from 2G to 6G in rocoto/sites/defaults.ent. With this change, the task should be able to run on any jet.

 

Thanks,

evan