Running WRF on multiple nodes with Singularity
Running WRF on multiple nodes with SingularityOne of the main advantages of Singularity is its broad support for HPC applications, specifically its lack of root privilege requirements and its support for scalable MPI on multi-node machines. This page will give an example of the procedure for running this tutorial's WPS/WRF Singularity container on multiple nodes on the NCAR Cheyenne supercomputer. The specifics of running on your particular machine of interest may be different, but you should be able to apply the lessons learned from this example to any HPC platform where Singularity is installed.
Step-by-step instructions
Load the singularity, gnu, and openmpi modules
module load gnu
module load openmpi
Set up experiment per usual (using snow case in this example)
git clone git@github.com:NCAR/container-dtc-nwp -b v${PROJ_VERSION}
mkdir data/ && cd data/
tcsh | bash |
---|---|
foreach f (/glade/p/ral/jntp/NWP_containers/*.tar.gz)
tar -xf "$f" end |
for f in /glade/p/ral/jntp/NWP_containers/*.tar.gz; do tar -xf "$f"; done
|
mkdir -p ${CASE_DIR} && cd ${CASE_DIR}
mkdir -p wpsprd wrfprd gsiprd postprd pythonprd metprd metviewer/mysql
export TMPDIR=${CASE_DIR}/tmp
mkdir -p ${TMPDIR}
Pull singularity image for wps_wrf from DockerHub
The Singularity containers used in this tutorial take advantage of the ability of the software to create Singularity containers from existing Docker images hosted on DockerHub. This allows the DTC team to support both of these technologies without the additional effort to maintain a separate set of Singularity recipe files. However, as mentioned on the WRF NWP Container page, the Docker containers in this tutorial contain some features (a so-called entrypoint script) to mitigate permissions issues seen with Docker on some platforms. Singularity on multi-node platforms does not work well with this entrypoint script, and because Singularity does not suffer from the same permissions issues as Docker, we have provided an alternate Docker container for use with Singularity to avoid these issues across multiple nodes:
Create a sandbox so the container is stored on disk rather than memory/temporary disk space
In the main tutorial, we create Singularity containers directly from the Singularity Image File (.sif). For multi-node Singularity, we will take advantage of an option known as "Sandbox" mode:
This creates a directory named "wps_wrf" that contains the entire directory structure of the singularity image; this is a way to interact with the Singularity container space from outside the container rather than having it locked away in the .sif file. You can use the ls command to view the contents of this directory, you will see it looks identical to the top-level directory structure of a typical linux install:
total 75
drwxr-xr-x 18 kavulich ral 4096 Feb 8 13:49 .
drwxrwxr-x 11 kavulich ral 4096 Feb 8 13:49 ..
-rw-r--r-- 1 kavulich ral 12114 Nov 12 2020 anaconda-post.log
lrwxrwxrwx 1 kavulich ral 7 Nov 12 2020 bin -> usr/bin
drwxr-xr-x 4 kavulich ral 4096 Feb 8 12:33 comsoftware
drwxr-xr-x 2 kavulich ral 4096 Feb 8 13:49 dev
lrwxrwxrwx 1 kavulich ral 36 Feb 8 13:42 environment -> .singularity.d/env/90-environment.sh
drwxr-xr-x 57 kavulich ral 16384 Feb 8 13:42 etc
lrwxrwxrwx 1 kavulich ral 27 Feb 8 13:42 .exec -> .singularity.d/actions/exec
drwxr-xr-x 4 kavulich ral 4096 Feb 8 12:52 home
lrwxrwxrwx 1 kavulich ral 7 Nov 12 2020 lib -> usr/lib
lrwxrwxrwx 1 kavulich ral 9 Nov 12 2020 lib64 -> usr/lib64
drwxr-xr-x 2 kavulich ral 4096 Apr 10 2018 media
drwxr-xr-x 2 kavulich ral 4096 Apr 10 2018 mnt
drwxr-xr-x 3 kavulich ral 4096 Dec 27 15:32 opt
drwxr-xr-x 2 kavulich ral 4096 Nov 12 2020 proc
dr-xr-x--- 5 kavulich ral 4096 Dec 27 16:00 root
drwxr-xr-x 13 kavulich ral 4096 Dec 27 16:20 run
lrwxrwxrwx 1 kavulich ral 26 Feb 8 13:42 .run -> .singularity.d/actions/run
lrwxrwxrwx 1 kavulich ral 8 Nov 12 2020 sbin -> usr/sbin
lrwxrwxrwx 1 kavulich ral 28 Feb 8 13:42 .shell -> .singularity.d/actions/shell
lrwxrwxrwx 1 kavulich ral 24 Feb 8 13:42 singularity -> .singularity.d/runscript
drwxr-xr-x 5 kavulich ral 4096 Feb 8 13:42 .singularity.d
drwxr-xr-x 2 kavulich ral 4096 Apr 10 2018 srv
drwxr-xr-x 2 kavulich ral 4096 Nov 12 2020 sys
lrwxrwxrwx 1 kavulich ral 27 Feb 8 13:42 .test -> .singularity.d/actions/test
drwxrwxrwt 7 kavulich ral 4096 Feb 8 12:53 tmp
drwxr-xr-x 13 kavulich ral 4096 Nov 12 2020 usr
drwxr-xr-x 18 kavulich ral 4096 Nov 12 2020 var
You can explore this directory to examine the contents of this container, but be cautious not to make any modifications that could cause problems later down the road!
Run WPS as usual
The command for running WPS is similar to that used in the main tutorial. Specifically, the fact that we are using a sandbox rather than creating a container straight from the singularity image file, requires a change to the run command. Note the bold part that is different from the original tutorial:
Prepare the wrfprd directory
Now this part is still a little hacky...but this will be cleaned up in future versions. Enter the wrfprd directory and manually link the met_em output files from WPS and rename them to the proper "nocolons" convention. Then, link in the contents of the WRF run directory containing the static input files and compiled executables from the container we created in a sandbox, and replace the default namelist with our case's custom namelist:
ln -sf ${CASE_DIR}/wps_wrf/comsoftware/wrf/WRF-4.3/run/* .
rm namelist.input
cp $PROJ_DIR/container-dtc-nwp/components/scripts/snow_20160123/namelist.input .
Finally, request as many cores/nodes as you want, reload the environment on compute nodes, and run!
qsub -V -I -l select=2:ncpus=36:mpiprocs=36 -q regular -l walltime=02:00:00 -A P48503002
ln -sf ${CASE_DIR}/wpsprd/met_em.* .
tcsh |
bash |
---|---|
foreach f ( met_em.* )
setenv j `echo $f | sed s/\:/\_/g` mv $f $j end |
for f in met_em.*; do mv "$f" "$(echo "$f" | sed s/\:/\_/g)"; done
|
mpiexec -np 72 singularity run -u -B/glade:/glade ${CASE_DIR}/wps_wrf ./wrf.exe
The rest of the tutorial can be completed as normal.