Beehive Computing Cluster at CAL

Beehive is a 10 node beowulf computer cluster.

User Guide

Upon logging into Beehive you will be directed to your user directory on the main node. From this node you will submit jobs to the other nodes via the PBS batch system.

Running Jobs

To run a job, you submit it to the qeue with the qsub command, where bongo.sh is a script you have written containing information about your job, and run commands for your executables.

qsub bongo.sh

This is an example of a script, bongo.sh:

#!/bin/sh
### Project or Job name
#PBS -N MrBongo
### Declare job non-rerunable
#PBS -r n
###Output Files
###PBS -e output.err
###PBS -o output.log
### Expected wall time of run in HH:MM:SS
#PBS -l walltime=01:00:00
### Number of nodes and number of processors per node (max 2)
#PBS -l nodes=1:ppn=1

#This jobs working directory
echo Working Directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

echo Running on host 'hostname'
echo Time is 'date'
echo Directory is 'pwd'

#Run executables (initials)
./bongo_dist

#Run executables sim
./bongo8

About this script: This script runs the bongo executables on one node. Choosing a job type will define how long your job may run; when it hits the walltime, your job will stop, finished or not. Shorter jobs may run sooner than longer jobs, but you want to make sure your job can finish in the given time. There are four queues: short (up to 20 min), medium (up to 2 hours), long (up to 24 hours), and verylong (up to 168 hours). You do not need to specify the queue name in the job -- if you specify the walltime, then PBS will choose the correct queue.

After you submit a script, a job number will be returned. In the qeue, your job would be referred to by its name, mrbongo, and job number. You can modify, track or kill your job with this number. When the run has completed, two files, Output.e##### and Output.o##### will contain any error messages, and any (x)terminal output from your code.

Please do not run large jobs on the master node (even niced jobs) as this will make Beehive unusable for all other users!! To run interactive jobs on Beehive, first log in to another node. This can be done using the "qsub -I" command.

To examine the jobs in the queue, use:

qstat -a

To cancel a job, use qstat to find it's ID number and then use this in the following command:

qdel job_id

Transferring Output and Maintaining Data

To transfer data back to your CW, you will need to set up a ssh key-pair. If transferring multiple files, you will want to preserve your key-pair password, so that you do not have to re-enter it for every transfer.

You can also make your life easier by taring up your files, or by taring and gziping larger files so they don't take so long to transfer.

tar -czf mongo.out.tar ./output/*
gzip mongo.out.tar

Using MPI on the cluster

If you have an MPI code, then it can be compiled with "mpif77" or "mpicxx", and submitted with a script like this:

#!/bin/sh
### Job name
#PBS -N mpi_test
### Declare job non-rerunable
#PBS -r n
### Output files
#PBS -e output_enzo.err
#PBS -o output_enzo.log
### Mail to user
##PBS -m bae
### Give email address
#PBS -M myemail@astro.columbia.edu
### Number of nodes and other properties
#PBS -l nodes=2:ppn=2
#PBS -l walltime=1:00:00

# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR  

# Count number of processors
NP=`wc -l < $PBS_NODEFILE `
echo Number of processors = $NP

# Run your mpi executable
mpiexec -machinefile $PBS_NODEFILE -n $NP my_mpi_program

Large memory nodes

The master and most nodes have 2 GB and two processors. Nodes 8 and 9 (node8 and node9) each have 8 GB of memory. To access these nodes using a pbs script, specify the following line in your script file:

#PBS -l nodes=1:ppn=1:bigmem

Using IDL on the cluster

The beehive cluster nodes are behind a NAT so to run IDL you need to create an SSH tunnel through the master node. Here is how it works:

First, set up passwordless sshkey access on the cluster. See this link for an example of how to do it: http://homepage.mac.com/wyuen/hpc/Passwordless_SSH.html

Second, create a file called .flexlmrc in your home directory that reads:

IDL_LMGRD_LICENSE_FILE=1700@localhost:/usr/local/rsi/idl_6.2/../license/license.dat

Finally, you are set to submit a PBS job that runs IDL. The one thing you must include is a line which generates the ssh tunnel (and then puts itself into the background so the tunnel stays around). Here is an example script:

#!/bin/sh
### Job name
#PBS -N idl_test
### Declare job non-rerunable
#PBS -r n
### Output files
#PBS -e output_simple.err
#PBS -o output_simple.log
### Mail to user
#PBS -m bae
### Give email address
#PBS -M your_email_address@astro.columbia.edu
### Queue name (small, medium, long, verylong)
#PBS -q medium
### Number of nodes and other properties (big memory)
#PBS -l nodes=1:bigmem

cd $PBS_O_WORKDIR

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# start the secure tunnel to the master (passwordless ssh must be setup)

ssh -N -L 1700:neptune.astro.columbia.edu:1700 -L 1705:neptune.astro.columbia.edu:1705 master &

# now we can run the idl job, replace my_idljob with your own idl script

/usr/local/bin/idl my_idljob

# clean up secure tunnel

kill %1

Administrator Guide

Please add and modify.

Shutdown Procedures

When connected to beehive remotely as root, rsh into each individual node and use "poweroff" command to shutdown each node. Then poweroff fileserver and master.

Restarting mpd

Sometimes mpd, the process that does MPI communication, will fail on the nodes. To test if this is true, use "mpdtrace" to get a list of which nodes are responding. For those which are not responding (note: node4 is currently down), run "/etc/init.d/mpd restart" on the nodes. If this fails, then you may need to manually kill the mpd process which sometimes hangs.

Clearing the undelivered pbs spool

If jobs starting failing with zero-length output and error files, check to see if the /var filesystem is full on any of the nodes. If so, this is most likely due to very large undelivered output and error files in the directory /var/spool/pbs/undelivered. This directory will need to be purged occasionally.