| Version 11 (modified by , 17 years ago) ( diff ) |
|---|
Beehive Computing Cluster at CAL
Beehive is a 10 node beowulf computer cluster.
User Guide
Upon logging into Beehive you will be directed to your user directory on the main node. From this node you will submit jobs to the other nodes via the PBS batch system.
Running Jobs
To run a job, you submit it to the qeue with the qsub command, where bongo.sh is a script you have written containing information about your job, and run commands for your executables.
qsub bongo.sh
This is an example of a script, bongo.sh:
#!/bin/sh ### Project or Job name #PBS -N MrBongo ### Declare job non-rerunable #PBS -r n ###Output Files ###PBS -e output.err ###PBS -o output.log ### Expected wall time of run in HH:MM:SS #PBS -l walltime=01:00:00 ### Number of nodes and number of processors per node (max 2) #PBS -l nodes=1:ppn=1 #This jobs working directory echo Working Directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR echo Running on host 'hostname' echo Time is 'date' echo Directory is 'pwd' #Run executables (initials) ./bongo_dist #Run executables sim ./bongo8
About this script: This script runs the bongo executables on one node. Choosing a job type will define how long your job may run; when it hits the walltime, your job will stop, finished or not. Shorter jobs may run sooner than longer jobs, but you want to make sure your job can finish in the given time. There are four queues: short (up to 20 min), medium (up to 2 hours), long (up to 24 hours), and verylong (up to 168 hours). You do not need to specify the queue name in the job -- if you specify the walltime, then PBS will choose the correct queue.
After you submit a script, a job number will be returned. In the qeue, your job would be referred to by its name, mrbongo, and job number. You can modify, track or kill your job with this number. When the run has completed, two files, Output.e##### and Output.o##### will contain any error messages, and any (x)terminal output from your code.
Please do not run large jobs on the master node (even niced jobs) as this will make Beehive unusable for all other users!! To run interactive jobs on Beehive, first log in to another node. This can be done using the "qsub -I" command.
To examine the jobs in the queue, use:
qstat -a
To cancel a job, use qstat to find it's ID number and then use this in the following command:
qdel job_id
Transferring Output and Maintaining Data
To transfer data back to your CW, you will need to set up a ssh key-pair. If transferring multiple files, you will want to preserve your key-pair password, so that you do not have to re-enter it for every transfer.
You can also make your life easier by taring up your files, or by taring and gziping larger files so they don't take so long to transfer.
tar -czf mongo.out.tar ./output/* gzip mongo.out.tar
Using MPI on the cluster
If you have an MPI code, then it can be compiled with "mpif77" or "mpicxx", and submitted with a script like this:
#!/bin/sh ### Job name #PBS -N mpi_test ### Declare job non-rerunable #PBS -r n ### Output files #PBS -e output_enzo.err #PBS -o output_enzo.log ### Mail to user ##PBS -m bae ### Give email address #PBS -M myemail@astro.columbia.edu ### Number of nodes and other properties #PBS -l nodes=2:ppn=2 #PBS -l walltime=1:00:00 # This job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR # Count number of processors NP=`wc -l < $PBS_NODEFILE ` echo Number of processors = $NP # Run your mpi executable mpiexec -machinefile $PBS_NODEFILE -n $NP my_mpi_program
Large memory nodes
The master and most nodes have 2 GB and two processors. Nodes 8 and 9 (node8 and node9) each have 8 GB of memory. To access these nodes using a pbs script, specify the following line in your script file:
#PBS -l nodes=1:ppn=1:bigmem
Using IDL on the cluster
[somebody add this please...]
Administrator Guide
Please add and modify.
Shutdown Procedures
When connected to beehive remotely as root, rsh into each individual node and use "poweroff" command to shutdown each node. Then poweroff fileserver and master.
Restarting mpd
Sometimes mpd, the process that does MPI communication, will fail on the nodes. To test if this is true, use "mpdtrace" to get a list of which nodes are responding. For those which are not responding (note: node4 is currently down), run "/etc/init.d/mpd restart" on the nodes. If this fails, then you may need to manually kill the mpd process which sometimes hangs.
