IIGB Linux Cluster

Introduction

This manual provides a brief introduction on the usage of IIGB's Linux cluster, Biocluster.
All servers and compute resources of the IIGB bioinformatics facility are available to researchers from all departments and colleges at UC Riverside for a minimal recharge fee (see rates).
To request an account, please contact Thomas Girke (thomas.girke@ucr.edu).
The latest hardware/facility description for grant applications is available here: Facility Description [pdf].

Biocluster Overview

Storage

  • Four enterprise class HPC storage systems
  • Approximately 200 TB of network storage
  • ZFS over NFS
  • Automatic backups

Network

  • Ethernet:
    • 1 Gb/s switch x 4
    • redundant, load balanced, robust mesh topology
  • Interconnect
    • 20 Gb/s InfiniBand

Head Nodes

  • Biocluster
    • Resources8 cores, 64 GB memory
    • Primary function: submitting jobs to the queuing system (Torque/Maui)
    • Secondary function: development; code editing and running small (under 50 % CPU and under 30 % RAM) sample jobs
  • Owl
    • Resources: 16 cores, 64 GB memory
    • Primary function: testing; running test sets of jobs
    • Secondary function: submitting jobs to the queuing system (Torque/Maui)

Worker Nodes

  • High-Memory nodes
    • m01-m03: each 32-64 cores and 252-512 GB memory
  • Compute nodes
    • n01-n32each 8 cores, each 16GB memory
    • n33n34: each 48 cores, each 64GB memory

Current status of Biocluster nodes

qstatMonitor_Report

Getting Started

The initial log-in, brings users into the Biocluster head node. From there, users can submit jobs via qsub to the compute nodes or log into owl to perform memory intensive tasks.
Since all machines are mounting a centralized file system, users will always see the same home directory on all systems. Therefore, there is no need to copy files from one machine to another. 

Login from Mac or Linux

Open the terminal and type:

ssh -X username@biocluster.ucr.edu

Login from Windows

Please refer to the login instructions of our Linux Basics manual.

Change Password

  1. Log-in via SSH using the Terminal on Mac/Linux or Putty on Windows
  2. Once you have logged in type the following command:
    passwd
  3. Enter the old password (the random characters that you were given as your initial password)
  4. Enter your new password
The password minimum requirements are:
  • Total length at least 8 characters long
  • Must have at least 3 of the following:
    • Lowercase character
    • Uppercase character
    • Number
    • Punctuation character

Modules

All software used on Biocluster is managed through a simple module system.
You must explicitly load and unload each package as needed.
More advanced users may want to load modules within their bashrc, bash_profile, or profile files.

Available Modules

To list all available software modules, execute the following:

module avail

This should output something like:

------------------------- /usr/local/Modules/versions --------------------------
3.2.9
--------------------- /usr/local/Modules/3.2.9/modulefiles ---------------------
BEDTools/2.15.0(default) modules
PeakSeq/1.1(default) python/3.2.2
SOAP2/2.21(default) samtools/0.1.18(default)
bowtie2/2.0.0-beta5(default) stajichlab
cufflinks/1.3.0(default) subread/1.1.3(default)
matrix2png/1.2.1(default) tophat/1.4.1(default)
maui/3.3.1(default) trans-ABySS/1.2.0(default)
module-info

Using Modules

To load a module, run:

module load <software name>[/<version>]

To load the default version of the tophat module, run:

module load tophat

If a specific version of tophat is needed, 1.4.1 for example, run:

module load tophat/1.4.1

Show Loaded Modules

To show what modules you have loaded at any time, you can run:

module list

Depending on what modules you have loaded, it will produce something like this:

Currently Loaded Modulefiles:
1) maui/3.3.1 2) tophat/1.4.1 3) PeakSeq/1.1

Unloading Software

Sometimes you want to no longer have a piece of software in path. To do this you unload the module by running:

module unload <software name>

Additional Features

There are additional features and operations that can be done with the module command. Please run the following to get more information:

module help

Quotas

CPU

Currently, the maximum number of CPU cores a user can use simultaneously on biocluster is 128 CPU cores when the load on the cluster is <30% and 64 CPU cores when the load is above 30%. If a user submits jobs for more than 128/64 CPU cores then the additional requests will be queued until resources within the user's CPU quota become available. Upon request a user's upper CPU quota can be extended temporarily, but only if sufficient CPU resources are available. To avoid monopolisation of the cluster by a small number of users, the high load CPU quota of 64 cores is dynamically readjusted by an algorithm that considers the number of CPU hours accumulated by each user over a period of 3 months along with the current overall CPU usage on the cluster. If the CPU hour average over the 3 month window exceeds an allowable amount then the default CPU quota will be reduced for such a heavy user to 32 CPU cores, and if it exceeds the allowable amount by two-fold it will be reduced to 16 CPU cores. Once the average usage of a heavy user drops again below those limits, the upper CPU limit will be raised accordingly. Note: when the overall CPU load on the cluster is below 70% then the dynamically readjusted CPU quotas are not applied. At those low load times every user has the same CPU quota: 128 CPU cores at <30% load and 64 CPU cores at 30-70% load.

Data Storage

A standard user account has a storage quota of 20GB. Much more storage space, in the range of many TBs, can be made available in a user account's bigdata directory. The amount of storage space available in bigdata depends on a user group's annual subscription. The pricing for extending the storage space in the bigdata directory is available here

Memory

From the Biocluster head node users can submit jobs to the batch queue or the highmem queue. The nodes (n01-n34) associated with the batch queue are mainly for CPU intensive tasks, while the nodes (m01-m03)  of the highmem queue are dedicated to memory intensive tasks. The batch nodes have 16-64GB RAM each and the highmem nodes have 256-512GB RAM.

What's Next?

You should now know the following:

  1. Basic orginization of Biocluster
  2. How to login to Biocluster
  3. How to use the Module system to gain access to Biocluster software
  4. CPU, storage, and memory limitations (quotas and hardware limits)
Now you can start using Biocluster.
The recommended way to run your jobs (scripts, pipelines, experiments, etc...) is to submit them to the queuing system by using qsub.
Biocluster uses Torque/Maui software as a PBS, Portable Batch System, queuing system.
Please do not run ANY computationally intensive tasks on the Biocluster head node. If this is done, we will have to kill your jobs, because they will slow down all other users.

However you may run memory intensive jobs on Owl.
Login to Owl like so:
ssh -X owl.ucr.edu

Managing Jobs

Submitting and managing jobs is a the heart of using the cluster.  A 'job' refers to the script, pipeline or experiment that you run on the nodes in the cluster.

Queues

There are two different queues available for cluster users to send jobs to:

  • batch 
    • Default queue
    • Nodes: n01-n34
    • Cores: 64 per user
    • RAM: 1GB default and 16GB max
    • Walltime (Run time): 168 hours (7 days) default
  • highmem
    • Nodes: m01 and m03
    • Cores: 32 per user
    • RAM: 16GB min and 500GB max
    • Walltime (Run time): 48 hours (2 days) default

Basic Usage

Submitting Jobs

The command used to submit jobs is qsub. There are two basic ways to submit jobs:

Using STDIN

The first way this can be done is by using a technique where you pipe in the command via STDIN.

echo <command> | qsub

For example, lets say you have a set of files in your home directory and want to run blast against them, you could run a command similar to what we find below to have that run on a node.

echo blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10 | qsub

Using a Script

When using the cluster it quickly becomes useful to be able to run multiple commands as part of a single job. To solve this we write scripts. In this case, the way it works is that we invoke the script as the last argument to qsub.

qsub <script name>

A script is just a set of commands that we want to make happen once the job runs. Below is an example script that does the same thing that we do with Exercise 5 in the Linux Basics Manual.

#!/bin/bash

#PBS -M email@address.com
# Define email address for job notifications

#PBS -m abe
# Send email notification if job is (a) aborted, (b) begins, or (e) ends

# Create a directory for us to do work in.
# We are using a special variable that is set by the cluster when a job runs.
mkdir $PBS_JOBID

# Change to that new directory
cd $PBS_JOBID

# Copy the proteome of Halobacterium spec.
cp /srv/projects/db/ex/AE004437.faa .

# Do some basic analysis
# The echo command prints info to our output file
echo "How many predicted proteins are there?"
grep '^>' AE004437.faa --count
echo "How many proteins contain the pattern \"WxHxxH\" or \"WxHxxHH\"?"
egrep 'W.H..H{1,2}' AE004437.faa

# Start preparing to do a blast run
# Use awk to grab a number of proteins and then put them in a file.
echo "Generating a set of IDs"
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs

# Make the proeome blastable
echo "Making a blastable database"
formatdb -i AE004437.faa -p T -o

# Make blastable IDs
echo "Making a set of blastable IDs"
fastacmd -d AE004437.faa -i my_IDs > myseq.fasta

# Run blast
echo "Running blast"
blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10

So if this script was called blast_AE004437.sh we could run the following to make all of those steps happen.

qsub blast_AE004437.sh

Tracking Jobs

Now that we have a job in the queue, how do I know if it is running? For that, there is a command called qstat. The command qstat will provide you with the current state of all the jobs running or queued to run on the cluster. The following is an example of that output:

Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 467655.torque-server ...MTcauLSS_5.sh xzhang 2562047: R batch 467660.torque-server ...TnormLSS_5.sh xzhang 5124113: R batch 467663.torque-server ARSUMTt3LSS_5.sh xzhang 5124113: R batch 474003.torque-server Aedes2 bradc 1095:30: R batch 7475989.torque-server Culex2 bradc 928:40:0 R batch 478663.torque-server STDIN snohzadeh 00:36:28 R batch 626327.torque-server STDIN wangya 11:16:38 R batch 645318.torque-server STDIN yfu 477:49:3 R batch 645353.torque-server STDIN yfu 464:31:4 R batch 655060.torque-server newphyml.sh nponts 364:57:5 R batch 655077.torque-server newphyml.sh nponts 401:32:2 R batch 655182.torque-server newphyml.sh nponts 396:35:2 R batch 655385.torque-server newphyml.sh nponts 337:29:4 R batch 655469.torque-server newphyml.sh nponts 146:23:5 R batch 655493.torque-server newphyml.sh nponts 335:05:0 R batch 655571.torque-server newphyml.sh nponts 358:33:5 R batch 655742.torque-server newphyml.sh nponts 314:08:5 R batch 655754.torque-server newphyml.sh nponts 299:45:2 R batch 655814.torque-server newphyml.sh nponts 109:59:4 R batch 655951.torque-server newphyml.sh nponts 268:58:3 R batch 655962.torque-server newphyml.sh nponts 325:04:2 R batch 656054.torque-server newphyml.sh nponts 277:43:3 R batch 656055.torque-server newphyml.sh nponts 327:37:2 R batch 656195.torque-server newphyml.sh nponts 270:07:2 R batch 656309.torque-server newphyml.sh nponts 261:18:4 R batch 656339.torque-server newphyml.sh nponts 306:47:0 R batch 656340.torque-server newphyml.sh nponts 275:05:4 R batch 656486.torque-server newphyml.sh nponts 259:59:3 R batch 659489.torque-server STDIN zwu 00:25:01 R batch 672645.torque-server STDIN snohzadeh 00:00:00 R batch 674351.torque-server STDIN yfu 165:40:5 R batch 674819.torque-server Seqrank_CL_08.sh xzhang 115:43:3 R batch 674940.torque-server submit_script.sh nsausman 683:14:1 R batch 675260.torque-server ...eatModeler.sh robb 233:01:5 R batch 675266[].torque-server sa.sh jban 0 R batch 675275[].torque-server LeucoMakerMrctr hishak 0 R batch 675853[].torque-server sa.sh jban 0 R batch 677089.torque-server LFPcorun.sh jychen 57:31:33 R batch 679437.torque-server Chr8.mergeBam.sh robb 0 Q batch 679438.torque-server Chr9.mergeBam.sh robb 0 Q batch 679439.torque-server Chr1.cat_fq.sh robb 0 Q batch 679440.torque-server Chr10.cat_fq.sh robb 0 Q batch 679441.torque-server Chr11.cat_fq.sh robb 0 Q batch 679442.torque-server Chr12.cat_fq.sh robb 0 Q batch ... CONTINUED ...

The R in the S column means a job is running and a Q means that the job is queued waiting to run. Jobs get queued for a number of reasons, the most common are:

  1. A job scheduling run has not yet completed. Scheduling runs take place approximately every 15 seconds.
  2. The queue is at ~75% capacity and the job is requesting a significant amount of walltime.
  3. The queue is at 100% capacity and the job has no place it can be started.
  4. The job is requesting specific resources, such as 8 processors, and there is no place the system is able to fit it.
  5. The user submitting the job has reached a resource maximum for that queue and cannot start any more jobs running until other jobs have finished.

There are additional flags that can be passed to qstat to get more information about the state of the cluster, including the -u flag that will only display the status of jobs for a particular user.
Once a job has finished, it will no longer show up in this listing. 

Job Results

By default, results from the jobs come out two different ways.

  1. The system sends STDOUT and STDERR to files called <job_name>.o<job_number> and <job_name>.e<job_number>.
  2. Any output created by your script, like the blastp.out in the example above.

For example if you ran the example from above and got a job number of 679746, you would end up with a file called blast_AE004437.sh.o679746 and a file called blast_AE004437.sh.e679746 in the directory where you ran qsub. Additionally, because our script creates a directory using the PBS_JOBID variable, you would have a directory in your home directory called 679746.torque01.

Deleting Jobs

Sometimes you need to delete a job. You may need to do this if you accidentally submitted something that will run longer than you want or perhaps you accidentally submitted the wrong script. To do delete a job, you use the qdel command. If you wanted to delete job number 679440, you would run:

qdel 679440

Please be aware that you can only delete jobs that you own.

Delete all jobs of one user:

qselect -u $USER | xargs qdel

Delete all jobs running by one user:

qselect -u $USER -s R | xargs qdel

Delete all jobs queued jobs by one user:

qselect -u $USER -s Q | xargs qdel

Advanced Usage

There are number of additional things you can do with qsub that do a better job of taking advantage of the cluster.

To view qsub options please visit the online manual, or run the following:

man qstat 

Requesting Additional Resources

Frequently, there is a need to use more than one processor or to specify some amount of memory. The qsub command has a -l flag that allows you to do just that.

Example: Requesting A Single Node with 8 Processors

Let's assume that the script we used above was multi-threaded and spins up 8 different processes to do work. If you wanted to ask for the processors required to do that, you would run the following:

qsub -l nodes=1:ppn=8 blast_AE004437.sh

This tells the system that your job needs 8 processors and it allocates them to you.

Example: Requesting 16GB of RAM for a Job

Using the same script as above, let's instead assume that this is just a monolitic process but we know that it will need about 16GB of RAM. Below is an example of how that is done:

qsub -l mem=16gb blast_AE004437.sh 

Example: Requesting 2 Weeks of Walltime for a Job

Using the same script as above, let's instead assume that it is going to run for close to 2 weeks. We know there are 7 days in a week and 24 hours in a day, so 2 weeks in hours would be (2 * 7 *24) 336 hours. Below is an example of requesting that a job can run for 336 hours.

qsub -l walltime=336:00:00 blast_AE004437.sh

Example: Requesting Specific Node(s)

The following requests 8 CPU cores and 16GB of RAM on high memory node m01 for 220 hours:

qsub -q highmem -l nodes=1:ppn=8+m01,mem=16gb,walltime=220:00:00 assembly.sh

Interactive Jobs

Sometimes, when testing, it is useful to run commands interactively instead of with a script. To do this you would run:

qsub -I

Just like scripts though, you may need additional resources. To solve this, specify resources, just like you would above:

qsub -l mem=16gb -I

Array Jobs

Many tasks in Bioinformatics need to be parallelized to be efficient. One of the ways we address this is using array jobs. An array job executes the same script a number of times depending on what arguments are passed. To specify that an array should be used, you use the -t flag. For example, if you wanted a ten element array, you would pass -t 1-10 to qsub. You can also specify arbitrary numbers in the array. Assume for a second that the 3 and 5-7 jobs failed for some unknown reason in your last run, you can specify -t 3,5-7 and run just those array elements.

Below is an example that does the same thing that the basic example from above, except that it spreads the workload out into seven different processes. This technique is particularly useful when dealing with much larger datasets.

Prepare Dataset

The script below creates a working directory and builds out the usable dataset. The job is passed to qsub with no arguments.

#!/bin/bash # Create a directory for us to do work in. # We are using a special variable that is set by the cluster when a job runs. mkdir blast_AE004437 # Change to that new directory cd blast_AE004437 # Copy the proteome of Halobacterium spec. cp /srv/projects/db/ex/AE004437.faa . # Do some basic analysis # The echo command prints info to our output file echo "How many predicted proteins are there?" grep '^>' AE004437.faa --count echo "How many proteins contain the pattern \"WxHxxH\" or \"WxHxxHH\"?" egrep 'W.H..H{1,2}' AE004437.faa # Start preparing to do a blast run # Use awk to grab a number of proteins and then put them in a file. echo "Generating a set of IDs" awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs # Make the proteome blastable echo "Making a blastable database" formatdb -i AE004437.faa -p T -o

Analyze the Dataset

The script below will do the actual analysis. Assuming the name is blast_AE004437-multi.sh, the command to submit it would be qsub -t 1-7 blast_AE004437-multi.sh.

#!/bin/bash # Specify the number of array runs. This means we are going to specify -t 1-7 # when calling qsub. NUM=7 # Change to that new directory cd blast_AE004437 # Do some math based on the number of runs we are going to do to figure out how # many lines, and which lines should be in this run. LINES=`cat my_IDs | wc -l` MULTIPLIER=$(( $LINES / $NUM )) SUB=$(( $MULTIPLIER - 1 )) END=$(( $PBS_ARRAYID * $MULTIPLIER )) START=$(( $END - $SUB )) # Grab the IDs that are going to be part of each blast run awk "NR==$START,NR==$END" my_IDs > $PBS_ARRAYID.IDs # Make blastable IDs echo "Making a set of blastable IDs" fastacmd -d AE004437.faa -i $PBS_ARRAYID.IDs > $PBS_ARRAYID.fasta # Run blast echo "Running blast" blastall -p blastp -i $PBS_ARRAYID.fasta -d AE004437.faa -o $PBS_ARRAYID.blastp.out -e 1e-6 -v 10 -b 10

Specifying Queues

Queues provide access to additional resources or allow use of resources in different ways. To take advantage of the queues, you will need to specify the -q option with the queue name on the command line.
For example, if you would like to run a job that consumes 16GB of memory, you should submit this job to the highmem queue:
qsub -q highmem myJob.sh

Troubleshooting

If a job has not started, or is in a queued state for a long period of time, users should try the following.
Check which nodes have available processors:
pbsnodes -l free

Check how many processors are immediately available per walltime window on the batch queue:
showbf -f batch

Check earliest start and completion times (should not be infinity):
showstart JOBID

Check if a job is held:
showhold | grep JOBID

Check status of job and display reason for failure (if applicable):
checkjob JOBID

Data Storage

Biocluster users are able to check on their home and bigdata storage usage from the Biocluster Dashboard.

Storage Locations

Home Directories

Home directories are where you place the scripts and various things you are working on, on biocluster. This space is very limited. Please see the Quotas section above for the space that is allocated per user.

Path /rhome/<username> (ex: /rhome/tgirke)
User Availability All Users
Node Availability All Nodes
Quota Responsibility User

Big Data

Big data is an area where large amounts of storage can be made available to users. A lab purchases big data space seperately from access to the cluster. This space is then made available to the lab via a shared directory and individual directories for each user.

Lab Shared Space

This directory can be accessed by the lab as a whole.

Path /shared/<labname> (ex: /shared/girkelab)
User Availability Labs that have purchased space.
Node Availability All Nodes
Quota Responsibility Lab

Individual User Space

This directory can be accessed by specific lab members.

Path /bigdata/<username> (ex: /bigdata/tgirke)
User Availability Labs that have purchased space.
Node Availability All Nodes
Quota Responsibility Lab

Non-Persistent Space

Frequently, there is a need to do things like, output a signifigant amount of intermediate data durring a job, access a dataset from a faster medium than bigdata or the home directories or write out lock files. These types of things are well suited to the use of non-persistent spaces. Below are the filesystems available on biocluster.

Memory Backed Space

This type of space takes away from physical memory but allows extremely fast access to the files located on it. You will need to factor in the space you are using in RAM as well. For example, if you have a dataset that is 1G in size and use this space, it will take 1G of RAM.

Path /dev/shm
User Availability All Users
Node Availability All Nodes
Quota Responsibility N/A

Temporary Space

This is the standard space available on all Linux systems. Please be aware that it is limited to the amount of free disk space on the node you are running on.

Path /tmp
User Availability All Users
Node Availability All Nodes
Quota Responsibility N/A

SSD Backed Space

This space is must faster than the standard temporary space, but slower than being memory backed.

Path /scratch
User Availability All Users
Node Availability High Mem Nodes
Quota Responsibility N/A

Sharing data with other users

It is useful to share data and results with other users on the cluster, and we encourage collaboration  The easiest way to share a file is to place it in a location that both users can access. Then the second user can simply copy it to a location of their choice. However, this requires that the file permissions permit the second user to read the file.
Basic file permissions on Linux and other Unix like systems are composed of three groups: owner, group, and other. Each one of these represents the permissions for different groups of people: the user who owns the file, all the group members of the group owner, and everyone else, respectively  Each group has 3 permissions: read, write, and execute, represented as r,w, and x. For example the following file is owned by the user 'bragr' (with read, write, and execute), owned by the group 'operations' (with read and execute), and everyone else cannot access it.

bragr@biocluster:~$ ls -l randomFileName
-rwxr-x---   1 bragr operations 1.6K Nov 19 12:32 randomFileName

If you wanted to share this file with someone outside the 'operations' group, read permissions must be added to the file for 'other'.

Set Default Permissions

In Linux, it is possible to set the default file permission for new files. This is useful if you are collaborating on a project, or frequently share files and  you do not want to be constantly adjusting permissions  The command responsible for this is called 'umask'. You should first check what your default permissions currently are by running 'umask -S'.

bragr@biocluster:~$ umask -S
u=rwx,g=rx,o=rx

To set your default permissions, simply run umask with the correct options. Please note, that this does not change permissions on any existing files, only new files created after you update the default permissions. For instance, if you wanted to set your default permissions to you having full control, your group being able to read and execute your files, and no one else to have access, you would run:

bragr@biocluster:~$ umask u=rwx,g=rx,o=

It is also important to note that these settings only affect your current session. If you log out and log back in, these settings will be reset. To make your changes permanent you need to add them to your '.bashrc' file, which is a hidden file in your home directory (if you do not have a '.bashrc' file, you will need to create an empty file called '.bashrc' in your home directory). Adding umask to your .bashrc file is as simple as adding your umask command (such as 'umask u=rwx,g=rx,o=r') to the end of the file. Then simply log out and back in for the changes to take affect. You can double check that the settings have taken affect by running 'umask -S'.

Further Reading


Copying large folders to and from Biocluster

Rsync can:
  • Copy (transfer) folders between different storage hardware
  • Perform transfers over the network via SSH
  • Compare large data sets (-n, --dry-run option)
  • Resume interrupted transfers

To perform over-the-network transfers, it is always recommended that you run the rsync command from your local machine (laptop or workstation). 

On your computer open the Terminal and run:

rsync -ai    FOLDER_A/    biocluster.ucr.edu:FOLDER_A/
Or:
rsync -ai    biocluster.ucr.edu:FOLDER_B/    FOLDER_B/

Rsync will use SSH and will ask you for your biocluster password as SSH or SCP does.

If your connection broke, rsync can pick up when it left from - simply run the same command again.
  • Rsync does not exist on Windows. Only Mac and Linux support rsync natively.
  • Always put the / after both folder names, e.g: FOLDER_B/ Failing to do so will result in the nesting folders every time you try to resume. If you don't put / you will get a second folder_B inside folder_B  FOLDER_B/FOLDER_B/
  • Rsync does not move but only copies.
  • man rsync

Copying large folders on Biocluster between Directories

Rsync does not move but only copies. You would need to delete once you confirm that everything has been transfered.

This is the rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:
rsync -ai   FOLDER_A/    X/FOLDER_A/

where X is a different folder (e.g. a Bigdata folder)
  • Once the rsync command is done, run it again. The second run will be short and it just a check. If there was no output, nothing changed, it is safe to delete the original location.
    Specifically, running rsync the second time will ensure that everything has been transferred correctly. The -i (--itemize-changes) option asks rsync to report (output) all the changes that occure on to the filesystem during the sync. No output = No changes = The folder has been transfered safely.
  • All the bullets in the above section (Copying large folders to and from Biocluster) apply to this section


Copying large folders between Biocluster and other servers

This is a very rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:
rsync -ai FOLDER_A/    sever2.xyz.edu:FOLDER_A/

where sever2.xyz.edu is a different server that accepts SSH connection.
  • All the bullets in the above sections (Copying large folders to and from Biocluster) apply to this section

Home Directories

Home directories are where you start each session on biocluster and where your jobs start when running on the cluster. They are automatically mounted when you log in and can be found at /rhome/<your username>.

Please remember: the default storage space quota per user account is 20 GB. A buffer of 10 GB is there to help with temporary overages but should not be used for permanent storage.

To get the current usage, run the following command in your home directory: 
du -sh .
To calculate the sizes of each separate folder in your home, run: 
du -sch ~/*
This will take some time to complete, please be patient.
For more information on your home directory, please see the Orientation section in the  Linux Basics manual.

Compression

On biocluster all data is automatically compressed with the lzjb algorithm.

The following will report the compressed file sizes

du -shc *
For actual sizes use, use the following
du --apparent-size -shc *

Table of Contents

Backups

Biocluster has backups but you may want to periodically make copies of your critical data to your own storage device.
Please remember, Biocluster is a production system for research computations with a very expensive high-performance SAN storage infrastructure. It is not a data archiving system. 

Databases 

Loading Databases

NCBI, PFAM, and Uniprot, do not need to be downloaded by users. They are installed as modules on Biocluster.
module load db-ncbi
module load db-pfam
module load db-uniprot

Specific database release numbers can be identified by the version label on the module:
module avail db-ncbi

----------------- /usr/local/Modules/3.2.9/modulefiles -----------------
db-ncbi/20140623(default)

Using Databases

In order to use the loaded database users can simply provide the corresponding environment variable (NCBI_DB, UNIPROT_DB, PFAM_DB, etc...) for the proper path in their executables.
Examples:
You should avoid using this old deprecated method, it may not work in the near future (old BLAST):
blastall -p blastp -i proteins.fasta -d $NCBI_DB/nr -o blastp.out

You can use this method if you require the old version of BLAST (old BLAST with legacy support):
BLASTBIN=`which legacy_blast.pl | xargs dirname` legacy_blast.pl blastall -p blastp -i proteins.fasta -d $NCBI_DB/nr -o blast.out --path $BLASTBIN

This is the preferred/recommended method (BLAST+):
blastp -query proteins.fasta -db $NCBI_DB/nr -out proteins_blastp.txt

Usually, we store the most recent release and 2-3 previous releases of each database. This way time consuming projects can use the same database version throughout their lifetime without always updating to the latest releases.
Requests for additional databases should be sent to support@biocluster.ucr.edu

Parallelization Software

Introduction

The low-latency interconnect provides speeds that average at 30 µs microseconds per message during high loads. This interconnect provides a break-through performance to computational jobs that can run in parallel on multiple compute nodes but require frequent node-to-node communication.
 
Advanced parallel computing technologies for a cluster are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine).

NAMD

Here is how to run a NAMD2 job on Biocluster:
  1. Log-in to Biocluster
  2. Create PBS script
    #!/bin/bash

    #PBS -N c3d_cr2_md
    #PBS -S /bin/bash
    #PBS -q batch
    #PBS -l nodes=32:ppn=1
    #PBS -l mem=16gb
    #PBS -l walltime=01:00:00

    # Load Moudle System
    #    You could also source this from within your .bashrc
    source /usr/local/Modules/3.2.9/init/bash

    # Load needed modules
    #    You could also load these from within your .bashrc
    module load torque
    module load openmpi
    module load namd

    # Swtich to the working directory
    cd $PBS_O_WORKDIR

    # Run job utilizing all requested processors
    #    Please visit the namd site for usage details: 
    http://www.ks.uiuc.edu/Research/namd/
    mpirun --mca btl ^tcp namd2 run.conf &> run_bio.log
  3. Submit PBS script to PBS queuing system
    qsub run_bio.sh

MPI


Biocluster provides: OpenMPI (Launcher: mpirun), Mvapich1 (Launcher: mpirun_rsh), and Mpich1.

Many implantations of MPI exists: Open MPI, OpenMP, FT-MPI, LA-MPI, LAM/MPI, PACX-MPI, Adaptive MPI, MPICH, MVAPICH


If you need to compile the MPI applications then use one of the compilers. The compilers are wrappers around the gcc compiler.

To launch MPI applications use qsub to reserve several nodes and then use the MPI launcher program.

OpenMPI, Mvapich, or Mpich

Open MPI
Link:
http://www.open-mpi.org/
Compilers:
mpiCC, mpiCC-vt, mpic++, mpic++-vt , mpicc, mpicc-vt, mpicxx, mpicxx-vt , mpiexec, mpif77, mpif77-vt, mpif90, mpif90-vt
Launcher:
mpirun

MPICh1
Link: http://www.mcs.anl.gov/research/projects/mpich1/
Include directroy: /opt/mpich1-1.2.1p1/include
Compilers: /opt/mpich1-1.2.1p1/bin/{mpicc,mpicxx,mpif77}
Launcher: /opt/mpich1-1.2.1p1/bin/{mpirun,mpiexec}

MVAPICh1
Link: http://mvapich.cse.ohio-state.edu
Note: MVAPICh1 runs over the InfiniBand RDMA hardware, so it is potentially faster then OpenMPI.
Include directroy:
/opt/mvapich1-1.4.1/include
Compilers: /opt/mvapich1-1.4.1/bin/{mpicc,mpicxx,mpif77}
Launcher:
/opt/mvapich1-1.4.1/bin/mpirun_rsh


Example: MrBayes (Complied against OpenMPI)
Official Mr. Baryes Manual, Online Help, and Bug Reports: http://mrbayes.csit.fsu.edu

mkdir mrbayes-on-3nodes # Create a folder,
cd !$ # cd into it,
cp -v /srv/projects/biocluster-examples/001-mrbayes/* . # and copy the example files


./run.sh -l nodes=3:ppn=8 # Start the MPI job on 3 nodes


qstat # See your job in Torque queue (R - running, Q - queued)


# ----------- Wait for 2 minutes -----------

qstat # Check that you job is still running
cat out # See Partial Results


# ----------- Wait for 20 minutes -----------
# (If you don't want to wait run "qdel 12345" where 12345 is your job ID for qstat)

qstat # Your job is now completed
cat out # View final results

Example: MAKER (Complied against MPICh1)
Official MAKER mailing list: http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org mkdir maker-on-3nodes # Create a folder,

cd !$ # cd into it,
cp -v /srv/projects/biocluster-examples/002-maker/* . # and copy the example files


# Tool number 1 (clean-up)
on-all-nodes-run `pwd`/kill-all-mpd # WARNING: This command will destroy
# all Mpich mpi programs if you are currently
# running them. Use this command only when
# you need to clean-up after any unsuccessful runs.


# Tool number 2 (starting runs)
./lev2-run.sh -l nodes=3:ppn=8


# Tool number 3 (monitoring CPU)
jmonGrep $USER # This will show real-time CPU usage on all nodes that you are reserving
# You can also see nodes for a specific job from qstat, for example jmonGrep 12345
#
# NOTICE: If nothing shows up then it means that you have nothing reserved.
# To quit press 'q'


# Tool number 4 (monitoring progress)
./progress-check 


# Tool number 5 (graphing progress)
# Reference performance: afrC_sca.fa on 3 nodes in 3 hours 
# Reference performance: afrC_sca.fa on 1 node in 8 hours (see result graphs)
...

Unavailable and untested examples: HMMER, ABySS Parallel Assembler

Monitoring the Load History

Several utilities are available for obtaining information about the history of the cluster load by users, labs and individual nodes.
  1. The R Function plotCPUhours
    When this function is executed from the R console on the biocluster head node, it will return an overview of the CPU hour history by users and PIs, as well as the average load of the entire cluster. 

    source("/home/tgirke/Applications/Cluster/cpu_hour_counts/cpuhourPlot.R")
    cpuhourlist <- plotCPUhours(plotdata=T, printdata=T, mymonths="all")

  2. Cluster Load History

    CPU hours used by each Lab (Locked*)
    CPU hours used by each Compute Node (Locked*)
    CPU hours used by each User (Locked*)

    * For password please email Thomas

Table of Contents

Communicating with Other Users

Biocluster is a shared resource. Communicating with other users can help to schedule large computations. 

Looking-Up Specific Users

A convenient overview of all users and their lab affiliations can be retrieved with the following command:
qstatMonitor

 You can generate an email list of particular users by running:
all-users | awk -F\\t '/((alevc)|(tgirk))[a-z0-9]*\t/ {print " " $1 " <" $3 ">"}'

Listing Users with Active Jobs on the Cluster

To get a list of UNIX user names:
qstat | awk '// {print $3}' | sort | uniq | grep "^[^-N]"

To get the list of real names:
grep <(all-users) -f <(qstat | awk '// {print $3}' | \
  sort | uniq | grep "^[^-N]") | awk -F\\t '// {print $1}'

To get the list of emails:
grep <(all-users) -f <(qstat | awk '// {print $3}' | \
  sort | uniq | grep "^[^-N]") | awk -F\\t '// {print " " $1 " <" $3 "> "}'

Sharing Files on the Web 

Simply create a symbolic link or move the files into your html directory when you want to share them.
For exmaple, log into Biocluster and run the following:
# Make new web project directory
mkdir www-project

# Create a default test file
echo '<h1>Hello!</h1>' > ./www-project/index.html

# Create shortcut/link for new web project in html directory 
ln -s `pwd`/www-project ~/.html/

Now, test it out by pointing your web-browser to http://biocluster.ucr.edu/~username/www-project/
Be sure to replace 'username' with your actual user name.

Password Protect Web Pages

Files in web directories can be password protected.

First create a password file and then create a new user:
touch ~/.html/.htpasswd
htpasswd ~/.html/.htpasswd newwebuser
This will prompt you to enter a password for the new user 'newwebuser'.

Create a new directory, or go to an existing directory, that you want to password protect:
mkdir ~/.html/locked_dir
cd ~/.html/locked_dir
You can choose a different directory name.

Then do the following:
echo "AuthName 'Please login'
AuthType Basic
AuthUserFile /home/$USER/.html/.htpasswd
require user newwebuser" > .htaccess

Now, test it out by pointing your web-browser to http://biocluster.ucr.edu/~username/locked_dir
Be sure to replace 'username' with your actual user name.

List of Installed Software

Systems Software

Software from Debian package repository

Version lookup:
dpkg -S `which any-command` | awk -F: '{print $1}' | xargs aptitude show
For example:
dpkg -S `which blastall` | awk -F: '{print $1}' | xargs aptitude show



Software from source code

Version lookup:
git --git-dir=/usr/local/.git log -- bin/any-command
For example:
git --git-dir=/usr/local/.git log -- bin/soap

R libraries 

Version Lookup (in R):
packageVersion("any-package")
For example:
packageVersion("GOstats")
The output should be similar to the follow:
[1] '2.26.0'

Here is a list of all currently installed R libraries:

Subpages (1): Attachments