IIGB Linux Cluster

Contents

  1. 1 Introduction
    1. 1.1 Biocluster Infrastructure
  2. 2 Quotas
    1. 2.1 CPU
    2. 2.2 Data Storage
    3. 2.3 Memory
  3. 3 Login
    1. 3.1 Login from Mac or Linux
    2. 3.2 Login from Windows
    3. 3.3 What's Next?
    4. 3.4 Modules
    5. 3.5 Listing Available Software
    6. 3.6 Using The Software
    7. 3.7 Showing What Software is Loaded
    8. 3.8 Unloading Software
    9. 3.9 Additional Features
  4. 4 Managing Jobs
    1. 4.1 Usage Guidelines
    2. 4.2 Basic Usage
      1. 4.2.1 Submitting Jobs
        1. 4.2.1.1 Using STDIN
        2. 4.2.1.2 Using a Script
      2. 4.2.2 Tracking Jobs
      3. 4.2.3 Job Results
      4. 4.2.4 Deleting Jobs
    3. 4.3 Advanced Usage
      1. 4.3.1 Requesting Additional Resources
        1. 4.3.1.1 Example: Requesting A Single Node with 8 Processors
        2. 4.3.1.2 Example: Requesting 16GB of RAM for a Job
        3. 4.3.1.3 Example: Requesting 2 Weeks of Walltime for a Job
        4. 4.3.1.4 Example: Requesting Specific Node(s)
      2. 4.3.2 Interactive Jobs
      3. 4.3.3 Array Jobs
      4. 4.3.4 Using the highmem and lowprio Queues
  5. 5 Data Storage
    1. 5.1 Storage Locations
      1. 5.1.1 Home Directories
      2. 5.1.2 Big Data
        1. 5.1.2.1 Lab Shared Space
        2. 5.1.2.2 Individual User Space
      3. 5.1.3 Non-Persistent Space
        1. 5.1.3.1 Memory Backed Space
        2. 5.1.3.2 Temporary Space
        3. 5.1.3.3 SSD Backed Space
    2. 5.2 Sharing data with other users
      1. 5.2.1 Set Default Permissions
      2. 5.2.2 Further Reading
    3. 5.3 Copying large folders to and from Biocluster
    4. 5.4 Copying large folders on Biocluster between Directories
    5. 5.5 Copying large folders between Biocluster and other servers
    6. 5.6 Home Directories
    7. 5.7 Compression
    8. 5.8 Backups
  6. 6 Databases 
    1. 6.1 Introduction
    2. 6.2 Locations of Database Files
  7. 7 Parallelization Software
    1. 7.1 Introduction
    2. 7.2 Charmrun
      1. 7.2.1 Example: NAMD
    3. 7.3 MPI
      1. 7.3.1 OpenMPI, Mvapich, or Mpich
        1. 7.3.1.1 Example: MrBayes
        2. 7.3.1.2 Example: MAKER
        3. 7.3.1.3 Unavailable and untested examples: HMMER, ABySS Parallel Assembler
  8. 8 Monitoring the Load History
  9. 9 Communicating with Other Users
  10. 10 Sharing Files on the Web
    1. 10.1 Password Protect Web Pages
  11. 11 List of Installed Software
    1. 11.1 Systems Software
    2. 11.2 Software from Debian package repository
    3. 11.3 Software from source code
    4. 11.4 R libraries 
    5. 11.5 Password Change

Introduction

This manual provides a brief introduction into the usage of IIGB's Linux clusters. All servers and compute clusters of the IIGB bioinformatics facility are available to researchers from all departments and colleges at UC Riverside for a minimal recharge fee (see rates). The latest hardware/facility description for grant applications is available here: Facility Description [pdf]. To request an account on one of our systems, please contact Thomas Girke (thomas.girke@ucr.edu). The configuration of our most popular Linux cluster - called "biocluster" - is outlined below.





Biocluster Infrastructure

Hardware
  • Head node biocluster: 64 GB memory
  • Secondary head node owl: 16 cores, 64 GB memory
  • High-memory nodes m01-m03: each 32-64 cores and 252-512 GB memory
  • Compute nodes n01-n32: each 16 cores, each 16GB memory
  • Compute nodes n33, n34: each 48 cores, each 64GB memory
  • Storage: ~200 TB of network attached storage
  • Cluster network: 20 Gb/s InfiniBand
  • Learn more...
File System
  • File system: ZFS over NFS
  • Online compression
Operating System
  • Debian Squeeze
Queuing System
  • Torque and Maui
The following image shows a real time snapshot of the user activity on biocluster:
 
qstatMonitor_Report

Quotas

CPU

Currently, the maximum number of CPU cores a user can use simultaneously on biocluster is 128 CPU cores when the load on the cluster is <30% and 64 CPU cores when the load is above 30%. If a user submits jobs for more than 128/64 CPU cores then the additional requests will be queued until resources within the user's CPU quota become available. Upon request a user's upper CPU quota can be extended temporarily, but only if sufficient CPU resources are available. To avoid monopolisation of the cluster by a small number of users, the high load CPU quota of 64 cores is dynamically readjusted by an algorithms that considers the number of CPU hours accumulated by each user over a period of 3 months along with the current overall CPU usage on the cluster. If the CPU hour average over the 3 month window exceeds an allowable amount then the default CPU quota will be reduced for such a heavy user to 32 CPU cores, and if it exceeds the allowable amount by two-fold it will be reduced to 16 CPU cores. Once the average usage of a heavy user drops again below those limits, the upper CPU limit will be raised accordingly. Note: when the overall CPU load on the cluster is below 70% then the dynamically readjusted CPU quotas are not applied. At those low load times every user has the same CPU quota: 128 CPU cores at <30% load and 64 CPU cores at 30-70% load.  

Data Storage

A standard user account has a storage quota of 20GB. Much more storage space, in the range of many TBs, can be made available in a user account's bigdata directory. The amount of storage space available in bigdata depends on a user group's annual subscription. The pricing for extending the storage space in the bigdata directory is available here

Memory

From the biocluster head node users can submit jobs to the batch queue or the highmem queue. The nodes (n01-n34) associated with the batch queue are mainly for CPU intensive tasks, while the nodes (m01-m03)  of the highmem queue are dedicated to memory intensive tasks. The batch nodes have 16-64GB RAM each and the highmem nodes have 256-512GB RAM.   

Login

The initial log-in, brings users into the biocluster head node. From there, users can submit jobs via qsub to the compute nodes or log into owl to perform memory intensive tasks. Since all machines are mounting a centralized file system, users will always see the same home directory on all systems. Therefore, there is no need to copy files from one machine to another.

Login from Mac or Linux

Open the terminal and type:

ssh -X username@biocluster.ucr.edu

Login from Windows

Please refer to the login instructions of our Linux Basics manual.

What's Next?

After the login users have the following options for submitting compute tasks:

  • Submit computationally intensive tasks via qsub to the cluster nodes.
  • Run memory intensive jobs on owl. The command 'ssh owl' logs users into this dedicated high-memory node.
  • Please do not run ANY computationally intensive tasks on the head node. If this is done, we will have to kill your jobs, because they will slow down all other users.

Modules

The modules system is a way to easily load software into your path. This approach has a number of advangates including allowing for multiple versions of the software to be installed at any given time.

Listing Available Software

To list the available software run:

module avail

This should output something like:

------------------------- /usr/local/Modules/versions --------------------------
3.2.9

--------------------- /usr/local/Modules/3.2.9/modulefiles ---------------------
BEDTools/2.15.0(default)     modules
PeakSeq/1.1(default)         python/3.2.2
SOAP2/2.21(default)          samtools/0.1.18(default)
bowtie2/2.0.0-beta5(default) stajichlab
cufflinks/1.3.0(default)     subread/1.1.3(default)
matrix2png/1.2.1(default)    tophat/1.4.1(default)
maui/3.3.1(default)          trans-ABySS/1.2.0(default)
module-info

Using The Software

To load a module into your path, run:

module load <software name>[/<version>]

You only need to add the version if you are wanting to select a version other than default. So, if I wanted to load tophat, I would run:

module load tophat

If I wanted to specifically load version 1.4.1, I would run:

module load tophat/1.4.1

Showing What Software is Loaded

To show what modules you have loaded at any time, you can run:

module list

Depending on what modules you have loaded, it will produce something like this:

Currently Loaded Modulefiles:
  1) maui/3.3.1     2) tophat/1.4.1   3) PeakSeq/1.1

Unloading Software

Sometimes you want to no longer have a piece of software in path. To do this you unload the module by running:

module unload <software name>

Additional Features

There are additional features and operations that can be done with the module command. Please run the following to get more information:

module help

Managing Jobs

Submitting and managing jobs is a the heart of using the cluster. Jobs are the way that you run software on the nodes in the cluster.

Usage Guidelines

There are a number of different queues available to cluster users. Below is a table of the resource limitations associated with each:

  • batch - This is the default queue, users may occupy no more than 64 processors at a time. This uses nodes n01-n34. A default walltime of 168 hours or 7 days is set on all jobs started in this queue.To use more than this, please use the -l flag in qsub.
  • loprio - This queue is for use with lots of short lived jobs. It has a wall clock limit of 10 minutes and the priority is set so that jobs in the batch queue will run before jobs in this queue. There is no limit on the processors that can be used. This queue uses nodes n01-n34.
  • highmem - This queue is for jobs that require 16GB or more of RAM. By default, all jobs will be set to a limit of 16GB of RAM and a wall clock limit of 6 hours. To use more than this, please use the -l flag in qsub.

Basic Usage

Submitting Jobs

The command used to submit jobs is qsub. There are two basic ways to submit jobs:

Using STDIN

The first way this can be done is by using a technique where you pipe in the command via STDIN.

echo <command> | qsub

For example, lets say you have a set of files in your home directory and want to run blast against them, you could run a command similar to what we find below to have that run on a node.

echo blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10 | qsub

Using a Script

When using the cluster it quickly becomes useful to be able to run multiple commands as part of a single job. To solve this we write scripts. In this case, the way it works is that we invoke the script as the last argument to qsub.

qsub <script name>

A script is just a set of commands that we want to make happen once the job runs. Below is an example script that does the same thing that we do with Exercise 5 in the Linux BasicsManual.

#!/bin/bash

# Create a directory for us to do work in.
# We are using a special variable that is set by the cluster when a job runs.
mkdir $PBS_JOBID

# Change to that new directory
cd $PBS_JOBID

# Copy the proteome of Halobacterium spec.
cp /srv/projects/db/ex/AE004437.faa .

# Do some basic analysis

# The echo command prints info to our output file
echo "How many predicted proteins are there?"
grep '^>' AE004437.faa --count

echo "How many proteins contain the pattern \"WxHxxH\" or \"WxHxxHH\"?"
egrep 'W.H..H{1,2}' AE004437.faa

# Start preparing to do a blast run

# Use awk to grab a number of proteins and then put them in a file.
echo "Generating a set of IDs"
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs

# Make the proeome blastable
echo "Making a blastable database"
formatdb -i AE004437.faa -p T -o

# Make blastable IDs
echo "Making a set of blastable IDs"
fastacmd -d AE004437.faa -i my_IDs > myseq.fasta

# Run blast
echo "Running blast"
blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10

So if this script was called blast_AE004437.sh we could run the following to make all of those steps happen.

qsub blast_AE004437.sh

Tracking Jobs

Now that we have a job in the queue, how do I know if it is running? For that, there is a command called qstat. The command qstat will provide you with the current state of all the jobs running or queued to run on the cluster. The following is an example of that output:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
467655.torque-server       ...MTcauLSS_5.sh xzhang          2562047: R batch          
467660.torque-server       ...TnormLSS_5.sh xzhang          5124113: R batch          
467663.torque-server       ARSUMTt3LSS_5.sh xzhang          5124113: R batch          
474003.torque-server       Aedes2           bradc           1095:30: R batch          
7475989.torque-server       Culex2           bradc           928:40:0 R batch          
478663.torque-server       STDIN            snohzadeh       00:36:28 R batch          
626327.torque-server       STDIN            wangya          11:16:38 R batch          
645318.torque-server       STDIN            yfu             477:49:3 R batch          
645353.torque-server       STDIN            yfu             464:31:4 R batch          
655060.torque-server       newphyml.sh      nponts          364:57:5 R batch          
655077.torque-server       newphyml.sh      nponts          401:32:2 R batch          
655182.torque-server       newphyml.sh      nponts          396:35:2 R batch          
655385.torque-server       newphyml.sh      nponts          337:29:4 R batch          
655469.torque-server       newphyml.sh      nponts          146:23:5 R batch          
655493.torque-server       newphyml.sh      nponts          335:05:0 R batch          
655571.torque-server       newphyml.sh      nponts          358:33:5 R batch          
655742.torque-server       newphyml.sh      nponts          314:08:5 R batch          
655754.torque-server       newphyml.sh      nponts          299:45:2 R batch          
655814.torque-server       newphyml.sh      nponts          109:59:4 R batch          
655951.torque-server       newphyml.sh      nponts          268:58:3 R batch          
655962.torque-server       newphyml.sh      nponts          325:04:2 R batch          
656054.torque-server       newphyml.sh      nponts          277:43:3 R batch          
656055.torque-server       newphyml.sh      nponts          327:37:2 R batch          
656195.torque-server       newphyml.sh      nponts          270:07:2 R batch          
656309.torque-server       newphyml.sh      nponts          261:18:4 R batch          
656339.torque-server       newphyml.sh      nponts          306:47:0 R batch          
656340.torque-server       newphyml.sh      nponts          275:05:4 R batch          
656486.torque-server       newphyml.sh      nponts          259:59:3 R batch          
659489.torque-server       STDIN            zwu             00:25:01 R batch          
672645.torque-server       STDIN            snohzadeh       00:00:00 R batch          
674351.torque-server       STDIN            yfu             165:40:5 R batch          
674819.torque-server       Seqrank_CL_08.sh xzhang          115:43:3 R batch          
674940.torque-server       submit_script.sh nsausman        683:14:1 R batch          
675260.torque-server       ...eatModeler.sh robb            233:01:5 R batch          
675266[].torque-server     sa.sh            jban                   0 R batch          
675275[].torque-server     LeucoMakerMrctr  hishak                 0 R batch          
675853[].torque-server     sa.sh            jban                   0 R batch          
677089.torque-server       LFPcorun.sh      jychen          57:31:33 R batch          
679437.torque-server       Chr8.mergeBam.sh robb                   0 Q batch          
679438.torque-server       Chr9.mergeBam.sh robb                   0 Q batch          
679439.torque-server       Chr1.cat_fq.sh   robb                   0 Q batch          
679440.torque-server       Chr10.cat_fq.sh  robb                   0 Q batch          
679441.torque-server       Chr11.cat_fq.sh  robb                   0 Q batch          
679442.torque-server       Chr12.cat_fq.sh  robb                   0 Q batch          
... CONTINUED ...

The R in the S column means a job is running and a Q means that the job is queued waiting to run. Jobs get queued for a number of reasons, the most common are:

  1. A scheduling run has not yet occured to start the job. Scheduling runs take place approximately every 15 seconds.
  2. The queue is at 100% capacity and the job has no place it can be started.
  3. The user submitting the job has reached a resource maximum for that queue and cannot start any more jobs running until other jobs have finished.
  4. The job is requesting specific resources, such as 8 processors, and there is no place the system is able to fit it.

Once a job has finished, it will no longer show up in this listing. There are additional flags that can be passed to qstat to get more information about the state of the cluster, including the -u flag that will only display the status of jobs for a particular user. Please see the qstat manual by running man qstat or by visiting http://www.clusterresources.com/torquedocs21/commands/qstat.shtml.

Job Results

By default, results from the jobs come out two different ways.

  1. The system sends STDOUT and STDERR to files called <job_name>.o<job_number> and <job_name>.e<job_number>.
  2. Any output created by your script, like the blastp.out in the example above.

For example if you ran the example from above and got a job number of 679746, you would end up with a file called blast_AE004437.sh.o679746 and a file called blast_AE004437.sh.e679746 in the directory where you ran qsub. Additionally, because our script creates a directory using the PBS_JOBID variable, you would have a directory in your home directory called 679746.torque_server.

Deleting Jobs

Sometimes you need to delete a job. You may need to do this if you accidentally submitted something that will run longer than you want or perhaps you accidentally submitted the wrong script. To do delete a job, you use the qdel command. If you wanted to delete job number 679440, you would run:

qdel 679440

Please be aware that you can only delete jobs that you own.

Delete all jobs of one user:

qselect -u $USER | xargs qdel

Delete all jobs running by one user:

qselect -u $USER -s R | xargs qdel

Delete all jobs queued jobs by one user:

qselect -u $USER -s Q | xargs qdel

Advanced Usage

There are number of additional things you can do with qsub that do a better job of taking advantage of the cluster. Below are a few specific examples but it is worth looking at the manual by running man qsub or by visiting http://www.clusterresources.com/torquedocs21/commands/qsub.shtml.

Requesting Additional Resources

Frequently, there is a need to use more than one processor or to specify some amount of memory. The qsub command has a -l flag that allows you to do just that.

Example: Requesting A Single Node with 8 Processors

Let's assume that the script we used above was multi-threaded and spins up 8 different processes to do work. If you wanted to ask for the processors required to do that, you would run the following:

qsub -l nodes=1:ppn=8 blast_AE004437.sh

This tells the system that your job needs 8 processors and it allocates them to you.

Example: Requesting 16GB of RAM for a Job

Using the same script as above, let's instead assume that this is just a monolitic process but we know that it will need about 16GB of RAM. Below is an example of how that is done:

qsub -l mem=16gb blast_AE004437.sh 

Example: Requesting 2 Weeks of Walltime for a Job

Using the same script as above, let's instead assume that it is going to run for close to 2 weeks. We know there are 7 days in a week and 24 hours in a day, so 2 weeks in hours would be (2 * 7 *24) 336 hours. Below is an example of requesting that a job can run for 336 hours.

qsub -l walltime=336:00:00 blast_AE004437.sh

Example: Requesting Specific Node(s)

The following requests 8 CPU cores and 16GB of RAM on high memory node m01 for 220 hours:

qsub -q highmem -l nodes=1:ppn=8+m01,mem=16gb,walltime=220:00:00 assembly.sh

Interactive Jobs

Sometimes, when testing, it is useful to run commands interactively instead of with a script. To do this you would run:

qsub -I

Just like scripts though, you may need additional resources. To solve this, specify resources, just like you would above:

qsub -l mem=16gb -I

Array Jobs

Many tasks in Bioinformatics need to be parallelized to be efficient. One of the ways we address this is using array jobs. An array job executes the same script a number of times depending on what arguments are passed. To specify that an array should be used, you use the -t flag. For example, if you wanted a ten element array, you would pass -t 1-10 to qsub. You can also specify arbitrary numbers in the array. Assume for a second that the 3 and 5-7 jobs failed for some unknown reason in your last run, you can specify -t 3,5-7 and run just those array elements.

Below is an example that does the same thing that the basic example from above, except that it spreads the workload out into seven different processes. This technique is particularly useful when dealing with much larger datasets.

Prepare Dataset

The script below creates a working directory and builds out the usable dataset. The job is passed to qsub with no arguments.

#!/bin/bash

# Create a directory for us to do work in.
# We are using a special variable that is set by the cluster when a job runs.

mkdir blast_AE004437

# Change to that new directory
cd blast_AE004437

# Copy the proteome of Halobacterium spec.
cp /srv/projects/db/ex/AE004437.faa .

# Do some basic analysis

# The echo command prints info to our output file
echo "How many predicted proteins are there?"
grep '^>' AE004437.faa --count

echo "How many proteins contain the pattern \"WxHxxH\" or \"WxHxxHH\"?"
egrep 'W.H..H{1,2}' AE004437.faa

# Start preparing to do a blast run

# Use awk to grab a number of proteins and then put them in a file.
echo "Generating a set of IDs"
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs

# Make the proteome blastable
echo "Making a blastable database"
formatdb -i AE004437.faa -p T -o

Analyze the Dataset

The script below will do the actual analysis. Assuming the name is blast_AE004437-multi.sh, the command to submit it would be qsub -t 1-7 blast_AE004437-multi.sh.

#!/bin/bash

# Specify the number of array runs. This means we are going to specify -t 1-7
# when calling qsub.
NUM=7

# Change to that new directory
cd blast_AE004437

# Do some math based on the number of runs we are going to do to figure out how
# many lines, and which lines should be in this run.
LINES=`cat my_IDs | wc -l`
MULTIPLIER=$(( $LINES / $NUM ))
SUB=$(( $MULTIPLIER - 1 ))

END=$(( $PBS_ARRAYID * $MULTIPLIER ))
START=$(( $END - $SUB ))

# Grab the IDs that are going to be part of each blast run
awk "NR==$START,NR==$END" my_IDs > $PBS_ARRAYID.IDs

# Make blastable IDs
echo "Making a set of blastable IDs"
fastacmd -d AE004437.faa -i $PBS_ARRAYID.IDs > $PBS_ARRAYID.fasta

# Run blast
echo "Running blast"
blastall -p blastp -i $PBS_ARRAYID.fasta -d AE004437.faa -o $PBS_ARRAYID.blastp.out -e 1e-6 -v 10 -b 10

Using the highmem and lowprio Queues

These queues provide access to additional resources or allow use of resources in different ways. To take advantage of the queues, you will need to do specify the -q flag with the queue name on the command line.

qsub -q highmem

or

qsub -q lowprio

Data Storage

Storage Locations

Home Directories

Home directories are where you place the scripts and various things you are working on, on biocluster. This space is very limited. Please see the Quotas section above for the space that is allocated per user.

Path /rhome/<username> (ex: /rhome/tgirke)
User Availability All Users
Node Availability All Nodes
Quota Responsibility User

Big Data

Big data is an area where large amounts of storage can be made available to users. A lab purchases big data space seperately from access to the cluster. This space is then made available to the lab via a shared directory and individual directories for each user.

Lab Shared Space

This directory can be accessed by the lab as a whole.

Path /shared/<labname> (ex: /shared/girkelab)
User Availability Labs that have purchased space.
Node Availability All Nodes
Quota Responsibility Lab

Individual User Space

This directory can be accessed by specific lab members.

Path /bigdata/<username> (ex: /bigdata/tgirke)
User Availability Labs that have purchased space.
Node Availability All Nodes
Quota Responsibility Lab

Non-Persistent Space

Frequently, there is a need to do things like, output a signifigant amount of intermediate data durring a job, access a dataset from a faster medium than bigdata or the home directories or write out lock files. These types of things are well suited to the use of non-persistent spaces. Below are the filesystems available on biocluster.

Memory Backed Space

This type of space takes away from physical memory but allows extremely fast access to the files located on it. You will need to factor in the space you are using in RAM as well. For example, if you have a dataset that is 1G in size and use this space, it will take 1G of RAM.

Path /dev/shm
User Availability All Users
Node Availability All Nodes
Quota Responsibility N/A

Temporary Space

This is the standard space available on all Linux systems. Please be aware that it is limited to the amount of free disk space on the node you are running on.

Path /tmp
User Availability All Users
Node Availability All Nodes
Quota Responsibility N/A

SSD Backed Space

This space is must faster than the standard temporary space, but slower than being memory backed.

Path /scratch
User Availability All Users
Node Availability High Mem Nodes
Quota Responsibility N/A

Sharing data with other users

It is useful to share data and results with other users on the cluster, and we encourage collaboration  The easiest way to share a file is to place it in a location that both users can access. Then the second user can simply copy it to a location of their choice. However, this requires that the file permissions permit the second user to read the file.
Basic file permissions on Linux and other Unix like systems are composed of three groups: owner, group, and other. Each one of these represents the permissions for different groups of people: the user who owns the file, all the group members of the group owner, and everyone else, respectively  Each group has 3 permissions: read, write, and execute, represented as r,w, and x. For example the following file is owned by the user 'bragr' (with read, write, and execute), owned by the group 'operations' (with read and execute), and everyone else cannot access it.

bragr@biocluster:~$ ls -l randomFileName
-rwxr-x---   1 bragr operations 1.6K Nov 19 12:32 randomFileName

If you wanted to share this file with someone outside the 'operations' group, read permissions must be added to the file for 'other'.

Set Default Permissions

In Linux, it is possible to set the default file permission for new files. This is useful if you are collaborating on a project, or frequently share files and  you do not want to be constantly adjusting permissions  The command responsible for this is called 'umask'. You should first check what your default permissions currently are by running 'umask -S'.

bragr@biocluster:~$ umask -S
u=rwx,g=rx,o=rx

To set your default permissions, simply run umask with the correct options. Please note, that this does not change permissions on any existing files, only new files created after you update the default permissions. For instance, if you wanted to set your default permissions to you having full control, your group being able to read and execute your files, and no one else to have access, you would run:

bragr@biocluster:~$ umask u=rwx,g=rx,o=

It is also important to note that these settings only affect your current session. If you log out and log back in, these settings will be reset. To make your changes permanent you need to add them to your '.bashrc' file, which is a hidden file in your home directory (if you do not have a '.bashrc' file, you will need to create an empty file called '.bashrc' in your home directory). Adding umask to your .bashrc file is as simple as adding your umask command (such as 'umask u=rwx,g=rx,o=r') to the end of the file. Then simply log out and back in for the changes to take affect. You can double check that the settings have taken affect by running 'umask -S'.

Further Reading


Copying large folders to and from Biocluster

Rsync can:
  • Copy (transfer) folders between different storage hardware
  • Perform transfers over the network via SSH
  • Compare large data sets (-n, --dry-run option)
  • Resume interrupted transfers

To perform over-the-network transfers, it is always recommended that you run the rsync command from your local machine (laptop or workstation). 

On your computer open the Terminal and run:

rsync -ai    FOLDER_A/    biocluster.ucr.edu:FOLDER_A/
or
rsync -ai    biocluster.ucr.edu:FOLDER_B/    FOLDER_B/

Rsync will use SSH and will ask you for your biocluster password as SSH or SCP does.

If your connection broke, rsync can pick up when it left from - simply run the same command again.
  • Rsync does not exist on Windows. Only Mac and Linux support rsync natively.
  • Always put the / after both folder names, e.g: FOLDER_B/ Failing to do so will result in the nesting folders every time you try to resume. If you don't put / you will get a second folder_B inside folder_B  FOLDER_B/FOLDER_B/
  • Rsync does not move but only copies.
  • man rsync

Copying large folders on Biocluster between Directories

Rsync does not move but only copies. You would need to delete once you confirm that everything has been transfered.

This is the rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:
rsync -ai   FOLDER_A/    X/FOLDER_A/

where X is a different folder (e.g. a Bigdata folder)
  • Once the rsync command is done, run it again. The second run will be short and it just a check. If there was no output, nothing changed, it is safe to delete the original location.
    Specifically, running rsync the second time will ensure that everything has been transferred correctly. The -i (--itemize-changes) option asks rsync to report (output) all the changes that occure on to the filesystem during the sync. No output = No changes = The folder has been transfered safely.
  • All the bullets in the above section (Copying large folders to and from Biocluster) apply to this section


Copying large folders between Biocluster and other servers

This is a very rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:
rsync -ai FOLDER_A/    sever2.xyz.edu:FOLDER_A/

where sever2.xyz.edu is a different server that accepts SSH connection.
  • All the bullets in the above sections (Copying large folders to and from Biocluster) apply to this section

Home Directories

Home directories are where you start each session on biocluster and where your jobs start when running on the cluster. They are automatically mounted when you log in and can be found at /rhome/<your username>.

Please remember: the default storage space quota per user account is 20 GB. A buffer of 10 GB is there to help with temporary overages but should not be used for permanent storage.

To get the current usage, run the following command in your home directory:
  • du -sh .

To calculate the sizes of each separate folder in your home, run:
  • du -sch ~/*
This will take some time....

For more information on your home directory, please see the Orientation section in the  Linux Basics manual.

Compression

On biocluster all data is automatically compressed with the lzjb algorithm.

The following will report the compressed file sizes

du -shc *
for actual sizes use, use the following
du --apparent-size -shc *

Table of Contents

Backups

Biocluster has backups but you may want to periodically make copies of your critical data to your own storage device.

Please remember, Biocluster is a production system for research computations with a very expensive high-performance SAN storage infrastructure. It is not a data archiving system. 

Databases 

Introduction

NCBI, PFAM, and Uniprot, do not need to be downloaded by users. They are in a central location on Biocluster:

ls /srv/projects/db/

Since these databases are accessible to everyone, users can simply provide the proper path in their executables. Specific database release numbers can be identified by the directory name under which a database is stored. For instance:

ls /srv/projects/db/pfam/2009-10-Pfam24.0/

Usually, we store the most recent release and 2-3 previous releases of each database. This way time consuming projects can use the same database version throughout their lifetime without always updating to the latest releases. Suggestions for additional shared databases can be emailed to Alex Levchuk.

Locations of Database Files

As of 2010-03-08, the available databases are:
  1. Uniprot (   /srv/projects/db/uniprot/2010-02-UniProt-15.14   )
    1. uniprot_sprot - curated and well annotated genes
    2. uniprot_sprot_plus_trembl - combination of the above (sprot) and the unsupervised (trembl) Uniport genes
  2.  Pfam (   /srv/projects/db/pfam/2009-10-Pfam24.0/   )
    1. Pfam-A - curated Pfam HMMs
    2. Pfam-B - automatically constructed unsupervised Pfam HMMs
    3. Pfam-C
  3.   NCBI (   /srv/projects/db/ncbi/2010-02-24   )
    1. nr  - non-redundant protein sequences
    2. nt - nucleotide DNA sequences

Parallelization Software

Introduction

The low-latency interconnect provides speeds that average at 30 µs microseconds per message during high loads. This interconnect provides a break-through performance to computational jobs that can run in parallel on multiple compute nodes but require frequent node-to-node communication.
 
Advanced parallel computing technologies for a cluster are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine).

Charmrun

Example: NAMD

Here is how to use Biocluster on a simple official NAMD example:
  1. Log-in to Biocluster
  2. Run
cp -r /srv/projects/biocluster-examples/004-namd2 namd2-example
cd  namd2-example

# Make a try
cp -r 1-2-sphere 1-2-sphere-try1
cd 1-2-sphere-try1

# Create symbolic links
ln -s ../qsub-8core
ln -s ../qsub-8core-namd2
ln -s ../qsub-48core
ln -s ../qsub-48core-namd2

# Start the run
./qsub-8core 5  # where 5 is the number of nodes (8 cores each)

  3. You can monitor progress with:
see q

  4. Once NAMD2 is finished look at the out and qsub2-namd2.* files
tail qsub2-namd2.*
tail out

For other NAMD runs, the only 2 scripts that you will need from the folder namd2-example are:
  • qsub-8core
  • qsub-8core-namd2
  • qsub-48core
  • qsub-48core-namd2

MPI


Biocluster provides: OpenMPI (Launcher: mpirun), Mvapich1 (Launcher: mpirun_rsh), and Mpich1.

Many implantations of MPI exists: Open MPI, OpenMP, FT-MPI, LA-MPI, LAM/MPI, PACX-MPI, Adaptive MPI, MPICH, MVAPICH


If you need to compile the MPI applications then use one of the compilers. The compilers are wrappers around the gcc compiler.

To launch MPI applications use qsub to reserve several nodes and then use the MPI launcher program.

OpenMPI, Mvapich, or Mpich

Open MPI
Link:
http://www.open-mpi.org/
Compilers:
mpiCC, mpiCC-vt, mpic++, mpic++-vt , mpicc, mpicc-vt, mpicxx, mpicxx-vt , mpiexec, mpif77, mpif77-vt, mpif90, mpif90-vt
Launcher:
mpirun

MPICh1
Link: http://www.mcs.anl.gov/research/projects/mpich1/
Include directroy: /opt/mpich1-1.2.1p1/include
Compilers: /opt/mpich1-1.2.1p1/bin/{mpicc,mpicxx,mpif77}
Launcher: /opt/mpich1-1.2.1p1/bin/{mpirun,mpiexec}

MVAPICh1
Link: http://mvapich.cse.ohio-state.edu
Note: MVAPICh1 runs over the InfiniBand RDMA hardware, so it is potentially faster then OpenMPI.
Include directroy:
/opt/mvapich1-1.4.1/include
Compilers: /opt/mvapich1-1.4.1/bin/{mpicc,mpicxx,mpif77}
Launcher:
/opt/mvapich1-1.4.1/bin/mpirun_rsh


Example: MrBayes

(Complied against OpenMPI)
Official Mr. Baryes Manual, Online Help, and Bug Reports: http://mrbayes.csit.fsu.edu

mkdir mrbayes-on-3nodes              # Create a folder,
cd !$                                # cd into it,
cp -v /srv/projects/biocluster-examples/001-mrbayes/* .  # and copy the example files



./run.sh -l nodes=3:ppn=8            # Start the MPI job on 3 nodes


qstat                 # See your job in Torque queue (R - running, Q - queued)

# ----------- Wait for 2 minutes -----------

qstat                 # Check that you job is still running
cat out               # See Partial Results


# ----------- Wait for 20 minutes -----------
# (If you don't want to wait run "qdel 12345" where 12345 is your job ID for qstat)

qstat                
# Your job is now completed
cat out               # View final results


Example: MAKER

(Complied against MPICh1)

mkdir maker-on-3nodes             # Create a folder,
cd !$                             # cd into it,
cp -v /srv/projects/biocluster-examples/002-maker/* . # and copy the example files



# Tool number 1 (clean-up)
on-all-nodes-run `pwd`/kill-all-mpd   # WARNING: This command will destroy
                                      #          all Mpich mpi programs if you are currently
                                      #          running them. Use this command only when
                                      #          you need to clean-up after any unsuccessful runs.


# Tool number 2 (starting runs)
./lev2-run.sh -l nodes=3:ppn=8


# Tool number 3 (monitoring CPU)
jmonGrep $USER     # This will show real-time CPU usage on all nodes that you are reserving
                   # You can also see nodes for a specific job from qstat, for example jmonGrep 12345
                   #
                   # NOTICE: If nothing shows up then it means that you have nothing reserved.
                   # To quit press 'q'


# Tool number 4 (monitoring progress)
./progress-check


# Tool number 5 (graphing progress)
# Reference performance: afrC_sca.fa on 3 nodes in 3 hours
# Reference performance: afrC_sca.fa on 1 node in 8 hours (see result graphs)
...



Unavailable and untested examples: HMMER, ABySS Parallel Assembler


Monitoring the Load History

Several utilities are available for obtaining information about the history of the cluster load by users, labs and individual nodes.
  1. The R Function plotCPUhours
    When this function is executed from the R console on the biocluster head node, it will return an overview of the CPU hour history by users and PIs, as well as the average load of the entire cluster. 

    source("/home/tgirke/Applications/Cluster/cpu_hour_counts/cpuhourPlot.R")
    cpuhourlist <- plotCPUhours(plotdata=T, printdata=T, mymonths="all")

  2. Cluster Load History

    CPU hours used by each Lab (Locked*)
    CPU hours used by each Compute Node (Locked*)
    CPU hours used by each User (Locked*)

    * For password please email Thomas

Table of Contents

Communicating with Other Users

Biocluster is a shared resource. Communicating with other users can help to schedule large computations. 

Looking-Up Specific Users

A convenient overview of all users and their lab affiliations can be retrieved with the following command:
qstatMonitor

You can generate an email list of particular users by running:
all-users | awk -F\\t '/((alevc)|(tgirk))[a-z0-9]*\t/ {print " " $1 " <" $3 ">"}'

Listing Users with Active Jobs on the Cluster

To get a list of UNIX user names run:
qstat | awk '// {print $3}' | sort | uniq | grep "^[^-N]"

To get the list of real names run:
grep <(all-users) -f <(qstat | awk '// {print $3}' | \
  sort | uniq | grep "^[^-N]") | awk -F\\t '// {print $1}'

To get the list of emails run:
grep <(all-users) -f <(qstat | awk '// {print $3}' | \
  sort | uniq | grep "^[^-N]") | awk -F\\t '// {print " " $1 " <" $3 "> "}'

Sharing Files on the Web


Simply move the files to ~/.html when you want to share them. In Biocluster run:
mkdir hello-www # Make a new directory in your current PWD
echo '<h1>Hello WWW!</h1>' > ./hello-www/hello.html

ln -s `pwd`/hello-www ~/.html/ # Make a symbolic link

Now, test it out by pointing your web-browser to http://biocluster.ucr.edu/~alevchuk/hello-www/
(Instead of alevchuk put your username)

Password Protect Web Pages

Files in web directories can be password protected as follows:

First run these commands:
touch ~/.html/.htpasswd
htpasswd ~/.html/.htpasswd webuser
This will ask you to create the new password

Go to the directory that you want to lock
mkdir ~/.html/locked_dir
cd ~/.html/locked_dir
You can choose a different directory name.

Now run this command:
echo 'AuthName "Please login. The username is webuser."
AuthType Basic
AuthUserFile /home/alevchuk/.html/.htpasswd
require user webuser' > .htaccess

But instead of /home/alevchuk put your own home directory.
To find it run:
echo ~

Now, test it out by pointing your web-browser to http://biocluster.ucr.edu/~alevchuk/locked_dir

But instead of alevchuk and locked_dir put your username and your directory name.



List of Installed Software

  • Installation request can be emailed to Alex Levchuk

Systems Software

Software from Debian package repository

Version lookup:
dpkg -S `which any-command` | awk -F: '{print $1}' | xargs aptitude show

For example:
dpkg -S `which blastall` | awk -F: '{print $1}' | xargs aptitude show



Software from source code

Version lookup:
git --git-dir=/usr/local/.git log -- bin/any-command

For example:
git --git-dir=/usr/local/.git log -- bin/soap


R libraries 

Version Lookup (in R):
packageVersion("any-package")

For example:
packageVersion("GOstats")





Password Change

Changing your password is currently disabled for security reasons

You can keep using the original Biocluster random password or change it:

  1. Log-in via SSH using the Terminal on Mac/Linux or Putty on Windows
  2. Type passwd
  3. Enter the old password (the random characters that you were given as your initial passowrd)
  4. Enter your new password

WARNING: Please do not reuse passwords for Biocluster!

This is to protect you and other Biocluster users from getting any of your electronic accounts compromised. The Biocluster password should not be similar to any of your other passwords. This includes passwords for:

 * laptops
 * email
 * university services
 * other labs and departments

The password minimum length is:

  • 10 lower-case letters
  • 8 letters AND 1 digit (9 characters total)
  • 8 letters AND 1 upper-case (9 characters total)
  • 8 letters AND 1 non-alphanumeric1 (9 characters total)
  • 6 letters AND 1 digit AND 1 upper-case (8 characters total)
  • 6 letters AND 1 digit AND 1 non-alphanumeric1 (8 characters total)
  • 6 letters AND 1 upper-case AND 1 non-alphanumeric1 (8 characters total)
  • 4 letters AND 1 digit AND 1 upper-case AND 1 non-alphanumeric1 (7 characters total)
1 non-alphanumeric is !@#$%^&*-+=...


Subpages (1): Attachments