IIGB Linux Cluster IntroductionThis manual provides a brief introduction into the usage of IIGB's Linux clusters. All servers and compute clusters of the IIGB bioinformatics facility are available to
researchers from all departments and colleges at UC Riverside for a minimal
recharge fee (see rates). The latest hardware/facility description for grant applications is
available here: Facility Description [pdf]. To request an account on one of our systems, please contact Thomas Girke (thomas.girke@ucr.edu). The configuration of our most popular Linux cluster - called "biocluster" - is outlined below.
Biocluster Infrastructure
Hardware |
| Path | /rhome/<username> (ex: /rhome/tgirke) |
|---|---|
| User Availability | All Users |
| Node Availability | All Nodes |
| Quota Responsibility | User |
Big Data
Big data is an area where large amounts of storage can be made available to users. A lab purchases big data space seperately from access to the cluster. This space is then made available to the lab via a shared directory and individual directories for each user.
Lab Shared Space
This directory can be accessed by the lab as a whole.
| Path | /shared/<labname> (ex: /shared/girkelab) |
|---|---|
| User Availability | Labs that have purchased space. |
| Node Availability | All Nodes |
| Quota Responsibility | Lab |
Individual User Space
This directory can be accessed by specific lab members.
| Path | /bigdata/<username> (ex: /bigdata/tgirke) |
|---|---|
| User Availability | Labs that have purchased space. |
| Node Availability | All Nodes |
| Quota Responsibility | Lab |
Non-Persistent Space
Frequently, there is a need to do things like, output a signifigant amount of intermediate data durring a job, access a dataset from a faster medium than bigdata or the home directories or write out lock files. These types of things are well suited to the use of non-persistent spaces. Below are the filesystems available on biocluster.
Memory Backed Space
This type of space takes away from physical memory but allows extremely fast access to the files located on it. You will need to factor in the space you are using in RAM as well. For example, if you have a dataset that is 1G in size and use this space, it will take 1G of RAM.
| Path | /dev/shm |
|---|---|
| User Availability | All Users |
| Node Availability | All Nodes |
| Quota Responsibility | N/A |
Temporary Space
This is the standard space available on all Linux systems. Please be aware that it is limited to the amount of free disk space on the node you are running on.
| Path | /tmp |
|---|---|
| User Availability | All Users |
| Node Availability | All Nodes |
| Quota Responsibility | N/A |
SSD Backed Space
This space is must faster than the standard temporary space, but slower than being memory backed.
| Path | /scratch |
|---|---|
| User Availability | All Users |
| Node Availability | High Mem Nodes |
| Quota Responsibility | N/A |
Sharing data with other users
It is useful to share data and results with other users on the cluster, and we encourage collaboration The easiest way to share a file is to place it in a location that both users can access. Then the second user can simply copy it to a location of their choice. However, this requires that the file permissions permit the second user to read the file.
Basic file permissions on Linux and other Unix like systems are composed of three groups: owner, group, and other. Each one of these represents the permissions for different groups of people: the user who owns the file, all the group members of the group owner, and everyone else, respectively Each group has 3 permissions: read, write, and execute, represented as r,w, and x. For example the following file is owned by the user 'bragr' (with read, write, and execute), owned by the group 'operations' (with read and execute), and everyone else cannot access it.
-rwxr-x--- 1 bragr operations 1.6K Nov 19 12:32 randomFileNameIf you wanted to share this file with someone outside the 'operations' group, read permissions must be added to the file for 'other'.
Set Default Permissions
In Linux, it is possible to set the default file permission for new files. This is useful if you are collaborating on a project, or frequently share files and you do not want to be constantly adjusting permissions The command responsible for this is called 'umask'. You should first check what your default permissions currently are by running 'umask -S'.
bragr@biocluster:~$ umask -Su=rwx,g=rx,o=rxTo set your default permissions, simply run umask with the correct options. Please note, that this does not change permissions on any existing files, only new files created after you update the default permissions. For instance, if you wanted to set your default permissions to you having full control, your group being able to read and execute your files, and no one else to have access, you would run:
bragr@biocluster:~$ umask u=rwx,g=rx,o=It is also important to note that these settings only affect your current session. If you log out and log back in, these settings will be reset. To make your changes permanent you need to add them to your '.bashrc' file, which is a hidden file in your home directory (if you do not have a '.bashrc' file, you will need to create an empty file called '.bashrc' in your home directory). Adding umask to your .bashrc file is as simple as adding your umask command (such as 'umask u=rwx,g=rx,o=r') to the end of the file. Then simply log out and back in for the changes to take affect. You can double check that the settings have taken affect by running 'umask -S'.
Further Reading
Copying large folders to and from Biocluster
Rsync can:- Copy (transfer) folders between different storage hardware
- Perform transfers over the network via SSH
- Compare large data sets (-n, --dry-run option)
- Resume interrupted transfers
To perform over-the-network transfers, it is always recommended that you run the rsync command from your local machine (laptop or workstation).
On your computer open the Terminal and run:
rsync -ai FOLDER_A/ biocluster.ucr.edu:FOLDER_A/rsync -ai biocluster.ucr.edu:FOLDER_B/ FOLDER_B/If your connection broke, rsync can pick up when it left from - simply run the same command again.
- Rsync does not exist on Windows. Only Mac and Linux support rsync natively.
- Always put the / after both folder names, e.g:
FOLDER_B/Failing to do so will result in the nesting folders every time you try to resume. If you don't put / you will get a second folder_B inside folder_BFOLDER_B/FOLDER_B/ - Rsync does not move but only copies.
man rsync
Copying large folders on Biocluster between Directories
Rsync does not move but only copies. You would need to delete once you confirm that everything has been transfered.This is the rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:
rsync -ai FOLDER_A/ X/FOLDER_A/where
X is a different folder (e.g. a Bigdata folder)- Once the rsync command is done, run it again. The second run will be short and it just a check. If there was no output, nothing changed, it is safe to delete the original location.
Specifically, running rsync the second time will ensure that everything has been transferred correctly. The -i (--itemize-changes) option asks rsync to report (output) all the changes that occure on to the filesystem during the sync. No output = No changes = The folder has been transfered safely.
- All the bullets in the above section (Copying large folders to and from Biocluster) apply to this section
Copying large folders between Biocluster and other servers
This is a very rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:rsync -ai FOLDER_A/ sever2.xyz.edu:FOLDER_A/where
sever2.xyz.edu is a different server that accepts SSH connection.
- All the bullets in the above sections (Copying large folders to and from Biocluster) apply to this section
Home Directories
Home directories are where you start each session on biocluster and where your jobs start when running on the cluster. They are automatically mounted when you log in and can be found at /rhome/<your username>.
Please remember: the default storage space quota per user account is 20 GB. A buffer of 10 GB is there to help with temporary overages but should not be used for permanent storage.
To get the current usage, run the following command in your home directory:-
du -sh .
To calculate the sizes of each separate folder in your home, run:
-
du -sch ~/*
For more information on your home directory, please see the Orientation section in the Linux Basics manual.
Compression
On biocluster all data is automatically compressed with the lzjb algorithm.
The following will report the compressed file sizes
du -shc *du --apparent-size -shc *Backups
Biocluster has backups but you may want to periodically make copies of your critical data to your own storage device.Please remember, Biocluster is a production system for research computations with a very expensive high-performance SAN storage infrastructure. It is not a data archiving system.
Databases
Introduction
NCBI, PFAM, and Uniprot, do not need to be downloaded by users. They are in a central location on Biocluster:ls /srv/projects/db/Since these databases are accessible to everyone, users can simply provide the proper path in their executables. Specific database release numbers can be identified by the directory name under which a database is stored. For instance:
ls /srv/projects/db/pfam/2009-10-Pfam24.0/Usually, we store the most recent release and 2-3 previous releases of each database. This way time consuming projects can use the same database version throughout their lifetime without always updating to the latest releases. Suggestions for additional shared databases can be emailed to Alex Levchuk.
Locations of Database Files
As of 2010-03-08, the available databases are:- Uniprot ( /srv/projects/db/uniprot/2010-02-UniProt-15.14 )
- uniprot_sprot - curated and well annotated genes
- uniprot_sprot_plus_trembl - combination of the above (sprot) and the unsupervised (trembl) Uniport genes
- Pfam ( /srv/projects/db/pfam/2009-10-Pfam24.0/ )
- Pfam-A - curated Pfam HMMs
- Pfam-B - automatically constructed unsupervised Pfam HMMs
- Pfam-C
- NCBI ( /srv/projects/db/ncbi/2010-02-24 )
- nr - non-redundant protein sequences
- nt - nucleotide DNA sequences
Parallelization Software
Introduction
The low-latency interconnect provides speeds that average at 30 µs microseconds per message during high loads. This interconnect provides a break-through performance to computational jobs that can run
in parallel on multiple compute nodes but require frequent node-to-node
communication.Advanced parallel computing technologies for a cluster are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine).
Charmrun
Example: NAMD
Here is how to use Biocluster on a simple official NAMD example:1. Log-in to Biocluster
2. Run
cp -r /srv/projects/biocluster-examples/004-namd2 namd2-example cd namd2-example
# Make a trycp -r 1-2-sphere 1-2-sphere-try1cd 1-2-sphere-try1# Create symbolic linksln -s ../qsub-8coreln -s ../qsub-8core-namd2ln -s ../qsub-48coreln -s ../qsub-48core-namd2# Start the run./qsub-8core 5 # where 5 is the number of nodes (8 cores each)
3. You can monitor progress with:
see q
tail qsub2-namd2.*tail outFor other NAMD runs, the only 2 scripts that you will need from the folder
namd2-example are:- qsub-8core
- qsub-8core-namd2
- qsub-48core
- qsub-48core-namd2
MPI
Biocluster provides: OpenMPI (Launcher: mpirun), Mvapich1 (Launcher: mpirun_rsh), and Mpich1.
Many implantations of MPI exists: Open MPI, OpenMP, FT-MPI, LA-MPI, LAM/MPI, PACX-MPI, Adaptive MPI, MPICH, MVAPICH
If you need to compile the MPI applications then use one of the compilers. The compilers are wrappers around the gcc compiler.
To launch MPI applications use qsub to reserve several nodes and then use the MPI launcher program.
OpenMPI, Mvapich, or Mpich
Open MPILink: http://www.open-mpi.org/
Compilers: mpiCC, mpiCC-vt, mpic++, mpic++-vt , mpicc, mpicc-vt, mpicxx, mpicxx-vt , mpiexec, mpif77, mpif77-vt, mpif90, mpif90-vt
Launcher: mpirun
MPICh1
Link: http://www.mcs.anl.gov/research/projects/mpich1/
Include directroy: /opt/mpich1-1.2.1p1/include
Compilers: /opt/mpich1-1.2.1p1/bin/{mpicc,mpicxx,mpif77}
Launcher: /opt/mpich1-1.2.1p1/bin/{mpirun,mpiexec}
MVAPICh1
Link: http://mvapich.cse.ohio-state.edu
Note: MVAPICh1 runs over the InfiniBand RDMA hardware, so it is potentially faster then OpenMPI.
Include directroy: /opt/mvapich1-1.4.1/include
Compilers: /opt/mvapich1-1.4.1/bin/{mpicc,mpicxx,mpif77}
Launcher: /opt/mvapich1-1.4.1/bin/mpirun_rsh
Example: MrBayes
mkdir mrbayes-on-3nodes # Create a folder,cd !$ # cd into it,
cp -v /srv/projects/biocluster-examples/001-mrbayes/* . # and copy the example files
./run.sh -l nodes=3:ppn=8 # Start the MPI job on 3 nodes
qstat # See your job in Torque queue (R - running, Q - queued)
# ----------- Wait for 2 minutes -----------qstat # Check that you job is still runningcat out # See Partial Results
# ----------- Wait for 20 minutes -----------# (If you don't want to wait run "qdel 12345" where 12345 is your job ID for qstat)
qstat # Your job is now completed
cat out #
View final results
Example: MAKER
mkdir maker-on-3nodes
# Create a folder,cd
!$
# cd into it,
cp -v /srv/projects/biocluster-examples/002-maker/* . #
and copy the example files
# Tool number 1 (clean-up)on-all-nodes-run `pwd`/kill-all-mpd # WARNING: This command will destroy
# all Mpich mpi programs if you are currently
# running them. Use this command only when
# you need to clean-up after any unsuccessful runs.
# Tool number 2 (starting runs)./lev2-run.sh -l nodes=3:ppn=8
# Tool number 3 (monitoring CPU)jmonGrep $USER # This will show real-time
CPU usage on all nodes that you are reserving
#
You can also see nodes for a specific job from qstat, for example
jmonGrep 12345
#
# NOTICE:
If nothing shows up then it means that you have nothing reserved.
# To quit press 'q'
# Tool number 4 (monitoring progress)./progress-check
# Tool number 5 (graphing progress)# Reference performance: afrC_sca.fa on 3 nodes in 3 hours
# Reference performance: afrC_sca.fa on 1 node in 8 hours (see result graphs)...
Unavailable and untested examples: HMMER, ABySS Parallel Assembler
Monitoring the Load History
Several utilities are available for obtaining information about the history of the cluster load by users, labs and individual nodes. - The R Function plotCPUhours
When this function is executed from the R console on the biocluster head node, it will return an overview of the CPU hour history by users and PIs, as well as the average load of the entire cluster.source("/home/tgirke/Applications/Cluster/cpu_hour_counts/cpuhourPlot.R")
cpuhourlist <- plotCPUhours(plotdata=T, printdata=T, mymonths="all")
- Cluster Load History
CPU hours used by each Lab (Locked*)
CPU hours used by each Compute Node (Locked*)
CPU hours used by each User (Locked*)
* For password please email Thomas
Communicating with Other Users
Biocluster is a shared resource. Communicating with other users can help to schedule large computations. Looking-Up Specific Users
qstatMonitorYou can generate an email list of particular users by running:
all-users | awk -F\\t '/((alevc)|(tgirk))[a-z0-9]*\t/ {print " " $1 " <" $3 ">"}'Listing Users with Active Jobs on the Cluster
qstat | awk '// {print $3}' | sort | uniq | grep "^[^-N]"To get the list of real names run:
grep <(all-users) -f <(qstat | awk '// {print $3}' | \
sort | uniq | grep "^[^-N]") | awk -F\\t '// {print $1}'grep <(all-users) -f <(qstat | awk '// {print $3}' | \
sort | uniq | grep "^[^-N]") | awk -F\\t '// {print " " $1 " <" $3 "> "}'Sharing Files on the Web
mkdir hello-www # Make a new directory in your current PWDecho '<h1>Hello WWW!</h1>' > ./hello-www/hello.htmlln -s `pwd`/hello-www ~/.html/ # Make a symbolic link(Instead of alevchuk put your username)
Password Protect Web Pages
Files in web directories can be password protected as follows:First run these commands:
touch ~/.html/.htpasswdhtpasswd ~/.html/.htpasswd webuserGo to the directory that you want to lock
mkdir ~/.html/locked_dircd ~/.html/locked_dirNow run this command:
echo 'AuthName "Please login. The username is webuser."AuthType BasicAuthUserFile /home/alevchuk/.html/.htpasswdrequire user webuser' > .htaccessTo find it run:
echo ~ But instead of alevchuk and locked_dir put your username and your directory name.
List of Installed Software
- Installation request can be emailed to Alex Levchuk
Systems Software
- Debian GNU/Linux
- ZFS and NFS served by OpenIndiana
- Monitoring by Munin
- Containerization by OpenVZ
Software from Debian package repository
Version lookup:dpkg -S `which any-command` | awk -F: '{print $1}' | xargs aptitude showFor example:
dpkg -S `which blastall` | awk -F: '{print $1}' | xargs aptitude showSoftware from source code
Version lookup:git --git-dir=/usr/local/.git log -- bin/any-commandFor example:
git --git-dir=/usr/local/.git log -- bin/soapR libraries
Version Lookup (in R):packageVersion("any-package")For example:
packageVersion("GOstats")Password Change
Changing your password is currently disabled for security reasons
You can keep using the original Biocluster random password or change it:
- Log-in via SSH using the Terminal on Mac/Linux or Putty on Windows
- Type passwd
- Enter the old password (the random characters that you were given as your initial passowrd)
- Enter your new password
WARNING: Please do not reuse passwords for Biocluster!
This is to protect you and other Biocluster users from getting any of
your electronic accounts compromised. The Biocluster password should
not be similar to any of your other passwords. This includes passwords
for:
* laptops
* email
* university services
* other labs and departments
The password minimum length is:
- 10 lower-case
letters
- 8 letters AND 1 digit (9 characters total)
- 8 letters AND 1 upper-case (9 characters total)
- 8 letters AND 1 non-alphanumeric1 (9 characters total)
- 6 letters AND 1 digit AND 1 upper-case (8 characters total)
- 6 letters AND 1 digit AND 1 non-alphanumeric1 (8 characters total)
- 6 letters AND 1 upper-case AND 1 non-alphanumeric1 (8 characters total)
- 4 letters AND 1 digit AND 1 upper-case AND 1 non-alphanumeric1 (7 characters total)

