Content | Navigation |

Introduction to Thunder Cluster

Overview

Acquisition of the Thunder cluster was supported by the NSF Major Research Instrumentation (MRI) program in order to provide the state-of-art computing resources for entire NDSU research community.  The cluster is designed to scale both horizontally and vertically in order to meet rapid growth of computational needs. Additional resources can be added to the cluster on-demand.

Currently, the cluster consists of 53 compute nodes, two login nodes, an NFS server, management servers, and the GPFS storage system. All the nodes are interconnected through an FDR Infiniband (56Gb/s) connection. The theoretical peak performance of the cluster is about 40TFlops. The fifty-three compute nodes are categorized into six groups based on hardware specification, purpose, and ownership as shown in Figure 1: twenty-four prod nodes, ten condo nodes, fourteen mic nodes, two mic-devel nodes, sandy node, and two lm nodes. Each group is assigned to specific queue for the most efficient use of the computing resource.

All of these compute nodes are connected to the tiered GPFS filesystem and the tape archive system with policy-driven hierarchical storage management (HSM). The data on this storage system is automatically migrated to another tier and to the tape archive system according to the policy.

Accessing Thunder Cluster

Figure 2 shows a schematic diagram of the Thunder cluster. The hostname of the Thunder cluster is “thunder.ccast.ndsu.edu”, and the connection to the Thunder cluster is automatically routed to one of the two login nodes, either login-0001 or login-0002. Users must use secure shell (SSH) connection to access the Thunder cluster. The SSH comes with Apple OS X and typical Linux distribution. There is no need to install additional software for Apple and Linux users. However, Windows users need to download and install PuTTY to create a SSH connection.
PuTTY can be downloaded from following URL: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

Batch Queue System

The batch queue system is a resource manager in which the entire computing resource is divided into smaller groups and then computing resources are allocated based on the resource request according to the policy for efficient and fair use of the system among all users. Using the computing resources to bypas the batch system is prohibited.

Queues

On the Thunder cluster, there are five queues based on the hardware, walltime,  memory, and other needs as shown in Table 1. All users have access to all the queues listed – default, bigmem, def-devel, bm-devel, and preempt queues and users can choose the appropriate queue for their needs. Users may want to use the preempt queue to access the condo resource pool when the resource is available. However, please note that the preempt job may be terminated if condo resources are requested by condo users.

Thunder
QueueQueue Properties
Route QueueRoute OrderExecution QueueAvailable Nodes per QueueProcessors CoresMax WalltimeMax Nodes per JobMax Processors per JobNotes
default1def-short387608 hours38760Only non-interactive
jobs will be accepted. Use route queue to route as ordered.
2def-medium1836024 hours18360
3def-long48072 hours480
def-devel2404 hours240Interactive jobs allowed on this queue
bigmem1bm-short2648 hours264Only non-interactive jobs will be accepted. Use
route queue to route as ordered.
2bm-medium26424 hours264
bm-devel1164 hours116Interactive jobs allowed on this queue
preempt60120024 hours601200Non condo user jobs may be dropped upon condo
user job submission

For more information, please see CCAST User Policies under Queue Policies.

Table 1: Type of queues

Job Submission

In order to get the computing resource allocated, the user must submit their job to the batch system in one of two ways, i.e., interactive job or non-interactive batch job.

  • Interactive job – the user can execute commands through the terminal window interactively. It is useful for program debugging.  Below is an example to submit the interactive job:
  • Batch job – this is typical way to submit job. The job will be running in background according to submit script the user created. Below is an example to submit the batch job:

Example of a Submit Script

Example of the submit script is shown in Figure 9.

Useful commands in the batch queue system

CommandExampleDescription
qstatqstatDisplay status of entire jobs in batch system.
qstat –f JOBIDDisplay detailed information about a specific job.
qsubqsub submit_scriptSubmit a job to the batch system.
qdelqdel JOBIDTerminate specific job.
qalterqalter –N nickname JOBIDChange nickname of specific job.
qmoveqmove queue JOBIDMove specific job to another queue.

Table 2: Batch system commands

Environment Modules

The user can find many applications available on the Thunder cluster. To run these applications correctly, the user must setup the Linux shell environment correctly. On the Thunder cluster, the user can change the shell environment according to the application of interest by using the environment modules as shown in Figure 10. The application is categorized as “application_name/version-compiler” in environment modules. If user does not specify the “version-compiler”, the default one will be selected.

Useful Commands in Environment Modules

CommandDescription
module avail List available applications.
module load application Load shell environment variables for specific application.
module unload application Unload shell environment variables for specific application.
module list List currently loaded shell environment variables for applications
module swap application1application2 Swap shell environment variables from application1 to application2.
module purge Unload all currently loaded shell environment variables by environment modules.

Storage System

The Thunder cluster employs state-of-the-art GPFS HSM filesystem on which data migrates between the tiers and the tape archive in order to achieve performance and capacity cost-effectively.  There are several directories for different purposes based on performance and limitations available to users on Thunder cluster as shown in Figure 11.

Directory Structure

Below summarizes each directory:

  • User’s Home Directory (/gpfs1/home/username) - User’s Home Directory is the user’s private and permanent storage.  This consists of disk and tape archive system.  Files under this directory are protected by tape back-up system. Users will have a quota of 500 GB on the tape archive system and a disk quota of 200 GB for this directory.  The Disk quota can be exceeded by 50 GB for 24 hours.  Upon expiration of this 24-hour grace period, files larger than 10 MB will be automatically migrated to the tape archive system based on the file last accessed.  Files with the oldest access date will be migrated first in order to free up disk space. Throughput of disk directory exceeds 2.6 GB/sec.

  • Global Scratch Directory (/gpfs1/scratch/username) - Global Scratch Directory is a temporary directory to store input files to run applications and to store temporary files generated by applications. All users must use this directory to run their applications.  Running applications from any other directory is a violation of CCAST policy and jobs will be terminated.  Files under this directory will be kept for 60 days.  All files not accessed in the last 60 days will be purged automatically.  Files are not protected by the tape backup system and there is no disk quota for this directory. Throughput of disk directory exceeds 4.1 GB/sec.

  • Project Directory (/gpfs1/projects/project_name)- Project Directory is the directory for sharing data among research group members.  Occasionally, a project directory can be created for multiple research groups upon approval of request.  The project directory consists of disk space and a tape archive system.  Files under this directory are protected by the tape back-up system.  Users will have a quota of 2 TB on the tape archive system and a disk quota of 1 TB for this directory.  The disk quota can be exceeded by 250 GB for a 24 hour period.  Upon expiration of this 24-hour grace period, files larger than 10 MB will be automatically migrated to the tape archive system based on the file last accessed.  Files with the oldest access date will be migrated first in order to free up disk space. Throughput of disk directory exceeds 2.6 GB/sec.

There is also a directory called "ARCHIVE" directory under users' home and project directory.  Users may want to use this directory to store their files to the tape archive system before reaching their disk quota to save disk space.  Files under this directory will be stored on the tape archive system before reaching the users' disk quota.  This directory is designed for files that are not accessed frequently. 
NOTE: only files larger than 10 MB will be stored on tape archive system.

Thunder FAQs

How do I log into a CCAST Cluster?

There are currently four clusters at CCAST. The following host names can be used:

cluster2.ccast.ndsu.edu
cluster3.ccast.ndsu.edu
thunder.ccast.ndsu.edu
cyrus.ccast.ndsu.edu

How to login from a Windows Computer

PuTTY SSH client should be used to access any CCAST cluster from a Windows computer. Once PuTTY is installed, double click the application and enter the hostname to access the cluster. Below is an example of how to log into the Thunder cluster:

How to login from Apple/Linux computer

Open the terminal window and then execute following line from the terminal window to access any CCAST cluster.

NOTE: “username” is user’s login name and the host name can be changed to the cluster's name. Below is an example of how to log into the Thunder cluster.

Unlike on a typical desktop, the user cannot execute their application by logging into the compute node directly or running the application on the login nodes.  The user can edit files and submit their application to the batch queue system from the login nodes.

How can I transfer data to a CCAST Cluster?

There are currently four clusters at CCAST. The following host names can be used:

cluster2.ccast.ndsu.edu
cluster3.ccast.ndsu.edu
thunder.ccast.ndsu.edu
cyrus.ccast.ndsu.edu

Transfering between a Windows computer and a CCAST cluster

WinSCP client should be used to transfer data between your Windows computer and any CCAST cluster. WinSCP can be downloaded from here for free.
Once your download is complete, open the application and fill in the fields as seen in the screenshot below.

Any of the host names from the above list can be substituted for the host name field.

Please enter your username and password in the appropriate fields and select 'SCP' as the File protocol and click login.
Once you are logged in you should be able to see the following screen:

Now you can drag and drop your files between your computer and any CCAST cluster easily.

Transfering between an Apple/Linux computer and a CCAST cluster

You can use the scp command to transfer files between your computer and any CCAST cluster. Enter the following commands in a terminal on your own computer.

To transfer files from CCAST cluster to your computer:

$> scp [[username@hostname]:[source-file]] [[destination]]

For example, you can transfer files from your Cluster3 to your computer as shown below:

$> scp your_username@cluster3.ccast.ndsu.edu:/home/your_username/myfile.txt /home/mycomputer/myfile.txt

To transfer files from your computer to CCAST cluster:

$> scp [[source-file]] [[username@hostname]:[[destination]]

For example, you can transfer files from your computer to the Thunder cluster as shown below:

$> scp myfile.txt your_username@thunder.ccast.ndsu.edu:/home/your_username

Any of the host names from the above list can be substituted for the fully qualified domain name field.

How do I work with Batch Queue Systems (PBS/Torque)?

Some useful commands in the batch queue system:

Here at CCAST, we use TORQUE, which is an open source implemtation of PBS.  The batch system or resource manager divides up larger computer systems so that multiple users and their jobs can be run on them.  Another part of the batch system is the scheduler which decides where and when a job will run.  A user requests resources from the resource manager in terms of time, the number of compute nodes, and the resources on those nodes which may be RAM, network, disk, software licenses, etc.  Once a job is submitted to the batch system, then the scheduler decides where and when that job will run.

Below yoiu will find information on:

  • A Sample PBS batch script
  • Environment variables available to your script
  • Batch system commands

Sample batch script

The best way to demonstrate using the batch system is with an input script example.  Below is a sample script that can be modifed to run a job, and there are more application specific scripts found on the interative nodes under /usr/local/PBS_EXAMPLES.  More information can be found in the manual page for qsub by typing man qsub.  The '#PBS' directives are significant.  They are not comments, but special directives to the batch system, but ignored by the scripting language.  To submit a job, simply type qsub $SCRIPTNAME.

PBS script example

#!/bin/bash
#
# file name: sample.pbs
# usage: qsub sample.pbs
#
# nick name of your job
#PBS -N my_first_job
#
# resource limits: number of node and number of processor per node to be used
# In this case, requesting single node and eight processors on the single node.
# nodes: number of compute node
# ppn: number of processor per node
#PBS -l nodes=1:ppn=8
#
# resource limits: amount of memory to be used
#PBS -l mem=1024mb
#
# resource limits: maximum wall clock time can be allocated
#PBS -l walltime=3:20:00
#
# path and filename of standard output
#PBS -o path/filename.o
#
# path and filename of standard error
#PBS -e path/filename.e
#
# queue name, one of {default, special express}
# The default queue, "default", need not be specified
#PBS -q default
#
# user’s email addresss
#PBS -M my-email-address
#
# send an email when job begins
#PBS -m b
# send an email when job ends
#PBS -m e
# send an email when job aborts (with an error)
#PBS -m a
# export all current shell environment variables to the job
#PBS –V
#
/path/to/the/executable

Batch environment variables avialable in your script

Some environment variables are provided by the batch system to your script. Some of the more meaningful ones are.

Batch Variables
PBS_JOBID This is the unique job identifier of your job.  This can be used to create scratch directories or other unique identifiers that are specific to your job.
PBS_O_WORKDIR This is the working directory where you originally submitted your job from.  It is common to use 'cd $PBS_O_WORKDIR' or to access input/output files via $PBS_O_WORKDIR
PBS_JOBNAME This is the name specified with the -N option when submitting your script
CommandExampleDescription
qstat qstat Display status of entire jobs in batch system.
qstat –f JOBID Display detailed information about a specific job.
qsub qsub submit_script Submit a job to the batch system.
qdel qdel JOBID Terminate specific job.
qalter qalter –N nickname JOBID Change nickname of specific job.
qmove qmove queue JOBID Move specific job to another queue.

 

 

How to start using Materials Studio?

You have to contact the CCAST support staff if you need to start using Materials Studio. We would be glad to come down to your workplace to install it for you.

How to run interactive jobs in the cluster?

Please use the command below to run interactive jobs in Cluster 2 and Cluster 3:
qsub -I -l nodes=1:ppn=8 -l walltime=4:00:00 -q devel

Please use the command below to run interactive jobs in Thunder:
qsub -I -l nodes=1:ppn=8 -l walltime=4:00:00 -q def-devel 

Note: Please limit your interactive jobs to maximum of 2 hours.

Can I access individual compute nodes?

Yes you can access individual nodes if you have a job running in that particular node.  Suppose you have a job running in cluster3-24 and cluster3-25, you can login to these nodes by typing

ssh cluster3-xx (xx replaces the node number)

NOTE: Access is not allowed unless you have a job already running on the node through the batch system.

My rsync request times out before the process is complete?

When you issue a rsync command it runs on the cluster head node. The cluster head node has a 30 minute time limitation to run commands. Therefore if you are trying to rsync a large amount of data it would time out after 30 minutes. The best option is to create multiple directories and rsync them seperately one by one. The other option is to remove 'z' option from the command. This would save the time used to compressing,

rsync -av [SOURCE] [DESTINATION]

When using rsync be extremely careful about the trailing slash. for example look at the following two rsync commands, 

#1: rsync -av /some/path/a/ /some/otherpath/
#2: rsync -av /some/path/a /some/otherpath/

The first command will make /some/otherpath/ mirror the content of /some/path/a/ whereas the second command would create a directory inside /some/otherpath/ to mirror the content of /some/path/a. Therefore be extremly mindful about what you want. 

How to find the nodes my job is running on?

While your job is running, you can use 

qstat -n JOB_ID

to get the node information.

NOTE: The first node is MPI node 0 of your job.

Example:- Your output would be something like this

cluster3-60/7+cluster3-60/6+cluster3-60/5+cluster3-60/4+cluster3-60/3

This means that your job runs on processor 3,4,5,6,7 on cluster3 node 60. 

How do I know how much storage I’m using in GPFS Scratch?

Navigate to the GPFS scratch directory and enter,

du -sch

This will show you the amount of storage you are using in GPFS Scratch.  If you want more information on your sub-directories do:

du -sch *

 

How do I run Materials Studio on my computer?

Materials Studio is available for the use of the CCAST Users. Should a user need to use Materials Studio on their computer please send an email to support@ccast.ndsu.edu. We would come and install Materials Studio for you. 

I want to run a job for a longer time than the queue permits?

Currently the longest queue (long queue) we have, allows for 2 weeks of computation. We believe this is sufficient for most of the users. However if your job needs more than 2 weeks of computation time, please email us at support@ccast.ndsu.edu

How do I run a sequence of similar jobs on cluster?

If you want to run an array of jobs in cluster use,

qsub -t 1-5 array.pbs

Instead of 1-5 you can use the number of jobs you want to run. If you want to run multiple ids you can type -t 1,10,20 instead of the range -t 1-5

NOTE:  That the only difference between the jobs is that the PBS_ARRAYID environment variable will be available and it will have the unique job identifier of your job.

You can delete jobs from the queue with the same -t syntax. So, qdel -t 1-5 YOUR_JOB_ID[] will delete all 5 of the array jobs.

How do I forward an X11 connection from my job

Users may want to use X11 forwarded applications for pre-processing or post-processing. However since our users are spread across many platforms there are many reasons for this to fail. Therefore, we strongly discourage users from forwarding X11. In unavoidable circumstances users can connect to cluster using SSH with X11 enabled using the following command,

ssh -X USERNAME@clusterX.ccast.ndsu.edu (X should be replaced by cluster number and USERNAME with your username)

Once you are logged in run the following command to use X11 forwarding.

qsub -I -l nodes=1:ppn=8 -q short -l walltime=04:00:00 -X 

 

How do I make a job depend on another?

By adding the -W depend=CONDITIONS syntax to your qsub command.

CONDiTIONS can be after:jobid[:jobid...] or afterok:jobid[:jobid...] where OK is defined as the job exiting with a 0 exit status. There are a number of dependency options, and its best to run man qsub on the cluster and determine with job dependency options are best for you.

Why isn’t my job running?

There can be multiple reasons for this. First check if your job is listed in the queue. If yes, then your job will run eventually.  Remember, that this is a shared resource, and that eventually all of your jobs will run. Try to provide more accurate resource (walltime,processor) requirements. The system prefers shorter/smaller jobs. Those are the easiest to schedule.  You can use the checkjob JOBID command to get extra information about your job.

If your job is not listed in the queue (running or waiting), this could be due to a problem with either your PBS script or your input file. Please refer to the appropirate software sections in the website to find out more information about running particular jobs. If you still can't find the problem contact support@ccast.ndsu.edu for help.

My question is not here, who can I ask for help?

Please do not hesitate to contact us.

How can I get additional software installed on the clusters?

Just send a support request, and if the software is appropriate for use on the systems, it will be centrally installed.  Send the request to support@ccast.ndsu.edu

Can I install 3rd party Python modules for softwares available in clusters?

You can install 3rd party modules for any software that supports 3rd party modules, under your home directory. The process of installing these 3rd party modules differs from software to software. Below is an example of installing a 3rd party module for Python 3. 

First you need to download the tarball for the module you want to install and untar it. Then navigate into the extracted directory and type,

python setup.py install --home=/home/YOURUSERNAME

This would install the 3rd party module in your home directory.
When you want to import this module in a python program type,

import sys

sys.path.append('/home/YOURUSERNAME/lib/python/')

import YOUR_3RD_PARTY_MODULE

On top of the script. If you need any help with installing a 3rd party module for the software please email us. We would be happy to help you. 

How do I get rid of the special characters that get generated when I edit a file in Windows?

When you edit a file in Windows it adds some special characters to the end of each line to indicate new line or a line return. You need to remove these special characters in order to get them to work with Linux. The easiest way to do this would be to use the "dos2unix" 

Just type. 

$>dos2unix /path/to/filename

This would remove all unnecessary characters from your script (Replace). For more options please look at the "man" file or email us. 

What are the differences between /gpfs1/home /gpfs1/scratch and /gpfs1/projects?

All of these areas are served by IBM's GPFS, however they appear as one namespace, but the underlying physical storage and usage policies are different for the different areas.

/gpfs1/home and /gpfs1/projects are on the slower tier2 storage pool.  And under each of these directories, each user's home directory and project area are known as GPFS filesets.  Because of the different filesets, you will see different output values from the UNIX command df.  df /gpfs1 will give you the total free space on all of the storage pools within GPFS.  However, df /gpfs1/projects/PROJECT_NAME or df /gpfs1/projects/USERNAME will give the disk usage according to GPFS filesets and the quotas set on those filesets.

/gpfs1/scratch is on the tier1 storage pool.  There are no individual user limits on this filesystem, and it is one open storage area.  The data on this filesystem will be purged that has not been accessed within 30 days.  It is against CCAST policy for user's to use this area for longer term storage. 

How to work with modules

Environment Modules

The user can find many applications available on any CCAST cluster. To run these applications correctly, the user must setup the Linux shell environment correctly. On the Thunder cluster, the user can change the shell environment according to the application of interest by using the environment modules as shoqn below. The application is categorized as “application_name/version-compiler” in environment modules. If user does not specify the “version-compiler”, the default one will be selected.

Useful Commands in Environment Modules

CommandDescription
module avail List available applications.
module load application Load shell environment variables for specific application.
module unload application Unload shell environment variables for specific application.
module list List currently loaded shell environment variables for applications
module swap application1application2 Swap shell environment variables from application1 to application2.
module purge Unload all currently loaded shell environment variables by environment modules

 


Student Focused. Land Grant. Research University.

Follow NDSU
  • Facebook
  • Twitter
  • RSS
  • Google Maps

CCAST Support
Phone: 701.231.5184
Physical/delivery address:  1805 NDSU Research Park Drive/Fargo, ND 58102
Mailing address:  P.O. Box 6050—Dept. 4100/Fargo, ND 58108-6050
Page manager: CCAST

Last Updated: Friday, July 28, 2017 8:31:08 AM
Privacy Statement