Changes between Version 3 and Version 4 of GCCluster


Ignore:
Timestamp:
2012-10-02T17:10:25+02:00 (12 years ago)
Author:
Pieter Neerincx
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GCCluster

    v3 v4  
    11= GCC cluster =
    22
    3 The GCC has its own 480 core cluster. The main workhorses are 10 servers with 48 cores, 256 GB, 1 GBit management NIC and a 10 GBit NIC for a dedicated IO connection to a 2 PB shared GPFS for storage.
     3The GCC has its own 480 core cluster. The main workhorses are 10 servers each with
     4 * 48 cores
     5 * 256 GB RAM
     6 * 1 GBit management NIC
     7 * 10 GBit NIC for a dedicated fast IO connection to a
     8 * 2 PB shared GPFS for storage
     9
     10= For users =
     11
     12== Login to the User Interface server ==
     13
     14To submit jobs, check the status, test scripts, etc. you need to login on the user interface server a.k.a. cluster.gcc.rug.nl using SSH:
     15Please note that cluster.gcc.rug.nl is only available from within certain RUG/UMCG subnets. From outside you need a double hop. Firstly login to the proxy:
     16{{{
     17$> ssh [your_account]@proxy.gcc.rug.nl
     18}}}
     19followed by:
     20{{{
     21$> ssh [your_account]@cluster.gcc.rug.nl
     22}}}
     23If you are within certain subnets of the RUG/UMCG network, you can skip the login to the proxy step and login to cluster.gcc.rug.nl directly.
     24
     25== Available queues ==
     26
     27In order to quickly test jobs you are allowed to run the directly on cluster.gcc.rug.nl outside the scheduler. Please think twice though before you hit enter: if you crash cluster.gcc.rug.nl others can no longer submit or monitor their jobs, which is pretty annoying. On the other hand it's not a disaster as the scheduler and execution daemons run on physically different servers and hence are not affected by a crash of cluster.gcc.rug.nl.
     28
     29To test how your jobs perform on an execution node and get an idea of the typical resource requirements for your analysis you should submit a few jobs to the test queues first. The test queues run on a dedicated execution node, so in case your jobs make that server run out of disk space, out of memory or do other nasty things accidentally, it will not affect the production queues and ditto nodes.
     30
     31Once you've tested your job scripts and are sure they will behave nice & perform well, you can submit jobs to the production queue named ''gcc''. In case you happen to be part of the gaf group and need to process high priority sequenced samples for the Genome Analysis Facility you can also use the ''gaf'' queue.
     32
     33||**Queue**||**Job type**||**Limits**||
     34||test-short||debugging||10 minutes max. walltime per job; limited to a single test node / 48 cores||
     35||test-long||debugging||max 4 jobs running simultaneously per user; limited to half the test node / 24 cores||
     36||gcc||production - default prio||none||
     37||gaf||production - high prio||only available to users from the gaf group||
     38
     39== Useful commands ==
     40
     41Please refer to the Torque manuals for a complete overview. Some examples:
     42
     43=== Submitting jobs: ===
     44{{{
     45$> qsub -N [nameOfYourJob] -W depend=afterok:[ID of a previously submitted job] myScript.sh
     46}}}
     47
     48=== Checking for the status of your jobs: ===
     49Default output for all users:
     50{{{
     51$> qstat
     52}}}
     53Long jobs names:
     54{{{
     55$> wqstat
     56}}}
     57Limit output to your own jobs
     58{{{
     59$> wqstat -u [your account]
     60}}}
     61Get "full" a.k.a detailed output for a specific job (you probably don't want that for all jobs....):
     62{{{
     63$> qstat -f [jobID]
     64}}}
     65Get other detailed status info for a specific job:
     66{{{
     67$> checkjob [jobID]
     68}}}
     69
     70=== List jobs based on priority as in who is next in the queue: ===
     71{{{
     72$> diagnose -p
     73}}}
     74
     75=== List available nodes: ===
     76{{{
     77$> pbsnodes
     78}}}
     79
     80= For admins =
     81
    482
    583== Servers ==
    684
    7 ||**Function**||**DNS**||**IP**||**Deamons**||**Comments**||
     85||**Function**||**DNS**||**IP**||**Daemons**||**Comments**||
    886||User interface node||cluster.gcc.rug.nl||195.169.22.156||- (clients only)||Login node to submit and inspect jobs. [[BR]] Relatively powerful machine. [[BR]] Users can run code outside the scheduler for debugging purposes.||
    987||scheduler VM||scheduler01||195.169.22.214||''pbs_server''[[BR]]''maui''||Dedicated scheduler [[BR]] No user logins if this one is currently the production scheduler||
     
    35113
    36114Torque clients are available on all servers.[[BR]]
    37 Torque's pbs_server deamon runs only on the schedulers.[[BR]]
     115Torque's pbs_server daemon runs only on the schedulers.[[BR]]
    38116Torque's pbs_mom daemon runs only on the execution nodes where the real work is done.[[BR]]
    39117Torque config files are installed in