Changes between Version 5 and Version 6 of GCCluster


Ignore:
Timestamp:
2012-10-02T22:48:11+02:00 (12 years ago)
Author:
Pieter Neerincx
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GCCluster

    v5 v6  
    1212== Login to the User Interface server ==
    1313
    14 To submit jobs, check the status, test scripts, etc. you need to login on the user interface server a.k.a. cluster.gcc.rug.nl using SSH:
    15 Please note that cluster.gcc.rug.nl is only available from within certain RUG/UMCG subnets. From outside you need a double hop. Firstly login to the proxy:
     14To submit jobs, check the status, test scripts, etc. you need to login on the user interface server a.k.a. cluster.gcc.rug.nl using SSH.
     15Please note that cluster.gcc.rug.nl is only available from within certain RUG/UMCG subnets. From outside you need a double hop; Firstly login to the proxy:
    1616{{{
    1717$> ssh [your_account]@proxy.gcc.rug.nl
     
    2121$> ssh [your_account]@cluster.gcc.rug.nl
    2222}}}
    23 If you are within certain subnets of the RUG/UMCG network, you can skip the login to the proxy step and login to cluster.gcc.rug.nl directly.
     23If you are inside certain subnets of the RUG/UMCG network, you can skip the proxy and login to cluster.gcc.rug.nl directly.
    2424
    2525== Available queues ==
     
    2929To test how your jobs perform on an execution node and get an idea of the typical resource requirements for your analysis you should submit a few jobs to the test queues first. The test queues run on a dedicated execution node, so in case your jobs make that server run out of disk space, out of memory or do other nasty things accidentally, it will not affect the production queues and ditto nodes.
    3030
    31 Once you've tested your job scripts and are sure they will behave nice & perform well, you can submit jobs to the production queue named ''gcc''. In case you happen to be part of the gaf group and need to process high priority sequenced samples for the Genome Analysis Facility you can also use the ''gaf'' queue.
     31Once you've tested your job scripts and are sure they will behave nice & perform well, you can submit jobs to the production queue named ''gcc''. In case you happen to be part of the gaf group and need to process high priority sequence data for the Genome Analysis Facility you can also use the ''gaf'' queue.
    3232
    3333||**Queue**||**Job type**||**Limits**||
     
    4242
    4343=== Submitting jobs: ===
     44Simple submit of job script to the default queue, which routes your job to the ''gcc'' production queue:
     45{{{
     46$> qsub myScript.sh
     47}}}
     48Submitting a job with a jobname different from the filename of the submitted script (default) and with a dependency on a previously submitted job.
     49This job will not start before the dependency has completed successfully:
    4450{{{
    4551$> qsub -N [nameOfYourJob] -W depend=afterok:[ID of a previously submitted job] myScript.sh
     52}}}
     53Instead of providing arguments to qsub on the commandline, you can also add them using the ''#PBS'' syntax as a special type of comments to your (bash) job script like this:
     54{{{
     55#!/bin/bash
     56#PBS -N jobName
     57#PBS -q test-short
     58#PBS -l nodes=1:ppn=2
     59#PBS -l walltime=00:06:00
     60#PBS -l mem=10mb
     61#PBS -e /some/path/to/your/testScript1.err
     62#PBS -o /some/path/to/your/testScript1.out
     63
     64[Your actual work...]
    4665}}}
    4766
     
    98117||Execution node||targetgcc10||192.168.211.200||''pbs_mom''||Redundant production node: only the default ''gcc'' and priority ''gaf'' queues run on this node.||
    99118
    100 
    101119== PBS software / flavour ==
    102120
     
    105123=== Maui ===
    106124
    107 Runs only on the schedulers with config files in
     125Runs only on the schedulers with config files in $MAUI_HOME:
    108126{{{
    109127/usr/local/maui/
     
    115133Torque's pbs_server daemon runs only on the schedulers.[[BR]]
    116134Torque's pbs_mom daemon runs only on the execution nodes where the real work is done.[[BR]]
    117 Torque config files are installed in
     135Torque config files are installed in $TORQUE_HOME:
    118136{{{
    119137/var/spool/torque/
    120138}}}
    121139
    122 == Dual scheduler setup ==
     140== Dual scheduler setup for seamless cluster upgrades ==
     141
     142We use two schedulers: scheduler01 and scheduler02. These alternate as production and test scheduler. The production scheduler is hooked up to cluster.gcc.rug.nl and does not allow direct user logins. Hence you cannot submit jobs from the production scheduler, but only from cluster.gcc.rug.nl. The other is the test scheduler, which does not have a dedicated user interface machine and does allow direct user logins. You will need to login to the test scheduler in order to submit jobs. When it is time to upgrade software or tweak the !Torque/Maui configs:
     143
     144 * We drain a few nodes: running jobs are allowed to finish, but no new ones will start.[[BR]]
     145   On the production scheduler as root:
     146{{{
     147$> qmgr -c 'set node targetgcc[0-9][0-9] state = offline'
     148}}}
     149 * Once ''idle'' move the drained nodes from the production to the test scheduler.[[BR]]
     150   Change the name of the scheduler in both these files on each node to be moved:
     151{{{
     152$TORQUE_HOME/server_name
     153$TORQUE_HOME/mom_priv/config
     154}}}
     155   On each execution node where the config changed run as root:
     156{{{
     157$> service pbs_mom restart
     158}}}
     159   On the test scheduler as root:
     160{{{
     161$> qmgr -c 'set node targetgcc[0-9][0-9] state = online'
     162}}}
     163 * Check the change in available execution nodes using:
     164{{{
     165$> pbsnodes
     166}}}
     167 * Test the new setup
     168 * Disable direct logins to the test scheduler
     169 * Enable direct logins to the production scheduler
     170 * Disable job submission from cluster.gcc.rug.nl on the production scheduler
     171 * Take cluster.gcc.rug.nl offline
     172 * Make cluster.gcc.rug.nl the user interface and submit host for the test scheduler
     173 * Take cluster.gcc.rug.nl back online: the test scheduler is now the new production scheduler and vice versa
     174 * Drain additional nodes and move them to the new production scheduler
    123175
    124176== Installation details ==
     177
     178Our current config files:
     179 * $TORQUE_HOME/mom_priv/config
     180 * $MAUI_HOME/maui.cfg
     181 * Other Torque settings can be loaded from a file using qmgr.[[]]
     182   To export/inspect the settings use:
     183{{{
     184$> qmgr -c 'p s'
     185}}}
     186
     187
     188
     189
     190Both the Torque and Maui source downloads contain a contrib folder with /etc/init.d/ scripts to start/stop the daemons. We use versions patched for:
     191* The location where the daemons are installed.
     192* The run levels at which the daemons should be started or stopped.
     193* Dependencies: GPFS is explicitly defined as service required for starting/stopping the Torque and Maui daemons.
     194
     195To install:
    125196
    126197On scheduler[01|02]:
     
    130201}}}
    131202
    132 On targetgcc[01-10]-mgmt:
     203On targetgcc![01-10]-mgmt:
    133204{{{
    134205$> cp suse.pbs_mom    /etc/init.d/pbs_mom;    chkconfig -a pbs_mom;    service pbs_mom status