infinity
Transcription
infinity
INFINITY https://lcc.ncbr.muni.cz/whitezone/development/infinity/ Petr Kulhánek1,2 kulhanek@chemi.muni.cz 1CEITEC – Central European Institute of Technology, Masaryk University, Kamenice 5, 62500 Brno, Czech Republic 2National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kotlářská 2, 611 37 Brno, Czech Republic Infinity overview, 18th October 2013 -1- Contents Docs & Contacts how to report a problem Terminology cluster topologies Clusters NCBR/CEITEC clusters, MetaCentrum, CERIT-SC, IT4I AMS – Advanced Module System command overviews, sites, modules, personal sites, big brother ABS – Advanced Batch System command overviews, resources, job submission, job monitoring Infinity overview, 18th October 2013 -2- Docs & Contacts Infinity overview, 18th October 2013 -3- Docs & Contacts Infinity web pages: https://lcc.ncbr.muni.cz/whitezone/development/infinity/ Support contacts: infinity-support@lcc.ncbr.muni.cz (support@lcc.ncbr.muni.cz) • bug reports, requests for software compilation, etc. • only administrators are notified Mailing list: infinity@lcc.ncbr.muni.cz https://lcc.ncbr.muni.cz/bluezone/mailman/cgi-bin/listinfo/infinity • general announcements, software news, general discussion, etc. • all members of the list are notified Infinity overview, 18th October 2013 -4- How to report a problem Do not assume that we know .... so report • problem to correct e-mail address (infinity-support@lcc.ncbr.muni.cz) • where and when the problem occurred (name of computer) • provide path to the job directory • make all files readable to everyone (chmod -R a+r,a-X *) • briefly summarize the problem kulhanek@chemi.muni.cz all e-mails will be silently ignored Infinity overview, 18th October 2013 -5- Terminology Infinity overview, 18th October 2013 -6- Terminology User Interface (UI) (Frontend) Cluster Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) Infinity overview, 18th October 2013 -7- Terminology User Interface (UI) (Frontend) Cluster Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) Infinity overview, 18th October 2013 User Interface (UI) (Frontend) Cluster Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) -8- Terminology, independent clusters User Interface (UI) (Frontend) Cluster Batch Computational Node #1 Server Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) Examples: wolf.ncbr.muni.cz User Interface (UI) (Frontend) Cluster Batch Computational Node #1 Server Computational Node #1 Worker Node (WN) Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) sokar.ncbr.muni.cz Independent clusters, each has own batch system for job management. Infinity overview, 18th October 2013 -9- Terminology, grids User Interface (UI) (Frontend) Cluster Computational Node #1Batch Computational Node #1 Worker Node Server Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) Examples: skirit.ics.muni.cz User Interface (UI) (Frontend) Cluster Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) perian.ncbr.muni.cz Clusters use common infrastructure with one batch system for job management. Infinity overview, 18th October 2013 -10- Terminology, grids User Interface (UI) (Frontend) Cluster User Interface (UI) (Frontend) Cluster Computational Node #1Batch Computational Node #1 Computational Node #1 Computational Node #1 Worker Node Server Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node Worker Node Computational Node (WN) #1 Computational Node (WN) #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Worker Node (WN) Worker Node (WN) storage node storage node (SN) (SN) #1#1 Examples: skirit.ics.muni.cz perian.ncbr.muni.cz Clusters use common infrastructure with one batch system for job management. Infinity overview, 18th October 2013 -11- Terminology, large grids User Interface (UI) (Frontend) Cluster User Interface (UI) (Frontend) Cluster Batch Computational Node #1 Computational Node #1 Server Computational Node #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node Worker Node Computational Node (WN) #1 Computational Node (WN) #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Worker Node (WN) Worker Node (WN) storage node storage node (SN) (SN) #1#1 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Batch Worker Node (WN) Worker Node (WN) Server User Interface (UI) (Frontend) Cluster Infinity overview, 18th October 2013 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) User Interface (UI) (Frontend) Cluster -12- Terminology, large grids User Interface (UI) (Frontend) Cluster User Interface (UI) (Frontend) Cluster CESNET Batch Computational Node #1 Computational Node #1 Server Computational Node #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 skirit, perian, gram, hildor, ... Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Worker Node (WN) Worker Node (WN) storage node storage node (SN) (SN) #1#1 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Batch Worker Node (WN) Worker Node (WN) Server User Interface (UI) (Frontend) Cluster Infinity overview, 18th October 2013 Computational Node #1 ComputationalCERIT-SC Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 zegox, zewura, ... Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) User Interface (UI) (Frontend) Cluster -13- Terminology, large grids User Interface (UI) (Frontend) Cluster User Interface (UI) (Frontend) Cluster site metacentrum Batch Computational Node #1 Computational Node #1 Server Computational Node #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 skirit, perian, gram, hildor, ... Worker Node Worker Node (WN) Computational Node (WN) #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Computational Node #1 Computational Node #1 Worker Node (WN) Worker Node (WN) Worker Node (WN) Worker Node (WN) storage node storage node (SN) (SN) #1#1 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Batch Worker Node (WN) Worker Node (WN) Server User Interface (UI) (Frontend) Cluster Infinity overview, 18th October 2013 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 zegox, zewura, ... Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) site cerit-sc User Interface (UI) (Frontend) Cluster -14- Terminology, large grids User Interface (UI) (Frontend) site cerit-sc storage node storage node (SN) (SN) #1#1 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node (WN) Computational Node #1 Batch Worker Node (WN) Worker Node (WN) Server User Interface (UI) (Frontend) Cluster Infinity overview, 18th October 2013 Computational Node #1 Computational Node #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 Worker Node Computational Node (WN) #1 zegox, zewura, ... Worker Node (WN) Computational Node #1 Worker Node (WN) Worker Node (WN) User Interface (UI) (Frontend) Cluster -15- Clusters Infinity overview, 18th October 2013 -16- NCBR/CEITEC Clusters Two main cluster: WOLF (1.18, 2.34, 24+1 PC, frontend: wolf.ncbr.muni.cz) SOKAR [0.18, 3x(24CPU, 72GB), 4x(64CPU, 256GB)] • Infinity activation is not necessary. • Job submission requires passwordless ssh connection among cluster nodes. Infinity overview, 18th October 2013 -17- Passwordless connection within cluster 1. Create the public/private ssh keys: Use empty passphrase! [kulhanek@wolf01 ~]$ cd .ssh [kulhanek@wolf01 .ssh]$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/kulhanek/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/kulhanek/.ssh/id_rsa. Your public key has been saved in /home/kulhanek/.ssh/id_rsa.pub. The key fingerprint is: e9:07:0b:fc:17:23:b3:c5:1a:8a:0c:1a:98:8f:fe:28 kulhanek@wolf01.wolf.inet 2. Put the public key to the list of authorized keys: [kulhanek@wolf01 .ssh]$ cat id_rsa.pub >> authorized_keys 3. Test passwordless connection to another node: [kulhanek@wolf01 .ssh]$ ssh wolf02 If there are some problems with ssh-agent, try $ ssh-add –D or login to GUI session again Infinity overview, 18th October 2013 -18- MetaCentrum/CERIT-SC MetaCentrum a CERIT-SC • National grid environment • OS Debian • cca 8500 CPU cores • CEITEC/NCBR own resources cca 850 CPU cores • Total 1000 TB storage capacity • cca 10 TB per user http://www.metacentrum.cz/ http://www.cerit-sc.cz/ Free access can be provided to member of any Czech university Infinity activation is required. Procedure is described here: https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/How_to_activate_Infinity Infinity overview, 18th October 2013 -19- Kerberos & Support Access to main services and passwordless authentication among cluster nodes is maintained via Kerberos tickets. Main commands: • kinit • klist • kdestroy A ticket is valid for 10 hours. It is created automatically by ssh when login to any node using password (it is not created when ssh keys are used for authentication). It can be created/recreated by the kinit command. Expired tickets can lead to very spurious behavior. Support (via Best Practical RT system): meta@cesnet.cz support@cerit-sc.cz Infinity overview, 18th October 2013 only system related stuff all problems related to Infinity should be sent to infinity-support@lcc.ncbr.muni.cz -20- IT4I http://www.it4i.cz/ Access is granted to successful proposals (6 months duration). Calls are opened twice a year. Small cluster: ca 3000 CPU, infiniband, PHI and GPU accelerators Big cluster: full operation in 2015 Infinity overview, 18th October 2013 -21- AMS Advanced Module System (Software management) https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/How_to_activate_Infinity Infinity overview, 18th October 2013 -22- Command Overview Software management: • site switching between computational resources • module activation/deactivation of software • ams-config configuration of software modules • ams-host information about computational node/frontend • ams-user information about logged user • ams-setenv prepare fake environment for given computational resources • ams-autosite name of default site for given computational node/frontend • ams-root where the AMS is installed Use command -h to list all command options. Infinity overview, 18th October 2013 -23- AMS Sites Infinity overview, 18th October 2013 -24- Sites A site is encapsulation of computational resources and software packages. On independent clusters, there is usually only one site available. On larger grids, there might be more sites available on each frontend and/or worker nodes. Available site are listed by the site command: [kulhanek@perian ~]$ site >>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [metacentrum] cerit-sc [kulhanek@perian ~]$ site site [kulhanek@sokar ~]$ >>> >>> AVAILABLE AVAILABLE SITES SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [metacentrum] cerit-sc [sokar] Infinity overview, 18th October 2013 -25- Sites A site is encapsulation of computational resources and software packages. On independent clusters, there is usually only one site available. On larger grids, there might be more sites available on each frontend. no arguments Available site are listed by the site command: [kulhanek@perian ~]$ site >>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [metacentrum] cerit-sc [kulhanek@perian ~]$ site site [kulhanek@sokar ~]$ the active site is in square brackets >>> >>> AVAILABLE AVAILABLE SITES SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [metacentrum] cerit-sc [sokar] Infinity overview, 18th October 2013 -26- Login to UI/WN [kulhanek@pes ~]$ ssh sokar.ncbr.muni.cz kulhanek@sokar.ncbr.muni.cz's password: Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-37-generic x86_64) ..... # # # # # # # # # # # # # # # # # # *** Welcome to sokar site *** ============================================================================== Site name : sokar (-active-) Site ID : {SOKAR:9848596a-17d1-47e2-9fce-b666fc0e5a36} ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User name : kulhanek User groups : compchem,lcc,pmflib ~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Full host name : sokar.ncbr.muni.cz Host arch tokens : i686,noarch,x86_64 Num of host CPUs : 16 Host SMP CPU model : Intel(R) Xeon(R) CPU E5530 @ 2.40GHz [Total memory: 24104 MB] ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/ Support e-mail : infinity-support@lcc.ncbr.muni.cz [issue tracking system] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [kulhanek@sokar ~]$ Infinity overview, 18th October 2013 -27- What site is activated? If multiple sites are allowed on UI/WN, the default site is determined by the ams-autosite command if logged from outside of the site. sokar no site ssh perian the metacentrum site will be activated Infinity overview, 18th October 2013 sokar no site ssh zuphux the cerit-sc site will be activated -28- What site is activated? If multiple sites are allowed on UI/WN, the default site is determined by the ams-autosite command if logged from outside of the site. sokar no site sokar ssh no site ssh perian zuphux the metacentrum site will be activated the cerit-sc site will be activated The site is preserved if it is allowed on the remote computer. perian ssh zuphux the metacentrum site is preserved Infinity overview, 18th October 2013 zuphux ssh perian the cerit-sc site is preserved -29- Remote login within the site [kulhanek@perian ~]$ ssh skirit >>> INFINITY environment will be propagated to the remote computer ... ... *** Welcome to metacentrum site *** # ============================================================================== # Site name : metacentrum (-active-) # Site ID : {METACETRUM:276b1c6d-4aca-4b8c-b517-be1f66a85ebe} # # ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # User name : kulhanek # User groups : compchem,pmflib # # ~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Full host name : skirit.ics.muni.cz # Host arch tokens : i686,noarch,x86_64 # Num of host CPUs : 4 # Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB] # # ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/ # Support e-mail : infinity-support@lcc.ncbr.muni.cz [issue tracking system] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Active modules were restored ... >>> Current directory was restored ... [kulhanek@skirit ~]$ Infinity overview, 18th October 2013 -30- Remote login within the site [kulhanek@perian ~]$ ssh skirit >>> INFINITY environment will be propagated to the remote computer ... ... ... >>> Active modules were restored ... >>> Current directory was restored ... [kulhanek@skirit ~]$ Remote login (interactive) within the site preserves: • the active site • active modules • current working directory (if it is possible) The feature can be disabled by any option given to the ssh command: [kulhanek@perian ~]$ ssh –x skirit ... [kulhanek@skirit ~]$ Infinity overview, 18th October 2013 -31- Changing the active site The active site can be changed by the site command: [kulhanek@perian ~]$ site activate cerit-sc *** Welcome to cerit-sc site *** # ============================================================================== # Site name : cerit-sc (-active-) # Site ID : {CERIT-SC:5d1cc70a-efdf-4017-b446-9a050a016f61} # # ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # User name : kulhanek # User groups : compchem,pmflib # # ~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Full host name : perian.ncbr.muni.cz # Host arch tokens : i686,noarch,x86_64 # Num of host CPUs : 4 # Host SMP CPU model : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz [Total memory: 5500 MB] # # ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/ # Support e-mail : infinity-support@lcc.ncbr.muni.cz [issue tracking system] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [kulhanek@perian ~]$ Infinity overview, 18th October 2013 -32- Info about the active site / a site The information about the active or any other site can be shown by the site command: [kulhanek@perian ~]$ site info metacentrum *** Welcome to metacentrum site *** # ============================================================================== # Site name : metacentrum (-not active-) # # ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # User name : kulhanek # User groups : compchem,pmflib # # ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/ # Support e-mail : infinity-support@lcc.ncbr.muni.cz [issue tracking system] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [kulhanek@perian ~]$ if the name of site is not provided, the active site is shown. Note: the host information is not available if the site is not active. More detailed information can be shown by the disp action of the site command: [kulhanek@perian ~]$ site disp metacentrum Infinity overview, 18th October 2013 -33- AMS Modules Infinity overview, 18th October 2013 -34- Modules Available software on the active site is listed by the module command: [kulhanek@skirit ~]$ module software packages are categorized AVAILABLE MODULES --- GPU Enabled Software --------------------------------------------------------------------pmemd-cuda --- Molecular Mechanics and Dynamics --------------------------------------------------------amber ambertools cicada espresso pmemd-cuda sander-pmf ... --- Quantum Mechanics and Dynamics ----------------------------------------------------------adf cpmd dalton gaussview multiwfn qmutil ... --- Docking and Virtual Screening -----------------------------------------------------------autodock autodock-vina cheminfo dock mgltools xscore --- Bioinformatics --------------------------------------------------------------------------blast copasi clustalw modeller rate4site blast+ cd-hit fasta muscle --- Conversion and Analysis -----------------------------------------------------------------3dna cats inchi openbabel symmol ... ... Note: the list can contain module versions, this features is configurable by the ams-config command. Infinity overview, 18th October 2013 -35- Modules via iSoftRepo https://lcc.ncbr.muni.cz/whitezone/development/infinity/isoftrepo/fcgi-bin/isoftrepo.fcgi List information about sites and modules. Description, typical usage, ACL (access control list), versions and default version are listed if available. Infinity overview, 18th October 2013 -36- Module build module name (name of software package) determine CPU/GPU architecture for which the build is compiled name:version:architecture:mode module version determine for which parallel execution mode the build is compiled All module versions are listed by: $ module versions <module_name> All module builds are listed by: $ module builds <module_name> Infinity overview, 18th October 2013 -37- sander/pmemd The sander/pmemd programs are applications from the AMBER package. They do molecular dynamics. Detailed information can be found on: http://ambermd.org #!/bin/bash # activate module with sander/pmemd # application module add amber:12.0 # execute the sander program sander –O –i prod.in –p topology.parm7 -c input.rst7 Job script: • only essential logic is present • in most cases, the script is the same for the sequential and parallel runs of the same applications • data are referenced relative to the job directory Infinity overview, 18th October 2013 -38- sander – single/parallel execution The only difference between sequential and parallel execution is in the resource specification during psubmit. The input data and the job script are the same! $ psubmit short test_sander ncpus=1 $ psubmit short test_sander ncpus=2 it can be omitted *.stdout *.stdout ..... Module build: amber:12.0:x86_64:single ..... ..... Module build: amber:12.0:x86_64:para ..... Computational node: Infinity overview, 18th October 2013 Computational node: -39- Module build, architectures Architecture Target noarch application requires shell environment, it should run everywhere i686 application requires 32-bit environment x86_64 application requires 64-bit environment gpu application requires GPU cuda application requires NVIDIA cuda environment ib infiniband is supported Host architecture tokens are listed via site command or explicitly via the ams-host command. The optimal module architecture must match all host architecture tokens. In ambiguous cases, the build with highest architecture score is used (for example, x86_64 has higher score than i686). Infinity overview, 18th October 2013 -40- Module build, modes Modes Target single application can utilize only one CPU node some parts of application can run in parallel on single computational node smp some parts of application can run in parallel on single computational node para some parts of application can run in parallel on several computational nodes noarch this build cannot be activated The optimal mode is determined by the available resources (CPUs, GPUs). The resources can be specified via the psubmit or ams-setenv command. The ams-setenv command should be used only for testing purposes! Its usage is limited to a single computational node! Infinity overview, 18th October 2013 -41- Default build In most cases, the module has a default build in the following form: name:default:auto:auto use default version (in most cases, the latest one) determine the best build for the given host determine the best mode according to requested computational resources Note: not all applications have default build set. This means that you must specify module version or even the whole build. Infinity overview, 18th October 2013 -42- Activate module The application is activated by the module command: [kulhanek@skirit ~]$ module add amber # Module specification: amber (add action) # ============================================================= Requested CPUs : 1 Requested GPUs : 0 Num of host CPUs : 4 Num of host GPUs : 0 Requested nodes : 1 Host arch tokens : i686,noarch,x86_64 Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exported module : amber:12.0 Module build : amber:12.0:x86_64:single Infinity overview, 18th October 2013 -43- Activate module The application is activated by the module command: [kulhanek@skirit ~]$ module add amber # Module specification: amber (add action) # ============================================================= Requested CPUs : 1 Requested GPUs : 0 Num of host CPUs : 4 Num of host GPUs : 0 Requested nodes : 1 Host arch tokens : i686,noarch,x86_64 Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exported module : amber:12.0 Module build : amber:12.0:x86_64:single determined and activated module build summary of available resources The build resolution can be shown by the disp action of the module command: [kulhanek@skirit ~]$ module disp amber Infinity overview, 18th October 2013 -44- Activate module The module activation does not run the application! It only changes the shell environment in such a way that the application is in the PATH. activate module vmd [kulhanek@wolf ~]$ module add vmd # Module specification: vmd (add action) # ============================================================= INFO: additional module povray is required, loading ... Loaded module : povray:3.6:i686:single Requested CPUs : 1 Requested GPUs : 0 Num of host CPUs : 4 Num of host GPUs : 0 Requested nodes : 1 Host arch tokens : i686,noarch,x86_64 Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exported module : vmd:1.9.1 Module build : vmd:1.9.1:x86_64:single [kulhanek@wolf ~]$ vmd run the application vmd Infinity overview, 18th October 2013 -45- Activate module, recommendations • provide only module name or module name and its version • it is recommended to provide explicit module version in computational scripts (the default version might change time to time) [kulhanek@skirit ~]$ module add amber:11.1 # Module specification: amber:11.1 (add action) # ============================================================= INFO: Module is active, reactivating .. Unload module : amber:11.1:x86_64:single Requested CPUs : 1 Requested GPUs : 0 Num of host CPUs : 4 Num of host GPUs : 0 Requested nodes : 1 Host arch tokens : i686,noarch,x86_64 Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exported module : amber:11.1 Module build : amber:11.1:x86_64:single Infinity overview, 18th October 2013 -46- Other operations List active module: [kulhanek@skirit ~]$ module active ACTIVE MODULES compat-ia32:4.0:i686:single dynutil-new:4.0.4241:noarch:single compat-amd64:5.0:x86_64:single povray:3.6:i686:single heimdal:meta:i686:single torque:meta:i686:single amber:11.1:x86_64:single mc:4.8.7:x86_64:single List exported module: [kulhanek@skirit ~]$ module exported [kulhanek@perian ~]$ site EXPORTED MODULES >>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abs:2.0.4216 amber:11.1 [metacentrum] cerit-sc dynutil-new:4.0.4241 povray:3.6 Note: Exported modules contains only user activated modules without architecture and mode parts. They are passed to jobs by the psubmit command. Infinity overview, 18th October 2013 -47- Deactivate modules To deactivate module, use the remove action of the module command: [kulhanek@skirit ~]$ module remove vmd # Module name: vmd (remove action) # ============================================================= Module build : vmd:1.9.1:x86_64:single Note: The remove action does not remove modules that are activated together with the module due to dependencies (in the case of the vmd module, the povray module is not removed). All modules can be removed by the purge action: [kulhanek@skirit ~]$ module purge Note: Use this only for testing purposes! The action removes all modules including the system ones. Infinity overview, 18th October 2013 -48- How to activate modules automatically [kulhanek@skirit ~]$ ams-config *** AMS Configuration Centre *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------------------ Main menu -----------------------------------------------------------1 - configure visualization (colors, delimiters, etc.) 2 - configure auto-restored modules -----------------------------------------------------------i - site info s - save changes p - print this menu once again q/r - quit program Type menu item and press enter: Infinity overview, 18th October 2013 -49- Modules by MetaCentrum VO You can access modules provided by the MetaCentrum VO using the metamodule command. The command is available only on the metacentrum and cerit-sc sites. [kulhanek@skirit ~]$ metamodule avail --------- /packages/run/modules-2.0/modulefiles_torque --------lam-7.1.4 mpich-p4 mpich-shmem mpich2 mpiexec-0.84 openmpi-1.6-intel openmpi-pgi lam-7.1.4-intel mpich-p4-intel mpich-shmem-intel mpich2-intel --------- /packages/run/modules-2.0/modulefiles_opensuse --------adf2007 demon-2.3 intelcdk-9 mvs pgicdk-6.0 turbomole-5.10-huger amber-12 demon-2.3-shmem intelcomp-11 namd-2.7b1 pgicdk-8.0 turbomole-5.6 amber-12-pgi g03 jdk-1.4.2 nfs4acl pgiwks-6.1 turbomole-6.0 mopac2009 ofed-1.3-mvapich2 python-2.5 turbomole-6.4 .... Note: Extra care must be taken when using metamodules especially when parallel execution is required. Infinity overview, 18th October 2013 lucida molden molpro to list modules, the avail action must be provided -50- ACL, Access Control List Access to almost all modules is granted to everyone. In certain cases, the access might be limited by ACL. The list of ACL rules is available only via iSoftService. For example, the usage of sander-pmf module is limited to users belonging to the pmflib group. Note: ACL rules can be defined on the level of module or module builds. Infinity overview, 18th October 2013 -51- User groups for AMS subsystem [kulhanek@wolf ~]$ ams-user User name : kulhanek (uid: 18773) Primary group name : lcc (gid: 2001) Site ID : {WOLF:669663ca-cb1c-4d0a-8393-13bb8f7a90da} Configuration realm : default =================================================================== >>> default ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Priority : 1 Groups : compchem >>> posix ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Priority : 2 All groups : kulhanek,rmarek,compchem,lcc User groups : lcc >>> groups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Priority : 3 All groups : pmflib,kulhanek User groups : pmflib =================================================================== >>> final Groups : compchem,lcc,pmflib Final AMS user groups. They are not necessarily related to the unix system groups! Infinity overview, 18th October 2013 -52- Personal sites & Big brother It is possible to install the AMS on your personal computer and thus benefit from uniform environment provided by the Infinity system. The installation procedure and prerequisites are available in the Infinity wiki: https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/Documentation Note: Activation of modules is monitored (user name, build name, host name, date, resources and scope). Infinity overview, 18th October 2013 -53- ABS Advanced Batch System Infinity overview, 18th October 2013 -54- Command Overview • • • • • • • • • • • • • • • • • • • • pqueues list available queues pnodes list available computational nodes pqstat list jobs in the batch system or given queue pjobs list jobs of the logged or other user infinity-env job shell psubmit submit job to the batch system pinfo print information about the job pgo change current directory to job input directory or log to comp. node psync synchronize computational node with job input directory paliases define aliases pkill terminate job pkillall terminate all jobs (they can be filtered) pcollections manage job collections presubmit resubmit the job to the batch system pstatus print short job status abs-config configure the ABS subsystem infinity-ijobs-prepare support for internal jobs infinity-ijobs-copy-into infinity-ijobs-launch Use command -h to list all command options. infinity-ijobs-finalize Infinity overview, 18th October 2013 -55- Job A Job must fulfill following conditions: • each job must be executed in separate directory (job input directory) • all job data must be present in job input directory • job directories need not be nested • job execution is controlled by a script or by an input file (for autodetected jobs) • job script must be in the bash shell language • the absolute pathways should not be used, all paths should be relative to the job input directory • directory cannot contain pjob??? directories or files /home/kulhanek job1 job2 Infinity overview, 18th October 2013 /home/kulhanek job1 job2 -56- Job, cont. The situation with two jobs in a single directory is detected by Infinity. [kulhanek@skirit 01.get_hostname]$ psubmit short get_hostname >>> List of jobs from info files ... # ST Job ID Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- -------------------- --------------- --------------- ----- ----- ----- -------------------F 72998 get_hostname long 1 0 1 2013-10-11 09:52:28 ERROR: Infinity runtime files were detected in the job input directory! The presence of runtime files indicates that another job has been started in this directory. Multiple job submission within the same directory is not permitted by the Infinity system. If you really want to submit the job, you have to remove runtime files of previous one. Please, be very sure that previous job has been already terminated otherwise undefined behaviour can occur! Type following to remove runtime files: rm -f *.info *.infex *.infout *.stdout *.nodes *.gpus *.infkey ___JOB_IS_RUNNING___ Infinity overview, 18th October 2013 -57- Job Script Job script can only be in the bash language. The interpreter can be specified directly as the bash interpreter or as special infinity-env interpreter. In the later cases, the script is protected from undesired execution that might lead to job data corruption or lost. #!/bin/bash #!/usr/bin/env infinity-env # job script # job script Infinity overview, 18th October 2013 -58- Job Script Job script can only be in the bash language. The interpreter can be specified directly as the bash interpreter or as special infinity-env interpreter. In the later cases, the script is protected from undesired execution that might lead to job data corruption or lost. #!/bin/bash #!/usr/bin/env infinity-env # job script # job script [kulhanek@skirit 01.get_hostname]$ ./testme ERROR: This script can be run as an Infinity job only! The script is protected by the infinity-env command, which permits the script execution only via psubmit commands. [kulhanek@skirit 01.get_hostname]$ Infinity overview, 18th October 2013 -59- Job Script, other shells Job script, which acts as a wrapper #!/usr/bin/env infinity-env # activate all required modules here module add amber module add cats # execute script in different shell ./my_tcsh_script my_tcsh_script #!/usr/bin/tcsh .... .... Infinity overview, 18th October 2013 -60- Job submission, name restrictions The job path can contain following characters: a-z A-Z 0-9 _+-.#/ The job script name can contain following characters: a-z A-Z 0-9 _+-.# Infinity overview, 18th October 2013 -61- Job submission A Job is submitted to the batch system by the psubmit command: psubmit destination job [resources] [syncmode] destination: • queue_name • node_name@queue_name • alias_name • node_name@alias_name job: • job script name • name of job input file for autodected jobs resources defines required resources, if not specified default resources are used (in most cases, ncpus=1 is used) syncmode determines the data transfer mode between the job input directory and working directory on the computational node, by default "sync" mode is used Infinity overview, 18th October 2013 -62- Job submission, cont. [kulhanek@skirit 01.get_hostname]$ psubmit short testme Job name : testme Job title : testme (Job type: generic) Job directory : skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname Job project : -none- (Collection: -none-) Site name : metacentrum (Torque server: arien.ics.muni.cz) Job key : d70c5f4b-8cd8-42b7-bd90-47a2efad4fe3 ======================================================== Req destination : short Req resources : -noneReq sync mode : -noneread carefully if job submission ---------------------------------------Alias : -nonespecification is correct Queue : short Default resources: maxcpuspernode=8 ---------------------------------------Number of CPUs : 1 Number of GPUs : 0 then Max CPUs / node : 8 Number of nodes : 1 Resources : nodes=1:ppn=1 Sync mode : sync ---------------------------------------Start after : -not definedconfirmation is required by default Exported modules : abs:2.0.4216 Excluded files : -none======================================================== Do you want to submit the job to the Torque server (YES/NO)? > Infinity overview, 18th October 2013 -63- Job submission, cont. Confirmation of job submission can be disabled: 1) using -y option of psubmit command $ psubmit –y short get_hostname 2) temporarily in the terminal/script by the pconfirmsubmit command $ pconfirmsubmit NO Confirmation of job submission is temporarily changed ! Confirm submit setup value: NO Infinity overview, 18th October 2013 -64- Job submission, cont. Confirmation of job submission can be disabled: 3) permanently via the abs-config command [kulhanek@skirit 01.get_hostname]$ abs-config *** ABS Configuration Centre *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Main menu -----------------------------------------------------------1 - configure aliases 2 - configure confirmation of job submission, e-mail alerts, etc. -----------------------------------------------------------i - site info s - save changes .... Infinity overview, 18th October 2013 -65- Available queues [kulhanek@skirit 01.get_hostname]$ pqueues # # Site name : metacentrum # Torque server : arien.ics.muni.cz # # Name Pri T Q R O Max UMax CMax MaxWall Mod Required property # --------------- --- ----- ----- ----- ----- ----- ---- ---- ------------- --- ----------------MetaSeminar 0 0 0 0 0 0 0 0 0d 00:00:00 SE q_metaseminar backfill 20 2 0 0 2 2000 1000 32 1d 00:00:00 SE q_backfill debian6 50 0 0 0 0 1000 50 0 1d 00:00:00 SE q_debian6 debian6_long 51 1 0 0 1 1000 50 0 30d 00:00:00 SE q_debian6_long default --> normal,short (routing queue) gpu 65 100 0 16 84 20 0 0 1d 00:00:00 SE q_gpu gpu_long 55 16 0 9 7 20 0 0 7d 00:00:00 SE q_gpu_long long 62 931 477 285 169 1000 70 0 30d 00:00:00 SE q_long ncbr_long 70 16 3 8 5 20 5 0 30d 00:00:00 SE q_ncbr_long ncbr_medium 65 72 0 18 54 1000 32 0 5d 00:00:00 SE q_ncbr_medium ncbr_single 64 228 0 112 116 1000 200 1 2d 00:00:00 SE q_ncbr_single normal 50 1220 411 194 615 1000 100 0 1d 00:00:00 SE q_normal orca 70 0 0 0 0 120 80 64 30d 00:00:00 SE orca orca16g 71 0 0 0 0 120 80 64 30d 00:00:00 SE orca16g preemptible 61 7 0 7 0 0 400 32 30d 00:00:00 SE q_preemptible privileged 65 25 0 19 6 1000 50 32 30d 00:00:00 SE q_privileged short 60 169 0 1 168 1000 250 0 0d 02:00:00 SE q_short # # # # # # Legend: Pri - Priority, T - Total, Q - Queued, R - Running jobs O - Other (completed, exiting, hold) jobs Max - Max running jobs, UMax - Max user running jobs CMax - Max CPUs per job, MaxWall - Max wall time per job Mod - Started/(-)Stopped : Enabled/(-)Disabled Infinity overview, 18th October 2013 By default, only queues available to the logged user are shown. -66- Available nodes [kulhanek@skirit 01.get_hostname]$ pnodes -g perian-2 # # Site name : metacentrum # Torque server : arien.ics.muni.cz # # Group : perian-2 # ----------------------------------------------------------------------------------# brno cl_perian debian debian50 em64t home_perian linux ncbr nfs4 per q_ncbr_long # q_ncbr_medium q_ncbr_single x86 x86_64 xeon # ----------------------------------------------------------------------------------# # Node Name CPUs Free Status Extra properties # ---------------------------- ---- ---- -------------------- ----------------------perian1-2.ncbr.muni.cz 8 8 free nodecpus8,per1,quadcore perian2-2.ncbr.muni.cz 8 0 job-exclusive nodecpus8,per1,quadcore perian5-2.ncbr.muni.cz 8 0 job-exclusive nodecpus8,per1,quadcore perian6-2.ncbr.muni.cz 8 8 free nodecpus8,per1,quadcore perian7-2.ncbr.muni.cz 8 8 free nodecpus8,per1,quadcore perian8-2.ncbr.muni.cz 8 7 free nodecpus8,per1,quadcore perian12-2.ncbr.muni.cz 8 1 free nodecpus8,per2,quadcore perian13-2.ncbr.muni.cz 8 0 job-exclusive nodecpus8,per2,quadcore perian14-2.ncbr.muni.cz 8 0 job-exclusive nodecpus8,per2,quadcore # ----------------------------------------------------------------------------------# Total number of CPUs : 484 # Free CPUs : 120 common properties for the cluster group # # # # # # # # extra properties for individual nodes all node properties All properties ----------------------------------------------------------------------------------brno cl_perian debian debian50 em64t home_perian hyperthreading linux ncbr nfs4 nodecpus12 nodecpus8 per per1 per2 per3 per4 q_ncbr_long q_ncbr_medium q_ncbr_single quadcore sixcore x86 x86_64 xeon ----------------------------------------------------------------------------------Total number of CPUs : 484 Free CPUs : 120 Infinity overview, 18th October 2013 -67- Resources Resource token Avail Meaning ncpus NMI number of requested CPUs (exception: it can be specified as a number only) ngpus NM number of requested GPUs props NM required properties of computational nodes mem NMI required memory vmem M required virtual memory scratch M required scratch size scratch_type NMI determine scratch type maxcpuspernode NMI maximum number of CPUs per computational node the number of nodes is determined from ncpus and maxcpuspernode walltime MI required walltime for job execution umask NMI umask used for new files and directories create on computational nodes NCBR clusters, MetaCentrum&CERIT-SC, IT4I Infinity overview, 18th October 2013 -68- Resources, cont. Resource token Avail Meaning account I account name (related to project) place I determine placing of job chunks cpu_freq I requested processor frequency Infinity overview, 18th October 2013 -69- Resource specification Resources are specified as comma separated list (for example): 8,props=cl_gram (ncpus=8,props=cl_perian) Note: Not all resources tokens are available on all sites! Infinity overview, 18th October 2013 -70- Resources, properties Single property specification: props=cl_perian select nodes that have cl_perian property Property combination: props=brno#infiniband select nodes that have brno AND infiniband properties Property exclusion: props=^cl_gram select any node except nodes with cl_gram property Node exclusion: props=^full.node.name only on the metacentrum and cerit-sc sites props=cl_gram:^gram2.zcu.cz select any node with cl_gram property except gram2.zcu.cz node Infinity overview, 18th October 2013 -71- Synchronization modes The synchronization mode determines how job data are transferred between the job input directory and the working directory on the computational node: Supported modes sync nosync jobdir Infinity overview, 18th October 2013 -72- Synchronization modes, sync Mode Meaning sync Data are copied from the job input directory to the working directory on the computational node. The working directory is created on the scratch of the computational node. After the job is finished, all data from the working directory are copied back to the job input directory. Finally, the working directory is removed if the data transfer was successful. User Interface (UI) (Frontend) /job/input/dir rsync Computational Node #1 Worker Node (WN) /scratch/job_id/ rsync Note: default synchronization mode Infinity overview, 18th October 2013 determined by scratch_type resource token -73- Synchronization modes, nosync Mode Meaning nosync Data are copied from the job input directory to the working directory on the computational node. The working directory is created on the scratch of the computational node. After the job is finished, data are kept on the computational node in the working directory. User Interface (UI) (Frontend) /job/input/dir rsync Computational Node #1 Worker Node (WN) /scratch/job_id/ Note: this mode should be used only in very special cases! Infinity overview, 18th October 2013 determined by scratch_type resource token -74- Synchronization modes, jobdir Mode Meaning jobdir Job data must be on shared volume, which is accessible on both UI and WN. No data are transferred. shared volume User Interface (UI) (Frontend) /job/input/dir /job/input/dir Computational Node #1 Worker Node (WN) Note: required by turbomole if it is executed in parallel among more than one computational node. Infinity overview, 18th October 2013 -75- Resources, scratch_type Value Avail Meaning local NM /scratch/$USER/$INF_JOB_ID/main local I /lscratch/$INF_JOB_ID/main shared I /scratch/$USER/$INF_JOB_ID/main shmem NMI /dev/shm/$USER/$INF_JOB_ID Infinity overview, 18th October 2013 -76- Resources, walltime The walltime token is used to specify the maximum execution time of job. Time can be specified in two ways: walltime=hhh:mm:ss walltime=Nu where u is Infinity overview, 18th October 2013 s m h d w seconds minutes hours days weeks -77- Job monitoring, pinfo The job progress can be monitored using the pinfo command invoked in the input job directory or in the working directory on the computational node. [kulhanek@skirit 01.get_hostname]$ pinfo Job name : testme Job ID : 2309124.arien.ics.muni.cz Job title : testme (Job type: generic) Job directory : skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname Job project : -none- (Collection: -none-) Site name : metacentrum (Torque server: arien.ics.muni.cz) Job key : d647eadd-1c4c-4864-8018-e820dc51666d ======================================================== Req destination : short Req resources : -noneReq sync mode : -none---------------------------------------Alias : -noneQueue : short Default resources: maxcpuspernode=8 ---------------------------------------Number of CPUs : 1 Number of GPUs : 0 Max CPUs / node : 8 Number of nodes : 1 Resources : nodes=1:ppn=1 Sync mode : sync ---------------------------------------Start after : -not definedExported modules : abs:2.0.4216 Excluded files : -none======================================================== Main node : tarkil10-1.cesnet.cz Working directory: /scratch/kulhanek/2309124.arien.ics.muni.cz ---------------------------------------CPU 001 : tarkil10-1.cesnet.cz ======================================================== Job was submitted on 2013-04-03 18:38:54 and was queued for 0d 00:00:32 Job was started on 2013-04-03 18:39:26 and is running for 0d 00:00:11 job submission summary Infinity overview, 18th October 2013 current job status -78- Job monitoring, pgo The pgo command can be used to change directory among the current directory, the job input directory and/or the job working directory on the computational node. User Interface (UI) (Frontend) no argument /any/directory pgo /job/input/dir Computational Node #1 Worker Node (WN) /working/directory/ pgo job_id Infinity overview, 18th October 2013 -79- Job monitoring, pgo Use JobID to move from any directory to the job input directory: [kulhanek@skirit ~]$ pgo 2308394 # ST Job ID User Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- ------------ ------------ --------------- --------------- ----- ----- ----- -------------------R 2308394 kulhanek testme short 1 0 1 0d 00:00:15 > /auto/home/kulhanek/Tests/01.get_hostname tarkil10-1.cesnet.cz INFO: The current directory was set to: /auto/home/kulhanek/Tests/01.get_hostname [kulhanek@skirit 01.get_hostname]$ Infinity overview, 18th October 2013 -80- Job monitoring, pgo II Use pgo in the job input directory to move to the computational node [kulhanek@skirit 01.get_hostname]$ pgo Job name : testme Job ID : 2308394.arien.ics.muni.cz Job title : testme (Job type: generic) Job directory : skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname Site name : metacentrum (Torque server: arien.ics.muni.cz) Job key : 12ea4a7b-a6b7-431f-861c-1b26eaf27350 ======================================================== Req destination : short .... Sync mode : sync ---------------------------------------Start after : -not definedExported modules : abs:2.0.4216 Excluded files : -none======================================================== Main node : tarkil10-1.cesnet.cz Working directory: /scratch/kulhanek/2308394.arien.ics.muni.cz ---------------------------------------CPU 001 : tarkil10-1.cesnet.cz ======================================================== Job was submitted on 2013-04-03 15:22:58 and was queued for 0d 00:00:08 Job was started on 2013-04-03 15:23:06 and is running for 0d 00:03:11 job input directory on UI working directory on WN >>> Satisfying job for pgo action ... # ST Job ID Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- -------------------- --------------- --------------- ----- ----- ----- -------------------R 2308394 testme short 1 0 1 0d 00:03:11 tarkil10-1.cesnet.cz > Site and job exported modules were recovered. [kulhanek@tarkil10-1 2308394.arien.ics.muni.cz]$ Infinity overview, 18th October 2013 -81- Control files In the job input directory several control files are created during job submission, in the course of job execution and its termination. • *.info job status file (XML file) • *.infex script executed by the batch system (wrapper) • *.infout standard output from execution of *.infex script, it is necessary to analyze it if the job was not terminated successfully • *.nodes list of computational node allocated for the job • *.gpus list of GPU cards allocated for the job • *.key unique ID of the job • *.stdout standard output from the job script Note: It is not wise to delete these files if the job is still running. Infinity overview, 18th October 2013 -82- pinfo command, other function [kulhanek@perian 02.prod-a]$ pinfo -c -r # ST Job ID Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- -------------------- --------------- --------------- ----- ----- ----- -------------------ER 2007744 precycleJob#101 gpu 1 1 1 F 2009982 precycleJob#104 gpu 1 1 1 2013-01-26 07:52:33 F 2010788 precycleJob#106 gpu 1 1 1 2013-01-26 18:44:23 .... Final statistics >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<< Number of all jobs = 100 Number of prepared jobs = 0 0.00 % Number of submitted jobs = 0 0.00 % Number of running jobs = 0 0.00 % Number of finished jobs = 98 98.00 % Number of other jobs = 2 2.00 % state min max total number averaged ------- ----------------- ----------------- ----------------- ------- ----------------queued 0d 00:00:02 0d 04:04:44 0d 10:04:56 98 0d 00:06:10 running 0d 02:22:41 0d 10:55:56 11d 07:23:39 98 0d 02:46:09 Total CPU time = 11d 07:23:39 New features: • -r recursive mode (gather information from all info files in the current directory and all subdirectories) • -c compact mode (job info on a single line) •-l print job comment in compact mode • -p print job path in compact mode Infinity overview, 18th October 2013 -83- pqstat/pjobs commands Job/batch server monitoring: • pqstat list jobs in the batch system or given queue • pjobs list jobs of the logged or other user Interesting options: • -c print completed jobs (by default they are not shown) • -l print job comment • -p print job path • -f print completed jobs ordered by time of termination (pjobs) • -s filter jobs [kulhanek@perian 02.prod-a]$ pjobs -p -l # # Site name : metacentrum # Torque server : arien.ics.muni.cz # # ST Job ID User Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- ------------ ------------ --------------- --------------- ----- ----- ----- -------------------R 2230607 kulhanek run_abf ncbr_long 8 0 1 17d 13:45:42 > /auto/smaug1.nfs4/home/kulhanek/01.Projects/55.Zora/alpha/03.dimer/04.water/03.abf perian31-2.ncbr.muni.cz [kulhanek@perian 02.prod-a]$ Infinity overview, 18th October 2013 job comment for queued jobs the status provided by the batch system for running jobs the name of the computational node -84- Terminate running/queued jobs The queued or running job can be prematurely terminated by the pkill command: [kulhanek@perian normal]$ pkill Job name : test_normal Job ID : 2311306.arien.ics.muni.cz Job title : test_normal (Job type: generic) Job directory : perian.ncbr.muni.cz:/home/kulhanek/Tests/normal Job project : -none- (Collection: -none-) Site name : metacentrum (Torque server: arien.ics.muni.cz) Job key : 4bc17745-75e7-4b55-a97d-ef40fe83ba93 ======================================================== Req destination : short Req resources : -noneReq sync mode : -none---------------------------------------Alias : -noneQueue : short Default resources: maxcpuspernode=8 ---------------------------------------Number of CPUs : 1 Number of GPUs : 0 Max CPUs / node : 8 Number of nodes : 1 Resources : nodes=1:ppn=1 Sync mode : sync ---------------------------------------Start after : -not definedExported modules : abs:2.0.4216|gaussian:09.C1 Excluded files : -none=============================================== Main node : tarkil10-1.cesnet.cz Working directory: /scratch/kulhanek/2311306.arien.ics.muni.cz ---------------------------------------CPU 001 : tarkil10-1.cesnet.cz ======================================================== Job was submitted on 2013-04-04 14:26:53 and was queued for 0d 00:02:55 Job was started on 2013-04-04 14:29:48 and is running for 0d 00:01:16 >>> Satisfying job(s) for pkill action ... # ST Job ID Job Title Queue NCPUs NGPUs NNods Last change/Duration # -- -------------------- --------------- --------------R 2311306 test_normal short > /home/kulhanek/Tests/normal tarkil10-1.cesnet.cz Do you want to kill listed jobs (YES/NO)? > YES Listed jobs were killed! The job is terminated and all data are kept on the working directory on the computational node! You have to clean the data by yourself manually!! Infinity overview, 18th October 2013 -85- Terminate running/queued jobs II The job can be killed softly with -s option. [kulhanek@perian normal]$ pkill -s ... Do you want to softly kill listed job (YES/NO)? > YES Sending TERM signal to tarkil10-1.cesnet.cz:/scratch/kulhanek/2311321.arien.ics.muni.cz ... >>> Process ID: 32720 [kulhanek@perian normal]$ pinfo .... ======================================================== Main node : tarkil10-1.cesnet.cz Working directory: /scratch/kulhanek/2311321.arien.ics.muni.cz Job exit code : 143 ---------------------------------------CPU 001 : tarkil10-1.cesnet.cz ======================================================== Job was submitted on 2013-04-04 14:35:18 and was queued for 0d 00:00:08 The job script is terminated but the job Job was started on 2013-04-04 14:35:26 and was running for 0d 00:01:59 itself is finished as usual. It means that in Job was finished on 2013-04-04 14:37:25 the sync synchronization mode the data are copied back to the UI and the working directory on WN is removed. Infinity overview, 18th October 2013 -86- Terminate running/queued jobs III All running or queued jobs can be killed by pkillall command. The job list can be filtered by –s option. Infinity overview, 18th October 2013 -87- Synchronize jobs Intermediate data of running jobs can be copied between job working and input directories by the psync command. It can be called either from the job input or working directory. psync <file1> [file2] ... psync --all Infinity overview, 18th October 2013 -88- pstatus, short job status The pstatus command prints short status information about the job. It can be executed without any argument in the job input or working directory. If there is more than one info file then the status of the job with the largest JobID is printed. Printed abbreviations Reason P job is prepared for submission (used in conjunction with collections) Q job is queued or hold in the batch system R job is running F job is finished K job was killed by the pkill command IN job is in inconsistent state, e.g. the info file shows the job in running state but the batch system shows it as finished UN no info file in the current directory Infinity overview, 18th October 2013 -89- Aliases Aliases are shortcomings for resource specifications. abs-config or paliases programs. [kulhanek@perian test]$ paliases # # Site name : metacentrum # Torque server : arien.ics.muni.cz # They can be defined by the without argument, it prints defined aliases # Name Destination Sync Mode Resources # -------------- ---------------- ------------ --------------------U gpu gpu sync props=cl_gram,ngpus=1 [node@]queue alias name might have the same name as any queue any resource token permitted on the site U = user defined, S = system defined Infinity overview, 18th October 2013 -90- Aliases, define new alias 1) using abs-config, recommended [kulhanek@skirit 01.get_hostname]$ abs-config *** ABS Configuration Centre *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Main menu -----------------------------------------------------------1 - configure aliases 2 - configure confirmation of job submission, e-mail alerts, etc. -----------------------------------------------------------i - site info s - save changes .... 2) using paliases, for experienced users [kulhanek@perian test]$ paliases add al1 normal sync props=brno [kulhanek@perian test]$ Infinity overview, 18th October 2013 -91- Aliases, resource resolution maxcpuspernode=8 increasing priority site default resources ( alias resources psubmit resources ) maxcpuspernode=64,props=linux,ncpus=32 ncpus=16,mem=30gb Final resources: ncpus=16,maxcpuspernode=64,props=linux,mem=30gb Infinity overview, 18th October 2013 -92- Aliases, resource resolution maxcpuspernode=8 increasing priority site default resources overwrites ( alias resources ) maxcpuspernode=64,props=linux,ncpus=32 overwrites psubmit resources extends ncpus=16,mem=30gb extends Final resources: ncpus=16,maxcpuspernode=64,props=linux,mem=30gb Infinity overview, 18th October 2013 -93- Execution of applications Infinity overview, 18th October 2013 -94- sander/pmemd The sander/pmemd programs are applications from the AMBER package. They do molecular dynamics. Detailed information can be found on: http://ambermd.org #!/bin/bash # activate module with sander/pmemd # application module add amber:12.0 # execute the sander program sander –O –i prod.in –p topology.parm7 -c input.rst7 Job script: • only essential logic is present • in most cases, the script is the same for the sequential and parallel runs of the same applications • data are referenced relative to the job directory Infinity overview, 18th October 2013 -95- sander – single/parallel execution The only difference between sequential and parallel execution is in the resource specification during psubmit. The input data and the job script are the same! $ psubmit short test_sander ncpus=1 $ psubmit short test_sander ncpus=2 it can be omitted *.stdout *.stdout ..... Module build: amber:12.0:x86_64:single ..... ..... Module build: amber:12.0:x86_64:para ..... Computational node: Infinity overview, 18th October 2013 Computational node: -96- gaussian, manual script preparation The gaussian package contains tools for quantum chemical calculations. Detailed description can be found on http://www.gaussian.com #!/bin/bash # activate gaussian module module add gaussian:09.C1 # execute g09 g09 input input file input.com must contain specification for number of CPUs requested for parallel execution (this number MUST be consistent with resource specification via psubmit command). %NProcShared=4 $ psubmit short test_gaussian ncpus=4 Infinity overview, 18th October 2013 -97- gaussian, manual script preparation The gaussian package contains tools for quantum chemical calculations. Detailed description can be found on http://www.gaussian.com #!/bin/bash # activate gaussian module module add gaussian:09.C1 # execute g09 g09 input input file input.com must contain specification for number of CPUs requested for parallel execution (this number MUST be consistent with resource specification via psubmit command). %NProcShared=4 $ psubmit short test_gaussian ncpus=4 Infinity overview, 18th October 2013 -98- gaussian, autodetection The ABS subsystem is able to recognize the gaussian job type. The job script is automatically created and the input file is automatically updated according to requested resources. $ module add gaussian $ psubmit short input.com ncpus=4 gaussian input file (must have .com extension), this is NOT job script! Autodetection: • job script is created automatically with correct gaussian binary name (g98, g03, g09) • %NProcShared is added or updated in the input file • check if only single node is requested (parallel execution is limited to a single node) [kulhanek@perian test]$ psubmit short input.com Job name : input Job title : input (Job type: gaussian) Job directory : perian.ncbr.muni.cz:/home/kulhanek/Tests/test Job project : -none- (Collection: -none-) Site name : metacentrum (Torque server: arien.ics.muni.cz) Job key : 384e3be5-9dac-405e-b235-74609ae4c486 ======================================================== Infinity overview, 18th October 2013 -99- gaussian – single/parallel execution The only difference between sequential and parallel execution is in the resource specification during psubmit. The input data are the same! $ psubmit short input.com ncpus=1 $ psubmit short input.com ncpus=4 it can be omitted Computational node: Infinity overview, 18th October 2013 Computational node: -100- precycle – restartable MD The aim of precycle is to split long Molecular Dynamics jobs into smaller chunks that can then be run more efficiently in queues that have shorter execution walltimes as normal or backfill. 1) activate !!!dynutil-new!!! module [kulhanek@perian normal]$ module add dynutil-new: # Module specification: dynutil-new: (add action) # ============================================================= Requested CPUs : 1 Requested GPUs : 0 Num of host CPUs : 4 Num of host GPUs : 0 Requested nodes : 1 Host arch tokens : i686,noarch,x86_64 Host SMP CPU model : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz [Total memory: 5500 MB] # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exported module : dynutil-new:4.0.4241 Module build : dynutil-new:4.0.4241:noarch:single 2) get precycle input file template [kulhanek@perian ~]$ precycle-prep All neccessary files for precycle were copied to working directory. [kulhanek@perian ~]$ Infinity overview, 18th October 2013 -101- precycle – restartable MD, II 3) update the precycle script .... # input topology -------------------------------------------------------------# file name without path, this file has to be present in working directory export PRECYCLE_TOP="" # input coordinates ----------------------------------------------------------# file name without path, this file has to be present in working directory # this file is used only for first production run export PRECYCLE_CRD="" # control file for MD it has to be compatible with used MD program -----------# file name without path, this file has to be presented in working directory export PRECYCLE_CONTROL="prod.in" # transform control file (YES/NO) --------------------------------------------# if YES then the RANDOM key is substituted by a random key export PRECYCLE_TRANSFORM_CONTROL="YES" # index of first production stage --------------------------------------------export PRECYCLE_START="1" # index of final production stage --------------------------------------------export PRECYCLE_STOP="" .... Infinity overview, 18th October 2013 -102- precycle – restartable MD, III 4) submit the job to batch system via psubmit command the presence of input files is checked the job is started from the last available restart file [kulhanek@perian precycle]$ psubmit normal precycleJob 8 # -----------------------------------------------# precycle job summary # -----------------------------------------------# Job script name : precycleJob # System topology : 1DC1-DNA_fix_sol_joined.parm7 # Initial coordinates : relax10.rst7 # Control file : prod.in # Compress trajectory : -no# Name format : prod%03d # Storage directory : storage # Internal cycles : 1 # Starting stage : 1 # Final stage : 200 # Current stage : 198 (found restart: storage/prod198.crd) # MD engine module : amber:12.0 # MD engine program : pmemd Job name Job title Job directory Job project Site name Job key : : : : : : precycleJob precycleJob#198 (Job type: precycle) perian.ncbr.muni.cz:/home/kulhanek/Tests/precycle -none- (Collection: -none-) metacentrum (Torque server: arien.ics.muni.cz) 2a57df2f-9119-4f59-b113-b3e37415502 Infinity overview, 18th October 2013 -103- precycle – restartable MD, IV 5) in the case of recoverable failure, simply resubmit the job to the batch system # WARNING: this type of job supports an automatic job restart of crashed job!!! # in the case of failure, please just resubmit a job into queue system # without any modification of this script [kulhanek@perian precycle]$ psubmit normal precycleJob 8 Available in the metacentrum site Employing GPU 1) change used MD module and MD core in precycleJob # program to perform MD -----------------------------------------------export MD_CORE="pmemd.cuda" export MD_MODULE="pmemd-cuda:12.1" 2) request GPU resources during job submission [kulhanek@perian precycle]$ psubmit gpu precycleJob ngpus=1,props=cl_gram Infinity overview, 18th October 2013 -104-