CLE and How to S Y A
Transcription
CLE and How to S Y A
CLE and How to S Start Your Application Y A li i Jan Thorbecke Scalable S ft Software Architecture A hit t 2 Scalable Software Architecture: Cray Linux Environment (CLE) Speciali Specialized ed Linux nodes Microkernel on Compute nodes, full featured Linux on Service nodes. Service PEs specialize by function Software Architecture eliminates OS “Jitter” Software Architecture enables reproducible run times Compute Partition Service Partition Service Partition 3 Service nodes • Overview Run full Linux (SuSe SLES), 2 nodes per service blade • Boot node first XE6 node to be booted: boots all other components • System Data Base (SDB) node hosts MySQL database processors, allocation, accounting, PBS information processors allocation accounting PBS information • Login nodes User login and code preparation activities: compile, launch User login and code preparation activities: compile, launch Partition allocation: ALPS (Application Level Placement Scheduler) 4 Trimming OS – Standard Linux Server nscd cron mingetty(s) klogd Portmap cupsd qmgr master … sshd powersaved pickup init slpd resmgrd kdm ndbd Linux Kernel 5 FTQ Plot of Stock SuSE (most daemons removed) 28350 Count 28150 27950 27750 27550 0 1 2 3 Time - Seconds 6 noise amplification in a parallel program pe 0 pe 1 pe 2 pe 3 pe 4 synchronization time •In each synchronization interval, one rank experiences a noise delay •Because the ranks synchronize, all ranks experience all of the delays •In this worst-case situation, the performance degradation due to the noise is multiplied by the number of ranks •Noise events that occur infrequently on any one rank (core) occur frequently in the parallel job S nchroni ation o Synchronization overhead erhead amplifies the noise 7 Linux on a Diet – CNL ALPS client syslogd l d Lustre Client klogd init Linux Kernel 8 FTQ plot of CNL 28350 Count 28150 27950 27750 27550 0 1 2 3 Time - Seconds 9 Cray XE I/O architecture • All I/O is offloaded to service nodes • Lustre High performance parallel I/O file system Direct data transfer between compute nodes and files • DVS S Virtualization service Allows compute nodes to access NFS mounted on service Allows compute nodes to access NFS mounted on service node Applications must execute on file systems mounted on compute nodes • No local disks • /tmp / i MEMORY fil is a MEMORY file system, on each login node hl i d 10 Scaling Shared Libraries with DVS Diskless Compute Node 0 Node 0 /dvs Diskless Compute Node 1 Node 1 /dvs Diskless Compute Node 2 Node 2 /dvs Diskless Compute Node 3 Node 3 /dvs Diskless Compute Node N Node N /dvs Cray C Interconnect Requests for shared libraries (.so files) are routed through DVS are routed through DVS Servers Provides similar functionality as NFS, but scales to 1000s of compute nodes C Central point of administration for t l i t f d i i t ti f shared libraries DVS Servers can be “re‐purposed” t d compute nodes November 7, 2011 Cray Proprietary DVS Server Server Node 0 NFS Shared Libraries 11 Slide 11 Running an A li i Application on the h Cray XE6 Cray XE6 Running an application on the Cray XE ALPS + aprun • ALPS : Application Level Placement Scheduler • aprun is the ALPS application launcher It must It t be used to run application on the XE compute nodes b dt li ti th XE t d If aprun is not used, the application is launched on the login node (and will most likely fail) ( y ) aprun man page contains several useful examples at least 3 important parameters to control: The total number of PEs : ‐n The number of PEs per node: ‐N The number of OpenMP threads: ‐d More precise : The ‘stride’ between 2 PEs in a node 13 Job Launch Login Node SDB Node XE6 User Compute Nodes 14 Job Launch Login Node SDB Node Login & Start App qsub b B t h Batch XE6 User Compute Nodes 15 Job Launch Login Node SDB Node Login & Start App qsub b B t h Batch apsched h d Login Shell XE6 User Aprun (c) apbasil(c) Compute Nodes 16 Job Launch Login Node SDB Node Login & Start App qsub b B t h Batch Apsched(dm) Login Shell XE6 User aprun apbasil Compute Nodes Apinit (dm) apshepherd Application 17 Job Launch Login Node SDB Node Login & Start App qsub b apsched h d B t h Batch Login Shell XE6 User aprun apbasil Compute Nodes apinit apshepherd Application … … 18 Job Launch Login Node SDB Node Login & Start App qsub b apsched h d B t h Batch Login Shell XE6 User aprun apbasil Application Runs on apinit compute t nodes d apshepherd Application … … IO Nodes s Impleme ent I/O Re equest Compute Nodes IO Node IO daemons IO Node IO daemons IO Node IO daemons 19 Job Launch Login Node SDB Node Login & Start App qsub b B t h Batch apsched h d Login Shell XE6 User aprun apbasil Compute Nodes Job is cleaned l d up Apinit (dm) 20 Job Launch Login Node Login & Start App qsub b B t h Batch SDB Node Nodes returned apsched h d Login Shell XE6 User aprun apbasil Compute Nodes Job is cleaned l d up apinit 21 Job Launch : Done Login Node SDB Node XE6 User Compute Nodes 22 Some Definitions • ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general ‘definitions’ : PE : Processing Elements Basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, … Numa_node Numa node The cores and memory on a node with ‘flat’ memory access, basically one of the 4 Dies on the Interlagos processor and the direct attach memory. y Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as p memory, while y, different PEs do not share these resources. Most likely you will use OpenMP threads. 23 Running an application on the Cray XE6 some basic examples • Assuming a XE6 IL16 system (32 cores per node) • Pure MPI application, using all the available cores in a node $ aprun –n 32 ./a.out • Pure MPI application, using only 1 core per node 32 MPI tasks, 32 nodes with 32*32 cores allocated Can be done to increase the available memory for the MPI tasks $ aprun –N N 1 –n 32 –d d 32 ./a.out / t (we’ll talk about the need for the –d32 later) • Hybrid MPI/OpenMP application, 4 MPI ranks per node 32 MPI tasks, 8 OpenMP threads each k h d h need to set OMP_NUM_THREADS $ export p OMP_NUM_THREADS=8 $ aprun –n 32 –N 4 –d $OMP_NUM_THREADS 24 aprun CPU Affinity control • CNL can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node • In some cases, moving PEs or threads from CPU to CPU increases cache and In some cases moving PEs or threads from CPU to CPU increases cache and Translation Lookaside Buffer (TLB) misses and therefore reduces performance • CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node subset of CPUs on a node • aprun CPU affinity option (see man aprun) Default settings : ‐cc cpu PEs are bound a to specific core depended on the d setting PEs are bound a to specific core, depended on the –d setting Binding PEs to a specific numa node : ‐cc numa_node PEs are not bound to a specific core but cannot ‘leave’ their numa_node No binding : ‐cc none Own binding : ‐cc 0,4,3,2,1,16,18,31,9,… 25 Memory affinity control • Cray XE6 systems use dual‐socket compute nodes with 4 dies Each die (8 cores) is considered a NUMA‐node • Remote‐NUMA‐node memory references, can adversely affect performance. Even if you PE and threads are bound to a specific numa node Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’ • aprun memory affinity options (see man aprun) Suggested setting is –ss a PE can only allocate the memory local to its assigned NUMA node If this is not possible your application will crash node. If this is not possible, your application will crash. 26 Running an application on the Cray XE ‐ MPMD • aprun supports MPMD – Multiple Program Multiple Data • Launching several executables on the same MPI_COMM_WORLD $ aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3 • Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node. Example : The following commands needs 3 nodes $ aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3 • Use a script to start several serial jobs on a node : $ aprun –a xt –n 3 script.sh >cat script.sh ./exe1& ./exe2& ./exe3& wait > 27 How to use the interlagos 1/3 1 MPI Rank on Each Integer Core Mode MPI Task 0 November 7, 2011 Shared Components MPI Task 1 Fetch Decode FP Scheduler L1 DCache Cray Proprietary Pipeline Pipeline Pipeline 128--bit FMA 128 AC Int Core 1 128--bit FMA 128 AC Pipeline Pipeline Int Core 0 Int Scheduler Pipeline Int Scheduler Pipeline each integer core ● Implications ● Each core has exclusive access to an Each core has exclusive access to an integer scheduler, integer pipelines and L1 Dcache ● The 256‐bit FP unit and the L2 Cache is shared between the two cores ● 256‐bit AVX instructions are dynamically executed as two 128‐bit instructions if the 2nd FP unit is busy instructions if the 2 FP unit is busy ● When to use ● Code is highly scalable to a large number of MPI ranks ● Code can run with 1 GB per core memory footprint (or 2 GB on 64 GB node) ● Code is not well vectorized Pipeline ● In this mode, an MPI task is pinned to L1 DCache Shared L2 Cache 28 Slide 28 How to use the interlagos 2/3 Wide AVX mode November 7, 2011 Active Components Fetch Decode FP Scheduler L1 DCache Cray Proprietary Pipeline Pipeline Pipeline 128--bit FMA 128 AC Int Core 1 128--bit FMA 128 AC Pipeline Pipeline Int Core 0 Int Scheduler Pipeline Int Scheduler Pipeline used per core pair ● Implications ● This core has exclusive This core has exclusive access to access to the 256‐bit FP unit and is capable of 8 FP results per clock cycle ● The core has twice the memory The core has twice the memory capacity and memory bandwidth in this mode ● The L2 cache is effectively twice as large ● The peak of the chip is not reduced ● When to use ● Code is highly vectorized and makes use of AVX instructions ● Code needs more memory per MPI rankk Idle Components Pipeline thi d l i t i ● IIn this mode, only one integer core is L1 DCache Shared L2 Cache 29 Slide 29 How to use the interlagos 3/3 2‐way OpenMP Mode November 7, 2011 Cray Proprietary Shared Components OpenMP Thread 1 Fetch Decode FP Scheduler L1 DCache Pipeline Pipeline Pipeline 128--bit FMA 128 AC Int Core 1 128--bit FMA 128 AC Pipeline Pipeline Int Core 0 Int Scheduler S h d l Pipeline Int Scheduler S h d l Pipeline core pair ● OpenMP is used to run a thread on each i t integer core ● Implications ● Each OpenMP thread has exclusive access to an integer scheduler integer access to an integer scheduler, integer pipelines and L1 Dcache ● The 256‐bit FP unit and the L2 Cache is shared between the two threads ● 256‐bit AVX instructions are dynamically executed as two 128‐bit instructions if the 2nd FP unit is busy ● When to use When to use ● Code needs a large amount of memory per MPI rank p p p ● Code has OpenMP parallelism exposed in each MPI rank OpenMP Thread 0 Pipeline ● In this mode, an MPI task is pinned to a L1 DCache Shared L2 Cache 30 Slide 30 Aprun: cpu_lists for each PE • CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they hit t h h ith hi h th may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads. Essentially, this provides the user more granularity to specify Essentially, this provides the user more granularity to specify cpu_lists for each processing element. Here an example with 3 threads : aprun n 4 N 4 cc 1 3 5 7 9 11 13 15 17 19 21 23 aprun ‐n 4 ‐N 4 ‐cc 1,3,5:7,9,11:13,15,17:19,21,23 • Note: This feature will be modified in CLE 4.0.UP03, however p this option will still be valid. 31 ‐cc detailed example export OMP_NUM_THREADS=2 aprun -n 4 -N 2 -d 2 -cc 0,2:4,6:8,10:12,14 ./acheck_mpi nid00028[ 0] on cpu 00 affinity for thread 0 is: cpu nid00028[ 0] on cpu 02 affinity for thread 1 is: cpu nid00028[ 1] on cpu 04 affinity for thread 0 is: cpu nid00028[ 1] on cpu 06 affinity for thread 1 is: cpu nid00572[ 2] on cpu 00 affinity for thread 0 is: cpu nid00572[ 2] on cpu 02 affinity for thread 1 is: cpu nid00572[ 3] on cpu 04 affinity for thread 0 is: cpu nid00572[ 3] on cpu 06 affinity for thread 1 is: cpu Application 3024546 resources: utime ~2s, stime ~0s aprun -n 4 -N 4 -d 2 -cc 0,2:4,6:8,10:12,14 ./acheck_mpi nid00028[ 0] on cpu 00 affinity for nid00028[ 2] on cpu 08 affinity for nid00028[ 0] on cpu 02 affinity for nid00028[ id00028[ 1] on cpu 06 affinity ffi it for f nid00028[ 3] on cpu 12 affinity for nid00028[ 2] on cpu 10 affinity for nid00028[ 1] on cpu 04 affinity for p 14 affinity y for nid00028[ 3] on cpu Application 3024549 resources: utime thread 0 is: cpu thread 0 is: cpu thread 1 is: cpu thread th d 1 is: i cpu thread 0 is: cpu thread 1 is: cpu thread 0 is: cpu thread 1 is: cpu p ~0s, stime ~0s 0, 2, 4, 6, 0, 2 2, 4, 6, mask mask mask mask mask mask mask mask 1 001 00001 0000001 1 001 00001 0000001 0, 8, 2, 6 6, 12, 10, 4, 14, mask mask mask mask k mask mask mask mask 1 000000001 001 0000001 0000000000001 00000000001 00001 000000000000001 32 Running a batch application with Torque • The number of required nodes and cores is determined by the parameters specified in the job header #PBS -l l mppwidth=256 idth 256 (MPP width: idth number b of f PE’ PE’s ) #PBS -l mppnppn=4 (MPP number of PE’s per node) • • • • This example uses 256/4=64 nodes p / The job is submitted by the qsub command At the end of the execution output and error files are returned to submission directory PBS environment variable: $PBS_O_WORKDIR Set to the directory from which the job has been submitted Set to the directory from which the job has been submitted Default is $HOME man qsub for env. variables 33 Other Torque options • #PBS ‐N job_name the job name is used to determine the name of job output and error files • #PBS ‐l walltime=hh:mm:ss Maximum job elapsed time Maximum job elapsed time should be indicated whenever possible: this allows Torque to determine best scheduling startegy g gy • #PBS ‐j oe job error and output files are merged in a single file • #PBS ‐q queue request execution on a specific queue 34 Torque and aprun Torque aprun ‐lmppwidth=$PE ‐n $PE Number of PE to start ‐lmppdepth=$threads lmppdepth=$threads ‐d $threads d $threads #threads/PE ‐lmppnppn=$N ‐N $N #(PEs per node) <none> ‐S $S #(PEs per numa_node) ‐lmem=$size ‐m $size[h|hs] per‐PE required memory • -B will provide aprun with the Torque settings for –n,-N,-d and –m aprun –B ./a.out /a out • Using –S can produce problems if you are not asking for a full node. If possible, ALPS will only give you access to a parts of a node if the Torque settings allows this. The following will fail : • PBS -lmppwidth=4 ! Not asking for a full node • aprun –n4 –S1 … ! Trying to run on every die doesn t use it • Solution is to ask for a full node, even if aprun doesn‘t 35 Core specialization • System ‘noise’ on compute nodes may significantly degrade scalability for some applications • Core Specialization can mitigate this problem 1 core per node will be dedicated for system work (service core) As many system interrupts as possible will be forced to execute As man s stem interr pts as possible ill be forced to e ec te on the service core The application will not run on the service core pp • Use aprun ‐r to get core specialization $ aprun –r –n 100 a.out • apcount provided to compute total number of cores required $ qsub -l mppwidth=$(apcount -r 1 1024 16)job aprun -n 1024 -r 1 a.out 36 Running a batch application with Torque • The number of required nodes can be specified in the job header • The job is submitted by the qsub command • At the end of the exection output and error files are returned to submission directory • Environment variables are En ironment ariables are inherited by #PBS ‐V • The job starts in the home j directory. $PBS_O_WORKDIR contains the directory from which the job has been submitted the job has been submitted Hybrid MPI + OpenMP #!/bin/bash #PBS –N hybrid #PBS –lwalltime=00:10:00 lwalltime=00:10:00 #PBS –lmppwidth=128 #PBS –lmppnppn=8 #PBS –lmppdepth=4 lmppdepth=4 cd $PBS_O_WORKDIR export p OMP_NUM_THREADS=4 aprun –n128 –d4 –N8 a.out 37 Starting an interactive session with Torque • An interactive job can be started by the –I argument That is <capital‐i> • Example: allocate 64 cores and export the environment variables to the job ( V) the job (‐V) $ qsub –I –V –lmppwith=64 • This will give you a new prompt in your shell from which you can Thi ill i i h ll f hi h use aprun directly. Note that you are running on a MOM node (shared resource) if not using aprun 38 Watching a launched job on the Cray XE • xtnodestat Shows XE nodes allocation and aprun Shows XE nodes allocation and aprun processes Both interactive and PBS • apstat Shows aprun processes status apstat overview apstat –a[ apid ]info about all the applications or a specific one apstat –n info about the status of the nodes • Batch qstat B t h t t command d shows batch jobs 39 Starting 512 MPI tasks (PEs) #PBS #PBS #PBS #PBS #PBS -N N -l -l -l l -j MPIjob MPIj b mppwidth=512 mppnppn=32 walltime=01:00:00 llti 01 00 00 oe cd $PBS_O_WORKDIR _ _ export MPICH_ENV_DISPLAY=1 export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=536870912 aprun -n 512 –cc cpu –ss ./a.out 40 Starting an OpenMP program, using a single node #PBS #PBS #PBS #PBS #PBS -N -l -l l -l -j OpenMP mppwidth=1 mppdepth=32 walltime=01:00:00 oe cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=536870912 export OMP OMP_NUM_THREADS=32 NUM THREADS=32 aprun –n1 –d $OMP_NUM_THREADS –cc cpu –ss ./a.out 41 Starting a hybrid job single node, 4 MPI tasks, each with 8 threads single node, 4 MPI tasks, each with 8 threads #PBS #PBS #PBS #PBS #PBS #PBS -N -l l -l -l -l l -j hybrid mppwidth=4 mppwidth 4 mppnppn=4 mppdepth=8 walltime=01:00:00 walltime 01:00:00 oe cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=536870912 export OMP_NUM_THREADS=8 aprun –n4 4 –N4 N4 –d d $OMP $OMP_NUM_THREADS NUM THREADS –cc cpu –ss ./a.out / t 42 Starting a hybrid job single node, 8 MPI tasks, each with 4 threads single node, 8 MPI tasks, each with 4 threads #PBS #PBS #PBS #PBS #PBS #PBS -N -l l -l -l -l l -j hybrid mppwidth=8 mppwidth 8 mppnppn=8 mppdepth=4 walltime=01:00:00 walltime 01:00:00 oe cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=536870912 export OMP_NUM_THREADS=4 aprun –n8 8 –N8 N8 –d d $OMP $OMP_NUM_THREADS NUM THREADS –cc cpu –ss ./a.out / t 43 Starting a MPMD job on a non‐default projectid using 1 master, 16 slaves, each with 8 threads #PBS #PBS #PBS #PBS #PBS #PBS -N -l -l -l -j -W hybrid mppwidth=160 ! Note : 5 nodes * 32 cores = 160 cores mppnppn=32 walltime=01:00:00 oe group_list=My_Project _ _ cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=536870912 export OMP_NUM_THREADS=8 id # Unix command ‚id‘, to check group id aprun –n1 n1 –d32 d32 –N1 N1 ./master.exe /master exe : -n 16 –N4 –d $OMP_NUM_THREADS –cc cpu –ss ./slave.exe 44 Starting an MPI job on two nodes using only every second integer core using only every second integer core #PBS #PBS #PBS #PBS #PBS #PBS -N -l l -l -l -l l -j hybrid mppwidth=32 mppwidth 32 mppnppn=16 mppdepth=2 walltime=01:00:00 walltime 01:00:00 oe cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 aprun –n32 –N16 –d 2 –cc cpu –ss ./a.out 45 Starting a hybrid job on two nodes using only every second integer core using only every second integer core #PBS #PBS #PBS #PBS #PBS #PBS -N -l l -l -l -l l -j hybrid mppwidth=32 mppwidth 32 mppnppn=16 mppdepth=2 walltime=01:00:00 walltime 01:00:00 oe cd $PBS_O_WORKDIR export MPICH_ENV_DISPLAY=1 export OMP_NUM_THREADS=2 aprun –n32 –N16 –d $OMP_NUM_THREADS –cc 0,2:4,6:8,10:12,14:16,18:20,22:24,26:28,30 –ss ./a.out 46 Running a batch application with SLURM • The number of required nodes can be specified in the job header • The job is submitted by the qsub command • At the end of the exection output and error files are returned to submission directory • Environment variables are En ironment ariables are inherited • The job starts in the directory j y from which the job has been submitted Hybrid MPI + OpenMP #!/bin/bash #SBATCH –-job-name=“hybrid” #SBATCH –-time=00:10:00 #SBATCH –-nodes=8 export OMP_NUM_THREADS OMP NUM THREADS=6 6 aprun –n32 –d6 a.out 47 Starting an interactive session with SLURM • An interactive job can be started by the SLURM salloc command • Example: allocate 8 nodes $ salloc –N 8 Further SLURM info available from CSCS web page: www.cscs.ch User Entry Point / How to Run a Batch Job / Palu ‐ Cray XE6 48 Documentation • Cray docs site http://docs.cray.com • Starting point for Cray XE info http://docs.cray.com/cgi‐bin/craydoc.cgi?mode=SiteMap;f=xe_sitemap // / / • Twitter ?!? Twitter ?!? http://twitter.com/craydocs 49 End 50