CLE and How to S Y A

Transcription

CLE and How to S
Start Your Application
Y
A li i
Jan Thorbecke
Scalable S ft
Software Architecture
A hit t
2
Scalable Software Architecture: Cray Linux Environment (CLE)
Speciali
Specialized ed
Linux nodes
Microkernel on Compute nodes, full featured Linux on Service nodes.
Service PEs specialize by function
Software Architecture eliminates OS “Jitter”
Software Architecture enables reproducible run times
Compute Partition
Service Partition
Service Partition
3
Service nodes
• Overview
 Run full Linux (SuSe SLES), 2 nodes per service blade
• Boot node
 first XE6 node to be booted: boots all other components
• System Data Base (SDB) node
 hosts MySQL database
 processors, allocation, accounting, PBS information
processors allocation accounting PBS information
• Login nodes
 User login and code preparation activities: compile, launch
User login and code preparation activities: compile, launch
 Partition allocation: ALPS (Application Level Placement Scheduler)
4
Trimming OS – Standard Linux Server
nscd
cron
mingetty(s)
klogd
Portmap
cupsd
qmgr
master
…
sshd
powersaved
pickup
init
slpd
resmgrd
kdm
ndbd
Linux Kernel
5
FTQ Plot of Stock SuSE (most daemons removed)
28350
Count
28150
27950
27750
27550
0
1
2
3
Time - Seconds
6
noise amplification in a parallel program
pe 0
pe 1
pe 2
pe 3
pe 4
synchronization
time
•In each synchronization interval, one rank experiences a noise delay
•Because the ranks synchronize, all ranks experience all of the delays
•In this worst-case situation, the performance degradation due to the
noise is multiplied by the number of ranks
•Noise events that occur infrequently on any one rank (core) occur
frequently in the parallel job
S nchroni ation o
Synchronization
overhead
erhead amplifies the noise
7
Linux on a Diet – CNL
ALPS
client
syslogd
l d
Lustre
Client
klogd
init
Linux Kernel
8
FTQ plot of CNL
28350
Count
28150
27950
27750
27550
0
1
2
3
Time - Seconds
9
Cray XE I/O architecture
• All I/O is offloaded to service nodes
• Lustre
 High performance parallel I/O file system
 Direct data transfer between compute nodes and files
• DVS S
 Virtualization service
 Allows compute nodes to access NFS mounted on service Allows compute nodes to access NFS mounted on service
node
 Applications must execute on file systems mounted on compute nodes
• No local disks
• /tmp
/
i MEMORY fil
is a MEMORY file system, on each login node
hl i
d
10
Scaling Shared Libraries with DVS
Diskless
Compute Node 0
Node 0
/dvs
Diskless
Compute Node 1
Node 1
/dvs
Diskless
Compute Node 2
Node 2
/dvs
Diskless
Compute Node 3
Node 3
/dvs
Diskless
Compute Node N
Node N
/dvs
Cray
C
Interconnect
 Requests for shared libraries (.so files) are routed through DVS
are
routed through DVS Servers
 Provides similar functionality as NFS, but scales to 1000s of compute nodes
 C
Central point of administration for t l i t f d i i t ti f
shared libraries
 DVS Servers can be “re‐purposed” t
d
compute nodes
November 7, 2011
Cray Proprietary
DVS Server
Server Node 0
NFS
Shared
Libraries
11
Slide 11
Running an A li i
Application on the h
Cray XE6
Cray XE6
Running an application on the Cray XE ALPS + aprun
• ALPS : Application Level Placement Scheduler
• aprun is the ALPS application launcher
 It must
It
t be used to run application on the XE compute nodes
b
dt
li ti
th XE
t
d
 If aprun is not used, the application is launched on the login node (and will most likely fail)
(
y )
 aprun man page contains several useful examples
 at least 3 important parameters to control:
 The total number of PEs :
‐n
 The number of PEs per node:
‐N
 The number of OpenMP threads:
‐d
More precise : The ‘stride’ between 2 PEs in a node
13
Job Launch
Login Node
SDB Node
XE6 User
Compute Nodes
14
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
B t h
Batch
XE6 User
Compute Nodes
15
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
B t h
Batch
apsched
h d
Login Shell
XE6 User
Aprun (c)
apbasil(c)
Compute Nodes
16
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
B t h
Batch
Apsched(dm)
Login Shell
XE6 User
aprun
apbasil
Compute Nodes
Apinit (dm)
apshepherd
Application
17
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
apsched
h d
B t h
Batch
Login Shell
XE6 User
aprun
apbasil
Compute Nodes
apinit
apshepherd
Application
…
…
18
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
apsched
h d
B t h
Batch
Login Shell
XE6 User
aprun
apbasil
Application
Runs on
apinit
compute
t nodes
d
apshepherd
Application
…
…
IO Nodes
s Impleme
ent I/O
Re
equest
Compute Nodes
IO Node
IO
daemons
IO Node
IO
daemons
IO Node
IO
daemons
19
Job Launch
Login Node
SDB Node
Login &
Start App
qsub
b
B t h
Batch
apsched
h d
Login Shell
XE6 User
aprun
apbasil
Compute Nodes
Job is
cleaned
l
d up
Apinit (dm)
20
Job Launch
Login Node
Login &
Start App
qsub
b
B t h
Batch
SDB Node
Nodes returned
apsched
h d
Login Shell
XE6 User
aprun
apbasil
Compute Nodes
Job is
cleaned
l
d up
apinit
21
Job Launch : Done
Login Node
SDB Node
XE6 User
Compute Nodes
22
Some Definitions
• ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general ‘definitions’ :
 PE : Processing Elements
Basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, …
 Numa_node
Numa node
The cores and memory on a node with ‘flat’ memory access, basically one of the 4 Dies on the Interlagos processor and the direct attach memory.
y
 Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as
p
memory, while y,
different PEs do not share these resources.
Most likely you will use OpenMP threads.
23
Running an application on the Cray XE6
some basic examples
• Assuming a XE6 IL16 system (32 cores per node)
• Pure MPI application, using all the available cores in a node
$ aprun –n 32 ./a.out
• Pure MPI application, using only 1 core per node
 32 MPI tasks, 32 nodes with 32*32 cores allocated
 Can be done to increase the available memory for the MPI tasks
$ aprun –N
N 1 –n 32 –d
d 32 ./a.out
/
t
(we’ll talk about the need for the –d32 later)
• Hybrid MPI/OpenMP application, 4 MPI ranks per node
 32 MPI tasks, 8 OpenMP threads each
k
h d
h
 need to set OMP_NUM_THREADS
$ export
p
OMP_NUM_THREADS=8
$ aprun –n 32 –N 4 –d $OMP_NUM_THREADS
24
aprun CPU Affinity control
• CNL can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node
• In some cases, moving PEs or threads from CPU to CPU increases cache and In some cases moving PEs or threads from CPU to CPU increases cache and
Translation Lookaside Buffer (TLB) misses and therefore reduces performance
• CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node
subset of CPUs on a node
• aprun CPU affinity option (see man aprun)
 Default settings : ‐cc cpu
PEs are bound a to specific core depended on the d setting
PEs are bound a to specific core, depended on the –d setting
 Binding PEs to a specific numa node : ‐cc numa_node
PEs are not bound to a specific core but cannot ‘leave’ their numa_node
 No binding : ‐cc none
 Own binding : ‐cc 0,4,3,2,1,16,18,31,9,…
25
Memory affinity control
• Cray XE6 systems use dual‐socket compute nodes with 4 dies
 Each die (8 cores) is considered a NUMA‐node
• Remote‐NUMA‐node memory references, can adversely affect performance. Even if you PE and threads are bound to a specific numa node
Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’ • aprun memory affinity options (see man aprun)
 Suggested setting is –ss
a PE can only allocate the memory local to its assigned NUMA node If this is not possible your application will crash
node. If this is not possible, your application will crash. 26
Running an application on the Cray XE ‐ MPMD • aprun supports MPMD – Multiple Program Multiple Data
• Launching several executables on the same MPI_COMM_WORLD
$ aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3
• Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node.
Example : The following commands needs 3 nodes $ aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
• Use a script to start several serial jobs on a node :
$ aprun –a xt –n 3 script.sh
>cat script.sh
./exe1&
./exe2&
./exe3&
wait
>
27
How to use the interlagos 1/3
1 MPI Rank on Each Integer Core Mode
MPI Task 0
November 7, 2011
Shared
Components
MPI Task 1
Fetch
Decode
FP
Scheduler
L1 DCache
Cray Proprietary
Pipeline
Pipeline
Pipeline
128--bit FMA
128
AC
Int Core 1
128--bit FMA
128
AC
Pipeline
Pipeline
Int Core 0
Int
Scheduler
Pipeline
Int
Scheduler
Pipeline
each integer core
● Implications
● Each core has exclusive access to an Each core has exclusive access to an
integer scheduler, integer pipelines and L1 Dcache
● The 256‐bit FP unit and the L2 Cache is shared between the two cores
● 256‐bit AVX instructions are dynamically executed as two 128‐bit instructions if the 2nd FP unit is busy
instructions if the 2
FP unit is busy
● When to use
● Code is highly scalable to a large number of MPI ranks
● Code can run with 1 GB per core memory footprint (or 2 GB on 64 GB node)
● Code is not well vectorized
Pipeline
● In this mode, an MPI task is pinned to L1 DCache
Shared L2 Cache
28
Slide 28
How to use the interlagos 2/3 Wide AVX mode
November 7, 2011
Active
Components
Fetch
Decode
FP
Scheduler
L1 DCache
Cray Proprietary
Pipeline
Pipeline
Pipeline
128--bit FMA
128
AC
Int Core 1
128--bit FMA
128
AC
Pipeline
Pipeline
Int Core 0
Int
Scheduler
Pipeline
Int
Scheduler
Pipeline
used per core pair
● Implications
● This core has exclusive
This core has exclusive access to access to
the 256‐bit FP unit and is capable of 8 FP results per clock cycle
● The core has twice the memory The core has twice the memory
capacity and memory bandwidth in this mode
● The L2 cache is effectively twice as large
● The peak of the chip is not reduced
● When to use
● Code is highly vectorized and makes use of AVX instructions
● Code needs more memory per MPI rankk
Idle
Components
Pipeline
thi
d
l
i t
i
● IIn this mode, only one integer core is L1 DCache
Shared L2 Cache
29
Slide 29
How to use the interlagos 3/3 2‐way OpenMP Mode
November 7, 2011
Cray Proprietary
Shared
Components
OpenMP
Thread 1
Fetch
Decode
FP
Scheduler
L1 DCache
Pipeline
Pipeline
Pipeline
128--bit FMA
128
AC
Int Core 1
128--bit FMA
128
AC
Pipeline
Pipeline
Int Core 0
Int
Scheduler
S
h d l
Pipeline
Int
Scheduler
S
h d l
Pipeline
core pair
● OpenMP is used to run a thread on each i t
integer core
● Implications
● Each OpenMP thread has exclusive access to an integer scheduler integer
access to an integer scheduler, integer pipelines and L1 Dcache
● The 256‐bit FP unit and the L2 Cache is shared between the two threads
● 256‐bit AVX instructions are dynamically executed as two 128‐bit instructions if the 2nd FP unit is busy
● When to use
When to use
● Code needs a large amount of memory per MPI rank
p
p
p
● Code has OpenMP
parallelism exposed in each MPI rank
OpenMP
Thread 0
Pipeline
● In this mode, an MPI task is pinned to a L1 DCache
Shared L2 Cache
30
Slide 30
Aprun: cpu_lists for each PE • CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they hit t
h
h
ith hi h th
may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads.
Essentially, this provides the user more granularity to specify
Essentially, this provides the user more granularity to specify cpu_lists for each processing element.
Here an example with 3 threads :
aprun n 4 N 4 cc 1 3 5 7 9 11 13 15 17 19 21 23
aprun ‐n 4 ‐N 4 ‐cc 1,3,5:7,9,11:13,15,17:19,21,23
• Note: This feature will be modified in CLE 4.0.UP03, however p
this option will still be valid.
31
‐cc detailed example
export OMP_NUM_THREADS=2
aprun -n 4 -N 2 -d 2 -cc 0,2:4,6:8,10:12,14 ./acheck_mpi
nid00028[ 0] on cpu 00 affinity for thread 0 is: cpu
Application 3024546 resources: utime ~2s, stime ~0s
aprun -n 4 -N 4 -d 2 -cc 0,2:4,6:8,10:12,14 ./acheck_mpi
nid00028[ 0] on cpu 00 affinity for
nid00028[
id00028[ 1] on cpu 06 affinity
ffi it for
f
p 14 affinity
y for
nid00028[ 3] on cpu
Application 3024549 resources: utime
thread 0 is: cpu
thread 0 is: cpu
thread 1 is: cpu
thread
th
d 1 is:
i
cpu
thread 0 is: cpu
thread 1 is: cpu
thread 0 is: cpu
thread 1 is: cpu
p
~0s, stime ~0s
0,
2,
4,
6,
0,
2
2,
4,
6,
mask
mask
mask
mask
mask
mask
mask
mask
1
001
00001
0000001
1
001
00001
0000001
0,
8,
2,
6
6,
12,
10,
4,
14,
mask
mask
mask
mask
k
mask
mask
mask
mask
1
000000001
001
0000001
0000000000001
00000000001
00001
000000000000001
32
Running a batch application with Torque
• The number of required nodes and cores is determined by the parameters specified in the job header
#PBS -l
l mppwidth=256
idth 256 (MPP width:
idth number
b
of
f PE’
PE’s )
#PBS -l mppnppn=4
(MPP number of PE’s per node)
•
•
•
•
This example uses 256/4=64 nodes
p
/
The job is submitted by the qsub command
At the end of the execution output and error files are returned to submission directory
PBS environment variable: $PBS_O_WORKDIR
Set to the directory from which the job has been submitted
Set to the directory from which the job has been submitted
Default is $HOME
man qsub for env. variables
33
Other Torque options
• #PBS ‐N job_name
the job name is used to determine the name of job output and error files
• #PBS ‐l walltime=hh:mm:ss
Maximum job elapsed time
Maximum job elapsed time
should be indicated whenever possible: this allows Torque to determine best scheduling startegy
g
gy
• #PBS ‐j oe
job error and output files are merged in a single file
• #PBS ‐q queue
request execution on a specific queue
34
Torque and aprun
Torque
aprun
‐lmppwidth=$PE
‐n $PE
Number of PE to start
‐lmppdepth=$threads
lmppdepth=$threads
‐d $threads
d $threads
#threads/PE
‐lmppnppn=$N
‐N $N
#(PEs per node)
<none>
‐S $S
#(PEs per numa_node)
‐lmem=$size
‐m $size[h|hs]
per‐PE required memory
• -B will provide aprun with the Torque settings for –n,-N,-d and –m
aprun –B ./a.out
/a out
• Using –S can produce problems if you are not asking for a full node.
If possible, ALPS will only give you access to a parts of a node if the Torque
settings allows this. The following will fail :
• PBS -lmppwidth=4 ! Not asking for a full node
• aprun –n4 –S1 …
! Trying to run on every die
doesn t use it
• Solution is to ask for a full node, even if aprun doesn‘t
35
Core specialization
• System ‘noise’ on compute nodes may significantly degrade scalability for some applications
• Core Specialization can mitigate this problem
 1 core per node will be dedicated for system work (service core)
 As many system interrupts as possible will be forced to execute
As man s stem interr pts as possible ill be forced to e ec te
on the service core
 The application will not run on the service core
pp
• Use aprun ‐r to get core specialization
$ aprun –r –n 100 a.out
• apcount provided to compute total number of cores required
$ qsub -l mppwidth=$(apcount -r 1 1024 16)job
aprun -n 1024 -r 1 a.out
36
Running a batch application with Torque
• The number of required nodes can be specified in the job header
• The job is submitted by the qsub command
• At the end of the exection output and error files are returned to submission directory
• Environment variables are En ironment ariables are
inherited by #PBS ‐V
• The job starts in the home j
directory. $PBS_O_WORKDIR
contains the directory from which the job has been submitted
the job has been submitted
Hybrid MPI + OpenMP
#!/bin/bash
#PBS –N hybrid
#PBS –lwalltime=00:10:00
lwalltime=00:10:00
#PBS –lmppwidth=128
#PBS –lmppnppn=8
#PBS –lmppdepth=4
lmppdepth=4
cd $PBS_O_WORKDIR
export
p
OMP_NUM_THREADS=4
aprun –n128 –d4 –N8 a.out
37
Starting an interactive session with Torque
• An interactive job can be started by the –I argument
 That is <capital‐i>
• Example: allocate 64 cores and export the environment variables to the job ( V)
the job (‐V)
$ qsub –I –V –lmppwith=64
• This will give you a new prompt in your shell from which you can Thi ill i
i
h ll f
hi h
use aprun directly. Note that you are running on a MOM node (shared resource) if not using aprun
38
Watching a launched job on the Cray XE
• xtnodestat
 Shows XE nodes allocation and aprun
Shows XE nodes allocation and aprun processes
 Both interactive and PBS
• apstat
 Shows aprun processes status
 apstat
overview
 apstat –a[ apid ]info about all the applications or a specific one
 apstat –n
info about the status of the nodes
• Batch qstat
B t h t t command
d
 shows batch jobs
39
Starting 512 MPI tasks (PEs)
#PBS
#PBS
#PBS
#PBS
#PBS
-N
N
-l
-l
-l
l
-j
MPIjob
MPIj
b
mppwidth=512
mppnppn=32
walltime=01:00:00
llti
01 00 00
oe
cd $PBS_O_WORKDIR
_ _
export MPICH_ENV_DISPLAY=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=536870912
aprun
-n 512 –cc cpu –ss ./a.out
40
Starting an OpenMP program, using a single node
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
-l
l
-l
-j
OpenMP
mppwidth=1
mppdepth=32
walltime=01:00:00
oe
cd $PBS_O_WORKDIR
export OMP
OMP_NUM_THREADS=32
NUM THREADS=32
aprun –n1 –d $OMP_NUM_THREADS –cc cpu –ss ./a.out
41
Starting a hybrid job single node, 4 MPI tasks, each with 8 threads
single node, 4 MPI tasks, each with 8 threads
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
l
-l
-l
-l
l
-j
hybrid
mppwidth=4
mppwidth
4
mppnppn=4
mppdepth=8
walltime=01:00:00
walltime
01:00:00
oe
cd $PBS_O_WORKDIR
aprun –n4
4 –N4
N4 –d
d $OMP
$OMP_NUM_THREADS
NUM THREADS –cc cpu –ss ./a.out
/
t
42
Starting a hybrid job single node, 8 MPI tasks, each with 4 threads
single node, 8 MPI tasks, each with 4 threads
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
l
-l
-l
-l
l
-j
hybrid
mppwidth=8
mppwidth
8
mppnppn=8
mppdepth=4
walltime=01:00:00
walltime
01:00:00
oe
cd $PBS_O_WORKDIR
aprun –n8
8 –N8
N8 –d
d $OMP
$OMP_NUM_THREADS
NUM THREADS –cc cpu –ss ./a.out
/
t
43
Starting a MPMD job on a non‐default projectid
using 1 master, 16 slaves, each with 8 threads
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
-l
-l
-j
-W
hybrid
mppwidth=160 ! Note : 5 nodes * 32 cores = 160 cores
mppnppn=32
walltime=01:00:00
oe
group_list=My_Project
_
_
cd $PBS_O_WORKDIR
id # Unix command ‚id‘, to check group id
aprun –n1
n1 –d32
d32 –N1
N1 ./master.exe
/master exe :
-n 16 –N4 –d $OMP_NUM_THREADS –cc cpu –ss ./slave.exe
44
Starting an MPI job on two nodes
using only every second integer core
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
l
-l
-l
-l
l
-j
hybrid
mppwidth=32
mppwidth
32
mppnppn=16
mppdepth=2
walltime=01:00:00
walltime
01:00:00
oe
cd $PBS_O_WORKDIR
aprun –n32 –N16 –d 2 –cc cpu –ss ./a.out
45
Starting a hybrid job on two nodes
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-l
l
-l
-l
-l
l
-j
hybrid
mppwidth=32
mppwidth
32
mppnppn=16
mppdepth=2
walltime=01:00:00
walltime
01:00:00
oe
cd $PBS_O_WORKDIR
aprun –n32 –N16 –d $OMP_NUM_THREADS
–cc 0,2:4,6:8,10:12,14:16,18:20,22:24,26:28,30 –ss ./a.out
46
Running a batch application with SLURM
• The number of required nodes can be specified in the job header
• The job is submitted by the qsub command
• At the end of the exection output and error files are returned to submission directory
• Environment variables are En ironment ariables are
inherited
• The job starts in the directory j
y
from which the job has been submitted
Hybrid MPI + OpenMP
#!/bin/bash
#SBATCH –-job-name=“hybrid”
#SBATCH –-time=00:10:00
#SBATCH –-nodes=8
export OMP_NUM_THREADS
OMP NUM THREADS=6
6
aprun –n32 –d6 a.out
47
Starting an interactive session with SLURM
• An interactive job can be started by the SLURM salloc command
• Example: allocate 8 nodes
$ salloc –N 8
Further SLURM info available from CSCS web page: www.cscs.ch
User Entry Point / How to Run a Batch Job / Palu ‐ Cray XE6 48
Documentation
• Cray docs site
http://docs.cray.com
• Starting point for Cray XE info
http://docs.cray.com/cgi‐bin/craydoc.cgi?mode=SiteMap;f=xe_sitemap
//
/
/
• Twitter ?!?
Twitter ?!?
http://twitter.com/craydocs
49
End
50

CLE and How to S Y A

Transcription

Similar documents

The Guide to PBS KIDS Island

here - Aberdeen City Council

Morning of Content: Across the Universe

Summary of Current Googols of Learning Programs and Initiatives

Western Perspective Premiere

A Whole New World: Piloting Emerging Fundraising

quick ref guide quotes only 4-8

The Continuous World of Dungeon Siege

Psoriatic arthritis Continuing PBS authority application Supporting information Important information