Numascale Best Practice Guide

Transcription

Numascale Best Practice Guide
Numascale
Best Practice Guide
Atle Vesterkjær
Senior Software Engineer
Numascale
E-mail: av@numascale.com
Website:www.numascale.com
Revision 1.4, May 22, 2015
Numascale Best Practice Guide
Revision History
Rev Date
Author
Atle
Vesterkjær
Atle
Vesterkjær
Atle
Vesterkjær
Atle
Vesterkjær
0.1
2014-03-03
1.0
2014-03-05
1.1
2014-03-16
1.2
2014-08-28
1.3
2014-08-28
Atle
Vesterkjær
1.4
2015-05-22
Atle
Vesterkjær
Changes
First version
Added link to wiki. Updated with better
fonts.
Installation hints
Added information about how to run
hybrid applications (MPI+OpenMP)
Added information about how to run
linear algebra code using thread safe
scalapack
Fixed the MPI and Hybrid rankfiles to fit
the interpretation of socket in newer
kernels, tested on 3.19.5-numascale45+
Numascale Best Practice Guide
Page 2 of 28
Numascale, www.numascale.com, av@numascale.com
1
INTRODUCTION ............................................................................................................4
1.1
1.2
1.3
ABSTRACT..........................................................................................4
ABBREVIATIONS...................................................................................4
BACKGROUND .....................................................................................4
2
MULTITHREADED APPLICATION THAT CAN BENEFIT FROM A LARGER NUMBER OF
CORES THEN A SINGLE SERVER CAN PROVIDE ................................................................6
2.1
TUNING YOUR APPLICATION FOR A NUMASCALE SYSTEM .............................7
2.1.1
OpenMP .................................................................................7
2.1.2
MPI ........................................................................................9
2.1.3
Hybrid (MPI+OpenMP) .......................................................... 11
2.1.4
Pthreads, or other multithreaded applications .......................... 12
2.2
RUNNING LINEAR ALGEBRA LIBRARIES ................................................... 15
3
MEMORY LAYOUT OF APPLICATIONS IN A NUMASCALE SHARED MEMORY SYSTEM
16
3.1
ALL THREADS ARE AVAILABLE, E.G. THROUGH HTOP ................................. 18
3.2
KEEP THREAD LOCAL STORAGE (TLS) LOCAL ........................................ 22
3.1
USE A NUMA-AWARE MEMORY ALLOCATOR ............................................ 22
3.1.1
The HOARD memory allocator ............................................... 22
3.1.1
The NCALLOC memory allocator ........................................... 22
3.2
ARRANGE YOUR DATA IN A NUMA-AWARE ............................................... 23
4
COMPILERS ................................................................................................................ 24
5
INSTALLING NUMASCALE BTL .................................................................................. 25
6
IPMI ............................................................................................................................. 26
6.1
6.2
7
POWER CONTROL .............................................................................. 26
CONSOLE ACCESS ............................................................................. 26
INSTALLING APPLICATIONS ...................................................................................... 27
7.1
GROMACS ........................................................................................ 27
Numascale Best Practice Guide
Page 3 of 28
Numascale, www.numascale.com, av@numascale.com
1 INTRODUCTION
1.1 Abstract
This is a summary of best practices for achieving optimal results using Numascale
Shared Memory Systems with emphasis on running and optimizing applications.
Keywords are:
 OpenMP, affinity, different compilers
 MPI, linpack, NPB, rankfiles
 Vmstat, htop, top, numastat, “cat /proc/cpuinfo” etc.
 Numa-aware optimizations of code
1.2 Abbreviations
Throughout this document, we use the following terms:
Numascale AS
The name of the company that
develops and owns NumaChip and
NumaConnect.
NumaServer
A server equipped with a
NumaConnect card.
NumaMaster
A NumaServer with an instance of the
operating system installed.
1.3 Background
This document is an addendum to documents at
http://www.numascale.com/numa_support.html that cover all aspects of setup,
cabling and installation of a Numascale Shared Memory system:
 Numascale hardware installation guides
o https://resources.numascale.com/Numascale-Hardware-InstallationIBM-System-x3755-M3.pdf
o https://resources.numascale.com/Numascale-Hardware-InstallationH8QGL-iF+.pdf
 NumaManager Quick Start Guide, https://resources.numascale.com/nmaquickstart.pdf
o a one page Quick Start Guide for deployment of NumaConnect
 NumaConnect Deployment with NumaManager,
https://resources.numascale.com/nc-deployment-nma.pdf
o a complete guide to get a NumaConnect system up and running.
Using the NumaManager in this process is strongly recommended
and the NumaManager is included in the NumaConnect delivery
 NumaConnect Deployment, https://resources.numascale.com/ncdeployment.pdf
o a complete guide to get a NumaConnect system up and running
when no NumaManager is available. It is recommended to use the
NumaManager in this process
Numascale Best Practice Guide
Page 4 of 28
Any multithreaded application can run on Numascale Shared Memory System.
The basic criteria’s for being interested in a Numascale Shared Memory System
are typically:
 You have a multithreaded application that can benefit from a larger number
of cores then a single server can provide.
 You have an application that requires more memory than a single server
can provide
 You have an application that requires both more cores and more memory
than a single server can provide
Numascale Best Practice Guide
Page 5 of 28
2 MULTITHREADED APPLICATION THAT CAN
BENEFIT FROM A LARGER NUMBER OF CORES
THEN A SINGLE SERVER CAN PROVIDE
If you have a multithreaded application that can benefit from a larger number of
cores than a single server can provide it is a good candidate for a Numascale
Shared Memory System. Compared to other High-End ccNuma solutions and
software only ccNuma solutions cores comes at a very low price in a
NumaConnect Shared Memory System.
If your application scales well when you use more of the cores in your server, then
there is a good chance that it will succeed running on a Numascale Shared
Memory System with a larger number of cores than you have in your single server.
Streams scaling (16 nodes - 32 sockets - 192 cores)
200 000
150 000
Triad – NumaConnect
Triad – 1041A-T2F
MByte/sec
Triad – Linear socket
Triad – Linear node
100 000
50 000
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Number of CPU Sockets
Figure 1: On a Numascale Shared Memory System the Streams benchmark can run in parallel on a
larger number of cores than you have in your single server
The Stream benchmark is used as an example of such an application. Looking at
the green bars in Figure 1 you see that the speedup increases when using more
sockets inside the one server. This makes the application a good candidate for
ccNuma systems.
In chapter 3 we will look into how a multithreaded application accesses memory in
a ccNuma system. This is very relevant to how well it will scale in a large shared
memory system when all CPUs and memory no longer are on the same physical
server.
Numascale Best Practice Guide
Page 6 of 28
2.1 Tuning your application for a Numascale
system
2.1.1 OpenMP
An OpenMP application has very nice runtime characteristics. It is the de-facto
state of the art way of doing multithreaded programs. OpenMP pragmas allows you
to parallelize code. OpenMP has a very nice way of parallelizing code as it gives
the compiler a hint on where to place the parallel threads. Use environment
variables in order to control the runtime of applications using OpenMP.
 GNU: OMP_NUM_THREADS=64 GOMP_CPU_AFFINITY=”0-255:4”
bin/ep.C.x
 GNU: OMP_NUM_THREADS=128 GOMP_CPU_AFFINITY=”0-127”
bin/ep.C.x
 GNU: OMP_NUM_THREADS=128 GOMP_CPU_AFFINITY=”0-255:2”
bin/ep.C.x
 PGI: OMP_THREAD_LIMIT=128 OMP_PROCS_BIND=true
OMP_NUM_THREADS=128 MP_BIND=yes MP_BLIST=$(seq -s, 0 2
255) bin_pgfortran/ep.C.x
The AMD Opteron™ 6300 series processor is organized as a Multi-Chip Module
(two CPU dies in the package), interconnected using a Hyper Transport link. Each
die has its own memory controller interface to allow external connection of up to
two memory channels per die. This memory interface is shared among all the CPU
core pairs on each die. Each core pair appears to the operating system as two
completely independent CPU cores. Thus, to the OS, the device shown in Figure 2
appears as 16 CPUs.
Figure 2: The AMD Opteron™ 6300 series processor is organized as a Multi-Chip Module
Numascale Best Practice Guide
Page 7 of 28
In Numascale Shared Memory System with AMD 6380 CPUs the most efficient
way to run an FPU intensive OpenMP application with 128 threads in a 256 cores
system is:
[atle@numademo NPB3.3-OMP]$
255:2” bin/sp.D.x
OMP_NUM_THREADS=128
GOMP_CPU_AFFINITY=”0-
Figure 3: Speedup of Numascale OpenMP application when using more FPUs
If an OpenMP application uses 64 threads, the most efficient way to distribute the
threads in the system is to get lower load on the memory interface and L3 cache
shared among all the CPU core pairs on each die is using every fourth core in your
system:
[atle@numademo NPB3.3-OMP]$ OMP_NUM_THREADS=64 GOMP_CPU_AFFINITY=”0-255:4”
bin/sp.D.x
Numascale Best Practice Guide
Page 8 of 28
Figure 4: Speedup of Numascale OpenMP application when balancing the load to get a FAT link to
memory and L3 cache
2.1.2 MPI
Numascale has developed a byte transfer layer (BTL) for OpenMPI that fully
utilizes the NumaConnect. The Numascale OpenMPI implementation is very
efficient compared to a standard shared memory BTL. Standard tests show a
speedup of 4.5 – 19.9 on typical systems.
Numascale Best Practice Guide
Page 9 of 28
Figure 5: Numascale NumaConnect-BTL boosts your MPI performance
In order to use the NumaConnect-BTL make sure that it is included in the path:
[atle@numademo NPB3.3-MPI]$ export LD_LIBRARY_PATH=/opt/openmpi-1.7.2numaconnect/lib:$LD_LIBRARY_PATH
[atle@numademo NPB3.3-MPI]$ export PATH=/opt/openmpi-1.7.2numaconnect/bin:$PATH
[atle@numademo NPB3.3-MPI]$ which mpiexec
/opt/openmpi-1.7.2-numaconnect/bin/mpiexec
[atle@numademo NPB3.3-MPI]$
There are different ways of optimizing an MPI application. In order to avoid using
slow disk IO we use tmpfs (dev/shm) as location for temporary OpenMPI files set
by the flag OMPI_PREFIX_ENV.
Use the –bind-to-core option for jobs running on all the cores in the system:
[atle@numademo ~]$ OMPI_PREFIX_ENV=/dev/shm mpiexec -n 256 --bind-to-core
-mca btl self,nc bin/lu.D.256
Use rankfiles when not running on all cores in the system.
An MPI application that utilizes less than all cores in a Numascale Shared Memory
is optimized using rank files. In a 256 core system using AMD 6380 CPUs you can
instruct your application to use one FPU per core by selecting every second core in
a rankfile:
[atle@numademo NPB3.3-MPI]$ cat rank128_lu
rank 0=numademo slot=0:0
rank 1=numademo slot=0:2
Numascale Best Practice Guide
Page 10 of 28
rank 2=numademo slot=0:4
rank 3=numademo slot=0:6
….. until
rank 120=numademo slot=15:0
rank 121=numademo slot=15:2
rank 122=numademo slot=15:4
rank 123=numademo slot=15:6
rank 124=numademo slot=15:8
rank 125=numademo slot=15:10
rank 126=numademo slot=15:12
rank 127=numademo slot=15:14
atle@numademo NPB3.3-MPI]$
By using every fourth core in the system, it is possible to distribute the processes
even more utilize all memory controllers and L3 caches in the system:
[atle@numademo NPB3.3-MPI]$ cat rank64
rank 0=numademo slot=0:0
rank 1=numademo slot=0:4
rank 2=numademo slot=0:8
rank 3=numademo slot=0:12
…..etc until
rank 60=numademo slot=15:0
rank 61=numademo slot=15:4
rank 62=numademo slot=15:8
rank 63=numademo slot=15:12
Start the application with these options set:
[atle@numademo NPB3.3-MPI]$ OMPI_PREFIX_ENV=/dev/shm mpiexec
-n 64 -rf rank64 -mca btl self,nc bin/lu.D.64
2.1.3
Hybrid (MPI+OpenMP)
Applications suites that comes with a hybrid version are able to run “out of the box”
on a Numascale Shared Memory System.
Here is one example using the NAS Parallel Benchmarks, NPB-MZ-MPI
applications suite that comes with a hybrid version. In this example each of the 16
MPI processes spawns two (2) threads using OpenMP, utilizing 32 cores.
[atle@numademo NPB3.3-MZ-MPI]$
export LD_PRELOAD=/opt/HOARD/libhoard.so
[atle@numademo NPB3.3-MZ-MPI]$
export OMP_NUM_THREADS=2;/opt/openmpi-1.6.0-numaconnect/bin/mpirun -n
16 --mca btl nc,self -rf rank16_colon2 bin_gnu/lu-mz.C.16
Here is the hybrid rank file used for the test:
Numascale Best Practice Guide
Page 11 of 28
[atle@numademo NPB3.3-MZ-MPI]$ cat rank16_2
rank 0=localhost slot=0:0,0:2
rank 1=localhost slot=0:4,0:6
rank 2=localhost slot=0:8,0:10
rank 3=localhost slot=0:12,0:14
…
rank 15=localhost slot=3:8,3:10
If you want to run a test on 16 MPI processes each spawning 8 OpenMP threads,
in total 128 cores you can use this rank-file:
[atle@numademo NPB3.3-MZ-MPI]$ cat rank16_8
rank 0=h8qgl slot=0:0,0:2,0:4,0:6,0:8,0:10,0:12,0:14
rank 1=h8qgl slot=1:0,1:2,1:4,1:6,1:8,1:10,1:12,1:14
rank 2=h8qgl slot=2:0,2:2,2:4,2:6,2:8,2:10,2:12,2:14
rank 3=h8qgl slot=3:0,3:2,3:4,3:6,3:8,3:10,3:12,3:14
rank 4=h8qgl slot=4:0,4:2,4:4,4:6,4:8,4:10,4:12,4:14
rank 5=h8qgl slot=5:0,5:2,5:4,5:6,5:8,5:10,5:12,5:14
rank 6=h8qgl slot=6:0,6:2,6:4,6:6,6:8,6:10,6:12,6:14
rank 7=h8qgl slot=7:0,7:2,7:4,7:6,7:8,7:10,7:12,7:14
rank 8=h8qgl slot=8:0,8:2,8:4,8:6,8:8,8:10,8:12,8:14
rank 9=h8qgl slot=9:0,9:2,9:4,9:6,9:8,9:10,9:12,9:14
rank 10=h8qgl slot=10:0,10:2,10:4,10:6,10:8,10:10,10:12,10:14
rank 11=h8qgl slot=11:0,11:2,11:4,11:6,11:8,11:10,11:12,11:14
rank 12=h8qgl slot=12:0,12:2,12:4,12:6,12:8,12:10,12:12,12:14
rank 13=h8qgl slot=13:0,13:2,13:4,13:6,13:8,13:10,13:12,13:14
rank 14=h8qgl slot=14:0,14:2,14:4,14:6,14:8,14:10,14:12,14:14
rank 15=h8qgl slot=15:0,15:2,15:4,15:6,15:8,15:10,15:12,15:14
and run:
[atle@numademo NPB3.3-MZ-MPI]$ export
OMP_NUM_THREADS=8;/opt/openmpi-1.6.0-numaconnect/bin/mpirun -n 16
--mca btl nc,self -rf rank16_8 bin_gnu/lu-mz.C.16
2.1.4 Pthreads, or other multithreaded applications
taskset or numactl apply a mask for the creation of threads in a multithreaded
program. Applications written with pthreads can use this to choose cores and
memory segments for execution. In tests at Numascale, taskset and numactl
perform equally well, but other online resources recommend using numactl for
guaranteed memory binding.
Numascale Best Practice Guide
Page 12 of 28
[root@x3755 02_BWA_MEM]# taskset
taskset (util-linux-ng 2.17.2)
usage: taskset [options] [mask | cpu-list] [pid | cmd [args...]]
set or get the affinity of a process
p,
c,
h,
V,
--pid
--cpu-list
--help
--version
operate on existing given pid
display and specify cpus in list format
display this help
output version information
The default behavior is to run a new command:
taskset 03 sshd -b 1024
You can retrieve the mask of an existing task:
taskset -p 700
Or set it:
taskset -p 03 700
List format uses a comma-separated list instead of a mask:
taskset -pc 0,3,7-11 700
Ranges in list format can take a stride argument:
e.g. 0-31:2 is equivalent to mask 0x55555555
[root@x3755 02_BWA_MEM]# numactl
usage: numactl [--interleave=nodes] [--preferred=node]
[--physcpubind=cpus] [--cpunodebind=nodes]
[--membind=nodes] [--localalloc] command args ...
numactl [--show]
numactl [--hardware]
numactl [--length length] [--offset offset] [--shmmode shmmode]
[--strict]
[--shmid id] --shm shmkeyfile | --file tmpfsfile
[--huge] [--touch]
memory policy | --dump | --dump-nodes
memory policy is—interleave, --preferred, --membind, --localalloc
nodes is a comma delimited list of node numbers or A-B ranges or all.
cpus is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old—cpubind argument is deprecated.
use—cpunodebind or—physcpubind instead
length can have g (GB), m (MB) or k (KB) suffixes
[root@x3755 02_BWA_MEM]#
“numactl –hardware” gives a good view of the hardware setup. Here a Numascale
Shared Memory System with two NumaServers, 8 sockets and 64 cores is used:
[root@x3755 02_BWA_MEM]# numactl—hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 65491 MB
Numascale Best Practice Guide
Page 13 of 28
node 0 free: 24474 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 65536 MB
node 1 free: 56116 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 65536 MB
node 2 free: 64338 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 47104 MB
node 3 free: 46263 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 65536 MB
node 4 free: 64358 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 65536 MB
node 5 free: 62861 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65536 MB
node 6 free: 63288 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 49152 MB
node 7 free: 46946 MB
node distances:
node
0
1
2
3
4
5
6
7
0: 10 16 16 22 100 100 100 100
1: 16 10 16 22 100 100 100 100
2: 16 16 10 16 100 100 100 100
3: 22 22 16 10 100 100 100 100
4: 100 100 100 100 10 16 16 22
5: 100 100 100 100 16 10 16 22
6: 100 100 100 100 16 16 10 16
7: 100 100 100 100 22 22 16 10
Here are some advices on how to tune a multithreaded application by selecting a
runtime mask:
 Running only on the first socket
taskset -c 0-7 /scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t
$procs -t ${REF} ${INP} ${INP2}
numactl—cpunodebind 0,1,2,3 --membind 0,1,2,3
/scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t $procs -t ${REF}
${INP} ${INP2}
 Running on all cores in the first NumaServer
taskset -c 0-31 /scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t
$procs -M ${REF} ${INP} ${INP2}
numactl—cpunodebind 0,1,2,3,4,5,6,7 --membind 0,1,2,3,4,5,6,7
/scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t $procs -t ${REF}
${INP} ${INP2}
numactl—physcpubind
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27
Numascale Best Practice Guide
Page 14 of 28
,28,29,30,31 --membind 0,1,2,3,4,5,6,7
/scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t $procs -t ${REF}
${INP} ${INP2}
 Running on all FPUs in the first NumaServer
taskset -c 0-31:2 /scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t
$procs -M ${REF} ${INP} ${INP2}
numactl—physcpubind 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 --membind
0,1,2,3,4,5,6,7 /scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t
$procs -t ${REF} ${INP} ${INP2}
 Running on all FPUs in the system
taskset -c 0-63:2 /scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t
$procs -M ${REF} ${INP} ${INP2}
numactl—physcpubind 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 --membind
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
/scratch/demouser04/04_Software/bwa-0.7.4/bwa mem -t $procs -t ${REF}
${INP} ${INP2}
Monitor the system usage with vmstat, htop, top and numastat to see if it behaves
as intended with these runtime parameters. Please note that htop and top requires
some system resources and might influence the performance. We recommend
restarting the benchmark with the more lightweight vmstat for the final benchmark
test when the runtime parameters have been qualified.
2.2 Running linear algebra libraries
ScaLAPACK is a scalable subset of the LAPACK (Linear Algebra Package)
routines. It is open source and can be found at www.netlib.org. All commonly used
linear algebra libraries like IBM ESSL, Intel MKL, AMD ADSL, IMSL, NAG use
LAPACK and BLAS. The vensors supply a highly optimized version of the Netlib
code. Numerical applications like Matlab or Fluent use these optimized libraries.
ScaLapack is based on MPI communication in Blacs and is not thread-save.
Our hardware supports solvers running threads distributed. We can provide a
thread-save ScaLapack and OpenMP, shared memory based Blacs library.
This way the solvers run in a single process using OpenMP distributed on several
machines.
Please check out https://wiki.numascale.com/tips/building-openmp-based-scalapack for
more information.
Numascale Best Practice Guide
Page 15 of 28
3 MEMORY LAYOUT OF APPLICATIONS IN A
NUMASCALE SHARED MEMORY SYSTEM
In order to illustrate the basic difference between a Numascale Shared Memory
System and a cluster we consider a large job that can be solved in parallel by a
large number of resources. In the clustering world, a large data set will be split into
smaller fragment that will be distributed to individual nodes that are loosely coupled
using the MPI protocol.
Imagine that such a job represents a large image like in Figure 6.
Figure 6: Distributed Memory (Clusters)
Consider the same job in the Numascale context. A shared view of the whole
dataset is just as simple as on a standalone server.
Numascale Best Practice Guide
Page 16 of 28
Figure 7: Shared Memory
Program a Numascale Shared Memory System just as a standalone server. The
full memory range is available to all applications. You can run ”top” on a 1.5 TB
system.
Figure 8: top on 1.5 TB system
Numascale Best Practice Guide
Page 17 of 28
3.1 All threads are available, e.g. through htop
Figure 9: htop on 144 cores
The major difference from a single CPU server and a Numascale Shared Memory
System is that there are differences in the memory latency. In a ccNuma system,
the memory latency differs depending on where the actual memory is located.
Numascale Best Practice Guide
Page 18 of 28
Thread latency (remote
non-polluting write latency)
for an arbitrary numanode
(socket no 2) when
accessing all the other
numanodes (sockets 1-32)
using
8 NumaServers each
equipped with…
latency (microseconds)
1,2
1,0
0,8
1,1
11,1
,1
1,0
1,0 1,0
1,0
1,0 1,0
1,0
1,0 1,0 1,0
0,9
0,9
0,9 0,9
0,9
0,9
0,8 0,8
0,8
0,8 0,8
0,8
0,8 0,8
0,8
0,6
0,4
0,4
0,4 0,4
0,2
0,0
0,0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
numanode <number>
Figure 10: Memory latencies in a Numascale system
Although a ccNuma system is logically similar to a traditional SMP, it is physically
different. The runtime of an application is very dependent on where the data is
located relative to the running thread.
Numascale Best Practice Guide
Page 19 of 28
Figure 11: Memory hierarchy in a Numascale system
“numastat” monitor the memory access pattern in a system and can be used to
verify the runtime the runtime parameters.
[atle@x3755 ~]$ numastat
numa_hit
numa_miss
numa_foreign
interleave_hit
local_node
other_node
numa_hit
numa_miss
numa_foreign
interleave_hit
local_node
other_node
[atle@x3755 ~]$
node0
116198855
0
390637
26017
116197682
1173
node1
22674225
390637
0
26221
22647868
416994
node2
1601598
0
0
26298
1576067
25531
node3
326121
0
0
26107
299901
26220
node4
2484895
0
0
26026
2458717
26178
node5
4957523
0
0
26209
4931194
26329
node6
4425252
0
0
26289
4398832
26420
node7
25191604
0
0
26088
25165424
26180
“numastat” can also be used to monitor the memory access pattern for a running
process.
[atle@x3755 ~]$ ps -elf | grep bwa
4 S root
25459 25458 99 80
0 - 1274150 futex_ 10:29 pts/3
/home/demouser04/04_Software/bwa-0.7.4/bwa mem -t 128 -M
../../../00_RAW_DATA/EC_K12
../../../00_RAW_DATA/QBICECKH100001W4_GCCAAT_L001_R2_001.fastq
[root@x3755 atle]# numastat -p 25459
Per-node process memory usage (in MBs) for PID 25459 (bwa)
Node 0
Node 1
Node 2
Numascale Best Practice Guide
Page 20 of 28
00:36:30
Huge
Heap
Stack
Private
---------------Total
--------------- --------------- --------------0.00
0.00
0.00
0.00
0.00
0.00
0.18
0.09
0.19
63.66
38.99
43.42
--------------- --------------- --------------63.85
39.07
43.61
Huge
Heap
Stack
Private
---------------Total
Node 3
Node 4
Node 5
--------------- --------------- --------------0.00
0.00
0.00
0.00
0.00
0.00
0.18
2.52
0.38
31.44
82.00
66.58
--------------- --------------- --------------31.62
84.51
66.96
Huge
Heap
Stack
Private
---------------Total
[root@x3755 atle]#
Node 6
Node 7
Total
--------------- --------------- --------------0.00
0.00
0.00
0.00
0.00
0.00
11.33
0.46
15.32
3434.12
64.36
3824.57
--------------- --------------- --------------3445.45
64.82
3839.89
Applications that take the different memory latencies in a ccNuma system into
account scales well on a Numascale Shared Memory System. The RWTH
TrajSearch code has the most speedup on NumaConnect, even if it was
originally adapted to Scale MP.
Numascale Best Practice Guide
Page 21 of 28
Figure 12: RWTH TrajSearch code
3.2 Keep Thread Local Storage (TLS) local
Numascale has patched the gcc and open64 OpenMP implementation to make
sure that the stack and Thread Local Storage (TLS) is local to the running threads.
This favors these compilers for OpenMP code. For more information about this
feature, see https://wiki.numascale.com/tips/placing-stack-and-tls-memory .
Figure 13: https://wiki.numascale.com/tips show best practice hints
3.1 Use a Numa-aware Memory allocator
3.1.1 The HOARD memory allocator
Numascale is experimenting with other memory allocators than libc in order to
speed up the allocation process by not shifting around blocks between threads.
http://www.hoard.org/ reduces lock contention by using multiple heaps and avoids
false sharing in its allocations.
export LD_PRELOAD=/opt/HOARD/libhoard.so
3.1.1 The NCALLOC memory allocator
Numascale has written its own memory allocator to ensure that the HEAP of a
running thread is local to the core it is running on. You can experiment with both
the NCALLOC and HOARD memory allocators. You can find more information
about the NCALLOC memory allocator in https://wiki.numascale.com/tips/thencalloc-memory-allocator.
export LD_PRELOAD=/opt/HOARD/libncalloc.so
Numascale Best Practice Guide
Page 22 of 28
3.2 Arrange your data in a Numa-aware
The least efficient memory model for an application that you would like to scale
well on ccNuma system is to have one big array of data on one node that all
threads has to access. Try to distribute the data as much as possible in order to be
able for threads to use memory local to the core they are running on.
Numascale Best Practice Guide
Page 23 of 28
4 COMPILERS
Check out https://wiki.numascale.com/tips for public tips from Numascale.
AMD has a good guide at http://amd-dev.wpengine.netdnacdn.com/wordpress/media/2012/10/CompilerOptQuickRef-63004300.pdf for
general compiler optimization flags for the AMD Opteron 6300 series.
AMD has done some work in optimizing compiler flags well-known applications:
 http://www.amd.com/us/products/server/benchmarks/Pages/gromacscluster-perf-d-dppc-two-socket-servers.aspx
 http://www.amd.com/us/products/server/benchmarks/Pages/highperformance-linpack-cluster-perf.aspx
 http://www.amd.com/us/products/server/benchmarks/Pages/moleculardynamics-namd-cluster-performance-two-socket-servers.aspx
More can be found at http://www.amd.com/us/products/server/benchmarks/
The most common compilers used with the Numascale architecture are:
 GCC 4.8.1
 PGI compiler 13.3
 Open64 compiler (x86_open64-4.5.2.1)
 Intel compiler (yes this works for AMD systems as well)
Numascale has patched the gcc and open64 OpenMP implementation to make
sure that the stack and Thread Local Storage (TLS) is local to the running threads.
This favors these compilers for OpenMP code.
Please check out:
http://developer.amd.com/wordpress/media/2012/10/51803A_OpteronLinuxTuning
Guide_SCREEN.pdf
Numascale Best Practice Guide
Page 24 of 28
5 INSTALLING NUMASCALE BTL
Download the source at:
https://resources.numascale.com/software/openmpi-1.7.2.tar.gz
It might be helpful to download:
https://resources.numascale.com/software/openmpi-tools.tar.gz
Remember to install numactl and numactl-devel using yum first.
Here are the installation steps on my system:
tar -zxvf ../openmpi-1.7.2.tar.gz
cd openmpi-tools/
cd automake-1.12.2/
./configure ; make; sudo make install
cd ..
cd autoconf-2.69/
./configure ; make; sudo make install
cd ../
cd flex-2.5.35/
sudo yum install help2ma
./configure ; make; sudo make install
cd ..
cd libtool-2.4.2/
./configure ; make; sudo make install
cd ../m4-1.4.16/
./configure ; make; sudo make install
cd ../../openmpi-1.7.2/
./autogen.sh
export CC=gcc F77=gfortran FC=gfortran LDFLAGS=-lnuma CFLAGS="-g -O3 -march=amdfam15"
./configure --disable-vt --prefix=/opt/openmpi-1.7.2-numaconnect
make
sudo make install
Start application like this:
mpirun -n 32 -rf rank32 -mca btl self,nc bin/lu.C.32
Numascale Best Practice Guide
Page 25 of 28
6 IPMI
6.1 Power control
To control the power of a IPMI based node, use the following command(s) :
% ipmitool -I lanplus -H <hostname>-ipmi -U <username> -P <password>
power on
% ipmitool -I lanplus -H <hostname>-ipmi -U <username> -P <password>
power off
% ipmitool -I lanplus -H <hostname>-ipmi -U <username> -P <password>
power reset
6.2 Console access
To open a serial console session towards an IPMI based node, use the following
command :
% ipmitool -I lanplus -H <hostname>-ipmi -U <username> -P <password> sol
activate
<username> and <password> may be different from device to device.
You can also pipe the console log to a file this way:
% ipmitool -I lanplus -H <hostname>-ipmi -U <username> -P <password> sol activate 2>&1 | tee
<filename> &
Numascale Best Practice Guide
Page 26 of 28
7 INSTALLING APPLICATIONS
7.1 Gromacs
Gromacs scales positively when using our optimized NumaConnect BTL with
OpenMPI:
Core
32
64
128
t (s)
2894.03
4431.17
5569.21
Wall t (s)
90.38
69.19
43.50
%
3201.90
6404.20
12803.70
ns/day
19.12
24.98
39.73
Hour/ns
1.26
0.96
0.60
Make sure you have a suitable FFTW version installed (i.e.
single-precision and AMD-optimised). Find some info in
http://www.gromacs.org/Documentation/Installation_Instructions_for_5.0
http://www.gromacs.org/Documentation/Installation_Instructions_for_5.0#3.4._Fast
_Fourier_Transform_library
http://www.gromacs.org/Documentation/Installation_Instructions_for_5.0#3.2.1._R
unning_in_parallel
Verify the installation by running “make check”.
You can easily check your (acceleration) setup with "mdrun -version"
Relevant mdrun options include "-pin on", "-dlb yes", '-tunepme", "-nt [n]" where n
is the number of threads. Should be equal to number of cores. If you specify 0,
mdrun will guess but it might not guess correctly.
Here are the steps used on the Numascale system:
Numascale Best Practice Guide
Page 27 of 28
Compile FFTW:
CC=/opt/gcc-4.8.2/bin/gcc CFLAGS="-lnuma" CXX=g++ FC=/opt/gcc-4.8.2/bin/gfortran F77=/opt/gcc4.8.2/bin/gfortran FFLAGS="-lnuma" ./configure --enable-float --enable-threads --enable-sse2 -prefix=/home/atle/GROMACS/fftw_lnuma
Gromacs:
CMAKE_PREFIX_PATH=/home/atle/GROMACS/fftw_lnuma cmake .. -DGMX_MPI=ON DCMAKE_CXX_COMPILER=/opt/openmpi-1.7.2-numaconnect/bin/mpicxx DCMAKE_C_COMPILER=/opt/openmpi-1.7.2-numaconnect/bin/mpicc -DCMAKE_C_FLAGS="-lnuma" DCMAKE_CXX_FLAGS="-lnuma" -DCMAKE_BUILD_TYPE=Release DCMAKE_INSTALL_PREFIX=/home/atle/GROMACS/MPI -DLIBXML2_LIBRARIES=/usr/lib64 DGMX_PREFER_STATIC_LIBS=ON -DCMAKE_BUILD_TYPE=Release DLIBXML2_INCLUDE_DIR=/usr/include/libxml2
Currently the MPI version works best on NumaConnect:
[atle@numademo mpi_numademo_128]$ mpirun -n 128 -rf rank128_lu -mca btl self,nc
/home/atle/GROMACS/MPI_NUMADEMO/bin/mdrun_mpi -s Test-performance_protein-watermembrane.tpr
Numascale Best Practice Guide
Page 28 of 28