Performance Report PRIMERGY RX4770 M3
Transcription
Performance Report PRIMERGY RX4770 M3
White Paper Performance Report PRIMERGY RX4770 M3 White Paper FUJITSU Server PRIMERGY Performance Report PRIMERGY RX4770 M3 This document contains a summary of the benchmarks executed for the FUJITSU Server PRIMERGY RX4770 M3. The PRIMERGY RX4770 M3 performance data are compared with the data of other PRIMERGY models and discussed. In addition to the benchmark results, an explanation has been included for each benchmark and for the benchmark environment. Version 1.1 2016-09-08 http://ts.fujitsu.com/primergy Page 1 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Contents Document history ................................................................................................................................................ 2 Technical data .................................................................................................................................................... 3 SPECcpu2006 .................................................................................................................................................... 5 Disk I/O: Performance of RAID controllers ......................................................................................................... 9 SAP SD ............................................................................................................................................................. 16 OLTP-2 ............................................................................................................................................................. 19 TPC-E ............................................................................................................................................................... 23 vServCon .......................................................................................................................................................... 26 VMmark V2 ....................................................................................................................................................... 33 STREAM ........................................................................................................................................................... 38 LINPACK .......................................................................................................................................................... 40 Literature ........................................................................................................................................................... 43 Contact ............................................................................................................................................................. 44 Document history Version 1.0 (2016-07-21) New: Technical data SPECcpu2006 ® ® Measurements with Intel Xeon Processor E7 v4 Family Disk I/O: Performance of RAID controllers Measurements with “PRAID CP400i”, “PRAID EP400i” and “PRAID EP420i” controllers OLTP-2 ® ® Results for Intel Xeon Processor E7 v4 Family TPC-E Measurement with Xeon E7-8890 v4 VMmark V2 “Performance Only“ Measurements with Xeon E7-8890 v4 „Performance with Server Power“ Measurement with Xeon E7-8890 v4 STREAM ® ® Measurements with Intel Xeon Processor E7 v4 Family LINPACK ® ® Measurements with Intel Xeon Processor E7 v4 Family Version 1.1 (2016-09-08) New: SAP SD Certification number 2016036 vServCon ® ® Results for Intel Xeon Processor E7 v4 Family Page 2 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Technical data PRIMERGY RX4770 M3 Decimal prefixes according to the SI standard are used for measurement units in this white paper (e.g. 1 GB 9 30 = 10 bytes). In contrast, these prefixes should be interpreted as binary prefixes (e.g. 1 GB = 2 bytes) for the capacities of caches and memory modules. Separate reference will be made to any further exceptions where applicable. Model PRIMERGY RX4770 M3 Form factor Rack server Chipset Intel C602 Chipset Number of sockets 4 Number of processors orderable 2 or 4 Processor type Intel Xeon Processor E7 v4 Family Number of memory slots 96 Maximum memory configuration 6 TB Onboard LAN controller 2 × 10 Gbit/s 9 × PCI-Express 3.0 x8 2 × PCI-Express 3.0 x16 12 ® ® PCI slots Max. number of internal hard disks ® Cores Processor Threads Processors (since system release) Cache QPI Speed Rated Frequency Max. Turbo Frequency Max. Memory Frequency TDP [MB] [GT/s] [Ghz] [Ghz] [MHz] [Watt] Xeon E7-8893 v4 4 8 60 9.60 3.20 3.50 1866 140 Xeon E7-4809 v4 8 16 20 6.40 2.10 entf. 1866 115 Xeon E7-4820 v4 10 20 25 6.40 2.00 entf. 1866 115 Xeon E7-8891 v4 10 20 60 9.60 2.80 3.50 1866 165 Xeon E7-4830 v4 14 28 35 8.00 2.00 2.80 1866 115 Xeon E7-4850 v4 16 32 40 8.00 2.10 2.80 1866 115 Xeon E7-8860 v4 18 36 45 9.60 2.20 3.20 1866 140 Xeon E7-8867 v4 18 36 45 9.60 2.40 3.30 1866 165 Xeon E7-8870 v4 20 40 50 9.60 2.10 3.00 1866 140 Xeon E7-8880 v4 22 44 55 9.60 2.20 3.30 1866 150 Xeon E7-8890 v4 24 48 60 9.60 2.20 3.40 1866 165 All the processors that can be ordered with the PRIMERGY RX4770 M3, apart from Xeon E7-4809 v4 and ® Xeon E7-4820 v4, support Intel Turbo Boost Technology 2.0. This technology allows you to operate the processor with higher frequencies than the nominal frequency. Listed in the processor table is "Max. Turbo Frequency" for the theoretical frequency maximum with only one active core per processor. The maximum http://ts.fujitsu.com/primergy Page 3 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 frequency that can actually be achieved depends on the number of active cores, the current consumption, electrical power consumption and the temperature of the processor. As a matter of principle Intel does not guarantee that the maximum turbo frequency will be reached. This is related to manufacturing tolerances, which result in a variance regarding the performance of various examples of a processor model. The range of the variance covers the entire scope between the nominal frequency and the maximum turbo frequency. The turbo functionality can be set via BIOS option. Fujitsu generally recommends leaving the "Turbo Mode" option set at the standard setting "Enabled", as performance is substantially increased by the higher frequencies. However, since the higher frequencies depend on general conditions and are not always guaranteed, it can be advantageous to disable the "Turbo Mode" option for application scenarios with intensive use of AVX instructions and a high number of instructions per clock unit, as well as for those that require constant performance or lower electrical power consumption. Registered ECC 16 1 4 2400 32GB (2x16GB) 1Rx4 DDR4-2400 R ECC 32 1 4 2400 64GB (2x32GB) 2Rx4 DDR4-2400 R ECC 64 2 4 2400 128GB (2x64GB) 4Rx4 DDR4-2400 LR ECC 128 4 4 2400 Power supplies (since system release) Load reduced 16GB (2x8GB) 1Rx4 DDR4-2400 R ECC Memory module Low voltage Ranks Frequency [MHz] Capacity [GB] Bit width of the memory chips Memory modules (since system release) Max. number Power Supply Module 1200W w/o power cord 4 Power Supply Module 1600W w/o power cord 4 Some components may not be available in all countries or sales regions. Detailed technical information is available in the data sheet PRIMERGY RX4770 M3. Page 4 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 SPECcpu2006 Benchmark description SPECcpu2006 is a benchmark which measures the system efficiency with integer and floating-point operations. It consists of an integer test suite (SPECint2006) containing 12 applications and a floating-point test suite (SPECfp2006) containing 17 applications. Both test suites are extremely computing-intensive and concentrate on the CPU and the memory. Other components, such as Disk I/O and network, are not measured by this benchmark. SPECcpu2006 is not tied to a special operating system. The benchmark is available as source code and is compiled before the actual measurement. The used compiler version and their optimization settings also affect the measurement result. SPECcpu2006 contains two different performance measurement methods: the first method (SPECint2006 or SPECfp2006) determines the time which is required to process single task. The second method (SPECint_rate2006 or SPECfp_rate2006) determines the throughput, i.e. the number of tasks that can be handled in parallel. Both methods are also divided into two measurement runs, “base” and “peak” which differ in the use of compiler optimization. When publishing the results the base values are always used; the peak values are optional. Benchmark Arithmetics Type Compiler optimization SPECint2006 integer peak aggressive SPECint_base2006 integer base conservative SPECint_rate2006 integer peak aggressive SPECint_rate_base2006 integer base conservative SPECfp2006 floating point peak aggressive SPECfp_base2006 floating point base conservative SPECfp_rate2006 floating point peak aggressive SPECfp_rate_base2006 floating point base conservative Measurement result Application Speed single-threaded Throughput multi-threaded Speed single-threaded Throughput multi-threaded The measurement results are the geometric average from normalized ratio values which have been determined for individual benchmarks. The geometric average - in contrast to the arithmetic average - means that there is a weighting in favour of the lower individual results. Normalized means that the measurement is how fast is the test system compared to a reference system. Value “1” was defined for the SPECint_base2006-, SPECint_rate_base2006, SPECfp_base2006 and SPECfp_rate_base2006 results of the reference system. For example, a SPECint_base2006 value of 2 means that the measuring system has handled this benchmark twice as fast as the reference system. A SPECfp_rate_base2006 value of 4 means that the measuring system has handled this benchmark some 4/[# base copies] times faster than the reference system. “# base copies” specify how many parallel instances of the benchmark have been executed. Not every SPECcpu2006 measurement is submitted by us for publication at SPEC. This is why the SPEC web pages do not have every result. As we archive the log files for all measurements, we can prove the correct implementation of the measurements at any time. http://ts.fujitsu.com/primergy Page 5 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment System Under Test (SUT) Hardware Model PRIMERGY RX4770 M3 Processor 4 processors of Intel Xeon Processor E7 v4 Family Memory 16 × 32GB (2x16GB) 2Rx4 DDR4-2400 R ECC ® ® Software BIOS settings Energy Performance = Performance Xeon E7-4809 v4, Xeon E7-4820 v4, Xeon E7-4830 v4, Xeon E7-8891 v4, Xeon E7-8893 v4: COD Enable = Disabled Home Dir Snoop with IVT- Style OSB Enable = Enabled Alle others: COD Enable = Enabled Home Dir Snoop with IVT- Style OSB Enable = Disabled Operating system SUSE Linux Enterprise Server 12 SP1 (x86_64) Operating system settings echo always > /sys/kernel/mm/transparent_hugepage/enabled Compiler C/C++: Version 16.0.0.101 of Intel C++ Studio XE for Linux Fortran: Version 16.0.0.101 of Intel Fortran Studio XE for Linux Some components may not be available in all countries or sales regions. Page 6 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark results SPECfp_rate2006 827 844 1160 982 1000 1290 1150 1170 2140 2250 1680 1710 4 1950 2040 1550 1580 Xeon E7-4850 v4 4 2330 2440 1740 1780 Xeon E7-8860 v4 4 2870 2990 2080 2130 Xeon E7-8867 v4 4 2950 3080 2130 2180 Xeon E7-8870 v4 4 3070 3200 2170 2230 Xeon E7-8880 v4 4 3320 3450 2260 2320 Xeon E7-8890 v4 4 3570 3710 2360 2430 Xeon E7-8893 v4 4 944 1000 Xeon E7-4809 v4 4 1100 Xeon E7-4820 v4 4 1240 Xeon E7-8891 v4 4 Xeon E7-4830 v4 SPECint_rate2006 Processor Number of processors SPECfp_rate_base2006 SPECint_rate_base2006 In terms of processors the benchmark result depends primarily on the size of the processor cache, the support for Hyper-Threading, the number of processor cores and on the processor frequency. The number of cores, which are loaded by the benchmark, determines the maximum processor frequency that can be achieved. th On 6 June 2016 the PRIMERGY RX4770 M3 with four Xeon E7-8890 v4 processors was ranked first in the 4-socket systems category for the benchmark SPECint_rate_base2006. The current results can be found at http://www.spec.org/cpu2006/results. http://ts.fujitsu.com/primergy Page 7 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 The two diagrams below illustrate the throughput of the PRIMERGY RX4770 M3 in comparison to its predecessor PRIMERGY RX4770 M2, in their respective most performant configuration. SPECcpu2006: integer performance PRIMERGY RX4770 M3 vs. PRIMERGY RX4770 M2 3710 4000 3570 2780 3500 3000 2700 2500 2000 1500 1000 SPECint_rate2006 500 SPECint_rate_base2006 0 PRIMERGY RX4770 M2 PRIMERGY RX4770 M3 4 × Xeon E7-8890 v3 4 × Xeon E7-8890 v4 SPECcpu2006: floating-point performance PRIMERGY RX4770 M3 vs. PRIMERGY RX4770 M2 2430 2500 1980 2360 1930 2000 1500 1000 SPECfp_rate2006 500 SPECfp_rate_base2006 0 PRIMERGY RX4770 M2 PRIMERGY RX4770 M3 4 × Xeon E7-8890 v3 4 × Xeon E7-8890 v4 Page 8 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Disk I/O: Performance of RAID controllers Benchmark description Performance measurements of disk subsystems for PRIMERGY and PRIMEQUEST servers are used to assess their performance and enable a comparison of the different storage connections for these servers. As standard, these performance measurements are carried out with a defined measurement method, which models the accesses of real application scenarios on the basis of specifications. The essential specifications are: Share of random accesses / sequential accesses Share of read / write access types Block size (kB) Number of parallel accesses (# of outstanding I/Os) A given value combination of these specifications is known as “load profile”. The following five standard load profiles can be allocated to typical application scenarios: Standard load profile Access Type of access read write Block size [kB] Application File copy random 50% 50% 64 Copying of files File server random 67% 33% 64 File server Database random 67% 33% 8 Database (data transfer) Mail server Streaming sequential 100% 0% 64 Database (log file), Data backup; Video streaming (partial) Restore sequential 0% 100% 64 Restoring of files In order to model applications that access in parallel with a different load intensity the "# of Outstanding I/Os" is increased from 1 to 512 (in steps to the power of two). The measurements of this document are based on these standard load profiles. The main results of a measurement are: Throughput [MB/s] Transactions [IO/s] Latency [ms] Throughput in megabytes per second Transaction rate in I/O operations per second Average response time in ms The data throughput has established itself as the normal measurement variable for sequential load profiles, whereas the measurement variable “transaction rate” is mostly used for random load profiles with their small block sizes. Data throughput and transaction rate are directly proportional to each other and can be transferred to each other according to the formula Data throughput [MB/s] = Transaction rate [IO/s] × Block size [MB] Transaction rate [IO/s] = Data throughput [MB/s] / Block size [MB] 12 This section specifies capacities of storage media on a basis of 10 (1 TB = 10 bytes) while all other 20 capacities, file sizes, block sizes and throughputs are specified on a basis of 2 (1 MB/s = 2 bytes/s). All the details of the measurement method and the basics of disk I/O performance are described in the white paper “Basics of Disk I/O Performance”. http://ts.fujitsu.com/primergy Page 9 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment All the measurement results discussed in this chapter were determined using the hardware and software components listed below: System Under Test (SUT) Hardware Processor 4 × Xeon E7-8867 v4 @ 2.40GHz Controller 1 × “PRAID CP400i”, “PRAID EP400i”, “PRAID EP420i”: Driver name: megasas2.sys, Driver version: 6.706.06 Firmware package: 24.7.0-0061 Storage media SSDs HDDs Toshiba PX02SMF040 HGST HUC156045CSS204 Software BIOS settings Intel Virtualization Technology = Disabled VT-d = Disabled Energy Performance = Performance Utilization Profile = Unbalanced CPU C6 Report = Disabled Operating system Microsoft Windows Server 2012 R2 Standard Operating system settings Choose or customize a power plan: High performance For the processes that create disk I/Os: set the AFFINITY to the CPU node to which the PCIe slot of the RAID controller is connected Administration software ServerView RAID Manager 6.2.1 Benchmark version 3.0 Stripe size Controller default Measuring tool Iometer 1.1.0 Measurement area The first 10% of the usable LBA area is used for sequential accesses; the next 25% for random accesses. File system raw Total number of Iometer workers 1 Alignment of Iometer accesses Aligned to whole multiples of 4096 bytes Some components may not be available in all countries / sales regions. Page 10 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark results The results presented here are designed to help you choose the right solution from the various configuration options of the PRIMERGY RX4770 M3 in the light of disk-I/O performance. Various combinations of RAID controllers and storage media will be analyzed below. Information on the selection of storage media themselves is to be found in the section “Disk I/O: Performance of storage media”. Hard disks The hard disks are the first essential component. If there is a reference below to “hard disks”, this is meant as the generic term for HDDs (“hard disk drives”, in other words conventional hard disks) and SSDs (“solid state drives”, i.e. non-volatile electronic storage media). Mixed drive configurations of SAS and SATA hard disks in one system are permitted, unless they are excluded in the configurator for special hard disk types. More detailed performance statements about hard disk types are available in the section “Disk I/O: Performance of storage media” in this performance report. Model versions The maximum number of hard disks in the system depends on the system configuration. The following table lists the essential cases. Only the highest supported version is named for all the interfaces we have dealt with in this section. Form factor 2.5" Interface SATA 6G, SAS 12G http://ts.fujitsu.com/primergy Connection type Number of PCIe controllers Maximum number of hard disks direct 1 8 Page 11 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 RAID controller In addition to the hard disks the RAID controller is the second performance-determining key component. In the case of these controllers the “modular RAID” concept of the PRIMERGY servers offers a plethora of options to meet the various requirements of a wide range of different application scenarios. The following table summarizes the most important features of the available RAID controllers of the PRIMERGY RX4770 M3. A short alias is specified here for each controller, which is used in the subsequent list of the performance values. Controller name Alias Cache Supported interfaces In the system Max. # disks per controller FBU RAID levels PRAID CP400i PRAID CP400i - SATA 6G SAS 12G PCIe 3.0 x8 8 × 2.5" 0, 1, 1E, 5, 10, 50 - PRAID EP400i PRAID EP400i 1 GB SATA 6G SAS 12G PCIe 3.0 x8 8 × 2.5" 0, 1, 1E, 5, 6, 10, 50, 60 PRAID EP420i PRAID EP420i 2 GB SATA 6G SAS 12G PCIe 3.0 x8 8 × 2.5" 0, 1, 1E, 5, 6, 10, 50, 60 System-specific interfaces The interfaces of a controller in CPU direction (DMI or PCIe) and in the direction of hard disks (SAS or SATA) have in each case specific limits for data throughput. These limits are listed in the following table. The minimum of these two values is a definite limit, which cannot be exceeded. This value is highlighted in bold in the following table. Controller alias Effective in the configuration PRAID CP400i 8 × SAS 12G 8240 MB/s 8 × PCIe 3.0 6761 MB/s - PRAID EP400i 8 × SAS 12G 8240 MB/s 8 × PCIe 3.0 6761 MB/s - PRAID EP420i 8 × SAS 12G 8240 MB/s 8 × PCIe 3.0 6761 MB/s - # Disk-side data channels Limit for throughput of disk interface # CPU-side data channels Connection via expander Limit for throughput of CPU-side interface More details about the RAID controllers of the PRIMERGY systems are available in the white paper “RAID Controller Performance”. Page 12 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Settings In most cases, the cache of HDDs has a great influence on disk-I/O performance. It is frequently regarded as a security problem in case of power failure and is thus switched off. On the other hand, it was integrated by hard disk manufacturers for the good reason of increasing the write performance. For performance reasons it is therefore advisable to enable the hard disk cache. To prevent data loss in case of power failure you are recommended to equip the system with a UPS. In the case of controllers with a cache there are several parameters that can be set. The optimal settings can depend on the RAID level, the application scenario and the type of data medium. In the case of RAID levels 5 and 6 in particular (and the more complex RAID level combinations 50 and 60) it is obligatory to enable the controller cache for application scenarios with write share. If the controller cache is enabled, the data temporarily stored in the cache should be safeguarded against loss in case of power failure. Suitable accessories are available for this purpose (e.g. a FBU). For the purpose of easy and reliable handling of the settings for RAID controllers and hard disks it is advisable to use the software “ServerView RAID Manager” that is supplied for the server. All the cache settings for controllers and hard disks can usually be made en bloc – specifically for the application – by using the pre-defined modi “Performance”, “Data Protection” or “Fast Path optimum”. The “Performance” mode ensures the best possible performance settings for the majority of the application scenarios with HDDs. In connection with the “FastPath” RAID controller option, the “Fast Path optimum” mode should be selected if maximum transaction rates are to be achieved with SSDs for random accesses with small blocks (≤ 8 kB, e. g. OLTP operation of databases). More information about the setting options of the controller cache is available in the white paper “RAID Controller Performance”. Performance values In general, disk-I/O performance of a logical drive depends on the type and number of hard disks, on the RAID level and on the RAID controller. If the limits of the system-specific interfaces are not exceeded, the statements on disk-I/O performance are therefore valid for all PRIMERGY systems. This is why all the performance statements of the document “RAID Controller Performance” also apply for the PRIMERGY RX4770 M3 if the configurations measured there are also supported by this system. The performance values of the PRIMERGY RX4770 M3 are listed in table form below, specifically for different RAID levels, access types and block sizes. Substantially different configuration versions are dealt with separately. The established measurement variables, as already mentioned in the subsection Benchmark description, are used here. Thus, transaction rate is specified for random accesses and data throughput for sequential accesses. To avoid any confusion among the measurement units the tables have been separated for the two access types. The table cells contain the maximum achievable values. This has three implications: On the one hand hard disks with optimal performance were used (the components used are described in more detail in the subsection Benchmark environment). Furthermore, cache settings of controllers and hard disks, which are optimal for the respective access scenario and the RAID level, are used as a basis. And ultimately each value is the maximum value for the entire load intensity range (# of outstanding I/Os). In order to also visualize the numerical values each table cell is highlighted with a horizontal bar, the length of which is proportional to the numerical value in the table cell. All bars shown in the same scale of length have the same color. In other words, a visual comparison only makes sense for table cells with the same colored bars. Since the horizontal bars in the table cells depict the maximum achievable performance values, they are shown by the color getting lighter as you move from left to right. The light shade of color at the right end of the bar tells you that the value is a maximum value and can only be achieved under optimal prerequisites. The darker the shade becomes as you move to the left, the more frequently it will be possible to achieve the corresponding value in practice. http://ts.fujitsu.com/primergy Page 13 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 2.5" - Random accesses (maximum performance values in IO/s): PRAID CP400i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD PRAID EP400i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD PRAID EP420i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD Page 14 (44) SSDs random 64 kB blocks 67% read [IO/s] SSDs random 8 kB blocks 67% read [IO/s] HDDs random 64 kB blocks 67% read [IO/s] RAID level #Disks Hard disk type RAID Controller Configuration version HDDs random 8 kB blocks 67% read [IO/s] PRIMERGY RX4770 M3 2 1 1153 971 65174 9478 8 8 8 10 0 5 4294 4784 2435 2229 2531 1382 117403 164298 28732 26336 38806 17952 2 8 1 10 1590 4623 888 3265 62721 183026 9386 28011 8 8 0 5 5291 3008 3784 2097 237818 132953 45574 16056 2 8 8 8 1 10 0 5 1544 4616 5230 2970 994 3213 3729 2039 63189 194669 248073 133538 9993 25960 44424 16243 http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 2.5" - Sequential accesses (maximum performance values in MB/s): PRAID CP400i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD PRAID EP400i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD PRAID EP420i HUC156045CSS204 SAS HDD PX02SMF040 SAS SSD SSDs sequential 64 kB blocks 100% write [MB/s] SSDs sequential 64 kB blocks 100% read [MB/s] HDDs sequential 64 kB blocks 100% write [MB/s] RAID level #Disks Hard disk type RAID Controller Configuration version HDDs sequential 64 kB blocks 100% read [MB/s] PRIMERGY RX4770 M3 2 1 327 230 1910 422 8 8 8 10 0 5 1051 1795 1579 892 1779 1558 5873 5851 5840 1640 3297 1760 2 8 1 10 350 1148 232 945 1877 5825 406 1486 8 8 0 5 1874 1648 1892 1658 5817 5871 3081 2661 2 8 8 8 1 10 0 5 375 1174 1871 1648 232 942 1892 1650 1887 5795 5819 5869 400 1665 3077 2569 Conclusion At full configuration with powerful hard disks the PRIMERGY RX4770 M3 achieves a throughput of up to 5873 MB/s for sequential load profiles and a transaction rate of up to 248073 IO/s for typical, random application scenarios. To operate SSDs within the maximum performance range the PRAID CP400i is already suited for the simpler RAID levels 0, 1 and 10, and a PRAID controller with cache is to be preferred for RAID 5. In the event of HDDs the controller cache for random load profiles with a significant write share has performance advantages for all RAID levels. http://ts.fujitsu.com/primergy Page 15 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 SAP SD Benchmark description The SAP application software consists of modules used to manage all standard business processes. These include modules for ERP (Enterprise Resource Planning), such as Assemble-to-Order (ATO), Financial Accounting (FI), Human Resources (HR), Materials Management (MM), Production Planning (PP) plus Sales and Distribution (SD), as well as modules for SCM (Supply Chain Management), Retail, Banking, Utilities, BI (Business Intelligence), CRM (Customer Relation Management) or PLM (Product Lifecycle Management). The application software is always based on a database so that a SAP configuration consists of the hardware, the software components operating system, zhe database and the SAP software itself. SAP AG has developed SAP Standard Application Benchmarks in order to verify the performance, stability and scaling of a SAP application system. The benchmarks, of which SD Benchmark is the most commonly used and most important, analyze the performance of the entire system and thus measure the quality of the integrated individual components. The benchmark differentiates between a 2-tier and a 3-tier configuration. The 2-tier configuration has the SAP application and database installed on one server. With a 3-tier configuration the individual components of the SAP application can be distributed via several servers and an additional server handles the database. The entire specification of the benchmark developed by SAP AG, Walldorf, Germany can be found at: http://www.sap.com/benchmark. Benchmark environment The measurement set-up is symbolically illustrated below: 2-tier environment Server Disk subsystem Network Benchmark driver Page 16 (44) System Under Test (SUT) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 System Under Test (SUT) Hardware Model PRIMERGY RX4770 M3 Processor 4 × Xeon E7-8890 v4 Memory 32 × 32GB (2x16GB) 2Rx4 DDR4-2400 R ECC Network interface 1Gbit/s LAN Disk subsystem PRIMERGY RX4770 M3: 4 × HD SAS 12G 300GB 15K HOT PL 2.5'' EP 1 × PRAID EP420i 1 × PRAID CP400e FH 2 × Eternus JX40 Software BIOS settings COD Enable = Enabled Operating system Microsoft Windows Server 2012 R2 Standard Edition Database Microsoft SQL Server 2012 (64-bit) SAP Business Suite Software SAP enhancement package 5 for SAP ERP 6.0 Benchmark driver Hardware Model PRIMERGY RX300 S4 Processor 2 × Xeon X5460 Memory 32 GB Network interface 1Gbit/s LAN Software Operating system SUSE Linux Enterprise Server 11 SP1 Some components may not be available in all countries or sales regions. Benchmark results Certification number 2016036 Number of SAP SD benchmark users 39,500 Average dialog response time 0.98 seconds Throughput Fully processed order line items/hour Dialog steps/hour SAPS 4,315,330 12,946,000 215,770 Average database request time (dialog/update) 0.010 sec / 0.030 sec CPU utilization of central server 98% Operating system, central server Windows Server 2012 R2 Standard Edition RDBMS SQL Server 2012 SAP Business Suite software SAP enhancement package 5 for SAP ERP 6.0 Configuration Central Server Fujitsu PRIMERGY RX4770 M3 4 processors / 96 cores / 192 threads Intel Xeon Processor E7-8890 v4, 2.20 GHz, 64 KB L1 cache and 256 KB L2 cache per core, 60 MB L3 cache per processor 1024 GB main memory http://ts.fujitsu.com/primergy Page 17 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 The following chart shows a comparison of two-tier SAP SD Standard Application Benchmark results for 4-way Xeon E7 v4 based servers with Windows OS and SQL Server database (as of July 19, 2016). The PRIMERGY RX4770 M3 outperforms the comparably configured servers from HP. The latest SAP SD 2-tier results can be found at http://www.sap.com/solutions/benchmark/sd2tier.epx. 4-way Xeon E7 v4 based Two-Tier SAP SD results with Windows OS and SQL Server RDBMS Fujitsu PRIMERGY RX4770 M3 4 x Xeon E7-8890 v4 4 processors/96 cores/192 threads Windows Server 2012 R2/ SQL Server 2012 SAP enhancement package 5 for SAP ERP 6.0 Certification number: 2016036 39500 HPE ProLiant DL580 Gen9 4 x Xeon E7-8890 v4 4 processors/96 cores/192 threads Windows Server 2012 R2/ SQL Server 2012 SAP enhancement package 5 for SAP ERP 6.0 Certification number: 2016029 38800 0 5000 10000 15000 20000 25000 30000 35000 40000 Number of Benchmark Users The following diagram illustrates the throughput of the PRIMERGY RX4770 M3 in comparison to its predecessor, the PRIMERGY RX4770 M2, in the respective most performant configuration. Two-Tier SAP SD results: PRIMERGY RX4770 M3 vs. predecessor Fujitsu PRIMERGY RX4770 M3 4 x Xeon E7-8890 v4 4 processors/96 cores/192 threads Windows Server 2012 R2/ SQL Server 2012 SAP enhancement package 5 for SAP ERP 6.0 Certification number: 2016036 39500 Fujitsu PRIMERGY RX4770 M2 4 x Xeon E7-8890 v3 4 processors/72 cores/144 threads Windows Server 2012 R2/ SQL Server 2012 SAP enhancement package 5 for SAP ERP 6.0 Certification number: 2015031 29750 0 5000 10000 15000 20000 25000 30000 35000 40000 Number of Benchmark Users Page 18 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 OLTP-2 Benchmark description OLTP stands for Online Transaction Processing. The OLTP-2 benchmark is based on the typical application scenario of a database solution. In OLTP-2 database access is simulated and the number of transactions achieved per second (tps) determined as the unit of measurement for the system. In contrast to benchmarks such as SPECint and TPC-E, which were standardized by independent bodies and for which adherence to the respective rules and regulations are monitored, OLTP-2 is an internal benchmark of Fujitsu. OLTP-2 is based on the well-known database benchmark TPC-E. OLTP-2 was designed in such a way that a wide range of configurations can be measured to present the scaling of a system with regard to the CPU and memory configuration. Even if the two benchmarks OLTP-2 and TPC-E simulate similar application scenarios using the same load profiles, the results cannot be compared or even treated as equal, as the two benchmarks use different methods to simulate user load. OLTP-2 values are typically similar to TPC-E values. A direct comparison, or even referring to the OLTP-2 result as TPC-E, is not permitted, especially because there is no priceperformance calculation. Further information can be found in the document Benchmark Overview OLTP-2. Benchmark environment The measurement set-up is symbolically illustrated below: Tier A Driver Network Tier B Network Application Server Database Server Disk subsystem System Under Test (SUT) Clients Database Server (Tier B) Hardware Model PRIMERGY RX4770 M3 Processor Intel Xeon Processor E7 v4 Family Memory 2048 GB: 32 × 64GB (2x32GB) 2Rx4 DDR4-2400 R ECC 1024 GB: 16 × 64GB (2x32GB) 2Rx4 DDR4-2400 R ECC Network interface 2 × onboard LAN 10 Gb/s Disk subsystem RX4770 M3: ® ® 7 × JX40 S2: http://ts.fujitsu.com/primergy PRAID EP400i 2 × 300 GB 15k rpm SAS Drives, RAID1 (OS), 6 × 600 GB 15k rpm SAS Drives, RAID10 (LOG) 7 × PRAID EP420e Je 24 × 400 GB SSD Drive, RAID5 (data) Page 19 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Software BIOS Version R1.0.0 Operating system Microsoft Windows Server 2012 R2 Standard Database Microsoft SQL Server 2016 Enterprise Application Server (Tier A) Hardware Model 1 × PRIMERGY RX2530 M1 Processor 2 × Xeon E5-2697 v3 Memory 64 GB, 2133 MHz registered ECC DDR4 Network interface 2 × onboard LAN 10 Gb/s 1 × Dual Port LAN 1Gb/s Disk subsystem 2 × 300 GB 15k rpm SAS Drive Software Operating system Microsoft Windows Server 2012 R2 Standard Client Hardware Model 1 × PRIMERGY RX300 S7 Processor 2 × Xeon E5-2667 v2 Memory 64 GB, 1600 MHz registered ECC DDR3 Network interface 2 × onboard LAN 1 Gb/s 1 × Dual Port LAN 1Gb/s Disk subsystem 1 × 300 GB 15k rpm SAS Drive Software Operating system Microsoft Windows Server 2012 R2 Standard Benchmark OLTP-2 Software EGen version 1.14.0 Some components may not be available in all countries / sales regions. Page 20 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark results Database performance greatly depends on the configuration options with CPU, memory and on the connectivity of an adequate disk subsystem for the database. In the following scaling considerations for the processors we assume that both the memory and the disk subsystem has been adequately chosen and is not a bottleneck. A guideline in the database environment for selecting main memory is that sufficient quantity is more important than the speed of the memory accesses. This why a configuration with a total memory of 2048 GB was considered for the measurements with four processors and a configuration with a total memory of 1024 GB for the measurements with two processors. The OLTP-2 scaling measurements presented here were all performed with a memory access speed – depending on the processor type – of at most 1600 MHz. Further information about memory performance can be found in the White Paper Memory performance of Xeon E7 v4 (Broadwell-EX)-based systems. The following diagram shows the OLTP-2 transaction rates that can be achieved with two and four ® ® processors of the Intel Xeon Processor E7 v4 Product Family. OLTP-2 tps 8761.05 Xeon E7-8890 v4 24 Core, HT 4867.25 8095.29 Xeon E7-8880 v4 22 Core, HT 4497.38 7396.84 Xeon E7-8870 v4 20 Core, HT 4109.36 7172.49 Xeon E7-8867 v4 18 Core, HT 3984.72 6977.98 Xeon E7-8860 v4 18 Core, HT 3876.66 5724.39 Xeon E7-4850 v4 16 Core, HT 3180.22 4965.52 Xeon E7-4830 v4 14 Core, HT 2758.62 4989.37 Xeon E7-8891 v4 10 Core, HT 2771.87 3447.78 Xeon E7-4820 v4 10 Core, HT 1915.43 2864.43 Xeon E7-4809 v4 8 Core, HT 1591.35 4CPUs 2048GB 2223.35 Xeon E7-8893 v4 4 Core, HT 2CPUs 1024GB 1235.19 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 tps HT: Hyper-Threading http://ts.fujitsu.com/primergy bold: measured cursive: calculated Page 21 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 It is evident that a wide performance range is covered by the variety of released processors. If you compare the OLTP-2 value of the processor with the lowest performance (Xeon E7-8893 v4) with the value of the processor with the highest performance (Xeon E7-8890 v4), the result is a 3.9-fold increase in performance. Based on the results achieved the processors can be divided into different performance groups: The start is made with Xeon E7-8893 v4 and E7-4809 v4 as processors with four or eight cores. The Xeon E7-4820 v4 processor with its ten cores has the next best performance to offer. Due to its high clock frequency and the high QPI speed of 9.60 GT/s a higher throughput rate is achieved with the performanceoptimized 10-core processor Xeon E7-8891 v4. The groups of 14-, 16-, 18 and 20-core processors offer in this processor series a medium-range OLTP-2 performance. Due to the various technical features of the processors in these groups (see “Technical data”) it is possible to choose the right CPU depending on the usage scenario. The processors with 22 or 24 cores are to be found at the upper end of the performance scale. An OLTP performance of between 8170.13 tps (4 × Xeon E7-8880 v4) and 8761.05 tps (4 × Xeon E7-8890 v4) is achieved. If you compare the maximum achievable OLTP-2 values of the current system generation with the values that were achieved on the predecessor systems, the result is an increase of about 27%. tps Maximum OLTP-2 tps Comparison of system generations 10000 9000 + ~27% 8000 7000 6000 5000 4000 3000 4 × E7-8890 v3 2048 GB 4 × E7-8890 v4 2048 GB 2000 SQL 2014 SQL 2016 1000 0 PRIMERGY RX4770 M2 Page 22 (44) PRIMERGY RX4770 M3 http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 TPC-E Benchmark description The TPC-E benchmark measures the performance of online transaction processing systems (OLTP) and is based on a complex database and a number of different transaction types that are carried out on it. TPC-E is not only a hardware-independent but also a software-independent benchmark and can thus be run on every test platform, i.e. proprietary or open. In addition to the results of the measurement, all the details of the systems measured and the measuring method must also be explained in a measurement report (Full Disclosure Report or FDR). Consequently, this ensures that the measurement meets all benchmark requirements and is reproducible. TPC-E does not just measure an individual server, but a rather extensive system configuration. Keys to performance in this respect are the database server, disk I/O and network communication. The performance metric is tpsE, where tps means transactions per second. tpsE is the average number of Trade-Result-Transactions that are performed within a second. The TPC-E standard defines a result as the tpsE rate, the price per performance value (e.g. $/tpsE) and the availability date of the measured configuration. Further information about TPC-E can be found in the overview document Benchmark Overview TPC-E. Benchmark results In July 2016 Fujitsu submitted a TPC-E benchmark result for the PRIMERGY RX4770 M3 with the 24-core processor Intel Xeon E7-8890 v4 and 2048 GB memory. The results show an enormous increase in performance compared with the PRIMERGY RX4770 M2 with a simultaneous reduction in costs. http://ts.fujitsu.com/primergy Page 23 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 FUJITSU Server PRIMERGY RX4770 M3 TPC-E Throughput 8,796.47 tpsE Price/Performance $ 116.62 USD per tpsE Availability Date July 31, 2016 TPC-E 1.14.0 TPC Pricing 1.7.0 Report Date July 12, 2016 Total System Cost $ 1,025,815 USD Database Server Configuration Operating System Microsoft Windows Server 2012 R2 Standard Edition Database Manager Microsoft SQL Server 2016 Enterprise Edition SUT Processors/Cores/Threads 4/96/192 Memory 2048 GB Tier A PRIMERGY RX2530 M1 2x Intel Xeon E5-2697 v3 2.60 GHz 64 GB Memory 2x 300 GB 15k rpm SAS Drive 2x onboard LAN 10 Gb/s 1x Dual Port LAN 1 Gb/s 1x SAS RAID controller Tier B PRIMERGY RX4770 M3 4x Intel Xeon E7-8890 v4 2.20 GHz 2048 GB Memory 2x 300 GB 15k rpm SAS Drives 6x 600 GB 15k rpm SAS Drives 2x onboard LAN 10 Gb/s 8x SAS RAID Controller Storage 1x PRIMECENTER Rack 7x ETERNUS JX40 168x 400 GB SSD Drives Initial Database Size 36,951 GB Redundancy Level 1 RAID-5 data and RAID-10 log Storage 168 x 400 GB SSD 6 x 600 GB 15k rpm HDD Some components may not be available in all countries / sales regions. More details about this TPC-E result, in particular the Full Disclosure Report, can be found via the TPC web page http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=116071201. Page 24 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 In July 2016, Fujitsu is represented with seven results in the TPC-E list (without historical results). System and Processors Price / Performance Throughput Availability Date PRIMERGY RX300 S8 with 2 × Xeon E5-2697 v2 2472.58 tpsE $135.14 per tpsE September 10, 2013 PRIMEQUEST 2800E with 2 × Xeon E7-8890 v2 8582.52 tpsE $205.43 per tpsE Mai 1, 2014 PRIMERGY RX2540 M1 with 2 × Xeon E5-2699 v3 3772.08 tpsE $130.44 per tpsE December 1, 2014 PRIMERGY RX4770 M2 with 4 × Xeon E7-8890 v3 6904.53 tpsE $126.49 per tpsE June 1, 2015 PRIMEQUEST 2800E2 with 8 × Xeon E7-8890 v3 10058.28 tpsE PRIMERGY RX2540 M2 with 2 × Xeon E5-2699 v4 4734.87 tpsE $111.65 per tpsE July 31, 2016 PRIMERGY RX4770 M3 with 4 × Xeon E7-8890 v4 8796.47 tpsE $116.62 per tpsE July 31, 2016 $187.53 per tpsE November 11, 2015 See the TPC web site for more information and all the TPC-E results (including historical results) (http://www.tpc.org/tpce). The following overview, sorted according to price/performance, shows the best TPC-E price per th performance ratios (as of July 12 , 2016, without historical results) and the corresponding TPCE throughputs. PRIMERGY RX4770 M3 with a price per performance ratio of $116.62/tpsE has the best price/performance value of all 4-socket systems. System Fujitsu PRIMERGY RX2540 M2 Fujitsu PRIMERGY RX4770 M3 Lenovo System x3650 M5 Fujitsu PRIMERGY RX4770 M2 Fujitsu PRIMERGY RX2540 M1 Fujitsu PRIMERGY RX300 S8 Lenovo System x3850 X6 Lenovo System x3950 X6 IBM System x3650 M4 Fujitsu PRIMEQUEST 2800E2 Processor type processors/ cores/threads 2 × Intel Xeon E5-2699 v4 4 × Intel Xeon E7-8890 v4 2 × Intel Xeon E5-2699 v4 4 × Intel Xeon E7-8890 v3 2 × Intel Xeon E5-2699 v3 2 × Intel Xeon E5-2697 v2 4 × Intel Xeon E7-8890 v4 8 × Intel Xeon E7-8890 v3 2 × Intel Xeon E5-2697 v2 8 × Intel Xeon E7-8890 v3 tpsE (higher is better) $/tpsE (low er is better) availability date 4,734.87 111.65 2016-07-31 8,796.47 116.62 2016-07-31 4,938.14 117.91 2016-07-31 6,904.53 126.94 2015-01-06 3,772.08 130.44 2014-12-01 2,472.58 135.14 2013-09-10 9,068.00 139.85 2016-07-31 11,058.99 143.91 2015-12-17 2,591.00 150.00 2013-09-10 10,058.28 187.53 2015-11-11 See the TPC web site for more information and all the TPC-E results (including historical results) (http://www.tpc.org/tpce). http://ts.fujitsu.com/primergy Page 25 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 vServCon Benchmark description vServCon is a benchmark used by Fujitsu to compare server configurations with hypervisor with regard to their suitability for server consolidation. This allows both the comparison of systems, processors and I/O technologies as well as the comparison of hypervisors, virtualization forms and additional drivers for virtual machines. vServCon is not a new benchmark in the true sense of the word. It is more a framework that combines already established benchmarks (or in modified form) as workloads in order to reproduce the load of a consolidated and virtualized server environment. Three proven benchmarks are used which cover the application scenarios database, application server and web server. Application scenario Benchmark No. of logical CPU cores Memory Database Sysbench (adapted) 2 1.5 GB Java application server SPECjbb (adapted, with 50% - 60% load) 2 2 GB Web server WebBench 1 1.5 GB Each of the three application scenarios is allocated to a dedicated virtual machine (VM). Add to these a fourth machine, the so-called idle VM. These four VMs make up a “tile”. Depending on the performance capability of the underlying server hardware, you may as part of a measurement also have to start several identical tiles in parallel in order to achieve a maximum performance score. System Under Test Database VM Java VM Web VM … Database Java VM VM Database Java VM VM Database Java VM VM Web VM Web VM Web VM Idle VM Tile n … Idle VM Idle VM Idle VM Tile 3 Tile 2 Tile 1 Each of the three vServCon application scenarios provides a specific benchmark result in the form of application-specific transaction rates for the respective VM. In order to derive a normalized score, the individual benchmark results for one tile are put in relation to the respective results of a reference system. The resulting relative performance values are then suitably weighted and finally added up for all VMs and tiles. The outcome is a score for this tile number. Starting as a rule with one tile, this procedure is performed for an increasing number of tiles until no further significant increase in this vServCon score occurs. The final vServCon score is then the maximum of the vServCon scores for all tile numbers. This score thus reflects the maximum total throughput that can be achieved by running the mix defined in vServCon that consists of numerous VMs up to the possible full utilization of CPU resources. This is why the measurement environment for vServCon measurements is designed in such a way that only the CPU is the limiting factor and that no limitations occur as a result of other resources. The progression of the vServCon scores for the tile numbers provides useful information about the scaling behavior of the “System under Test”. Moreover, vServCon also documents the total CPU load of the host (VMs and all other CPU activities) and, if possible, electrical power consumption. A detailed description of vServCon is in the document: Benchmark Overview vServCon. Page 26 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment The measurement set-up is symbolically illustrated below: Framework controller Server Disk subsystem Multiple 1Gb or 10Gb networks System Under Test (SUT) Load generators System Under Test (SUT) Hardware Model PRIMERGY RX4770 M3 Processor Intel Xeon Processor E7 v4 Family Memory 1 TB: 32 × 32GB (2x16GB) 2Rx4 DDR4-2400 R ECC Network interface 1 × dual port 1GbE adapter 1 × dual port 10GbE server adapter Disk subsystem 1 × dual-channel FC-Controller Emulex LPe16002 LINUX/LIO based flash storage system ® ® Software Operating system VMware ESXi 6.0.0 U2 Build 3620759 Load generator (incl. Framework controller) Hardware (Shared) Enclosure PRIMERGY BX900 Hardware Model 18 × PRIMERGY BX920 S1 server blades Processor 2 × Xeon X5570 Memory 12 GB Network interface 3 × 1 Gbit/s LAN Software Operating system http://ts.fujitsu.com/primergy Microsoft Windows Server 2003 R2 Enterprise with Hyper-V Page 27 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Load generator VM (per tile 3 load generator VMs on various server blades) Hardware Processor 1 × logical CPU Memory 512 MB Network interface 2 × 1 Gbit/s LAN Software Operating system Microsoft Windows Server 2003 R2 Enterprise Edition Some components may not be available in all countries or sales regions. Benchmark results ® The quad-socket system PRIMERGY RX4770 M3 dealt with here are based on processors of the Intel ® Xeon Processor E7 v4 Family. The features of the processors are summarized in the section “Technical data”. The available processors of these systems with their results can be seen in the following table. Score #Tiles 4 Cores Hyper-Threading, Turbo-Mode E7-8893 v4 17.3 9 8 Cores Hyper-Threading E7-4809 v4 21.5 15 10 Cores Hyper-Threading E7-4820 v4 26.1 19 10 Cores Hyper-Threading, Turbo-Mode E7-8891 v4 40.9 22 14 Cores Hyper-Threading, Turbo-Mode E7-4830 v4 39.3 26 16 Cores Hyper-Threading, Turbo-Mode E7-4850 v4 47.4 30 E7-8860 v4 58.5 35 E7-8867 v4 59.9 35 20 Cores Hyper-Threading, Turbo-Mode E7-8870 v4 63.5 39 22 Cores Hyper-Threading, Turbo-Mode E7-8880 v4 69.5 42 24 Cores Hyper-Threading, Turbo-Mode E7-8890 v4 75.6 44 ® ® Intel Xeon Processor E7 v4 Family Processor 18 Cores Hyper-Threading, Turbo-Mode These PRIMERGY quad-socket systems are very suitable for application virtualization thanks to the progress made in processor technology. Compared with a system based on the previous processor generation an approximate 30% higher virtualization performance can be achieved (measured in vServCon score in their maximum configuration). The relatively large performance differences between the processors can be explained by their features. The values scale on the basis of the number of cores, the size of the L3 cache and the CPU clock frequency and as a result of the features of Hyper-Threading and turbo mode, which are available in most processor types. Furthermore, the data transfer rate between processors (“QPI Speed”) also determines performance. As a matter of principle, the memory access speed also influences performance. A guideline in the virtualization environment for selecting main memory is that sufficient quantity is more important than the speed of the memory accesses. More information about the topic "Memory Performance" can be found in the White Paper Memory performance of Xeon E7 v4 (Broadwell-EX)-based systems. Page 28 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 The first diagram compares the virtualization performance values that can be achieved with the processors reviewed here. ® ® Intel Xeon Processor E7 v4 Family E7-4809 v4 E7-4820 v4 E7-8891 v4 E7-4830 v4 30 35 35 39 42 E7-8880 v4 26 E7-8870 v4 22 E7-8867 v4 19 E7-8860 v4 15 E7-4850 v4 9 E7-8893 v4 #Tiles 44 80 60 50 40 30 20 10 E7-8890 v4 Final vServCon Score 70 0 70 ×2 60 50 30 20 10 75.6@44 tiles 40 37.9@22 tiles Final vServCon Score 80 2 × E7-8890 v4 4 × E7-8890 v4 Until now we have looked at the virtualization performance of a fully configured system. However, with a server with four sockets the question also arises as to how good performance scaling is from two to four processors. The better the scaling, the lower the overhead usually caused by the shared use of resources within a server. The scaling factor also depends on the application. If the server is used as a virtualization platform for server consolidation, the system scales with a factor of 2. When operated with four processors, the system thus achieves twice the performance as with two processors, as is illustrated in the diagram opposite using the processor version Xeon E7-8890 v4 as an example. 0 http://ts.fujitsu.com/primergy Page 29 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 The next diagram illustrates the virtualization performance for increasing numbers of VMs based on the Xeon E7-4830 v3 (14-Core) processor. E7-4830 v4 40 vServCon Score 30 20 2.41 5.00 7.33 9.92 12.3 14.6 17.0 19.0 21.5 23.7 25.4 27.5 29.0 30.5 31.1 32.3 33.2 34.2 35.0 36.2 37.2 37.3 38.3 38.5 39.2 39.3 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #Tiles In addition to the increased number of physical cores, Hyper-Threading, which is supported by all Xeon E7 processors, is an additional reason for the high number of VMs that can be operated. As is known, a physical processor core is consequently divided into two logical cores so that the number of cores available for the hypervisor is doubled. This standard feature thus generally increases the virtualization performance of a system. The scaling curves for the number of tiles as seen in the previous diagram are specifically for systems with Hyper-Threading. 56 physical and thus 112 logical cores are available with the Xeon E7-4830 v4 processors; approximately four of them are used per tile (see Benchmark description). This means that a parallel use of the same physical cores by several VMs is avoided up to a maximum of about fourteen tiles. That is why the performance curve in this range scales almost ideal. For the quantities above the growth is flatter up to CPU full utilization. The previous diagram examined the total performance of all application VMs of a host. However, studying the performance from an individual application VM viewpoint is also interesting. This information is in the previous diagram. For example, the total optimum is reached in the above Xeon E7-4830 v4 situation with 78 application VMs (26 tiles, not including the idle VMs); the low load case is represented by three application VMs (one tile, not including the idle VM). Remember: the vServCon score for one tile is an average value across the three application scenarios in vServCon. This average performance of one tile drops when changing from the low load case to the total optimum of the vServCon score – from 2.41 to 39.3/26=1.50, i.e. to 59%. The individual types of application VMs can react very differently in the high load situation. It is thus clear that in a specific situation the performance requirements of an individual application must be balanced against the overall requirements regarding the numbers of VMs on a virtualization host. Page 30 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 The virtualization-relevant progress in processor technology since 2009 has an effect on the one hand on an individual VM and, on the other hand, on the possible maximum number of VMs up to CPU full utilization. The following comparison shows the proportions for both types of improvements. Five systems are compared: a system from 2009, a System Best vServCon Best vServCon system from 2010, a system from Performance Score Maximum Score 2011, a system from 2014, a Few VMs 1 Tile Performance max. system from 2015 and a current 2009 RX600 S4 X7460 1.69 X7460 6.66@ 6 tiles system with the best processors 2010 RX600 S5 X7542 2.07 X7560 17.8@18 tiles each (see table opposite) for few 2011 RX600 S6 E7-8837 2.24 E7-4870 23.9@24 tiles VMs and for highest maximum 2014 RX4770 M1 E7-4890 v2 2.76 E7-4890 v2 44.9@28 tiles performance. 2015 2016 RX4770 M2 RX4770 M3 E7-8893 v3 E7-8893 v4 3.04 3.18 E7-8890 v3 E7-8890 v4 58.2@34 tiles 75.6@44 tiles Virtualization relevant improvements 10 Few VMs (1 Tile) 9 8 vServCon Score 7 6 5 3 × 1.05 × 1.10 4 × 1.22 × 1.23 × 1.08 2 1 1.69 2.07 2.24 2008 X7460 2667 MHz 6C 2010 X7542 2667 MHz 6C 2011 E7-8837 2667 MHz 8C 2.76 3.04 3.18 2014 E7-4890 v2 2800 MHz 15C 2015 E7-8893 v3 3200 MHz 4C 2016 E7-8893 v4 3200 MHz 4C 0 Year Y CPU Freq. #Cores The performance for an individual VM in low-load situations has only slightly increased for the processors compared here with the highest clock frequency per core. We must explicitly point out that the increased virtualization performance as seen in the score cannot be completely deemed as an improvement for one individual VM. http://ts.fujitsu.com/primergy Page 31 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Performance increases in the virtualization environment since 2010 are mainly achieved by increases in the maximum number of VMs that can be operated. Virtualization relevant improvements 80 × 1.30 Score at optimum Tile count 70 × 1.30 vServCon Score 60 × 1.88 50 40 75.60 × 1.34 30 58.20 × 2.67 44.9 20 10 17.80 23.9 6.66 0 2008 X7460 2667 MHz 6C Page 32 (44) 2010 X7560 2267 MHz 8C 2011 E7-4870 2400 MHz 10C 2014 E7-4890 v2 2800 MHz 15C 2015 E7-8890 v3 2500 MHz 18C 2016 E7-8890 v4 2200 MHz 24C Year Y CPU Freq. #Cores http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 VMmark V2 Benchmark description VMmark V2 is a benchmark developed by VMware to compare server configurations with hypervisor solutions from VMware regarding their suitability for server consolidation. In addition to the software for load generation, the benchmark consists of a defined load profile and binding regulations. The benchmark results can be submitted to VMware and are published on their Internet site after a successful review process. After the discontinuation of the proven benchmark “VMmark V1” in October 2010, it has been succeeded by “VMmark V2”, which requires a cluster of at least two servers and covers data center functions, like Cloning and Deployment of virtual machines (VMs), Load Balancing, as well as the moving of VMs with vMotion and also Storage vMotion. In addition to the “Performance Only” result, it is also possible from version 2.5 of VMmark to alternatively measure the electrical power consumption and publish it as a “Performance with Server Power” result (power consumption of server systems only) and/or “Performance with Server and Storage Power” result (power consumption of server systems and all storage components). VMmark V2 is not a new benchmark in the actual sense. Application scenario Load tool # VMs It is in fact a framework that consolidates already LoadGen 1 established benchmarks, as workloads in order to Mail server Web 2.0 Olio client 2 simulate the load of a virtualized consolidated server environment. Three proven benchmarks, which cover E-commerce DVD Store 2 client 4 the application scenarios mail server, Web 2.0, and Standby server (IdleVMTest) 1 e-commerce were integrated in VMmark V2. Each of the three application scenarios is assigned to a total of seven dedicated virtual machines. Then add to these an eighth VM called the “standby server”. These eight VMs form a “tile”. Because of the performance capability of the underlying server hardware, it is usually necessary to have started several identical tiles in parallel as part of a measurement in order to achieve a maximum overall performance. A new feature of VMmark V2 is an infrastructure component, which is present once for every two hosts. It measures the efficiency levels of data center consolidation through VM Cloning and Deployment, vMotion and Storage vMotion. The Load Balancing capacity of the data center is also used (DRS, Distributed Resource Scheduler). The result of VMmark V2 for test type „Performance Only“ is a number, known as a “score”, which provides information about the performance of the measured virtualization solution. The score reflects the maximum total consolidation benefit of all VMs for a server configuration with hypervisor and is used as a comparison criterion of various hardware platforms. This score is determined from the individual results of the VMs and an infrastructure result. Each of the five VMmark V2 application or front-end VMs provides a specific benchmark result in the form of applicationspecific transaction rates for each VM. In order to derive a normalized score the individual benchmark results for one tile are put in relation to the respective results of a reference system. The resulting dimensionless performance values are then averaged geometrically and finally added up for all VMs. This value is included in the overall score with a weighting of 80%. The infrastructure workload is only present in the benchmark once for every two hosts; it determines 20% of the result. The number of transactions per hour and the average duration in seconds respectively are determined for the score of the infrastructure workload components. In addition to the actual score, the number of VMmark V2 tiles is always specified with each VMmark V2 score. The result is thus as follows: “Score@Number of Tiles”, for example “4.20@5 tiles”. In the case of the two test types “Performance with Server Power” and “Performance with Server and Storage Power” a so-called “Server PPKW Score” and “Server and Storage PPKW Score” is determined, which is the performance score divided by the average power consumption in kilowatts (PPKW = performance per kilowatt (KW)). The results of the three test types should not be compared with each other. A detailed description of VMmark V2 is available in the document Benchmark Overview VMmark V2. http://ts.fujitsu.com/primergy Page 33 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment The measurement set-up is symbolically illustrated below: Clients & Management Server(s) Storage System Multiple 1Gb or 10Gb networks vMotion network Load Generators incl. Prime Client and Datacenter Management Server System under Test (SUT) System Under Test (SUT) Hardware Number of servers 2 Model PRIMERGY RX4770 M3 Processor 4 × Xeon E7-8890 v4 Memory 1024 GB: Network interface Intel X540-AT2 Dual Port 10GbE Onboard Adapter 1 × Emulex OneConnect OCe14000 Dual 10GbE Port Adapter Disk subsystem Dual port PFC EP LPe16002 3 × PRIMERGY RX300 S8 configured as Fibre Channel target Details see disclosures 16 × 64GB (2x32GB) 2Rx4 DDR4-2400 R ECC Software BIOS Version V5.0.0.11 R1.0.0 BIOS settings See details Operating system VMware ESXi 6.0.0 U2 Build 3620759 Operating system settings ESXi settings: see details Datacenter Management Server (DMS) Hardware (Shared) Enclosure PRIMERGY BX600 Network Switch 1 × PRIMERGY BX600 GbE Switch Blade 30/12 Hardware Model 1 × server blade PRIMERGY BX620 S5 Processor 2 × Xeon X5570 Memory 24 GB Network interface 6 × 1 Gbit/s LAN Software Operating system Page 34 (44) VMware ESXi 5.1.0 Build 799733 http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Datacenter Management Server (DMS)VM Hardware Processor 4 × logical CPU Memory 10 GB Network interface 2 × 1 Gbit/s LAN Software Operating system Microsoft Windows Server 2008 R2 Enterprise x64 Edition Prime Client Hardware (Shared) Enclosure PRIMERGY BX600 Network Switch 1 × PRIMERGY BX600 GbE Switch Blade 30/12 Hardware Model 1 × server blade PRIMERGY BX620 S5 Processor 2 × Xeon X5570 Memory 12 GB Network interface 6 × 1 Gbit/s LAN Software Operating system Microsoft Windows Server 2008 Enterprise x64 Edition SP2 Load generator Hardware Model 3 × PRIMERGY RX600 S6 Processor 4 × Xeon E7-4870 Memory 512 GB Network interface 6 × 1 Gbit/s LAN Software Operating system VMware ESX 4.1.0 U2 Build 502767 Load generator VM (per tile 1 load generator VM) Hardware Processor 4 × logical CPU Memory 4 GB Network interface 1 × 1 Gbit/s LAN Software Operating system Microsoft Windows Server 2008 Enterprise x64 Edition SP2 Details See disclosure http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vmmark/2016-06-21Fujitsu-RX4770M3.pdf http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vmmark/2016-06-21Fujitsu-RX4770M3-serverPPKW.pdf Some components may not be available in all countries or sales regions. http://ts.fujitsu.com/primergy Page 35 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark results “Performance Only” measurement result On June 21, 2016 Fujitsu achieved with two PRIMERGY RX4770 M3 systems with Xeon E7-8890 v4 processors and VMware ESXi 6.0.0 U2 a VMmark V2 score of “61.32@52 tiles” in a system configuration with a total of 2 × 96 processor cores and when using two identical servers/partitions in the “System under Test” (SUT). With this result the PRIMERGY RX4770 M3 is in the official VMmark V2 “Performance Only” ranking the most powerful 4-socket server in a “matched pair” configuration consisting of two identical hosts (valid as of benchmark results publication date). st All comparisons for the competitor products reflect the status of 21 June 2016. The current VMmark V2 results as well as the detailed results and configuration data are available at http://www.vmware.com/products/vmmark/results.html. The diagram shows the result of the PRIMERGY RX4770 M3 in comparison with the best 4-socket systems in a “matched pair” configuration. 4-socket systems, “matched pair” 70 50 46.62@40 tiles 46.64@40 tiles 47.16@40 tiles 47.46@40 tiles 10 47.66@40 tiles 20 57.87@50 tiles 30 60.63@50 tiles 40 61.32@52 tiles VMmark V2 Score 60 0 2 × Fujitsu 2 × Huawei 2 × HPE 2 × Huawei 2 × Fujitsu 2 × Huawei 2 × HP 2 × Fujitsu PRIMERGY FusionServer ProLiant FusionServer PRIMERGY FusionServer ProLiant PRIMEQUEST RX4770 M3 RH5885H V3 DL580 Gen9 CH242 V3 RX4770 M2 RH5885H V3 DL580 Gen9 2800E2 2 × 4 × Xeon 2 × 4 × Xeon 2 × 4 × Xeon DDR4 2 × 4 × Xeon 2 × 4 × Xeon 2 × 4 × Xeon 2 × 4 × Xeon E7-8890 v4 E7-8890 v4 E7-8890 v4 2 × 4 × Xeon E7-8890 v3 E7-8890 v3 E7-8890 v3 E7-8890 v3 E7-8890 v3 The table opposite shows the difference 4-socket systems, “matched pair” VMmark V2 score Difference in the score (in %) between the Fujitsu Fujitsu PRIMERGY RX4770 M3 61.32@52 tiles system and the other 4-socket systems Huawei FusionServer RH5885H V3 60.63@50 tiles 1.14% (“matched pair”). 57.87@50 tiles 5.96% The processors used, which with a good HPE ProLiant DL580 Gen9 hypervisor setting could make optimal Huawei FusionServer CH242 V3 DDR4 47.66@40 tiles 28.66% use of their processor features, were the Fujitsu PRIMERGY RX4770 M2 47.46@40 tiles 29.20% essential prerequisites for achieving all 47.16@40 tiles 30.03% PRIMERGY RX4770 M3 result. These Huawei FusionServer RH5885H V3 46.64@40 tiles 31.48% features include Hyper-Threading. All this HP ProLiant DL580 Gen9 has a particularly positive effect during Fujitsu PRIMEQUEST 2800E2 46.62@40 tiles 31.53% virtualization. All VMs, their application data, the host operating system as well as additionally required data were on a powerful Fibre Channel disk subsystem. As far as possible, the configuration of the disk subsystem takes the specific requirements of the benchmark into account. The use of flash technology in the form of SAS SSDs and PCIe-SSDs in the powerful Fibre Channel disk subsystem resulted in further advantages in response times of the storage medium used. The network connection to the load generators was implemented via 10Gb LAN ports. The infrastructureworkload connection between the hosts was by means of 1Gb LAN ports. All the components used were optimally attuned to each other. Page 36 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 „Performance with Server Power“ measurement result On June 21, 2016 Fujitsu achieved with two PRIMERGY RX4770 M3 systems with Xeon E7-8890 v4 processors and VMware ESXi 6.0.0 U2 a VMmark V2 “Server PPKW Score” of “30.4559@52 tiles” in a system configuration with a total of 2 × 96 processor cores and when using two identical servers/partitions in the “System under Test” (SUT). With this result the PRIMERGY RX4770 M3 is in the official VMmark V2 “Performance with Server Power” ranking the most energy-efficient 4-socket server worldwide (valid as of benchmark results publication date). st All comparisons for the competitor products reflect the status of 21 June 2016. The current VMmark V2 “Performance with Server Power” results as well as the detailed results and configuration data are available at http://www.vmware.com/products/vmmark/results.1.html. The diagram shows all VMmark V2 “Performance with Server Power“ results. Performance with Server Power 40 35 30 2 × Fujitsu PRIMERGY RX4770 M3 2 × 4 × Xeon E7-8890 v4 2 × Fujitsu PRIMERGY RX2540 M1 2 × 2 × Xeon E5-2699 v3 2 × Fujitsu PRIMERGY RX2530 M1 2 × 2 × Xeon E5-2699 v3 17.6899@20 tiles 2 × Fujitsu PRIMERGY RX2540 M2 2 × 2 × Xeon E5-2699 v4 20.0410@40 tiles 5 22.8998@40 tiles 10 23.6493@22 tiles 15 25.2305@22 tiles 20 30.4559@52 tiles 25 38.3065@28 tiles VMmark V2 Server PPKW Score 45 0 http://ts.fujitsu.com/primergy 2 × Fujitsu 2 × Fujitsu 2 × HP PRIMERGY PRIMEQUEST ProLiant RX4770 M2 2800E2 DL380 Gen9 2 × 4 × Xeon 2 × 4 × Xeon 2 × 2 × Xeon E7-8890 v3 E7-8890 v3 E5-2699 v3 Page 37 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 STREAM Benchmark description STREAM is a synthetic benchmark that has been used for many years to determine memory throughput and which was developed by John McCalpin during his professorship at the University of Delaware. Today STREAM is supported at the University of Virginia, where the source code can be downloaded in either Fortran or C. STREAM continues to play an important role in the HPC environment in particular. It is for example an integral part of the HPC Challenge benchmark suite. The benchmark is designed in such a way that it can be used both on PCs and on server systems. The unit of measurement of the benchmark is GB/s, i.e. the number of gigabytes that can be read and written per second. STREAM measures the memory throughput for sequential accesses. These can generally be performed more efficiently than accesses that are randomly distributed on the memory, because the processor caches are used for sequential access. Before execution the source code is adapted to the environment to be measured. Therefore, the size of the data area must be at least 12 times larger than the total of all last-level processor caches so that these have as little influence as possible on the result. The OpenMP program library is used to enable selected parts of the program to be executed in parallel during the runtime of the benchmark, consequently achieving optimal load distribution to the available processor cores. During implementation the defined data area, consisting of 8-byte elements, is successively copied to four types, and arithmetic calculations are also performed to some extent. Type Execution Bytes per step Floating-point calculation per step COPY a(i) = b(i) 16 0 SCALE a(i) = q × b(i) 16 1 SUM a(i) = b(i) + c(i) 24 1 TRIAD a(i) = b(i) + q × c(i) 24 2 The throughput is output in GB/s for each type of calculation. The differences between the various values are usually only minor on modern systems. In general, only the determined TRIAD value is used as a comparison. The measured results primarily depend on the clock frequency of the memory modules; the processors influence the arithmetic calculations. 9 This chapter specifies throughputs on a basis of 10 (1 GB/s = 10 Byte/s). Page 38 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment System Under Test (SUT) Hardware Model PRIMERGY RX4770 M3 Processor 4 processors of Intel Xeon Processor E7 v4 Family Memory 16 × 32GB (2x16GB) 2Rx4 DDR4-2400 R ECC ® ® Software BIOS settings Energy Performance = Performance COD Enable = Disabled Home Dir Snoop with IVT- Style OSB Enable = Enabled HyperThreading = Disabled Operating system SUSE Linux Enterprise Server 12 SP1 (x86_64) Operating system settings Transparent Huge Pages inactivated Compiler Intel C++ Composer XE 2016 for Linux Benchmark STREAM Version 5.10 Some components may not be available in all countries or sales regions. Benchmark results Processor Memory Frequency [MHz] Max. Memory Bandwidth [GB/s] Cores Processor Frequency [GHz] Number of Processors TRIAD Xeon E7-8893 v4 1600 102 4 3.20 4 Xeon E7-4809 v4 1600 102 8 2.10 4 175 Xeon E7-4820 v4 1600 102 10 2.00 4 198 Xeon E7-8891 v4 1600 102 10 2.80 4 260 Xeon E7-4830 v4 1600 102 14 2.00 4 234 Xeon E7-4850 v4 1600 102 16 2.10 4 239 Xeon E7-8860 v4 1600 102 18 2.20 4 265 Xeon E7-8867 v4 1600 102 18 2.40 4 265 Xeon E7-8870 v4 1600 102 20 2.10 4 269 Xeon E7-8880 v4 1600 102 22 2.20 4 264 Xeon E7-8890 v4 1600 102 24 2.20 4 266 [GB/s] Further information about memory performance can be found in the White Paper Memory performance of Xeon E7 v4 (Broadwell-EX) based systems. http://ts.fujitsu.com/primergy Page 39 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 LINPACK Benchmark description LINPACK was developed in the 1970s by Jack Dongarra and some other people to show the performance of supercomputers. The benchmark consists of a collection of library functions for the analysis and solution of linear system of equations. A description can be found in the document http://www.netlib.org/utk/people/JackDongarra/PAPERS/hplpaper.pdf. LINPACK can be used to measure the speed of computers when solving a linear equation system. For this purpose, an n × n matrix is set up and filled with random numbers between -2 and +2. The calculation is then performed via LU decomposition with partial pivoting. A memory of 8n² bytes is required for the matrix. In case of an n × n matrix the number of arithmetic 2 3 2 operations required for the solution is /3n + 2n . Thus, the choice of n determines the duration of the measurement: a doubling of n results in an approximately eight-fold increase in the duration of the measurement. The size of n also has an influence on the measurement result itself: as n increases, the measured value asymptotically approaches a limit. The size of the matrix is therefore usually adapted to the amount of memory available. Furthermore, the memory bandwidth of the system only plays a minor role for the measurement result, but a role that cannot be fully ignored. The processor performance is the decisive factor for the measurement result. Since the algorithm used permits parallel processing, in particular the number of processors used and their processor cores are - in addition to the clock rate - of outstanding significance. LINPACK is used to measure how many floating point operations were carried out per second. The result is referred to as Rmax and specified in GFlops (Giga Floating Point Operations per Second). An upper limit, referred to as Rpeak, for the speed of a computer can be calculated from the maximum number of floating point operations that its processor cores could theoretically carry out in one clock cycle: Rpeak = Maximum number of floating point operations per clock cycle × Number of processor cores of the computer × Rated processor frequency[GHz] LINPACK is classed as one of the leading benchmarks in the field of high performance computing (HPC). LINPACK is one of the seven benchmarks currently included in the HPC Challenge benchmark suite, which takes other performance aspects in the HPC environment into account. Manufacturer-independent publication of LINPACK results is possible at http://www.top500.org/. The use of a LINPACK version based on HPL is prerequisite for this (see: http://www.netlib.org/benchmark/hpl/). Intel offers a highly optimized LINPACK version (shared memory version) for individual systems with Intel processors. Parallel processes communicate here via "shared memory", i.e. jointly used memory. Another version provided by Intel is based on HPL (High Performance Linpack). Intercommunication of the LINPACK processes here takes place via OpenMP and MPI (Message Passing Interface). This enables communication between the parallel processes - also from one computer to another. Both versions can be downloaded from http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download/. Manufacturer-specific LINPACK versions also come into play when graphics cards for General Purpose Computation on Graphics Processing Unit (GPGPU) are used. These are based on HPL and include extensions which are needed for communication with the graphics cards. Page 40 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Benchmark environment System Under Test (SUT) Hardware Model PRIMERGY RX4770 M3 Processor 4 processors of Intel Xeon Processor E7 v4 Family Memory 16 × 32GB (2x16GB) 2Rx4 DDR4-2400 R ECC ® ® Software BIOS settings Energy Performance = Performance COD Enable = Disabled Home Dir Snoop with IVT- Style OSB Enable = Enabled HyperThreading = Disabled All processors apart from Xeon E7-4809 v4: Turbo Mode = Enabled (default) = Disabled Operating system SUSE Linux Enterprise Server 12 SP1 (x86_64) Benchmark HPL version: Intel Optimized MP LINPACK Benchmark for Clusters 11.3 Some components may not be available in all countries or sales regions. http://ts.fujitsu.com/primergy Page 41 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 95% 1938 108% 1701 95% 1863 104% 4 2150 2027 94% 2225 103% 4 2534 2384 94% 2504 99% 94% 4 819 2.10 4 1075 Xeon E7-4820 v4 10 2.00 4 1280 Xeon E7-8891 v4 10 2.80 4 Xeon E7-4830 v4 14 2.00 4 Xeon E7-4850 v4 16 2.10 Xeon E7-8860 v4 18 2.20 Rpeak [GFlops] 3.20 8 Processor 4 Xeon E7-4809 v4 Efficiency 1704 1792 Xeon E7-8893 v4 Rmax (with Turbo Mode) [GFlops] Efficiency 1792 Number of processors 95% Rated Frequency [Ghz] 1020 Cores Rmax (without Turbo Mode) [GFlops] Benchmark results Xeon E7-8867 v4 18 2.40 4 2764 2600 2722 98% Xeon E7-8870 v4 20 2.10 4 2688 2533 94% 2677 100% Xeon E7-8880 v4 22 2.20 4 3097 2915 94% 3004 97% Xeon E7-8890 v4 24 2.20 4 3379 3176 94% 3338 99% Rmax = Measurement result Rpeak = Maximum number of floating point operations per clock cycle × Number of processor cores of the computer × Rated frequency [GHz] As explained in the section "Technical Data", Intel does not as a matter of principle guarantee that the maximum turbo frequency can be reached in the processor models due to manufacturing tolerances. A further restriction applies for workloads, such as those generated by LINPACK: with intensive use of AVX instructions and a high number of instructions per clock unit. Here the frequency of a core can also be limited if the upper limits of the processor for power consumption and temperature are reached before the upper limit for the current consumption. This can result in the achievement of a lower performance with turbo mode than without turbo mode. In such cases, you should disable the turbo functionality via BIOS option. Page 42 (44) http://ts.fujitsu.com/primergy White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 Literature PRIMERGY Servers http://primergy.com/ PRIMERGY RX4770 M3 This White Paper: http://docs.ts.fujitsu.com/dl.aspx?id=8b5a6911-0329-477b-a46b-8891da2b627a http://docs.ts.fujitsu.com/dl.aspx?id=1370ebd1-e19a-43c7-91e2-a9c142b3b0b7 http://docs.ts.fujitsu.com/dl.aspx?id=7984ec7d-3891-4a43-8769-71820068ec18 Data sheet http://docs.ts.fujitsu.com/dl.aspx?id=0769aca2-f7b3-4afe-8ad1-3e6a9a4d0ce6 PRIMERGY Performance http://www.fujitsu.com/fts/x86-server-benchmarks Performance of Server Components http://www.fujitsu.com/fts/products/computing/servers/mission-critical/benchmarks/x86components.html BIOS optimizations for Xeon E5 v4 & E7 v4 based systems http://docs.ts.fujitsu.com/dl.aspx?id=eb90c352-8d98-4f5a-9eed-b5aade5ccae1 Memory performance of Xeon E7 v4 (Broadwell-EX)-based systems http://docs.ts.fujitsu.com/dl.aspx?id=7bd26a0c-a46c-4717-be6d-78abebba56b2 RAID Controller Performance http://docs.ts.fujitsu.com/dl.aspx?id=9845be50-7d4f-4ef7-ac61-bbde399c1014 Disk I/O: Performance of storage media and RAID controllers Basics of Disk I/O Performance http://docs.ts.fujitsu.com/dl.aspx?id=65781a00-556f-4a98-90a7-7022feacc602 Information about Iometer http://www.iometer.org LINPACK The LINPACK Benchmark: Past, Present, and Future http://www.netlib.org/utk/people/JackDongarra/PAPERS/hplpaper.pdf TOP500 http://www.top500.org/ HPL - A Portable Implementation of the High-Performance Linpack Benchmark for DistributedMemory Computers http://www.netlib.org/benchmark/hpl/ Intel Math Kernel Library – LINPACK Download http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download/ OLTP-2 Benchmark Overview OLTP-2 http://docs.ts.fujitsu.com/dl.aspx?id=e6f7a4c9-aff6-4598-b199-836053214d3f SAP SD http://www.sap.com/benchmark Benchmark overview SAP SD http://docs.ts.fujitsu.com/dl.aspx?id=0a1e69a6-e366-4fd1-a1a6-0dd93148ea10 SPECcpu2006 http://www.spec.org/osg/cpu2006 Benchmark overview SPECcpu2006 http://docs.ts.fujitsu.com/dl.aspx?id=1a427c16-12bf-41b0-9ca3-4cc360ef14ce http://ts.fujitsu.com/primergy Page 43 (44) White Paper Performance Report PRIMERGY RX4770 M3 Version: 1.1 2016-09-08 STREAM http://www.cs.virginia.edu/stream/ TPC-E http://www.tpc.org/tpce Benchmark Overview TPC-E http://docs.ts.fujitsu.com/dl.aspx?id=da0ce7b7-3d80-48cd-9b3a-d12e0b40ed6d VMmark V2 Benchmark Overview VMmark V2 http://docs.ts.fujitsu.com/dl.aspx?id=2b61a08f-52f4-4067-bbbf-dc0b58bee1bd VMmark V2 http://www.vmmark.com vServCon Benchmark Overview vServCon http://docs.ts.fujitsu.com/dl.aspx?id=b953d1f3-6f98-4b93-95f5-8c8ba3db4e59 Contact FUJITSU Website: http://www.fujitsu.com/ PRIMERGY Product Marketing mailto:Primergy-PM@ts.fujitsu.com PRIMERGY Performance and Benchmarks mailto:primergy.benchmark@ts.fujitsu.com © Copyright 2016 Fujitsu Technology Solutions. Fujitsu and the Fujitsu logo are trademarks or registered trademarks of Fujitsu Limited in Japan and other countries. Other company, product and service names may be trademarks or registered trademarks of their respective owners. Technical data subject to modification and delivery subject to availability. Any liability that the data and illustrations are complete, actual or correct is excluded. Designations may be trademarks and/or copyrights of the respective manufacturer, the use of which by third parties for their own purposes may infringe the rights of such owner. For further information see http://www.fujitsu.com/fts/resources/navigation/terms-of-use.html 2016-09-08 WW EN Page 44 (44) http://ts.fujitsu.com/primergy