Post-Dennard Scaling and the final Years of Moore`s Law

Transcription

Post-Dennard Scaling and the final Years of Moore`s Law
Post-Dennard Scaling and the final Years of
Moore’s Law
Consequences for the Evolution of Multicore-Architectures
Prof. Dr.-Ing. Christian Märtin
Christian.Maertin@hs-augsburg.de
Technical Report
Hochschule Augsburg University of Applied Sciences
Faculty of Computer Science
September 2014 Forschungsbericht 2014
Informatik und Interaktive Systeme
Post-Dennard Scaling and the final Years of Moore’s Law
Consequences for the Evolution of Multicore-Architectures
This paper undertakes a critical review of the current chal-
the effects of such pessimistic forecasts will affect embe-
lenges in multicore processor evolution, underlying trends
dded applications and system environments in a milder
and design decisions for future multicore processor im-
way than the software in more conservative standard
plementations. It is a short version of a paper presented at
and high-performance computing environments.
In the following we discuss the reasons for these
the Embedded World 2014 conference [1]. For keeping
up with Moore´s law during the last decade, the VLSI
developments together with other future challenges for
scaling rules for processor design had to be dramatically
multicore processors. We also examine possible solution
However, in a similar way as the necessary transition from complex s
However, in ato
similar ay as topics.
the necessary discussing
transition from complex s
approaches
some w
of
operating frequencies to the
multicore When
processors with moderate fre
operating frequencies to multicore processors with moderate fr
the performance
of thermal multicore
systems,
we must
a complex sin
dark silicon will be unavoidable and chip architects will
exponentially growing design power (TDP) have
of the exponentially growing thermal design power (TDP) of the complex sin
linear performance improvements [3], the ongoing multicore evolution look on adequate multicore performance models that
have to find new ways for balancing further performancelinear performance improvements [3], the ongoing multicore evolution
will undergo dramatic changes during the next several years [4]. As ne
both consider
the effects
of during Amdahl’s law
on several different
gains, energy efficiency and software complexity. The pa-will undergo dramatic changes next [4]. As n
show [5], [6], power problems and the lthe imited degree of iyears nherent applica
[
5], [
6], p
ower p
roblems a
nd t
he l
imited d
egree inherent aTpplic
multicore
architectures
and
workloads,
and
on
theopf rocessors. per compares the various architectural alternatives on theshow percentages of dark or dim silicon in future multicore his m
f dark r off dim silicon iwith
n future ulticore processors. This Itm
consequences
ofothese
models
to multicore
basis of specific analytical models for multicore systems.percentages have to be oswitched or operated at regard
low mfrequencies all the time. have to be switched off or operated at low frequencies all the time. I
effects of and
such energy
pessimistic forecasts We
will affect embedded power
requirements.
use
the models
alsoapplication
effects of way such than pessimistic forecasts will affect embedded application
milder the software in more conservative standard an
Introduction
to introduce
the
different
architectural
classes
for multimilder way than the software in more conservative standard an
environments. core processors.
More than 12 years after IBM started into the age of environments. As will be shown, the trend towards
In the following we discuss the reasons for these developments toge
heterogeneous and/or dynamic architectures and
multicore processors with the IBM Power4, the first
In more
the following we discuss he reasons for these dsolution evelopments toge
for multicore processors. We talso examine possible approach
innovative
design
directions
mitigatepossible several of
the must commercial dual core processor chip, Moore´s law ap- for multicore processors. We also examine solution approach
discussing the performance of can
multicore systems, we have discussing the problems.
performance of multicore systems, we must have expected
pears still to be valid as demonstrated by Intel´s fast track
performance models that both consider the effects of Amdahl’s law on and workloads, and on the consequences of these models with regard
from 32 to 22nm mass production and towards its new performance models that both consider the effects of Amdahl’s law on
requirements. We and
use the consequences models also to the different ar
and workloads, and on the of introduce these models with regard
Moore´s law
dark
silicon
14nm CMOS process with even smaller and at the same
processors. As will be shown, the trend towards more heterogeneous a
requirements. We use the models also to introduce the different a
time more energy efficient structures every two years processors. As will be shown, the trend towards more heterogeneous a
The major
reason
for thecan current
situation
and
innovative design directions mitigate several of the
the expected problem
[2]. At the extreme end of the performance spectrum, innovative upcoming
trend
towards
large
areas
of
dark
silicon
design directions can mitigate several of the are
expected proble
changed. In future multicore designs large quantities of
Prof. Dr.-Ing. Christian Märtin
Hochschule Augsburg
Fakultät für Informatik
Telefon +49 (0) 821 5586-3454
christian.maertin@hs-augsburg.de
Fachgebiete
■
■
■
■
Rechnerarchitektur
Intelligente Systeme
Mensch-Maschine-Interaktion
Software-Technik
Moore´s law is also expressed by the industry´s multi-
the new scaling rules for VLSI design. Dennard’s scaling
law ain
nd dark ilicon billion transistor multicore and many-core server chips 2 rulesMoore´s [7] were perceived
1974
and shave
held for more
The m
ajor r
eason f
or t
he c
urrent s
ituation nd the upower
pcoming trend to
2 Moore´s l
aw a
nd d
ark s
ilicon and GPUs. Obviously the transistor raw material needed
than 30 years until around 2005. As is well aknown,
for integrating even more processor cores and larger
caches onto future chips for all application areas and
performance levels is still available.
are the ajor new scaling frules for VLSI design. Dennard’s rules [7] wt
The or tbe
he current sas:
ituation and the scaling upcoming trend in m
CMOSreason chips can
modeled
held for more than 30 years until around 2005. As is well known, powe
are the new scaling rules for VLSI design. Dennard’s scaling rules [7] w
as: held for more than 30 years until around 2005. As is well known, powe
𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼 (1)
(1) as: (1) 𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼!"#$#%" !"#$#%"
However, in a similar way as the necessary transition
(1) 𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼!
!"#$#%" ∝ 𝑄𝑄 is the number of transistor devices, he operating frequency of t
operating
Q
number=
of (transistor
𝑟𝑟) =devices,
𝑟𝑟 ∝/! 𝑓𝑓 tthe
(2) 𝑃𝑃 is
=the
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
could be neglected 𝐼𝐼and
!"#$#%"
ting frequencies to multicore processors with moderate the operating voltage. The leakage current of the chip, C the capacitance
V the
𝑄𝑄 frequency
is the number of transistor devices, 𝑓𝑓 the operating frequency of larger than 65nm. frequencies was caused by the exponentially growing the operating voltage. The leakage current operating
voltage. The leakage current 𝐼𝐼!"#$#%" could
be
could be neglected
With Dennard’s scaling rules the total chip power for a given area si
larger than 65nm. thermal design power (TDP) of the complex single core
neglected
until 2005 with device structures larger than
generation to process generation. At the same time, with a scaling fact
processors for reaching linear performance improve65nm.
With Dennard’s scaling rules the total chip power for a given area s
at a rate of 1/𝑆𝑆 (the scaling ratio), transistor count doubled (Moore´s la
ments [3], the ongoing multicore evolution has again generation to process generation. At the same time, with a scaling fac
With Dennard’s scaling rules the total chip power
by 40 % [5] every two years. With feature sizes below 65nm, these rul
at a rate of (the scaling ratio), transistor count doubled (Moore´s hit the power wall and will undergo dramatic changes because for a given
size stayed
the same
from
process
of 1/𝑆𝑆 the area
exponential growth of the leakage current. To lessen by 40 % [5] every two years. With feature sizes below 65nm, these ru
moving to 45nm, introduced extremely efficient new Hafnium based
during the next several years [4]. As new analytical
generation
to process
generation.
At the
same time,
of the because exponential growth of the leakage current. To lessen
When moving to 22nm, optimized models and studies show [5], [6], power problems and processor. with a scaling
factor
S=√2,
featureIntel size shrinked
at the a switching pr
moving to 45nm, introduced new Hafnium based
transistors that are currently uextremely sed in the Hefficient aswell processors and will also
the limited degree of inherent application parallelism processor. rate of When 1/S (themoving scaling to ratio),
transistor
count doubled
22nm, Intel optimized the switching However, even these remarkable breakthroughs could not revive thep
will lead to rising percentages of dark or dim silicon intransistors (Moore´s
the frequency
increased
byprocessors 40 % [5] and will als
that law)
are and
currently used in the Haswell because no further scaling of the threshold voltage is possible as long a
future multicore processors. This means that large parts at However, every
two
years.
With
feature
sizes
below
65nm,
even these very remarkable breakthroughs cthese
ould not revive th
the already low level. Therefore operating voltage has current of the chip have to be switched off or operated at low because no further scaling of the threshold voltage is possible as long a
rules1could
longer
be sustained,
because of the expoaround V for sno
everal processor chip generations. current already very level. Therefore voltage has
frequencies all the time. It has to be studied, whether at the !low like nential
growth
of the
leakage
current.
To lessen
the leaWith Post-­‐Dennard scaling, Dennard soperating caling of t
(𝑓𝑓)
= processor with the number 𝑆𝑆!"#$!!
! c hip around 1
V
f
or s
everal g
enerations. !!! !
frequency with 𝑆𝑆 from generation to generation, i.e. the potential com
from complex single core architectures with high opera-
2
!
!
Post-­‐Dennard scaling, like generations. with Dennard scaling the number oaf 𝑆𝑆 With or 2.8 between two process Transistor capacitance frequency with 𝑆𝑆
f
rom generation to generation, i.e. the potential com
scaling regimes. However, as threshold and thus operating voltage can
𝑆𝑆 !longer possible to keep the power envelope constant from generation t
or 2.8 between two process generations. Transistor capacitance a
ars [4]. As new analytical models and studies erent application parallelism will lead to rising essors. This means that large parts of the chip ll the time. It has to be studied, whether the ures with high and system environments in a d applications used by and the high-­‐performance computing standard s for reaching ower wall and pments together with other future challenges ls and studies approaches l on lead to rising to some of the topics. When must have a look on adequate multicore rts of the chip dahl’s law on different multicore architectures , whether the s onments with regard in a to multicore power and energy different architectural classes for multicore ce computing erogeneous and/or dynamic architectures and ected problems. ure challenges topics. When ate multicore architectures ming trend towards large areas of dark silicon er and energy ng rules [7] were perceived in 1974 and have for multicore kage current, Intel, when moving to 45nm, introduced
known, power in CMOS chips can be modeled hitectures and extremely efficient new Hafnium based gate insulators
for the Penryn processor. When moving to 22nm, Intel
optimized the switching process by using new 3D Finrequency of the chip, 𝐶𝐶 the capacitance and 𝑉𝑉 FET transistors that are currently used in the Haswell
be neglected until 2005 with device structures of dark silicon processors and will also be scaled down to 14nm.
974 and have n be modeled However, even these remarkable breakthroughs
given area size stayed the same from process could
revive the scaling of the operating voltage,
a scaling factor 𝑆𝑆 =not2 , feature size shrinked d (Moore´s law) and the frequency increased because no further scaling of the threshold voltage is
nm, these rules could no longer be sustained, possible as long as the operating frequency is kept at
t. To lessen 𝑉𝑉the acitance and leakage current, Intel, when thegate current
already
very
level. Therefore operating
fnium based insulators for the low
Penryn vice structures switching process new at3D FinFET value of around 1 V
voltage by has using remained
a constant
s and will also scaled processor
down to 14nm. forbe several
chip generations.
from process not r
evive t
he s
caling o
f the operating scaling,
voltage, like with Dennard
e size shrinked With Post-Dennard
ble as long as the operating frequency is kept ncy increased scaling the number of transistors grows with S2 and the
voltage has remained at a constant value of be sustained, frequency with S from generation to generation, i.e. the
t, Intel, when 3
computing
or the Penryn e number of potential
transistors grows with performance
𝑆𝑆 ! and the increases by S or 2.8
ew 3D FinFET between two process generations. Transistor capacitance
otential computing performance increases by !
14nm. also scales
both scaling regimes. Hoapacitance also scales down
down to
to uunder
nder both !
rating voltage, voltage cannot be scaled any longer, it is no wever, as threshold and thus operating voltage cannot
quency is kept generation to generation and simultaneously stant value be
of scaled any longer, it is no longer possible to keep the
with Dennard scaling power remains constant power
envelope
constant from generation to generation
2 per generation for the wer increase of 𝑆𝑆 ! =
!
!
and
simultaneously
reach
theof potential
performance
omputing resources decreases with a rate the ith 𝑆𝑆 and !!
e increases by improvements. Whereas with Dennard scaling power
!
under both remains constant between generations, Post-Dennard
!
onger, it is no Scaling leads to a power increase of S2 = 2 per generatiimultaneously on for the same die area [8]. At the same time utilization
mains constant of a chip’s computing resources decreases with a rate of
ration for the !
ith a rate of ! per generation.
!
This means that at runtime large quantities of the
Forschungsbericht 2014
Informatik und Interaktive Systeme
industry. With the advances in transistor and process
technology, new architectural approaches, larger chip
sizes, introduction of multicore processors and the
partial use of the die area for huge caches microprocessor performance increased about 1000 times in the
last two decades and 30 times every ten years [4]. But
with the new voltage scaling constraints this trend
cannot be maintained. Without significant architectural
breakthroughs Shekar Borkar based on Intel’s internal
data only predicts a six times performance increase for
standard multicore based microprocessor chips in the
decade between 2008 and 2018. This projection corresponds to the theoretical 40% increase per generation
and over five generations in Post-Dennard scaling that
amounts to S5=5.38.
Therefore, computer architects and system designers have to find effective strategic solutions for handling these major technological challenges. Formulated
as a question, we have to ask: Are there ways to increase
performance by substantially more than 40% per
generation, when novel architectures or heterogeneous
systems are applied that are extremely energy-efficient
and/or use knowledge about the software structure of
the application load to make productive use of the dark
silicon?
Modeling Performance and Power of
Multicore Architectures
When Gene Amdahl wrote his famous article on how
the amount of the serial software code to be executed
transistors on the chip have to be switched off com-
influences the overall performance of parallel compu-
pletely, operated at lower frequencies or organized in
ting systems, he could not have envisioned that a very
completely different and more energy efficient ways.
fast evolution in the fields of computer architecture and
For a given chip area energy efficiency can only be
microelectronics would lead to single chip multipro-
improved by 40 % per generation. This dramatic effect,
cessors with dozens and possibly hundreds of processor
called dark silicon, already can be seen in current
cores on one chip. However, in his paper, he made a
multicore processor generations and will heavily affect
visionary statement, valid until today, that pinpoints the
future multicore and many-core processors. Appendix A
current situation: “the effort expended on achieving high
shows the amount of dark and dim (i.e. lower frequency)
parallel processing rates is wasted unless it is accompa-
silicon for the next two technology generations.
nied by achievements in sequential processing rates of
These considerations are mainly based on physical
laws applied to MOSFET transistor scaling and CMOS
very nearly the same magnitude” [10], [11].
Today we know, that following Pollack’s rule [4],
technology. However, the scaling effects on perfor-
single core performance can only be further improved
mance evolution are confirmed by researchers from
with great effort. When using an amount of r transistor
3
considerations are mainly based on physical laws applied to MOSFET scaling internal and data only predicts a six significant architectural breakthroughs Shekar Borkar transistor based on theoretical Intel’s aling constraints this trend cannot be These maintained. Without 2008 and 2018. This projection corresponds to the 40% increase per generation an
CMOS technology. However, the scaling effects on performance evolution are confirmed by researchers hekar Borkar based on Intel’s internal data only predicts a six times performance increase for standard multicore based microprocessor chips in the decade between generations in Post-­‐Dennard scaling that amounts to 𝑆𝑆 ! = 5.38. from industry. With the advances in 2transistor process technology, approaches, 2008 and 018. This and projection corresponds to new the tarchitectural heoretical 40% increase per generation and over five multicore based microprocessor chips in the decade between ! area fdesigners architects have to find effective strategic so
larger chip sizes, of multicore rocessors acomputer nd the partial se of tand he ie or huge caches ds to the theoretical 40% increase per generation and introduction over five generations = 5.38. in PpTherefore, ost-­‐Dennard scaling that aumounts to d𝑆𝑆system microprocessor performance increased ahandling bout 1000 times in the last two decades and 30 times every these major technological challenges. Formulated as ten a question, we have to ask: Are t
mounts to 𝑆𝑆 ! = 5.38. Forschungsbericht
2014
computer architects system designers have to find effective strategic solutions for years [4]. But with the new Therefore, voltage scaling constraints this and trend cannot be more maintained. Without to increase performance by substantially than 40% per generation, when novel archit
handling these major technological challenges. Formulated as a predicts question, e have to ask: Are there ways ystem Informatik
designers have to find Interaktive
effective strategic solutions for und
Systeme
significant architectural breakthroughs Shekar Borkar based on Intel’s internal data only a wsix heterogeneous systems are applied that are extremely energy-­‐efficient and/or use knowledge
by substantially more than 40% per pgeneration, when novel or nges. Formulated as a question, we htimes performance increase for standard multicore based microprocessor chips in the decade between ave to ask: Are there ways to increase performance software structure of the a
pplication l
oad t
o m
ake roductive u
se of the dark architectures silicon? more than 40% per generation, when novel or heterogeneous systems are applied that are extremely energy-­‐efficient and/or use knowledge about the 2008 and 2architectures 018. This projection corresponds to the theoretical 40% increase per generation and over five software structure of the application load to make productive use of the dark silicon? are extremely energy-­‐efficient and/or use knowledge about the generations in Post-­‐Dennard scaling that amounts to 𝑆𝑆 ! = 5.38. to make productive use of the dark silicon? 3 Modeling Performance nd Power of M
have to find effective a
strategic solutions for ulticore Architecture
Therefore, computer architects and system designers When Gene Amdahl wrote his famous article on how the of the serial software c
handling these major technological c
hallenges. F
ormulated a
s a
q
uestion, w
e h
ave t
o a
sk: A
re t
here w
ays 3 Modeling Performance and Power of Multicore amount Architectures influences executed the generation, overall performance of parchitectures arallel computing to increase performance by substantially more than 40% per when novel or systems, he could not have and Power of Multicore A
rchitectures When Gene Amdahl his famous article the amount of the serial software code to be lea
in the that a very fwrote ast evolution fields on of chow omputer architecture and microelectronics would heterogeneous systems are applied that are extremely energy-­‐efficient and/or use knowledge about the the overall performance of parallel computing systems, he could not have envisioned s article on how the amount of the serial software code to be executed influences chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. Howe
software structure of the application load to make productive use of the dark silicon? evolution in the fields f cSymmetric
omputer multicore
architecture and icroelectronics would lead single L1
nce of parallel computing systems, he could not have envisioned that a very fast Fig.o1.
processor:
eachmblock
is a basic core equivalent
(BCE)to including
paper, he made a visionary statement, valid until today, that pinpoints the current situation: are not modeled. The computing performance of a BCE is normalized to 1.
omputer architecture and microelectronics w
ould lead to single chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. However, in his Fig. 1. Symmetric multicore processor: each block isexpended a basic core equivalent
(BCE) including
L1 and L2
caches. L3 cache
andis onchip-network
on achieving high parallel processing rates wasted unless it is accompanied by ach
paper, he made a visionary statement, valid until today, that pinpoints the current situation: “the effort sibly hundreds of processor cores on one chip. However, in his The computing performance
of a BCE
normalized
to p
1.rocessing rates of very nearly the same magnitude” [10], [11]. in issnd equential 3 are not modeled.
Modeling Performance Power of M3.1 ulticore rchitectures Standard M
odels expended on aachieving high parallel p
rocessing rA
ates is w
asted unless is accompanied by aL1
chievements lid until today, that pinpoints the current situation: “the effort Fig.
1. Symmetric
multicore
processor:
each
block
is a basic
coreit equivalent
(BCE) including
and L2 caches. L3
Today we know, that following Pollack’s rule [4], single are rnot
modeled.
The
performance
a BCE
is normalized
to be 1. core performance can only When bStandard Gene Amdahl famous article on how amount of the serial software code in shis equential processing ates othe f very ncomputing
early the same mof
agnitude” [10], [to 11]. essing rates is wasted unless y achievements it is accompanied In 2008 Hill and Marty started this discussion with performance mo
3.1 Mwrote odels improved with great effort. When using an amount of 𝑟𝑟nvisioned transistor resources, scalar performan
influences the overall performance of parallel computing systems, he symmetric could not h(fig. ave e1), ly the same magnitude” [10], [11]. executed architectures: asymmetric (fig. and In 2008 Hill and Marty started this discussion with performance models for three types of multicore Today we know, that following Pollack’s rule [4], single core performance can 2), only be dynamic further (
Fig. 1. Symmetric
each block
a basic
core
(BCE)
including
L1the and
L2
caches.
L3
cache
and where
onchip-network
that multicore
a very scalar
fprocessor:
ast evolution in tishe fields oequivalent
f c𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 omputer architecture and m
icroelectronics would lead to Mtime odels 𝑟𝑟 3.1 =
𝑟𝑟. Standard At same single core power 𝑃𝑃for s(ingle or the TDP) rises 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
as
in
Amdahl’s
law,
the
Speedup
S achieved
by with characterize
resources,
performance
rises
to
evaluated their speedup models different workloads, architectures: symmetric (fig. 1), asymmetric (fig. 2), and dynamic (fig. 3) multicore chips [13] and improved with great effort. When using an amount of 𝑟𝑟 transistor resources, scalar performance rises to ack’s rule [4], single core areperformance can only be further not modeled.
The
computing
performance
of
a
BCE
is
normalized
to
1.
chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. However, in his 𝛼𝛼 = 1.75 or nearly the square of the targeted scalar performance. For the rest of this paper w
In 2008 Hill and Marty started this discussion with performance models for three
code that can be parallelized as in Amdahl’s law, where the Speedup parallel
At
the same their time speedup single core
power
P
(or
rises time evaluated models workloads, characterized the is
respective degree 𝑓𝑓 of with 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! [3] with 𝑆𝑆 n amount of 𝑟𝑟 transistor resources, scalar performance rises to 𝑟𝑟for = different 𝑟𝑟. the
At TDP)
the same single processing
core by power 𝑃𝑃defi
(or ned
the as:
TDP) rises 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 paper, he made a visionary statement, valid until today, that pinpoints the current situation: “the effort that 𝑃𝑃 includes dynamic as well as leakage power. Thus, there is a direct relation between th
architectures: symmetric (fig. 1), asymmetric (fig. 2), and dynamic (fig. 3) multic
defined !
code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆a s: achieved by parallel processing is 𝛼𝛼
=
1.75 o
r nearly the square of the targeted scalar performance. For the rest of this paper we assume e core power 𝑃𝑃 (or the TDP) with 𝑃𝑃
=
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
[
3] with with
with
or
nearly
the
square
3.1 rises Standard M
odels expended on achieving high parallel processing rates is wtasted unless it resources is accompanied bdifferent y achievements sequential core, he their transistor a!nd the performance. Power characterized can be expressed, as sres
ho
evaluated speedup models for workloads, by the defined as: that includes dynamic as well as leakage power. Thus, there is a direct relation between the TDP of a eted scalar performance. For the rest of this paper we assume (3) (3) 𝑆𝑆!"#$!!
In 2008 Hill and Marty started this discussion with performance models for three types of multicore of
targeted
scalar performance.
the
rest
of this
in the
sequential processing rates of v𝑃𝑃ery nFor
early the same magnitude” [10], [(𝑓𝑓)
11]. = by: ! code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆
a
chieved by p
!!! !
sequential (fig. core, the transistor resources nd the performance. ower can be expressed, as shown in [12] e power. Thus, there is a direct relation between the TDP of a ! 1), asymmetric architectures: symmetric dynamic 3) amulticore chips ! [13] Pand as: ∝(fig. =know, (fig. following dynamic
2), and defined (3) 𝑆𝑆!"#$!!
we (𝑓𝑓)
assume
that
well
!as
! includes
Today Pollack’s rule [4], performance can only be further by: = ( single 𝑟𝑟) =core 𝑟𝑟 ∝/!
respective degree (2) 𝑃𝑃 =as 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
nd the performance. Power can be epaper
xpressed, awe s shown in !that [P
12] !!!
evaluated their speedup models for different workloads, characterized by the 𝑓𝑓
of !
! equation ƒ is
improved with great effort. When using an amount of 𝑟𝑟 transistor resources, scalar performance rises to In this equation 𝑓𝑓
i
s the fraction of possible parallel work and 𝑛𝑛 is t
leakage
power.
Thus,
there
is
a
direct
relation
between
In
this
the
fraction
of
! (2) possible
(3) parallel work
𝑆𝑆=!"#$!!
!
code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆 a=chieved by parallel processing is = (When Grochowski derived his equation from comparing Intel processor generations from t
𝑟𝑟)∝ power 𝑟𝑟 ∝/!𝑃𝑃 (𝑓𝑓)
𝑃𝑃 =time 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃single !!!
! rises Amdahl’s law uses fixed pplication sizes, the Amdahl’s
achievable maximum spee
𝑟𝑟 of
= a sequential
𝑟𝑟. At the same core (or the TDP) with 𝑃𝑃 of
=a available
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! [3] with As
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! the number
In this equation 𝑓𝑓
i
s the fraction of possible parallel work and 𝑛𝑛
i
s the number of available cores. As the
TDP
core,
the
transistor
resources
and
is
cores.
law
defined a
s: !
(2) 𝛼𝛼 Amdahl’s = 1.75 or nearly the square of the targeted scalar performance. For the rest of this paper we assume the the Pentium 4 Cedarmill (all peedup, of them normalized to l65nm = multicore . With a evolution fixed appliw
multicores are uw
sed, is always imited technology), to 𝑆𝑆o!"#
application izes, achievable maximum hether multiprocessors r maximum
! law uses fixed !!!
and
the
performance.
Power
can
be sexpressed,
as shown
When Grochowski derived his equation from comparing Intel processor generations from the i486 to uses fisxed
application
sizes,
the achievable
!
In this equation 𝑓𝑓
i
s the fraction of possible parallel work and 𝑛𝑛 is the number o
that 𝑃𝑃
i
ncludes dynamic as well as leakage power. Thus, there is a direct relation between the TDP of a gathering momentum and the Post-­‐Dennard scaling effects only were in their initial phase. W
= re used, i s always limited to 𝑆𝑆 (3) 𝑆𝑆!"#$!! (𝑓𝑓)multicores !a
= (all . Wof ith them a fby Hill and Marty also reach their performance limits relatively soon, an
ixed application to size, the technology), models presented the Pentium a4 nd Cedarmill normalized 65nm evolution was slowly !"#
n from comparing Intel processor generations from the i486 to !!!
in
[12]!!!
by:!c!ore, the transistor multiprocessors
or[multicore multicores
arespeedup, law speedup,
ses fixed pplication sizes, he achievable maximum whethe
sequential resources the pAmdahl’s erformance. Puower can whether
bae expressed, as sthown in 12] technology, leakage currents and not further down-­‐scalable supply voltages would forbid the large number of cores does not s𝑓𝑓eem !adequate. rmalized to 65nm technology), multicore evolution was slowly gathering momentum and the Post-­‐Dennard scaling effects only were in their initial phase. With current by Hill and Marty also reach their performance limits relatively soon, and for workloads with <
0.99 a
by: entire die multicores area of mainstream microprocessor chips for a single core processor. Therefore
. W
ith a afixed application size, th
are used, is always limited to
to 𝑆𝑆!"# = used,
is
always
limited
With
fi
xed
!!!
technology, leakage currents and not further down-­‐scalable supply voltages would forbid the use of the ard scaling effects only were in their initial phase. With current In this equation 𝑓𝑓 is the fraction of possible parallel work and 𝑛𝑛 fis the number of available cores. As large number of cores does not seem adequate. multicore models or existing or envisioned architectural alternatives have to consider the impl
!
∝
∝/!
by Hill and Marty also reach their performance limits relatively soon, and for worklo
application
size,
the
models
presented
by
Hill
and
Marty
(2)
=
(
𝑟𝑟)
=
𝑟𝑟
(2) 𝑃𝑃
=
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
entire die area of mainstream microprocessor chips for a single core processor. Therefore analytical her down-­‐scalable supply voltages would forbid the use of the Amdahl’s law uses fixed application sizes, the achievable maximum speedup, whether m
ultiprocessors or different workload characteristics, both, to the performance, and the power requirement
!
of cores not seem adequate. or ealarge xisting opplication r envisioned adrchitectural aplternatives h ave to consider the and
implications of cessor chips for a single core processor. analytical also
reach
their
performance
limits
relatively
soon,
odels . Wfith fixed naumber size, toes he m
odels resented multicores are uTherefore sed, is always limited tmulticore o 𝑆𝑆!"# =m
architectures. !!!
When Grochowski derived his equation from comparing Intel processor generations from the i486 to characteristics, both, to the performance, and the power requirements of these d architectural alternatives have to consider the implications of different workload for workloads
with evolution a large
number of cores
When
Grochowski
derived his
from
comparing
by Hill and Marty also reach their performance limits relatively soon, and for workloads with 𝑓𝑓 < 0.99 awas requirements the Pentium 4 Cedarmill of equation
them normalized to 65nm technology), multicore slowly architectures. to the performance, and the power of these (all large number oprocessor
f cores does not seem adequate. Intel
generations
from the i486 to the Pentium does not seem adequate.
gathering momentum and the Post-­‐Dennard scaling effects only were in their initial phase. With current 4technology, leakage currents and not further down-­‐scalable supply voltages would forbid the use of the Cedarmill (all of them normalized to 65nm technolo entire die area of mainstream microprocessor chips for a single core processor. Therefore analytical gy),
multicore evolution was slowly gathering momen multicore models for existing or envisioned architectural alternatives have to consider the implications of tum
and the
Post-Dennard
scaling both, effectsto only
different workload characteristics, the were
performance, and the power requirements of these architectures. in
their initial phase. With current technology, leakage
currents
and not further down-scalable supply voltages
Fig. 2. Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCE
would
the use of the entire die area of main Fig. 2. forbid
Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs.
In the engineering and embedded system domains there are many s
stream microprocessor chips for a single core processor.
with moderate data size, irregular and data flow structure
4 and
12 BCEs.
Fig. 2. Asymmetric
multicore
processor consisting
of a complex
core control with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
In the engineering and embedded system domains there are many standard and real time workloads Therefore analytical multicore models for existing or
Fig. 2. Asymmetric
multi- the easily exploitable parallelism. For these workload characteristics, with moderate data size, irregular control and data flow structures and with a limited degree of core processor consisting
envisioned
architectural
have to characteristics, consider
the the Marty with some extensions can lead to valuable insights of how diffe
In the engineering and embedded system domains there are many standard and exploitable parallelism. alternatives
For these workload easily understandable study of Hill and of
a
complex
core with
4 andmoderate 12 BCEs.influence Fig. 2. Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =with the airregular chievable control performance. data size, and data flow structures and with a
implications
of different workload characteristics, both,
Marty with some extensions can lead to valuable insights of how different multicore architectures may Perƒ=√4 and
12easily BCEs.understandab
exploitable parallelism. For these workload characteristics, the influence t
he a
chievable p
erformance. to the performance, and the power requirementsMarty with some extensions can lead to valuable insights of how different multicor
of these
In the engineering and embedded system domains there are many standard and real time workloads with moderate structures a limited and
degree of data size, irregular control and data flow architectures.
the
engineering
embedded
system domains there
with influence the In
aand chievable performance. exploitable parallelism. For these workload characteristics, the easily understandable study of Hill and are
many
standard
and
real
time
workloads
with mode Marty with some extensions can lead to valuable insights of how different multicore architectures may rate
data
size,
irregular
control
and
data
fl
ow
structures
influence the a chievable performance. and with a limited degree of exploitable parallelism. For
these workload characteristics, the easily understandable
Fig. 1. Symmetric mul ticore processor: each
study of Hill and Marty with some extensions can lead
block is a basic core
equi- to
valuable insights of how different multicore architecvalent (BCE) including L1
tures may influence the achievable performance.
and L2 caches. L3 cache
and onchip-network
are
not modeled. The computing performance of a
BCE is normalized to 1.
Standard Models
In 2008 Hill and Marty started this discussion with
performance models for three types of multicore architectures: symmetric (fig. 1), asymmetric (fig. 2), and
dynamic (fig. 3) multicore chips [13] and evaluated their
speedup models for different workloads, characterized by
the respective degree ƒ of code that can be parallelized
4
Fig. 3. Dynamic multicore
processor: 16 BCEs
or one large core with
Perƒ=√16.
Forschungsbericht 2014
Informatik und Interaktive Systeme
. Symmetric multicore processor: each block is a basic core equivalent (BCE) including L1 and L2 caches. L3 cache and onchip-network
ot modeled. The computing performance of a BCE is normalized to 1.
Standard Models n 2008 Hill and Marty started this discussion with performance models for three types of multicore symmetric 1), L1asymmetric (fig. 2), andand dynamic (fig. 3) multicore chips [13] and aitectures: basic core equivalent
(BCE) (fig. including
and L2 caches. L3
cache
onchip-network
their to speedup models for different workloads, characterized by the respective degree 𝑓𝑓 of Euated is normalized
1.
e that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆 achieved by parallel processing is ned as: !
discussion with performance models for three types of multicore (3) #$!! (𝑓𝑓) = ! mmetric (fig. !!!
2), !and dynamic (fig. 3) multicore chips [13] and !
Thecharacterized idea behind the
models
is that
ifferent workloads, by different
the respective degree 𝑓𝑓 othe
f giahl’s law, where the Speedup 𝑆𝑆 achieved by parallel processing is n this equation is the fraction of possible parallel work and s the number of available cores. As ven 𝑓𝑓chip
area allows
for the implementation of 𝑛𝑛 isimple
dahl’s law uses fixed aso
pplication sizes, core
the aequivalents
chievable m(BCEs)
aximum with
speedup, whether multiprocessors or cores
or
called
basic
a
!
=
. W
ith a
f
ixed a
pplication size, the models presented ticores a
re u
sed, i
s a
lways l
imited t
o 𝑆𝑆
given
(3) !!!
performance of 1. !"#
To construct
the architectural
Fig. 3. Dynamic multicore processor: 16 BCEs or one large core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 16.
Hill and Marty also reach their performance limits relatively soon, and for workloads with 𝑓𝑓 < 0.99 a alternatives, more complex cores are modeled by consue number of cores does not seem adequate. 𝑛𝑛 possible parallel work and s the number of available cores. As mingThe idea behind the different models is that the given chip area allows for the implementation of r of the 𝑛𝑛 iBCEs,
leading to a performance perf(r)=
simple cores or so called basic core equivalents (BCEs) with a given performance of 1. To construct the s, the achievable m
aximum s
peedup, w
hether m
ultiprocessors o
r √r per core, if we go along with Pollack’s performance
! architectural alternatives, more complex cores are modeled by consuming 𝑟𝑟 of the 𝑛𝑛 BCEs, leading to a 𝑆𝑆!"# = . With a fixed application size, the models presented Fig. 4.
!!!
rule
for complex
cores.
𝑟𝑟 = 𝑟𝑟 per core, if we go along with Pollack’s performance rule for complex cores. Asymmetric
performance 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Fig. 3. Dynamic multicore processor: 16 BCEs or one large
with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
rmance limits relatively soon, and for workloads with 𝑓𝑓 core
< 0.99 a = 16.
multicore with 2 complex
These
leadto to
the
following
speedup
These cconsiderations
onsiderations lead the following speedup equations for the three alternatives: cores and 8 BCEs
dequate. equations
for the three alternatives:
The idea behind the different models is that the given chip area allows for the implementation of 𝑛𝑛 simple cores or so called basic core equivalents (BCEs) with a given performance of 1. To construct the architectural alternatives, ! more complex cores are modeled by consuming 𝑟𝑟 of the 𝑛𝑛 BCEs, leading to a 𝑆𝑆!"# 𝑓𝑓, 𝑛𝑛, 𝑟𝑟 = !!!
(4) (4) !∙!
𝑟𝑟 = ! !"#$
𝑟𝑟 p!er core, if we go along with Pollack’s performance rule for complex cores. performance 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝!"#$(!)
∙!
These considerations lead to the following speedup equations for the three alternatives: . Asymmetric multicore processor consisting of a complex
core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs.
!
(5) (5) 𝑆𝑆!"#$ (𝑓𝑓, 𝑛𝑛, 𝑟𝑟) = !!! !
!
! 𝑆𝑆!"# 𝑓𝑓, 𝑛𝑛, 𝑟𝑟 = !!!
(4) !"#$(!) !"#$
!∙! ! !!!! n the engineering and embedded system domains there are many standard and real time workloads ! !"#$(!)
!"#$ ! ∙!
Fig. 5. Dynamic multicore
with 4 cores and frequency scaling using the
power budget of 8 BCEs.
Currently one core is at
full core TDP, two cores
are at ½ core TDP. One
core is switched off.
h moderate data size, irregular control and data flow structures and with a limited degree of oitable parallelism. For these workload characteristics, the easily understandable study of Hill and ty with some extensions can lead to valuable insights of how different multicore architectures may a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs.!
𝑆𝑆!"# (𝑓𝑓, 𝑛𝑛,
𝑟𝑟) = !!! !! (6) uence the achievable erformance. ! 𝑟𝑟) = !"#$
(6) (5) 𝑆𝑆!"#$ (𝑓𝑓,p𝑛𝑛,
!
!!!!
!
! ystem domains there are many standard and real time workloads !"#$(!) !"#$ ! !!!! ontrol and data Note, flow tstructures with a climited degree (of hat for the and asymmetric ase in equation 5), the parallel work is done together by the large core and the the load characteristics, understandable of Hill and or the small cores would be running, to keep the 𝑛𝑛 − easily 𝑟𝑟 small cores. If either study the large core, Note,
thatpfor
the
asymmetric
casethe in equation
(5),
thechange to: o valuable insights of how different multicore architectures may available ower budget in balance, equation w
ould !
𝑛𝑛, 𝑟𝑟) =
(6) 𝑆𝑆!"# (𝑓𝑓,work
parallel
is done
together
by
the
large
core
and
the
!!!
!
! !
!!!
(7) 𝑆𝑆!"#$ (𝑓𝑓, 𝑛𝑛, 𝑟𝑟) =!"#$
!
!
! ! n - r small cores. If!"#$(!)
either !!! the large core, or the small
Fig. 6. Heterogeneous
with a large
cores
would
to keepcthe
power
Note, that be
for running,
the asymmetric ase available
in equation (5), the parallel work is done together by the large cmulticore
ore addition o achievable erformance, power (TDP) and energy consumption are important indicators and In the − 𝑟𝑟 stmall cores. If peither the change
large core, the four BCEs,
budget
in𝑛𝑛 balance,
the
equation
would
to: or the small cores would be running, to keep core,
for the pfeasibility and in appropriateness of multi-­‐ and many-­‐core-­‐architectures with symmetric two
and accelerators or
available ower budget balance, the equation would change to: asymmetric structures. ! Woo and Lee have extended the work of Hill and Marty towards modeling the co-processors
of type
𝑆𝑆average 𝑛𝑛, 𝑟𝑟) =envelope, !!!
workloads (7)that (7) can be modeled with Amdahl’s law are executed on D, E, each. Each
A, B,
!"#$ (𝑓𝑓, power !when ! !"#$(!)
multicore processors [14]. !!! co-processor/accelerator
uses the same transistor
If awe have at the evolution of commercial in are addition to the already In ddition o a alook chievable performance, power (TDP) amulticore nd energy processors, consumption important indicators In addition
to tachievable
performance,
power
(TDP)
budget as a BCE.
discussed architectural alternatives, we meanwhile can many-­‐core-­‐architectures see new architectural variants for asymmetric for the feasibility and appropriateness of multi-­‐ and with symmetric and and
energy
consumption
are
important
indicators
for the multicore systems (fig. 6). systems (
fig. 4
), d
ynamic s
ystems (
fig. 5
) a
nd h
eterogeneous asymmetric structures. Woo and Lee have extended the work of Hill and Marty towards modeling the feasibility
and appropriateness
multi- and
many-coreaverage power envelope, when ofworkloads that can be modeled with Amdahl’s law are executed on Models for Heterogeneous Multicores
multicore processors [14]. architectures
with symmetric
and asymmetric structures.
Woo
and have Lee have
extended
the work of of commercial Hill and Marty
Heterogeneous
on already first glance look similar
If we a look at the evolution multicore processors, in architectures
addition to the discussed architectural alternatives, meanwhile can see new architectural variants for asymmetric towards
modeling
the average
powerwe envelope,
when
to asymmetric
multicores.
However,
in addition to a
systems (fig. 4), dynamic systems (fig. 5) and heterogeneous multicore systems (fig. 6). workloads that can be modeled with Amdahl’s law are
conventional complex core, they introduce unconventi executed
on multicore processors [14].
onal or U-cores [12] that represent custom logic, GPU
If we have a look at the evolution of commercial
multicore processors, in addition to the already dis cussed
architectural alternatives, we meanwhile can see
resources, or FPGAs. Such cores can be interesting
for specific application types with SIMD parallelism,
GPU-like multithreaded parallelism, or specific parallel
new architectural variants for asymmetric systems (fig.
algorithms that have been mapped to custom or FPGA
Fig. 4. Asymmetric multicore with 2 complex cores and 8 BCE
4), dynamic systems (fig. 5) and heterogeneous multilogic. To model such an unconventional core, Chung et
core
systems (fig. 6).
al. suggest to take the same transistor budget for a U Fig. 4. Asymmetric multicore with 2 complex cores and 8 BCE
5
Forschungsbericht 2014
Informatik und Interaktive
Systeme
Fig. 5. Dynamic multicore with 4 cores and frequency scaling using the power budget of 8 BCEs. Currently one core is at full core TDP, two
cores are at ½ core TDP. One core is switched off.
C
C
C
C
A
A
B
B
co-processors
budget as a
D E
D
E
Fig. 6. Heterogeneous multicore with a large core, four BCEs, two acceler-ators or
of type A, B, D, E, each. Each co-processor/accelerator uses the same transistor
BCE.
as for a simple BCE core. The U-core then executes
core
BCE resources. If μi = 1, the coprocessor can be interpre-
for Heterogeneous ulticores ted as a cluster of ri simple cores working in parallel.
a3.2 specificModels parallel application
section with aMrelative
Heterogeneous architectures on first glance look similar to asymmetric multicores. However, in performance of μ, while consuming a relative power of
addition to a conventional complex core, they introduce unconventional or U-­‐cores [12] that represent Other Studies
φcustom logic, GPU resources, or FPGAs. Such cores can be interesting for specific application types with compared to a BCE with μ = φ = 1. If μ > 1 for a specific
workload,
the U-core
worksmultithreaded as an accelerator.
With φ <or 1 specific In [16]
it is shown,
that there
will be
no limits to multiSIMD parallelism, GPU-­‐like parallelism, parallel algorithms that have been Fig. 5. Dynamic multicore with 4 cores and frequency scaling using the power budget of 8 BCEs. Currently one core is at full core TDP, two
mapped cconsumes
ustom or core
Fless
PGA logic. Toff.
o mcan
odel help
such to
an make
unconventional ore, Chung et scaling,
al. suggest to tapplication
ake the the
and
core cperformance
if the
size grows
coresU-core
are at t½o core
TDP. One
is power
switched
same transistor budget for a U-­‐core as for a simple BCE core. The U-­‐core then executes a specific parallel better use of the dark silicon. The resulting speedup
with the number of available cores, as was stated first
application section with a relative performance of 𝜇𝜇 , while consuming a relative power of 𝜑𝜑 compared to equation
by𝜇𝜇Chung
derived
from the U-­‐core for classic
byWGustafson
a BCE with = 𝜑𝜑 =et1. al.
If can
𝜇𝜇 >directly
1 for a sbe
pecific workload, works multiprocessors
as an accelerator. ith 𝜑𝜑 < 1 [17]. Sun and
C
C
C
C
the U-­‐core consumes less power and can help to make better use of the dark silicon. The equation
(7):
Chen show that Gustafson’s lawresulting also holds for multicore
speedup equation by Chung et al. can directly be derived from equation (7): architectures
and
that
the
multicore
speedup scales as a
A
A
!
(8) 𝑆𝑆!!"!#$%!&!$'( (𝑓𝑓, 𝑛𝑛, 𝑟𝑟, 𝜇𝜇) = !!!
linear
function
of
n,
if
the
workload
is
also growing with
(8)
!
! !"#$(!) !(!!!) B
B
the
system
size.
They
show,
that
even,
if continuously
growing parts of the workload will reside in slow DRAM
This equation is very similar to equation (7) that
This equation is very similar to equation (7) that gives the speedup model for asymmetric multicore Fig. 6. Heterogeneous multicore with a large core, four BCEs, two acceler-ators or
E
E
D
systems. In their D
evaluation, and Lee multicore
[14] that asymmetric multicores are much more unlimited
speedup
is still possible,
gives
the speedup
model for Woo asymmetric
co-processors
of
typeshow A, B,sysD, E, such each.memory,
Each
co-processor/accelerator
uses
thescaling
same transistor
appropriate architectures, either with complex or simple cores. Going a BCE.
budget as a for power scaling than symmetric as long as the DRAM access time is a constant. Note,
tems. In their evaluation, Woo and Lee [14] show that
step further, it follows that heterogeneous multicores will even scale better, if 𝜇𝜇 > 1 and 𝜑𝜑 < 1 or at such
multicores
are much more
least 𝜑𝜑 =Models 1. As prototypical results in current research towards a better utilization of the dark silicon in 3.2 asymmetric
for Heterogeneous Mappropriate
ulticores that without scaling up the workload of the LINPACK
logic chips demonstrate, future architectures might similar profit in the
performance energy benchmark,
multi-million-core
supercomputers
for power
scaling than
symmetricon architectures,
Heterogeneous architectures first glance either
look to mostly asymmetric multicores. and However, in consumption, if not only one, but many additional coprocessors or accelerator cores are added to one or addition to a conventional complex core, they introduce unconventional or U-­‐cores [12] that represent presented in the Top 500 list twice every year, would not
with complex or simple cores. Going a step further, it
a small number of large cores [15]. custom logic, GPU resources, or FPGAs. Such cores can be interesting for specific application types with make sense at all.
follows that heterogeneous multicores will even scale
SIMD parallelism, multithreaded parallelism, or different specific aparallel algorithms that chave If, in addition to GPU-­‐like a conservative complex core, there are ccelerator/co-­‐processor ores obeen n an recent
andaepplication much
more
pessimistic
better,
if
μ
>
1
and
φ <
1
or
at
least
φ =
1.
As
prototypical
mapped to custom FPGA logic. To mbodel such oan n duifferent nconventional hung t al. suggest to take Lthe heterogeneous chip, otr hey will typically e active parts oAf ctore, he eCntire workload. et study by
same t
ransistor b
udget f
or a
U
-­‐core a
s f
or a
s
imple B
CE c
ore. T
he U
-­‐core t
hen e
xecutes a
s
pecific p
arallel [5]
constructs
a very fine grained
results
in current
research
towards
utilization
𝑟𝑟! B
CEs with 𝑟𝑟 denoting us assume that the entire die area can better
hold resources for 𝑛𝑛 = 𝑟𝑟Esmaeilzadeh
+ 𝑟𝑟! + 𝑟𝑟! + ⋯et+al.
application section wcith a relative paerformance f 𝜇𝜇 , resources while consuming relative power f 𝜑𝜑c core. ompared to processors.
the resources for the omplex core nd 𝑟𝑟! giving otfuture
he fperformance
or each as pecific coprocessor We can model
for ofuture
multicore
of
the
dark silicon
in
logic
chips
demonstrate,
a BCE with 𝜇𝜇a =more 𝜑𝜑 =flexible 1. If 𝜇𝜇 speedup > 1 for aequation specific wthat orkload, the the U-­‐core orks as an astructure ccelerator. With 𝜑𝜑 <
1 now derive mirrors real wapplication and that can Theuse model
includes
separate
partial
models for Postarchitectures
might profi
t mostly
performance
and
the U-­‐core consumes power in
and can phelp make better of the dark silicon. The resulting easily be used to model less application-­‐specific ower to and energy consumption: speedup equation b
y Cifhung t al. one,
can dbut
irectly be dadditierived from eDennard
quation (7): device scaling, microarchitecture core scaling,
energy
consumption,
not eonly
many
!
(9) 𝑆𝑆!"#$%&"# = !!!
!
!
!
!
onal
coprocessors
or𝑟𝑟,
accelerator
(𝑓𝑓, ! 𝑛𝑛,
!!! cores! are
added
to one
(8) and the complete multicore. The model’s starting point
𝑆𝑆!!"!#$%!&!$'(
!!!𝜇𝜇)
!! ∙!=
!!"#(!)
!
! !"#$(!)[15].
!(!!!) is the 45nm i960 Intel Nehalem quadcore processor
or a small number of large cores
that is mapped to all future process generations until a
If, in addition to a conservative complex core,
This equation is very similar to equation (7) that gives the speedup model for asymmetric multicore fictitious
8nm multicores process. Following
ITRS
there
are
different
accelerator/co-processor
cores
on
systems. In their evaluation, Woo and Lee [14] show that such asymmetric are much optimistic
more appropriate f
or p
ower s
caling t
han s
ymmetric a
rchitectures, e
ither w
ith c
omplex o
r s
imple c
ores. G
oing a
scaling, the downscaling of a core will result in a 3.9 x
an heterogeneous chip, they will typically be active on
step further, it follows that heterogeneous multicores will even scale better, if 𝜇𝜇 > 1 and 𝜑𝜑 < 1 or at core performance improvement and an 88% reduction
different parts of the entire application workload. Let
least 𝜑𝜑 = 1. As prototypical results in current research towards a better utilization of the dark silicon in in core
power
However,
if the more
uslogic assume
that
the
entire
die
area
can
hold
resources
for
chips demonstrate, future architectures might profit mostly in consumption.
performance and energy o
nly o
ne, b
ut m
any a
dditional c
oprocessors o
r a
ccelerator c
ores a
re a
dded t
o o
ne o
r are applied,
conservative
projections
by
Shekar
Borkar
nconsumption, = r + r1 + r2 + ...if + nrot BCEs
with
r
denoting
the
resources
l
a s
mall n
umber o
f l
arge c
ores [
15]. for the complex core and r giving the resources for each the performance will only increase by 34% with a power
i
If, cin coprocessor
addition to a core.
conservative omplex core, here are different accelerator/co-­‐processor on an using the
reduction
of 74%. The modelcores is evaluated
specifi
We cancnow
derive
a tmore
heterogeneous chip, they will typically be active on different parts of the entire application workload. Let 12 real benchmarks of the PARSEC suite for parallel
flexible speedup equation that mirrors the real applius assume that the entire die area can hold resources for 𝑛𝑛 = 𝑟𝑟 + 𝑟𝑟! + 𝑟𝑟! + ⋯ + 𝑟𝑟! BCEs with 𝑟𝑟 denoting evaluation.
cation
structure
and
that
can
easily
be
used
to
model
the resources for the complex core and 𝑟𝑟! giving the resources fperformance
or each specific coprocessor core. We can now derive a more flexible speedup equation that mirrors the real application that can The results of structure the studyand show
that
the entire future
application-specific power and energy consumption:
easily be used to model application-­‐specific power and energy c8nm
onsumption: multicore chip will only reach a 3.7 x (conservative
𝑆𝑆!"#$%&"# = 6
!
!
!!!
! !!!! !
!! ∙!!
!!"#(!)
(9) (9) scaling) to 7.9 x speedup (ITRS scaling). These numbers
are the geometric mean of the 12 PARSEC benchmarks
executed on the model. The study also shows that 32
The ƒi are exploitable fractions of the application
workload, , ƒ = ƒ1 + ƒ2 + ... + ƒl with ƒ ≤ 1 and μi is the
to 35 cores at 8nm will be enough to reach 90% of the
relative performance of a coprocessor, when using ri
relatively much exploitable parallelism, there is not
ideal speedup. Even though the benchmarks contain
Forschungsbericht 2014
Informatik und Interaktive Systeme
enough parallelism to utilize more cores, even with an
cessors, we have compared and extended some straight-
unlimited power budget. For more realistic standard ap-
forward performance and power models. Although vari-
plications, the speedup would be limited by the available
ous results from research and industrial studies predict
power envelope, rather than by exploitable parallelism.
that huge parts of future multicore chips will be dark or
The power gap would lead to large fractions of dark
dim all the time, it can already be seen in contemporary
silicon on the multicore chip. The executable model by
commercial designs, how these areas can be dealt with
Esmaeilzadeh et al. can be executed for different load
for making intelligent use of the limited power budget
specifications and hardware settings. The authors have
with heterogeneous coprocessors, accelerators, larger
made it available at a URL at the University of Wiscon-
caches, faster memory buffers, and improved communi-
sin [6].
cation networks.
Although this model was devised very carefully
In the future, the success of many-core architectures
and on the basis of available facts and plausible future
will first and foremost depend on the exact knowledge
trends, it cannot predict future breakthroughs that
of specific workload properties as well as easy-to-use
might influence such hardware parameters that are
software environments and software developing tools
currently treated like immovable objects. An example
that will support programmers to exploit the explicit
is the slow DRAM access time, which is responsible
and implicit regular and heterogeneous parallelism of
for the memory wall and has existed for decades. In
the application workloads.
Esmaeilzadeh’s model it is kept constant through to
the 8nm process, whereas off-chip memory bandwidth
References
is believed to scale linearly. Currently memory makers
  [1] Märtin, C.,”Multicore Processors: Challenges, Op-
like Samsung and HMC are working on disruptive 3D-
portunities, Emerging Trends”, Proceedings Embed-
DRAM technologies that will organize DRAM chips
ded World Conference 2014, 25-27 February, 2014,
vertically and speedup access time and bandwidth per
chip by a factor of ten, without needing new DRAM
cells [18]. Such and other technology revolutions might
affect all of the discussed models and could lead to more
Nuremberg, Germany, Design & Elektronik, 2014
  [2] Bohr, M., Mistry, K., “Intel’s revolutionary 22 nm
Transistor Technology,” Intel Corporation, May 2011
  [3] Grochowski, E., ”Energy per Instruction Trends in
optimistic multicore performance scenarios. Read the
Intel Microprocessors,” Technology@Intel Maga-
full paper for a discussion of the ongoing evolution
zine, March 2006
of commercial multicore processors, current research
trends, and the consequences for software developers
[1].
  [4] Borkar, S., Chien, A.A.,”The Future of Microprocessors,” Communications of the ACM, May 2011,
Vol. 54, No. 5, pp. 67-77
Conclusion
As VLSI scaling puts new challenges on the agenda of
chip makers, computer architects and processor designers have to find innovative solutions for future multicore processors that make the best and most efficient use
of the abundant transistor budget offered by Moore’s law
  [5] Esmaeilzadeh, H. et al., “Power Challenges May
End the Multicore Era,” Communications of the
ACM, February 2013, Vol. 56, No. 2, pp. 93-102
  [6] Esmaeilzadeh et al., “Executable Dark Silicon
Performance Model”: http://research.cs.wisc.edu/
vertical/DarkSilicon, last access Jan. 20, 2014
  [7] Dennard, R.H. et al., “Design of Ion-Implanted
that will hopefully be valid for another decade. In this
MOSFET’s with Very Small Physical Dimensions,”
paper, we have clarified the reasons of the Post-Dennard
IEEE J. Solid-State Circuits, vol. SC-9, 1974, pp.
scaling regime and discussed the consequences for future
multicore designs.
For the four typical architectural classes: symmetric,
asymmetric, dynamic, and heterogeneous multicore pro-
256-268
  [8] Taylor, M.B., “A Landscape of the New Dark Silicon
Design Regime,” IEEE Micro, September/October
2013, pp. 8-19
7
Forschungsbericht 2014
Informatik und Interaktive Systeme
  [9] Borkar, S., “3D Integration for Energy Efficient
System Design,” DAC’11, June 5-10, San Diego,
California, USA
[10] Amdahl, G.M., “Validity of the Single Processor
[14] Woo, D.H., Lee, H.H., “Extending Amdahl’s Law
for Energy-Efficient Computing in the Many-Core
Era,” IEEE Computer, December 2008, pp. 24-31
[15] Venkatesh, G. et al., “Conservation Cores: Reducing
Approach to Achieving Large Scale Computing Ca-
the Energy of Mature Computations,” Proc. 15th Ar-
pabilities,” Proc. AFIPS Conf., AFIPS Press, 1967,
chitectural Support for Programming Languages and
pp. 483-485
[11] Getov, V., “A Few Notes on Amdahl’s Law,” IEEE
Computer, December 2013, p. 45
[12] Chung, E.S., et al., “Single-Chip Heterogeneous
Computing: Does the Future Include Custom
Logic, FPGAs, and GPGPUs?” Proc. 43rd Annual
IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 225-236
[13] Hill, M.D., Marty, M., “Amdahl’s Law in the Multicore Era,” IEEE Computer, July 2008, pp. 33-38
Operating Systems Conf., ACM, 2010, pp. 205-218
[16] S
un, X.-H., Chen, Y., ”Reevaluating Amdahl’s law
in the multicore era,” J. Parallel Distrib. Comput.
(2009), doi: 10.1016/j.jpdc2009.05.002
[17] G
ustafson, J.L, “Reevaluating Amdahl’s Law,”
Communications of the ACM, Vol. 31, No. 5, 1988,
pp. 532-533
[18] C
ourtland, R., “Chipmakers Push Memory Into the
Third Dimension,” http://spectrum.ieee.org, last
access: January 20, 2014
Appendix A: Dark and dim silicon in symmetric multicore processor generations from 22nm to 11nm. Each BE represents
a RISC core with 8 MB last-level cache. Communication infrastructure is included in the transistor count.
8
Forschungsbericht 2014
Informatik und Interaktive Systeme
9