Slides - Undergraduate Courses

Transcription

Slides - Undergraduate Courses
1/30/09
Prof. Wu FENG, feng@cs.vt.edu, 540-231-1192
Depts. of Computer Science and Electrical & Computer Engineering
© W. Feng, January 2009
Overview: Efficient Parallel Computing
•  Interdisciplinary
–  Approaches: Experimental, Simulative, and Theoretical
–  Abstractions: Systems Software, Middleware, Applications
•  Synergistic Research Sampling
–  Emerging Chip Multiprocessors: Multicore, GPUs, Cell / PS3
•  Performance Analysis & Optimization: Systems & Applications
–  Automated Process-to-Core Mapping: SyMMer
–  Mixin Layers & Concurrent Mixin Layers (Tilevich)
–  Green Supercomputing: EcoDaemon & Green500
–  Pairwise Sequence Search: BLASTIP2M
and Smith-Waterman
Award
–  Episodic Data Mining (Ramakrishnan, Cao)
–  Electrostatic Potential (Onufriev)
synergy.cs.vt.edu
1
1/30/09
Overview: Efficient Parallel Computing
•  Interdisciplinary
–  Approaches: Experimental, Simulative, and Theoretical
–  Abstractions: Systems Software, Middleware, Applications
•  Synergistic Research Sampling
–  Hybrid Circuit- and Packet-Switched Networking
•  Application-Awareness in
Network Infrastructure: FAATE
•  Composable Protocols for
High-Performance Networking
synergy.cs.vt.edu
Overview: Efficient Parallel Computing
•  Interdisciplinary
–  Approaches: Experimental, Simulative, and Theoretical
–  Abstractions: Systems Software, Middleware, Applications
•  Synergistic Research Sampling
–  Large-Scale Bioinformatics
•  mpiBLAST: Parallel Sequence Search
www.mpiblast.org
–  “Finding Missing Genes” in Genomes (Setubal)
•  ParaMEDIC: Parallel Metadata Environment for Distributed I/O
& Computing
Storage Challenge
Award
synergy.cs.vt.edu
2
1/30/09
Overview: Efficient Parallel Computing
•  Interdisciplinary
Best
–  Approaches: Experimental, Simulative,Undergraduate
and Theoretical
Student Poster
–  Abstractions: Systems Software, Middleware,
Award Applications
•  Synergistic Research Sampling
–  Virtual Computing for K-12 Pedagogy & Research (Gardner)
•  Characterization & Optimization of Virtual Machines
•  Cybersecurity & Scalability: Private vs. Public Addresses
synergy.cs.vt.edu
The Synergy Lab (http://synergy.cs.vt.edu/)
•  Ph.D. Students
– 
– 
– 
– 
– 
– 
– 
Jeremy Archuleta
Marwa Elteir
Song Huang
Thomas Scogland
Sushant Sharma
Kumaresh Singh
Shucai Xiao
•  M.S. Students
–  Ashwin Aji
–  Rajesh Gangam
–  Ajeet Singh
•  B.S. Students
–  Jacqueline Addesa
–  William Gomez
–  Gabriel Martinez
•  Faculty Collaborators
– 
– 
– 
– 
– 
– 
– 
– 
– 
Kirk Cameron
Yong Cao
Mark Gardner
Alexey Onufriev
Dimitris Nikolopoulos
Naren Ramakrishnan
Adrian Sandu
João Setubal
Eli Tilevich
synergy.cs.vt.edu
3
1/30/09
Prof. Wu FENG, feng@cs.vt.edu, 540-231-1192
Depts. of Computer Science and Electrical & Computer Engineering
© W. Feng, January 2009
Green Destiny: Low-Power Supercomputer
Why Green Supercomputing?
•  Premise
–  We need horsepower to compute but horsepower also
translates to electrical power consumption.
•  Consequences
–  Electrical power costs $$$$.
–  “Too much”
power affects efficiency,
reliability, availability.
Only Difference?
The Processors
•  Idea
–  Autonomic energy and power savings while maintaining
performance in the supercomputer or datacenter.
•  Low-Power Approach: Power Awareness at Integration Time
Street Journal,
29, 2007
•  Power-Aware Approach: PowerWall
Awareness
at January
Run Time
© W. Feng, Mar. 2007
Green Destiny “Replica”: Traditional
synergy.cs.vt.edu
Supercomputer
4
1/30/09
Outline
•  Motivation & Background
–  Where is High-End Computing (HEC)?
–  The Need for Efficiency, Reliability, and Availability
•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/)
–  Distant Past: Green Destiny (2001-2002)
–  Recent Past to Present: Evolution of Green Destiny (2003-2007)
•  Architectural
–  MegaScale, Orion Multisystems, IBM Blue Gene/L
•  Software-Based
–  EnergyFit™: Power-Aware Run-Time System (β adaptation)
•  Conclusion and Future Work
synergy.cs.vt.edu
Where is High-End Computing?
We have spent decades focusing on
performance, performance, performance
(and price/performance).
synergy.cs.vt.edu
5
1/30/09
Where is High-End Computing?
TOP500 Supercomputer List
•  Benchmark
–  LINPACK: Solves a (random) dense system of linear
equations in double-precision (64 bits) arithmetic.
•  Introduced by Prof. Jack Dongarra, U. Tennessee & ORNL
•  Evaluation Metric
–  Performance (i.e., Speed)
•  Floating-Operations Per Second (FLOPS)
•  Web Site
–  http://www.top500.org/
synergy.cs.vt.edu
Where is High-End Computing?
Gordon Bell Awards at SC
•  Metrics for Evaluating Supercomputers (or HEC)
–  Performance (i.e., Speed)
•  Metric: Floating-Operations Per Second (FLOPS)
–  Price/Performance  Cost Efficiency
•  Metric: Acquisition Cost / FLOPS
•  Example: VT System X cluster
•  Performance & price/performance are important
metrics, but …
synergy.cs.vt.edu
6
1/30/09
Where is High-End Computing?
… more “performance”  more horsepower 
more power consumption
 less efficiency, reliability, and availability
synergy.cs.vt.edu
Where is High-End Computing?
… more “performance”  more horsepower 
more power consumption
 less efficiency, reliability, and availability
synergy.cs.vt.edu
7
1/30/09
Moore’s Law for Power Consumption?
1000
Chip Maximum
Power in watts/cm2
Rocket
Nozzle
Surpassed
Nuclear Reactor
Pentium 4
(Prescott)
100
Pentium 4 (Willamette)
Pentium III
Pentium II
Pentium Pro
Surpassed
Heating Plate
10
Pentium
1
I386
1.5µ
1985
1µ
I486
0.7µ
0.5µ
1995
0.35µ
0.25µ
0.18µ
2001
0.13µ
0.1µ
Source: Intel
0.07µ
2004
Year
synergy.cs.vt.edu
Where is High-End Computing?
… more “performance”  more horsepower 
more power consumption
 less efficiency, reliability, and availability
synergy.cs.vt.edu
8
1/30/09
Efficiency of HEC Systems
•  “Performance” and “Price/Performance” Metrics …
–  Lower efficiency, reliability, and availability.
–  Higher operational costs, e.g., admin, maintenance, etc.
•  Examples
–  Computational Efficiency
•  Relative to Peak: Actual Performance/Peak Performance
•  Relative to Space: Performance/Sq. Ft.
•  Relative to Power: Performance/Watt
–  Performance: 10,000-fold increase (since the Cray C90).
•  Performance/Sq. Ft.:
•  Performance/Watt:
Only 65-fold increase.
Only 300-fold increase.
–  Massive construction and operational costs associated with
powering and cooling.
synergy.cs.vt.edu
Reliability & Availability of HEC Systems
Systems
CPUs
Reliability & Availability
ASCI Q
8,192
MTBF: 6.5 hrs. 114 unplanned outages/month.
–  HW outage sources: storage, CPU, memory.
ASCI
White
8,192
MTBF: 5 hrs. (2001) and 40 hrs. (2003).
–  HW outage sources: storage, CPU, 3rd-party
HW.
PSC
Lemieux
3,016
MTBF: 9.7 hrs.
Availability: 98.33%.
Google
(projected
from 2003)
~450,000
~550 reboots/day; 2-3% machines replaced/yr.
–  HW outage sources: storage, memory.
Availabiliity: ~100%.
synergy.cs.vt.edu
9
1/30/09
Ubiquitous Need for Efficiency, Reliability,
and Availability
•  Requirement: Near-100% availability with efficient
and reliable resource usage.
–  E-commerce, enterprise apps, online services, ISPs, data
and HPC centers supporting R&D.
•  Problems
Adapted from David Patterson, IEEE IPDPS 2001
–  Frequency of Service Outages
•  65% of IT managers report that their web sites were
unavailable to customers over a 6-month period.
–  Cost of Service Outages
• 
• 
• 
• 
NYC stockbroker:
$ 6,500,000 / hour
Amazon.com:
$ 1,860,000 / hour
Ebay (22 hours):
$ 225,000 / hour
Social Effects: negative press, loss of customers who “click
over” to competitor (e.g., Google vs. Ask Jeeves)
synergy.cs.vt.edu
Outline
•  Motivation & Background
–  Where is High-End Computing (HEC)?
–  The Need for Efficiency, Reliability, and Availability
•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/)
–  Distant Past: Green Destiny (2001-2002)
–  Recent Past to Present: Evolution of Green Destiny (2003-2007)
•  Architectural
–  MegaScale, Orion Multisystems, IBM Blue Gene/L
•  Software-Based
–  EnergyFit™: Power-Aware Run-Time System (β adaptation)
•  Conclusion and Future Work
synergy.cs.vt.edu
10
1/30/09
Supercomputing in Small Spaces
Established 2001 at LANL. Now at Virginia Tech.
http://sss.cs.vt.edu/
•  Goals
–  Improve efficiency, reliability, and availability in large-scale
supercomputing systems (HEC, server farms, distributed systems)
•  Autonomic energy and power savings while maintaining performance in
a large-scale computing system.
–  Reduce the total cost of ownership (TCO).
•  TCO = Acquisition Cost (“one time”) + Operation Cost (recurring)
•  Crude Analogy
–  Today’s Supercomputer or Datacenter
•  Formula One Race Car: Wins raw performance but reliability is so poor
that it requires frequent maintenance. Throughput low.
–  “Supercomputing in Small Spaces” Supercomputer or Datacenter
•  Toyota Camry: Loses raw performance but high reliability and
efficiency results in high throughput (i.e., miles driven/month 
answers/month).
synergy.cs.vt.edu
Green Destiny Supercomputer
(circa December 2001 – February 2002)
•  A 240-Node Cluster in Five Sq. Ft. Equivalent Linpack to a
256-CPU SGI Origin 2000
•  Each Node
(On the TOP500
List
–  1-GHz Transmeta TM5800 CPU w/ High-Performance
CodeMorphing Software running Linux 2.4.x
at the time)
–  640-MB RAM, 20-GB hard disk, 100-Mb/s Ethernet
•  Total
–  240 Gflops peak (Linpack: 101 Gflops in March 2002.)
–  150 GB of RAM (expandable to 276 GB)
–  4.8 TB of storage (expandable to 38.4 TB)
–  Power Consumption: Only 3.2 kW (diskless)
2003
WINNER
•  Reliability & Availability
–  No unscheduled downtime in 24-month lifetime.
•  Environment: A dusty 85°-90° F warehouse!
Featured in The New York Times, BBC News, and CNN.
Now in the Computer History Museum.
synergy.cs.vt.edu
11
1/30/09
Michael S. Warren, Los Alamos National Laboratory
synergy.cs.vt.edu
© W. Feng, May 2002
Green Destiny: Low-Power Supercomputer
Only Difference? The Processors
© W. Feng, Mar. 2007
Green Destiny “Replica”: Traditional Supercomputer
12
1/30/09
Parallel Computing Platforms Running an
N-body Gravitational Code
Machine
Year
Performance (Gflops)
Area
(ft2)
Avalon
Beowulf
ASCI
Red
ASCI
White
Green
Destiny
1996
1996
2000
2002
18
600
2500
58
120
1600
9920
5
Power (kW)
18
1200
2000
5
DRAM (GB)
36
585
6200
150
Disk (TB)
0.4
2.0
160.0
4.8
DRAM density (MB/ft2)
300
366
625
3000
Disk density (GB/ft2)
3.3
1.3
16.1
960.0
Perf/Space (Mflops/ft2)
150
375
252
11600
Perf/Power (Mflops/watt)
1.0
0.5
1.3
11.6
synergy.cs.vt.edu
Yet in 2002 …
•  “Green Destiny is so low power that it runs just as fast
when it is unplugged.”
•  “The slew of expletives and exclamations that followed
Feng’s description of the system …”
•  “In HPC, no one cares about power & cooling, and no
one ever will …”
•  “Moore’s Law for Power will stimulate the economy by
creating a new market
in cooling technologies.”
synergy.cs.vt.edu
13
1/30/09
Outline
•  Motivation & Background
–  Where is High-End Computing (HEC)?
–  The Need for Efficiency, Reliability, and Availability
•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/)
–  Distant Past: Green Destiny (2001-2002)
–  Recent Past to Present: Evolution of Green Destiny (2003-2007)
•  Architectural
–  MegaScale, Orion Multisystems, IBM Blue Gene/L
•  Software-Based
–  EnergyFit™: Power-Aware Run-Time System (β adaptation)
–  EcoDaemon™: Power-Aware Operating System Daemon
•  Conclusion and Future Work
synergy.cs.vt.edu
Cluster
Technology
Low-Power
Systems Design
Linux
But in the form factor
of a workstation …
a cluster workstation
© W. Feng, Aug. 2007
14
1/30/09
•  LINPACK
Performance
–  14 Gflops
•  Footprint
–  3 sq. ft.
–  1 cu. ft.
(24” x 18”)
(24" x 4" x 18“)
•  Power Consumption
–  170 watts at load
•  How does this
compare with a
traditional desktop?
synergy.cs.vt.edu
Inter-University Project: MegaScale
http://www.para.tutics.tut.ac.jp/megascale/
Univ. of Tsukuba Booth @ SC2004, Nov. 2004.
synergy.cs.vt.edu
15
1/30/09
IBM Blue Gene/L
System
(64 cabinets, 64x32x32)
Now the fastest supercomputer
in the world.
Cabinet
(32 Node boards, 8x8x16)
Node Card
(32 chips, 4x4x2)
16 Compute Cards
180/360 TF/s
16 TB DDR
Compute Card
(2 chips, 2x1x1)
Chip
(2 processors)
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
2.8/5.6 GF/s
4 MB
© 2004 IBM Corporation
90/180 GF/s
8 GB DDR
October 2003
BG/L half rack prototype
500 Mhz
512 nodes/1024 proc.
2 TFlop/s peak
1.4 Tflop/s sustained
synergy.cs.vt.edu
Outline
•  Motivation & Background
–  Where is High-End Computing (HEC)?
–  The Need for Efficiency, Reliability, and Availability
•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/)
–  Distant Past: Green Destiny (2001-2002)
–  Recent Past to Present: Evolution of Green Destiny (2003-2007)
•  Architectural
–  MegaScale, Orion Multisystems, IBM Blue Gene/L
•  Software-Based
–  EnergyFit™: Power-Aware Run-Time System (β adaptation)
–  EcoDaemon™: Power-Aware Operating System Daemon
•  Conclusion and Future Work
synergy.cs.vt.edu
16
1/30/09
43%
Software-Based Approach:
Power-Aware Computing
57%
•  Self-Adapting Software for Energy Efficiency
–  Conserve power & energy WHILE maintaining performance.
•  Observations
–  Many commodity technologies support dynamic voltage &
frequency scaling (DVFS), which allows changes to the
processor voltage and frequency at run time.
–  A computing system can trade off processor performance for
power reduction.
•  Power α V2f, where V is the supply voltage of the processor and f is
its frequency.
•  Processor performance α f.
synergy.cs.vt.edu
Power-Aware HPC via DVFS:
Key Observation
Execution Time (%)
•  Execution time of many programs is insensitive to
CPU speed change.
110
Performance degraded
by 4%
NAS IS benchmark
Clock speed slowed down by 50%
100
0.6 0.8
1
1.2 1.4 1.6 1.8
2
2.2
CPU speed (GHz)
synergy.cs.vt.edu
17
1/30/09
Power-Aware HPC via DVFS:
Key Idea
Execution Time (%)
•  Applying DVFS to these programs will result in
significant power and energy savings at a minimal
performance impact.
110
0.8 GHz
d
Performance
constraint
90
0
20
NAS IS benchmark
2 GHz
Energy-optimal
DVS schedule
40
60
80
100
120
Energy Usage(%)
synergy.cs.vt.edu
Why is Power Awareness via DVFS Hard?
•  What is cycle time of a processor?
–  Frequency ≈ 2 GHz  Cycle Time ≈ 1 / (2 x 109 ) = 0.5 ns
•  How long does the system take to scale voltage and
frequency?
O (10,000,000 cycles)
synergy.cs.vt.edu
18
1/30/09
Problem Formulation:
LP-Based Energy-Optimal DVFS Schedule
•  Definitions
–  A DVFS system exports n { (fi, Pi ) } settings.
–  Ti : total execution time of a program running at setting i
•  Given a program with deadline D, find a DVS schedule
(t1*, …, tn*) such that
–  If the program is executed for ti seconds at setting i, the total energy
usage E is minimized, the deadline D is met, and the required work is
completed.
synergy.cs.vt.edu
Performance Modeling
•  Traditional Performance Model
–  T(f) = (1 / f) * W
where T(f) (in seconds) is the execution time of a task running
at f and W (in cycles) is the amount of CPU work to be done.
•  Problems?
–  W needs to be known a priori. Difficult to predict.
–  W is not always constant across frequencies.
–  It predicts that the execution time will double if the CPU speed
is cut in half. (Not so for memory & I/O-bound.)
synergy.cs.vt.edu
19
1/30/09
Single-Coefficient β Performance Model
•  Our Formulation
–  Define the relative performance slowdown δ as
T(f) / T(fMAX) – 1
–  Re-formulate two-coefficient model
as a single-coefficient model:
–  The coefficient β is computed at run-time using a regression method on
the past MIPS rates reported from the built-in PMU.
C. Hsu and W. Feng.
“A Power-Aware Run-Time
System for HighPerformance Computing,”
SC|05, Nov. 2005.
synergy.cs.vt.edu
Current DVFS Scheduling Algorithms
• 
2step (i.e., CPUSPEED via SpeedStep):
–  Using a dual-speed CPU, monitor CPU utilization periodically.
–  If utilization > pre-defined upper threshold, set CPU to fastest; if utilization <
pre-defined lower threshold, set CPU to slowest.
• 
nqPID: A refinement of the 2step algorithm.
–  Recognize the similarity of DVFS scheduling and a classical control-systems
problem  Modify a PID controller (Proportional-Integral-Derivative) to suit DVFS
scheduling problem.
• 
freq: Reclaims the slack time between the actual processing time and the
worst-case execution time.
–  Track the amount of remaining CPU work Wleft and the amount of remaining time
before the deadline Tleft.
–  Set desired CPU frequency at each interval to fnew = Wleft / Tleft.
–  The algorithm assumes that the total amount of work in CPU cycles is known a
priori, which, in practice, is often unpredictable and not always a constant across
frequencies.
synergy.cs.vt.edu
20
1/30/09
β - Adaptation on Sequential Codes (SPEC)
relative time / relative energy
with respect to total execution time and system energy usage
SMALLER numbers are BETTER.
synergy.cs.vt.edu
NAS Parallel on an Athlon-64 Cluster
synergy.cs.vt.edu
21
1/30/09
NAS Parallel on an Opteron Cluster
synergy.cs.vt.edu
Outline
•  Motivation & Background
–  Where is High-End Computing (HEC)?
–  The Need for Efficiency, Reliability, and Availability
•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/)
–  Distant Past: Green Destiny (2001-2002)
–  Recent Past to Present: Evolution of Green Destiny (2003-2007)
•  Architectural
–  MegaScale, Orion Multisystems, IBM Blue Gene/L
•  Software-Based
–  EnergyFit™: Power-Aware Run-Time System (β adaptation)
•  Conclusion and Future Work
synergy.cs.vt.edu
22
1/30/09
Opportunities Galore!
•  Component Level
•  Abstraction Level
–  CPU, e.g., β 
(Library vs. OS)
–  Memory
–  Disk
–  System
–  Application Transparent
•  Compiler: βcompile-time
•  Library: βrun-time
•  OS: EnergyFit, EcoDaemon
–  Application Modified
•  Resource Co-scheduling
–  Across Systems, e.g., Job
Scheduling
43%
(Too intrusive)
–  Parallel Library Modified (e.g.,
P. Raghavan & B. Norris,
IPDPS’05)
•  Directed Monitoring
–  Hardware performance
counters
57%
synergy.cs.vt.edu
Selected References: Available at http://sss.cs.vt.edu/
Green Supercomputing
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
“Green Supercomputing Comes of Age,” IT Professional, Jan.-Feb. 2008.
“The Green500 List: Encouraging Sustainable Supercomputing,” IEEE
Computer, Dec. 2007.
“Green Supercomputing in a Desktop Box,” IEEE Workshop on HighPerformance, Power-Aware Computing at IEEE IPDPS, Mar. 2007.
“Making a Case for a Green500 List,” IEEE Workshop on High-Performance,
Power-Aware Computing at IEEE IPDPS, Apr. 2006.
“A Power-Aware Run-Time System for High-Performance Computing,” SC|05,
Nov. 2005.
“The Importance of Being Low Power in High-Performance Computing,”
Cyberinfrastructure Technology Watch (NSF), Aug. 2005.
“Green Destiny & Its Evolving Parts,” International Supercomputer Conf
Innovative Supercomputer Architecture Award,., Apr. 2004.
“Making a Case for Efficient Supercomputing,” ACM Queue, Oct. 2003.
“Green Destiny + mpiBLAST = Bioinfomagic,” 10th Int’l Conf. on Parallel
Computing (ParCo’03), Sept. 2003.
“Honey, I Shrunk the Beowulf!,” Int’l Conf. on Parallel Processing, Aug. 2002.
synergy.cs.vt.edu
23
1/30/09
Conclusion
•  Recall that …
–  We need horsepower to compute but horsepower also
translates to electrical power consumption.
•  Consequences
–  Electrical power costs $$$$.
–  “Too much” power affects efficiency, reliability, availability.
•  Solution
–  Autonomic energy and power savings while maintaining
performance in the supercomputer and datacenter.
•  Low-Power Architectural Approach
•  Power-Aware Commodity Approach
synergy.cs.vt.edu
Acknowledgments
•  Contributions
– 
– 
– 
– 
Jeremy S. Archuleta (LANL & VT)
Chung-hsing Hsu (LANL)
Michael S. Warren (LANL)
Eric H. Weigle (UCSD)
Former LANL Intern, now PhD
Former LANL Postdoc
Former LANL Colleague
Former LANL Colleague
•  Sponsorship & Funding (Past)
– 
– 
– 
– 
AMD
DOE Los Alamos Computer Science Institute
LANL Information Architecture Project
Virginia Tech
synergy.cs.vt.edu
24
1/30/09
Laboratory
Wu FENG
feng@cs.vt.edu
http://synergy.cs.vt.edu/
http://www.chrec.org/
http://www.mpiblast.org/
http://www.green500.org/
http://sss.cs.vt.edu/
25