Slides - Undergraduate Courses
Transcription
Slides - Undergraduate Courses
1/30/09 Prof. Wu FENG, feng@cs.vt.edu, 540-231-1192 Depts. of Computer Science and Electrical & Computer Engineering © W. Feng, January 2009 Overview: Efficient Parallel Computing • Interdisciplinary – Approaches: Experimental, Simulative, and Theoretical – Abstractions: Systems Software, Middleware, Applications • Synergistic Research Sampling – Emerging Chip Multiprocessors: Multicore, GPUs, Cell / PS3 • Performance Analysis & Optimization: Systems & Applications – Automated Process-to-Core Mapping: SyMMer – Mixin Layers & Concurrent Mixin Layers (Tilevich) – Green Supercomputing: EcoDaemon & Green500 – Pairwise Sequence Search: BLASTIP2M and Smith-Waterman Award – Episodic Data Mining (Ramakrishnan, Cao) – Electrostatic Potential (Onufriev) synergy.cs.vt.edu 1 1/30/09 Overview: Efficient Parallel Computing • Interdisciplinary – Approaches: Experimental, Simulative, and Theoretical – Abstractions: Systems Software, Middleware, Applications • Synergistic Research Sampling – Hybrid Circuit- and Packet-Switched Networking • Application-Awareness in Network Infrastructure: FAATE • Composable Protocols for High-Performance Networking synergy.cs.vt.edu Overview: Efficient Parallel Computing • Interdisciplinary – Approaches: Experimental, Simulative, and Theoretical – Abstractions: Systems Software, Middleware, Applications • Synergistic Research Sampling – Large-Scale Bioinformatics • mpiBLAST: Parallel Sequence Search www.mpiblast.org – “Finding Missing Genes” in Genomes (Setubal) • ParaMEDIC: Parallel Metadata Environment for Distributed I/O & Computing Storage Challenge Award synergy.cs.vt.edu 2 1/30/09 Overview: Efficient Parallel Computing • Interdisciplinary Best – Approaches: Experimental, Simulative,Undergraduate and Theoretical Student Poster – Abstractions: Systems Software, Middleware, Award Applications • Synergistic Research Sampling – Virtual Computing for K-12 Pedagogy & Research (Gardner) • Characterization & Optimization of Virtual Machines • Cybersecurity & Scalability: Private vs. Public Addresses synergy.cs.vt.edu The Synergy Lab (http://synergy.cs.vt.edu/) • Ph.D. Students – – – – – – – Jeremy Archuleta Marwa Elteir Song Huang Thomas Scogland Sushant Sharma Kumaresh Singh Shucai Xiao • M.S. Students – Ashwin Aji – Rajesh Gangam – Ajeet Singh • B.S. Students – Jacqueline Addesa – William Gomez – Gabriel Martinez • Faculty Collaborators – – – – – – – – – Kirk Cameron Yong Cao Mark Gardner Alexey Onufriev Dimitris Nikolopoulos Naren Ramakrishnan Adrian Sandu João Setubal Eli Tilevich synergy.cs.vt.edu 3 1/30/09 Prof. Wu FENG, feng@cs.vt.edu, 540-231-1192 Depts. of Computer Science and Electrical & Computer Engineering © W. Feng, January 2009 Green Destiny: Low-Power Supercomputer Why Green Supercomputing? • Premise – We need horsepower to compute but horsepower also translates to electrical power consumption. • Consequences – Electrical power costs $$$$. – “Too much” power affects efficiency, reliability, availability. Only Difference? The Processors • Idea – Autonomic energy and power savings while maintaining performance in the supercomputer or datacenter. • Low-Power Approach: Power Awareness at Integration Time Street Journal, 29, 2007 • Power-Aware Approach: PowerWall Awareness at January Run Time © W. Feng, Mar. 2007 Green Destiny “Replica”: Traditional synergy.cs.vt.edu Supercomputer 4 1/30/09 Outline • Motivation & Background – Where is High-End Computing (HEC)? – The Need for Efficiency, Reliability, and Availability • Supercomputing in Small Spaces (http://sss.cs.vt.edu/) – Distant Past: Green Destiny (2001-2002) – Recent Past to Present: Evolution of Green Destiny (2003-2007) • Architectural – MegaScale, Orion Multisystems, IBM Blue Gene/L • Software-Based – EnergyFit™: Power-Aware Run-Time System (β adaptation) • Conclusion and Future Work synergy.cs.vt.edu Where is High-End Computing? We have spent decades focusing on performance, performance, performance (and price/performance). synergy.cs.vt.edu 5 1/30/09 Where is High-End Computing? TOP500 Supercomputer List • Benchmark – LINPACK: Solves a (random) dense system of linear equations in double-precision (64 bits) arithmetic. • Introduced by Prof. Jack Dongarra, U. Tennessee & ORNL • Evaluation Metric – Performance (i.e., Speed) • Floating-Operations Per Second (FLOPS) • Web Site – http://www.top500.org/ synergy.cs.vt.edu Where is High-End Computing? Gordon Bell Awards at SC • Metrics for Evaluating Supercomputers (or HEC) – Performance (i.e., Speed) • Metric: Floating-Operations Per Second (FLOPS) – Price/Performance Cost Efficiency • Metric: Acquisition Cost / FLOPS • Example: VT System X cluster • Performance & price/performance are important metrics, but … synergy.cs.vt.edu 6 1/30/09 Where is High-End Computing? … more “performance” more horsepower more power consumption less efficiency, reliability, and availability synergy.cs.vt.edu Where is High-End Computing? … more “performance” more horsepower more power consumption less efficiency, reliability, and availability synergy.cs.vt.edu 7 1/30/09 Moore’s Law for Power Consumption? 1000 Chip Maximum Power in watts/cm2 Rocket Nozzle Surpassed Nuclear Reactor Pentium 4 (Prescott) 100 Pentium 4 (Willamette) Pentium III Pentium II Pentium Pro Surpassed Heating Plate 10 Pentium 1 I386 1.5µ 1985 1µ I486 0.7µ 0.5µ 1995 0.35µ 0.25µ 0.18µ 2001 0.13µ 0.1µ Source: Intel 0.07µ 2004 Year synergy.cs.vt.edu Where is High-End Computing? … more “performance” more horsepower more power consumption less efficiency, reliability, and availability synergy.cs.vt.edu 8 1/30/09 Efficiency of HEC Systems • “Performance” and “Price/Performance” Metrics … – Lower efficiency, reliability, and availability. – Higher operational costs, e.g., admin, maintenance, etc. • Examples – Computational Efficiency • Relative to Peak: Actual Performance/Peak Performance • Relative to Space: Performance/Sq. Ft. • Relative to Power: Performance/Watt – Performance: 10,000-fold increase (since the Cray C90). • Performance/Sq. Ft.: • Performance/Watt: Only 65-fold increase. Only 300-fold increase. – Massive construction and operational costs associated with powering and cooling. synergy.cs.vt.edu Reliability & Availability of HEC Systems Systems CPUs Reliability & Availability ASCI Q 8,192 MTBF: 6.5 hrs. 114 unplanned outages/month. – HW outage sources: storage, CPU, memory. ASCI White 8,192 MTBF: 5 hrs. (2001) and 40 hrs. (2003). – HW outage sources: storage, CPU, 3rd-party HW. PSC Lemieux 3,016 MTBF: 9.7 hrs. Availability: 98.33%. Google (projected from 2003) ~450,000 ~550 reboots/day; 2-3% machines replaced/yr. – HW outage sources: storage, memory. Availabiliity: ~100%. synergy.cs.vt.edu 9 1/30/09 Ubiquitous Need for Efficiency, Reliability, and Availability • Requirement: Near-100% availability with efficient and reliable resource usage. – E-commerce, enterprise apps, online services, ISPs, data and HPC centers supporting R&D. • Problems Adapted from David Patterson, IEEE IPDPS 2001 – Frequency of Service Outages • 65% of IT managers report that their web sites were unavailable to customers over a 6-month period. – Cost of Service Outages • • • • NYC stockbroker: $ 6,500,000 / hour Amazon.com: $ 1,860,000 / hour Ebay (22 hours): $ 225,000 / hour Social Effects: negative press, loss of customers who “click over” to competitor (e.g., Google vs. Ask Jeeves) synergy.cs.vt.edu Outline • Motivation & Background – Where is High-End Computing (HEC)? – The Need for Efficiency, Reliability, and Availability • Supercomputing in Small Spaces (http://sss.cs.vt.edu/) – Distant Past: Green Destiny (2001-2002) – Recent Past to Present: Evolution of Green Destiny (2003-2007) • Architectural – MegaScale, Orion Multisystems, IBM Blue Gene/L • Software-Based – EnergyFit™: Power-Aware Run-Time System (β adaptation) • Conclusion and Future Work synergy.cs.vt.edu 10 1/30/09 Supercomputing in Small Spaces Established 2001 at LANL. Now at Virginia Tech. http://sss.cs.vt.edu/ • Goals – Improve efficiency, reliability, and availability in large-scale supercomputing systems (HEC, server farms, distributed systems) • Autonomic energy and power savings while maintaining performance in a large-scale computing system. – Reduce the total cost of ownership (TCO). • TCO = Acquisition Cost (“one time”) + Operation Cost (recurring) • Crude Analogy – Today’s Supercomputer or Datacenter • Formula One Race Car: Wins raw performance but reliability is so poor that it requires frequent maintenance. Throughput low. – “Supercomputing in Small Spaces” Supercomputer or Datacenter • Toyota Camry: Loses raw performance but high reliability and efficiency results in high throughput (i.e., miles driven/month answers/month). synergy.cs.vt.edu Green Destiny Supercomputer (circa December 2001 – February 2002) • A 240-Node Cluster in Five Sq. Ft. Equivalent Linpack to a 256-CPU SGI Origin 2000 • Each Node (On the TOP500 List – 1-GHz Transmeta TM5800 CPU w/ High-Performance CodeMorphing Software running Linux 2.4.x at the time) – 640-MB RAM, 20-GB hard disk, 100-Mb/s Ethernet • Total – 240 Gflops peak (Linpack: 101 Gflops in March 2002.) – 150 GB of RAM (expandable to 276 GB) – 4.8 TB of storage (expandable to 38.4 TB) – Power Consumption: Only 3.2 kW (diskless) 2003 WINNER • Reliability & Availability – No unscheduled downtime in 24-month lifetime. • Environment: A dusty 85°-90° F warehouse! Featured in The New York Times, BBC News, and CNN. Now in the Computer History Museum. synergy.cs.vt.edu 11 1/30/09 Michael S. Warren, Los Alamos National Laboratory synergy.cs.vt.edu © W. Feng, May 2002 Green Destiny: Low-Power Supercomputer Only Difference? The Processors © W. Feng, Mar. 2007 Green Destiny “Replica”: Traditional Supercomputer 12 1/30/09 Parallel Computing Platforms Running an N-body Gravitational Code Machine Year Performance (Gflops) Area (ft2) Avalon Beowulf ASCI Red ASCI White Green Destiny 1996 1996 2000 2002 18 600 2500 58 120 1600 9920 5 Power (kW) 18 1200 2000 5 DRAM (GB) 36 585 6200 150 Disk (TB) 0.4 2.0 160.0 4.8 DRAM density (MB/ft2) 300 366 625 3000 Disk density (GB/ft2) 3.3 1.3 16.1 960.0 Perf/Space (Mflops/ft2) 150 375 252 11600 Perf/Power (Mflops/watt) 1.0 0.5 1.3 11.6 synergy.cs.vt.edu Yet in 2002 … • “Green Destiny is so low power that it runs just as fast when it is unplugged.” • “The slew of expletives and exclamations that followed Feng’s description of the system …” • “In HPC, no one cares about power & cooling, and no one ever will …” • “Moore’s Law for Power will stimulate the economy by creating a new market in cooling technologies.” synergy.cs.vt.edu 13 1/30/09 Outline • Motivation & Background – Where is High-End Computing (HEC)? – The Need for Efficiency, Reliability, and Availability • Supercomputing in Small Spaces (http://sss.cs.vt.edu/) – Distant Past: Green Destiny (2001-2002) – Recent Past to Present: Evolution of Green Destiny (2003-2007) • Architectural – MegaScale, Orion Multisystems, IBM Blue Gene/L • Software-Based – EnergyFit™: Power-Aware Run-Time System (β adaptation) – EcoDaemon™: Power-Aware Operating System Daemon • Conclusion and Future Work synergy.cs.vt.edu Cluster Technology Low-Power Systems Design Linux But in the form factor of a workstation … a cluster workstation © W. Feng, Aug. 2007 14 1/30/09 • LINPACK Performance – 14 Gflops • Footprint – 3 sq. ft. – 1 cu. ft. (24” x 18”) (24" x 4" x 18“) • Power Consumption – 170 watts at load • How does this compare with a traditional desktop? synergy.cs.vt.edu Inter-University Project: MegaScale http://www.para.tutics.tut.ac.jp/megascale/ Univ. of Tsukuba Booth @ SC2004, Nov. 2004. synergy.cs.vt.edu 15 1/30/09 IBM Blue Gene/L System (64 cabinets, 64x32x32) Now the fastest supercomputer in the world. Cabinet (32 Node boards, 8x8x16) Node Card (32 chips, 4x4x2) 16 Compute Cards 180/360 TF/s 16 TB DDR Compute Card (2 chips, 2x1x1) Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 5.6/11.2 GF/s 0.5 GB DDR 2.8/5.6 GF/s 4 MB © 2004 IBM Corporation 90/180 GF/s 8 GB DDR October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained synergy.cs.vt.edu Outline • Motivation & Background – Where is High-End Computing (HEC)? – The Need for Efficiency, Reliability, and Availability • Supercomputing in Small Spaces (http://sss.cs.vt.edu/) – Distant Past: Green Destiny (2001-2002) – Recent Past to Present: Evolution of Green Destiny (2003-2007) • Architectural – MegaScale, Orion Multisystems, IBM Blue Gene/L • Software-Based – EnergyFit™: Power-Aware Run-Time System (β adaptation) – EcoDaemon™: Power-Aware Operating System Daemon • Conclusion and Future Work synergy.cs.vt.edu 16 1/30/09 43% Software-Based Approach: Power-Aware Computing 57% • Self-Adapting Software for Energy Efficiency – Conserve power & energy WHILE maintaining performance. • Observations – Many commodity technologies support dynamic voltage & frequency scaling (DVFS), which allows changes to the processor voltage and frequency at run time. – A computing system can trade off processor performance for power reduction. • Power α V2f, where V is the supply voltage of the processor and f is its frequency. • Processor performance α f. synergy.cs.vt.edu Power-Aware HPC via DVFS: Key Observation Execution Time (%) • Execution time of many programs is insensitive to CPU speed change. 110 Performance degraded by 4% NAS IS benchmark Clock speed slowed down by 50% 100 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 CPU speed (GHz) synergy.cs.vt.edu 17 1/30/09 Power-Aware HPC via DVFS: Key Idea Execution Time (%) • Applying DVFS to these programs will result in significant power and energy savings at a minimal performance impact. 110 0.8 GHz d Performance constraint 90 0 20 NAS IS benchmark 2 GHz Energy-optimal DVS schedule 40 60 80 100 120 Energy Usage(%) synergy.cs.vt.edu Why is Power Awareness via DVFS Hard? • What is cycle time of a processor? – Frequency ≈ 2 GHz Cycle Time ≈ 1 / (2 x 109 ) = 0.5 ns • How long does the system take to scale voltage and frequency? O (10,000,000 cycles) synergy.cs.vt.edu 18 1/30/09 Problem Formulation: LP-Based Energy-Optimal DVFS Schedule • Definitions – A DVFS system exports n { (fi, Pi ) } settings. – Ti : total execution time of a program running at setting i • Given a program with deadline D, find a DVS schedule (t1*, …, tn*) such that – If the program is executed for ti seconds at setting i, the total energy usage E is minimized, the deadline D is met, and the required work is completed. synergy.cs.vt.edu Performance Modeling • Traditional Performance Model – T(f) = (1 / f) * W where T(f) (in seconds) is the execution time of a task running at f and W (in cycles) is the amount of CPU work to be done. • Problems? – W needs to be known a priori. Difficult to predict. – W is not always constant across frequencies. – It predicts that the execution time will double if the CPU speed is cut in half. (Not so for memory & I/O-bound.) synergy.cs.vt.edu 19 1/30/09 Single-Coefficient β Performance Model • Our Formulation – Define the relative performance slowdown δ as T(f) / T(fMAX) – 1 – Re-formulate two-coefficient model as a single-coefficient model: – The coefficient β is computed at run-time using a regression method on the past MIPS rates reported from the built-in PMU. C. Hsu and W. Feng. “A Power-Aware Run-Time System for HighPerformance Computing,” SC|05, Nov. 2005. synergy.cs.vt.edu Current DVFS Scheduling Algorithms • 2step (i.e., CPUSPEED via SpeedStep): – Using a dual-speed CPU, monitor CPU utilization periodically. – If utilization > pre-defined upper threshold, set CPU to fastest; if utilization < pre-defined lower threshold, set CPU to slowest. • nqPID: A refinement of the 2step algorithm. – Recognize the similarity of DVFS scheduling and a classical control-systems problem Modify a PID controller (Proportional-Integral-Derivative) to suit DVFS scheduling problem. • freq: Reclaims the slack time between the actual processing time and the worst-case execution time. – Track the amount of remaining CPU work Wleft and the amount of remaining time before the deadline Tleft. – Set desired CPU frequency at each interval to fnew = Wleft / Tleft. – The algorithm assumes that the total amount of work in CPU cycles is known a priori, which, in practice, is often unpredictable and not always a constant across frequencies. synergy.cs.vt.edu 20 1/30/09 β - Adaptation on Sequential Codes (SPEC) relative time / relative energy with respect to total execution time and system energy usage SMALLER numbers are BETTER. synergy.cs.vt.edu NAS Parallel on an Athlon-64 Cluster synergy.cs.vt.edu 21 1/30/09 NAS Parallel on an Opteron Cluster synergy.cs.vt.edu Outline • Motivation & Background – Where is High-End Computing (HEC)? – The Need for Efficiency, Reliability, and Availability • Supercomputing in Small Spaces (http://sss.cs.vt.edu/) – Distant Past: Green Destiny (2001-2002) – Recent Past to Present: Evolution of Green Destiny (2003-2007) • Architectural – MegaScale, Orion Multisystems, IBM Blue Gene/L • Software-Based – EnergyFit™: Power-Aware Run-Time System (β adaptation) • Conclusion and Future Work synergy.cs.vt.edu 22 1/30/09 Opportunities Galore! • Component Level • Abstraction Level – CPU, e.g., β (Library vs. OS) – Memory – Disk – System – Application Transparent • Compiler: βcompile-time • Library: βrun-time • OS: EnergyFit, EcoDaemon – Application Modified • Resource Co-scheduling – Across Systems, e.g., Job Scheduling 43% (Too intrusive) – Parallel Library Modified (e.g., P. Raghavan & B. Norris, IPDPS’05) • Directed Monitoring – Hardware performance counters 57% synergy.cs.vt.edu Selected References: Available at http://sss.cs.vt.edu/ Green Supercomputing • • • • • • • • • • “Green Supercomputing Comes of Age,” IT Professional, Jan.-Feb. 2008. “The Green500 List: Encouraging Sustainable Supercomputing,” IEEE Computer, Dec. 2007. “Green Supercomputing in a Desktop Box,” IEEE Workshop on HighPerformance, Power-Aware Computing at IEEE IPDPS, Mar. 2007. “Making a Case for a Green500 List,” IEEE Workshop on High-Performance, Power-Aware Computing at IEEE IPDPS, Apr. 2006. “A Power-Aware Run-Time System for High-Performance Computing,” SC|05, Nov. 2005. “The Importance of Being Low Power in High-Performance Computing,” Cyberinfrastructure Technology Watch (NSF), Aug. 2005. “Green Destiny & Its Evolving Parts,” International Supercomputer Conf Innovative Supercomputer Architecture Award,., Apr. 2004. “Making a Case for Efficient Supercomputing,” ACM Queue, Oct. 2003. “Green Destiny + mpiBLAST = Bioinfomagic,” 10th Int’l Conf. on Parallel Computing (ParCo’03), Sept. 2003. “Honey, I Shrunk the Beowulf!,” Int’l Conf. on Parallel Processing, Aug. 2002. synergy.cs.vt.edu 23 1/30/09 Conclusion • Recall that … – We need horsepower to compute but horsepower also translates to electrical power consumption. • Consequences – Electrical power costs $$$$. – “Too much” power affects efficiency, reliability, availability. • Solution – Autonomic energy and power savings while maintaining performance in the supercomputer and datacenter. • Low-Power Architectural Approach • Power-Aware Commodity Approach synergy.cs.vt.edu Acknowledgments • Contributions – – – – Jeremy S. Archuleta (LANL & VT) Chung-hsing Hsu (LANL) Michael S. Warren (LANL) Eric H. Weigle (UCSD) Former LANL Intern, now PhD Former LANL Postdoc Former LANL Colleague Former LANL Colleague • Sponsorship & Funding (Past) – – – – AMD DOE Los Alamos Computer Science Institute LANL Information Architecture Project Virginia Tech synergy.cs.vt.edu 24 1/30/09 Laboratory Wu FENG feng@cs.vt.edu http://synergy.cs.vt.edu/ http://www.chrec.org/ http://www.mpiblast.org/ http://www.green500.org/ http://sss.cs.vt.edu/ 25