Energy-efficient GPUs from mobile to supercomputers
Transcription
Energy-efficient GPUs from mobile to supercomputers
Parallelism and Energy-efficient GPUs from Mobile to Supercomputers Jem Davies ARM Fellow,VP of Technology Media Processing Division 24-Feb-13 First, some background… 2 The Eras of Computing The Internet of Things 100 Billion Mobile Internet Units Desktop Internet PC Mini Mainframe 1st Era 1M 1960 3 1970 1 Billion 100M 2nd Era 10M 1980 10 Billion 1990 2000 2010 2020 The Eras of Computing The Internet of Things 100 Billion Mobile Internet Units Desktop Internet PC Mini Mainframe 1st Era 1M 1960 4 1970 1 Billion 100M 2nd Era 10M 1980 10 Billion 1990 2000 2010 2020 Merging Our Digital and Physical Worlds ARM® technology and the ARM ecosystem scale from sensors to servers, connecting our world, from the Internet of Things to supercomputers ARM’s partners make SoCs containing heterogeneous compute engines 5 8.7 billion ARM-based chips in 2012, from <50 cents to >$200 We’ve always been in the multicore business From 1mm3 to 1km3 Battery 1mm3 platform Solar Cells 8.75mm3 platform solar cell 0.18µm Cortex™-M3 12µAh Li-ion battery University of Michigan Processor, SRAM and PMU 4200 ARM powered Neutrino Detectors 70 bore holes 2.5km deep 60 detectors per bore hole Work supported by the National Science Foundation and University of Wisconsin-Madison 6 1km3 platform Mobility Is Reshaping Behaviour 79% Of iPad users use exclusively for Internet 36% 60% Facebook users access from mobile Tablets outsell laptop PCs 71% of enterprises plan to establish own store Chinese Android™ users spend 2+ hrs a day on mobile apps 1 billion smartphones will ship in 2013 IDG, Pew Research 7 50% of mobile computing devices in 2013 will be ARM-based Symantec “2012 State of Mobility Survey,” BI intelligence 2012 Guohe, 2012 Pushing the Boundaries of Mobile Computing Delivering tomorrow’s features and capabilities Facial recognition and unlock Multi-tasking and multiwindow productivity 8 2.5k/4k displays, multiple simultaneous displays Video & image editing, manipulation, and rendering Consolequality gaming experience Live image and scene recognition for augmented reality Intuitive gesture and natural language input 1080p HD video and beyond, multiple streams, HEVC Interactive Mobile Devices Drive Graphics Graphics capability is a key factor in consumer purchasing decisions Rich graphics is a priority for anything with a screen Smartphones, DTVs, STBs, Tablets, hand-held games consoles, netbooks In-car infotainment In the post-PC era, data from the cloud and graphically-rich, interactive, connected devices means a growing market for GPUs 4 billion internet-connected screens in 2016, most with embedded graphics 9 Solving the Graphics Market Challenges 10 Relative Power Performance/Power is the Big Challenge ARM GPU and System savings of 35% annually Requirements on the GPU continue to grow exponentially but still have to fit within constant power boundaries Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required every year to fit new performance requirements within SoC thermal limits 11 ARM Mali GPU Momentum Delivering Tomorrow’s Features and Capabilities Over 150M Mali™-based GPUs reported in 2012 #1 graphics in Android tablets (>50%) >20% Android smartphones 12 >230 OEM products shipping with Mali GPUs 75 licences across 54 partners, making Mali-based SoCs #1 in graphicsenabled DTV (>70%) ARM is not just about Mobile… …but Mobile has been driving a lot of innovation and enabling diversity 13 Mobile Revolution Built On Diversity 40LP 32 HKMG 14/16 FinFET Cortex -A53 Cortex -A5 20nm Cortex -A57 Cortex -A7 Cortex -A9 Cortex -A15 Mali -400 Mali -450 Mali -T628 SOI Connectivity Video Memory Modem Android Windows Phone Firefox Mobile Chrome OS Windows RT Aliyun OS Mali -T604 Mali -T678 Reference Design Superphone Camera Massmarket smartphone 14 Clamshell Tablet Post-PC World Built on Diversity Rapid pace of development has embraced numerous changes, built on diverse, nimble, fast-responding ecosystem of partners One size does NOT fit all (nor should it!) Not just desktops and laptops any more New form factors and new usage models: Phones, tablets, phablets, netbooks, hybrid tablets, TVs, consoles, Chromebooks, watches! Always-on, always-connected, long battery-life Different form factors in different regions Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET ARM’s partners are driven by innovation and user requirements, not by ARM ARM shows technical leadership but embraces diversity – we do not pick winners! 15 big Performance, LITTLE Power Highest Performance Performance Smartphone needs to be very responsive big Mobile performance needs are highly elastic Constant connectivity and usage Voice, SMS, trickle feeds are lowintensity tasks LITTLE 16 Highest Efficiency GPU Compute Making the Difference Heterogeneous computing Portability Parallel computation Hardware acceleration GPU Computing Trends Benefits More energy-efficient processing BOM reduction Improved accuracy/quality Improved existing use cases Unlock new use cases Multi-Perspective Vision 2D to 3D Multi-User Interaction Computer Vision Real Time Still and Moving Image Perfection Computational Photography Information Extraction Light-Field Photography Upscaling 17 Unifying CPU and GPU Compute High efficiency approaches for: Computational photography Immersive visual computing Augmented Reality Improved user interfaces GPU for highly-threaded tasks and Right-sized computing CPU for low-threaded tasks Efficient ARM IP synergy makes it worthwhile to move even small tasks to the GPU Full Profile GPU Compute is key to enabling portability 18 Continued Performance & Efficiency Gains Energy Efficiency Underlies It All Efficiency is a requirement for a connected world 19 ARM has always understood this – others are now catching on Energy-efficiency Has Always Mattered Segment differentiation matters less than you think High-end Mobile driven by thermal constraints And “You can never be too thin” Sensors need 10 years from a button cell battery Everybody hates fans Desktops struggle to: 20 Dissipate more than ~150W out of a single chip package Supply more than ~300W from a PSU onto a PCI card Servers struggle: To get more than ~10kW into a rack To get more than ~500kW of heat out of a shipping container Spending more on power and cooling than on hardware To buy more power from the grid GPUs from Mobile to Supercomputer Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012 The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected the Samsung Exynos platform as the building block for powering its first integrated low power- High Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of computer architecture capable of setting future global HPC standards, built from today’s energy efficient solutions used in embedded and mobile devices The Samsung Exynos 5 Dual is built on 32nm low-power HKMG (High-K Metal Gate), and features a dual-core 1.7GHz mobile CPU built on ARM® Cortex™-A15 architecture plus an integrated ARM Mali™-T604 GPU for increased performance density and energy efficiency. It has been featured and market proven in consumer and mobile devices such as Samsung Chromebook and Google’s Nexus 10 21 The Search for Parallelism ARM (like the industry) utilizes parallelism across all segments, across multiple devices to drive down energy use CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3 CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors Graphics is one of the few problem spaces that have very high levels of parallelism: Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame) Do <bar> on every pixel in the frame (millions of pixels/frame) Data-level parallelism through thread-level parallelism GPU compute (throughput architectures/implementations) has great advantages: Performance and Energy efficiency CPUs remain “easier” to program Do GPUs merge with CPUs? Maybe not yet, but connections between them become ever more important 22 Amdahl’s Law is Alive and Well 100% parallel 32 28 Speedup on parallel processors is limited by the sequential portion of the program Maximum speedup 24 20 Sequential portion need not be large to constrain speedup significantly ! 16 12 95% parallel 8 90% parallel 4 75% parallel 50% parallel 0 0 4 8 12 16 Number of cores 23 20 24 28 32 ARM’s Heterogeneous Computing Advantages ARM has always been involved in heterogeneous systems For 23 years, mobile SoCs have contained CPUs, DSPs and other compute elements big.LITTLE™ and GPU Computing ARM’s first multicore CPU was produced 10 years ago ARM has produced several generations of controllers, baseband, WiFi coherent interconnect ARM introduced the world’s first embedded multicore GPU Now have TFLOP GPUs 24 Right task, right processor Today’s modern mobile SoC contains ISPs, DSPs, GPUs, baseband accelerators, crypto accelerators, GPUs and VPUs “Helper” ARM CPUs, security, power Heterogeneous, multicore applications CPUs ARM is introducing multicore VPUs All within today’s energy budget… Improving the Granule of Offload Hardware: sharing data Traditional CPU -> GPU was unidirectional offload GPU is I/O-coherent with CPU ◦ GPU reads snoop CPU data 25 Frame buffer is not cached Too big, especially with multi-buffering Typically large, pre-arranged buffers Set up by a driver GPU Compute is much more bidirectional GPU needs to be fully coherent with the CPU Smaller-scale sharing of data Exchanging pointers to data structures Directly referred to by the application Avoid driver interaction CPU Display processor GPU Object Database Texture Database Frame Buffer CPU GPU CPU + GPU Heterogeneous benefits ARM leadership in system coherency is key to enable heterogeneous computing and the future of graphics GPU participating in system coherency enables a broad set of benefits and values Avoids unnecessary cache flush/invalidate operations Up to Quad big CPUs L2 Up to Quad LITTLE CPUs L2 L2 Maintain warm caches: lower power, higher performance CoreLink CCN-504 Cache Coherent Network with AMBA 4 ACE Interfaces Avoids DRAM access to data present in CPU caches Lower power by elimination of off-chip memory accesses NIC Network CoreLink DMC 520 System and I/O Eliminates software-managed cache coherency interconnect Greater efficiency, reduced complexity Faster TTM by removing source of complex synchronization bugs Reduced cost of sharing data between CPU and GPU Widens the set of algorithms ripe for GPU acceleration ™ 26 Up to 8 MaliT678 ® ™ What is ARM doing to help? 27 Building Smarter Systems Visual computing is about designing the best compute systems to match the competing demands of power and performance ARM provides the four fundamental IP building blocks needed to enable high-performance, energy efficient, coherent System-on-Chips ARM’s unique experience and portfolio of processors, memory systems, physical IP and interconnect designed together ARM’s leading system IP extends coherency for optimized multicore solutions Heterogeneous computing for optimum energy efficiency 28 ARMv8: 64-bit Processing Now Available In Mobile Power 64-bit processing and full backward compatibility with 32-bit Mobile devices will need more than 4GBytes CRYPTO Scalar FP ARM v7A Software Advanced SIMD Scalable power and performance with big.LITTLE 32-bit Sub 1mm2 processors Co-designed with Mali-T600 GPUs ARMv8 Designed to deliver all your computing needs 29 64-bit Complexity Out-of-order superscalar CPUs introduces some complexity Multicore CPUs/GPUs – more complexity Memory consistency model Threading models Heterogeneous computing - more complexity Some GPU compute engines are: 30 Complex Badly described Difficult to reason about Most graphics developers want to create visually stunning content, not argue about the finer points of computer science Complexity drives the need for standards If we do not abstract away from this complexity and present a simpler world to the developer… Either they won’t use these new systems Or, they’ll use it and get it wrong Either way, it will be our fault, and they will hate us for it If we all provide different abstractions… They will hate us for it We need standard(s) And a very small number of them 31 ARM and HSA Heterogeneous System Architecture (HSA) Foundation Not-for-profit consortium founded by industry leaders to create hardware and software standards for heterogeneous computing to: Simplify the programming environment Make compute at low power pervasive Introduce new capabilities in modern computing devices ARM is a founding member and board member Heterogeneous computing is the essence of efficient computation: the right processor for the right task More research required – not a solved problem Better models, compilation technologies, debug, programming methodologies We need to make it easier to use – still seen as a “black art” 32 Diverse Partners Driving Future of Heterogeneous Computing Founders Promoters Supporters Contributors Academic 33 © Copyright 2012 HSA Foundation. All Rights Reserved. A Call to the ARM Partnership for Action 34 Towards the Future Don’t miss opportunities. Push the boundaries! Innovation in the post-PC era requires: Focus on energy-efficiency across diverse market segments with diverse devices Continue to utilize parallelism (of all forms) Compute on the right device Highest efficiency CPU for required performance Use the right programmable device for the right task, such as GPU computing Domain-specific processors for specific tasks Not just hardware problem, need more software investment: 35 Operating systems, hypervisors, managed environments, APIs, tools Application developers across the ecosystem We need new standards, but there will be different paths to follow One size does NOT fit all ARM committed to heterogeneous, parallel computing 36 Thank you 谢谢 37 Questions? 38