Introduction to Parallel Computing

Transcription

CPE 779 Parallel Computing
http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr12/index.html
Lecture 1: Introduction
Walid Abu-Sufah
University of Jordan
CPE 779 Parallel Computing - Spring 2012
1
Acknowledgment: Collaboration
This course is being offered in collaboration with
• The IMPACT research group at the University of Illinois
http://impact.crhc.illinois.edu/
• The Computation based Science and Technology Research
Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/
2
Acknowledgment: Slides
Some of the slides used in this course are based on slides
by
• Jim Demmel, University of California at Berkeley & Horst
Simon, Lawrence Berkeley National Lab (LBNL)
http://www.cs.berkeley.edu/~demmel/cs267_Spr12/
• Wen-mei Hwu of the University of Illinois and David Kurk,
Nvidia Corporation
http://courses.engr.illinois.edu/ece408/ece408_syll.html
• Kathy Yelick, University of California at Berkeley
http://www.cs.berkeley.edu/~yelick/cs194f07
3
Course Motivation
In the last few years:
• Conventional sequential processors can not get
faster
• Previously clock speed doubled every 18 months
• All computers will be parallel
• >>> All programs will have to become parallel
programs
• Especially programs that need to run faster.
4
Course Motivation (continued)
There will be a huge change in the entire computing
industry
• Previously the industry depended on selling new
computers by running their users' programs
faster without the users having to reprogram
them.
• Multi/ many core chips have started a revolution
in the software industry
5
Course Motivation (continued)
Large research activities to address this issue
are underway
• Computer companies: Intel, Microsoft, Nvidia,
IBM, ..etc
• Parallel programming is a concern for the entire
computing industry.
• Universities
• Berkeley's ParLab (2008: $20 million grant)
6
Course Goals
Part 1 (~4 weeks)
• focus on the techniques that are most appropriate
for multicore programming and the use of parallelism
to improve program performance. Topics include
• performance analysis and tuning
• data techniques
• shared data structures
• load balancing. and task parallelism
• synchronization
7
Course Goals (continued - I)
Part 2 (~ 12 weeks)
• Learn how to program massively parallel processors
and achieve
• high performance
• functionality and maintainability
• scalability across future generations
• Acquire technical knowledge required to achieve the
above goals
• principles and patterns of parallel algorithms
• processor architecture features and constraints
• programming API, tools and techniques
8
Outline of rest of lecture
all
• Why powerful computers must use parallel processors
Including your laptops and handhelds
• Examples of Computational Science and Engineering (CSE)
problems which require powerful computers
Commercial problems too
• Why writing (fast) parallel programs is hard
But things are improving
• Principles of parallel computing performance
• Structure of the course
9
What is Parallel Computing?
• Parallel computing: using multiple processors in parallel to
solve problems (execute applications) more quickly than with
a single processor
• Examples of parallel machines:
• A cluster computer that contains multiple PCs combined together with
a high speed network
• A shared memory multiprocessor (SMP*) by connecting multiple
processors to a single memory system
• A Chip Multi-Processor (CMP) contains multiple processors (called
cores) on a single chip
• Concurrent execution comes from the desire for performance
• * Technically, SMP stands for “Symmetric Multi-Processor”
10
Units of Measure
• High Performance Computing (HPC) units are:
• Flop: floating point operation
• Flops/s: floating point operations per second
• Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…
Mega:
Giga:
Tera:
Peta:
Exa:
Zetta:
Yotta:
Mflop/s = 1006 flop/sec;
Gflop/s = 1009 flop/sec;
Tflop/s = 1012 flop/sec;
Pflop/s = 1015 flop/sec;
Eflop/s = 1018 flop/sec;
Zflop/s = 1021 flop/sec;
Yflop/s = 1024 flop/sec;
Mbyte = 220 = 1048576 ~ 106 bytes
Gbyte = 230 ~ 109 bytes
Tbyte = 240 ~ 1012 bytes
Pbyte = 250 ~ 1015 bytes
Ebyte = 260 ~ 1018 bytes
Zbyte = 270 ~ 1021 bytes
Ybyte = 280 ~ 1024 bytes
• Current fastest (public) machine ~ 11 Pflop/s
• Up-to-date list at www.top500.org
11
all
(2007)
Why powerful
computers are parallel
12
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18 months.
Slide source: Jack Dongarra
13
Microprocessor Transistors / Clock (1970-2000)
10000000
1000000
Transistors (Thousands)
100000
Frequency (MHz)
10000
1000
100
10
1
0
1970
14
1975
1980
1985
1990
1995
2000
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a
factor of x ?
• Clock rate goes up by x because wires are shorter
• actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
• typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
• typically x3 is devoted to either on-chip
• parallelism: hidden parallelism such as ILP
• locality: caches
• So most programs x3 times faster, without changing them
15
Power Density Limits Serial Performance
– Dynamic power is
proportional to V2fC
– Increasing frequency (f)
also increases supply
voltage (V)  cubic
effect
– Increasing cores
increases capacitance
(C) but only linearly
– Save power by lowering
clock speed
Scaling clock speed (business as usual) will not work
10000
Sun’s
Surface
Source: Patrick Gelsinger,
Shenkar Bokar, Intel
Rocket
1000
Power Density (W/cm2)
• Concurrent systems are
more power efficient
Nozzle
Nuclear
100
Reactor
Hot Plate
8086
10
4004
8008
8080
P6
8085
286
Pentium®
386
486
1
1970
1980
1990
2000
2010
Year
• High performance serial processors waste power
- Speculation, dynamic dependence checking, etc. burn power
- Implicit parallelism discovery
• More transistors, but not faster serial processors
16
Revolution in Processors
10000000
1000000
1000000
Transistors
Transistors (Thousands)
(Thousands)
Transistors(MHz)
(Thousands)
Frequency
Frequency (MHz)
Power
Cores (W)
Cores
100000
100000
10000
10000
1000
1000
100
100
10
10
1
1
0
1970
•
•
•
•
1975
1980
1985
1990
1995
2000
2005
Chip density is continuing increase ~2x every 2 years
Clock speed is not
Number of processor cores may double instead
CPE 779 Parallel
Computing
- Spring 2012
Power is under control,
no longer
growing
2010
17
Parallelism in 2012?
• These arguments are no longer theoretical
• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine
• To keep doubling performance, parallelism must double
• Which (commercial) applications can use this parallelism?
• Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?
• New software model needed
• Try to hide complexity from most programmers – eventually
• In the meantime, need to understand it
• Computer industry betting on this big change, but does not have
all the answers
• Berkeley ParLab established to work on this
18
Parallelism in 2012?
• These arguments are no longer theoretical
• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine
• To keep doubling performance, parallelism must double
• Which (commercial) applications can use this parallelism?
• Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?
• New software model needed
• Try to hide complexity from most programmers – eventually
• In the meantime, need to understand it
• Computer industry betting on this big change, but does not have
all the answers
• Berkeley ParLab established to work on this
19
Memory is Not Keeping Pace
Technology trends against a constant or increasing memory per core
• Memory density is doubling every three years; processor logic is every two
• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
Cost of Computation vs. Memory
Source: David Turek, IBM
Source: IBM
Question: Can you double concurrency without doubling memory?
• Strong scaling: fixed problem size, increase number of processors
• Weak scaling: grow problem size proportionally to number of
processors
20
The TOP500 Project
• Listing the 500 most powerful computers in
the world
• Yardstick: Rmax of Linpack
• Solve Ax=b, dense problem, matrix is random
• Dominated by dense matrix-matrix multiply
• Update twice a year:
• ISC’xy in June in Germany
• SCxy in November in the U.S.
• All information available from the TOP500
web site at: www.top500.org
21
38th List: The TOP10
Rank
Site
Manufacturer
1
RIKEN Advanced Institute
for Computational
Science
Fujitsu
2
National SuperComputer
Center in Tianjin
NUDT
3
Oak Ridge National
Laboratory
Cray
4
National Supercomputing
Centre in Shenzhen
Dawning
5
GSIC, Tokyo Institute of
Technology
NEC/HP
6
DOE/NNSA/LANL/SNL
Cray
7
NASA/Ames Research
Center/NAS
SGI
8
DOE/SC/
LBNL/NERSC
9
Commissariat a l'Energie
Atomique (CEA)
10
DOE/NNSA/LANL
Computer
K Computer
SPARC64 VIIIfx 2.0GHz,
Tofu Interconnect
Tianhe-1A
NUDT TH MPP,
Rmax Power
[Pflops] [MW]
Country
Cores
Japan
795,024
10.51 12.66
China
186,368
2.566
4.04
USA
224,162
1.759
6.95
China
120,640
1.271
2.58
Japan
73,278
1.192
1.40
USA
142,272
1.110
3.98
USA
111,104
1.088
4.10
USA
153,408
1.054
2.91
France
138.368
1.050
4.59
USA
122,400
22
1.042
2.34
Xeon 6C, NVidia, FT-1000 8C
Jaguar
Cray XT5, HC 2.6 GHz
Nebulae
TC3600 Blade, Intel X5650, NVidia
Tesla C2050 GPU
TSUBAME-2
HP ProLiant, Xeon 6C, NVidia,
Linux/Windows
Cielo
Cray XE6, 8C 2.4 GHz
Pleiades
SGI Altix ICE 8200EX/8400EX
Hopper
Cray XE6, 6C 2.1 GHz
Tera 100
Bull
Bull bullx super-node
S6010/S6030
Roadrunner
CPE
IBM 779 Parallel Computing - Spring 2012
BladeCenter QS22/LS21
Cray
Performance Development
74.2 PFlop/s
100 Pflop/s
10.51 PFlop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
SUM
50.9 TFlop/s
10 Tflop/s
1 Tflop/s
N=1
1.17 TFlop/s
100 Gflop/s
59.7 GFlop/s
N=500
10 Gflop/s
1 Gflop/s
100 Mflop/s
400 MFlop/s
23
Projected Performance Development
1 Eflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
SUM
100 Tflop/s
10 Tflop/s
N=1
1 Tflop/s
100 Gflop/s
N=500
10 Gflop/s
1 Gflop/s
100 Mflop/s
24
Core Count
25
Moore’s Law reinterpreted
• Number of cores per chip can double every
two years
• Clock speed will not increase (possibly
decrease)
• Need to deal with systems with millions of
concurrent threads
• Need to deal with inter-chip parallelism as well
as intra-chip parallelism
26
Outline
all
• Why powerful computers must be parallel processors
• Large CSE problems require powerful computers
27
Drivers for Change
• Continued exponential increase in computational power
• Can simulate what theory and experiment can’t do
• Continued exponential increase in experimental data
• Moore’s Law applies to sensors too
• Need to analyze all that data
28
Simulation: The Third Pillar of Science
• Traditional scientific and engineering method:
(1) Do theory or paper design
(2) Perform experiments or build system
Theory
Experiment
• Limitations:
–Too difficult—build large wind tunnels
–Too expensive—build a throw-away passenger jet
–Too slow—wait for climate or galactic evolution
–Too dangerous—weapons, drug design, climate
experimentation
Simulation
• Computational science and engineering paradigm:
(3) Use computers to simulate and analyze the phenomenon
• Based on known physical laws and efficient numerical methods
• Analyze simulation results with computational tools and methods
beyond what is possible manually
29
Data Driven Science
• Scientific data sets are growing exponentially
- Ability to generate data is exceeding our ability to
store and analyze
- Simulation systems and some observational
devices grow in capability with Moore’s Law
• Petabyte (PB) data sets will soon be common:
• Climate modeling: estimates of the next IPCC
data is in 10s of petabytes
• Genome: JGI alone will have .5 petabyte of data
this year and double each year
• Particle physics: LHC is projected to produce 16
petabytes of data per year
• Astrophysics: LSST and others will produce 5
petabytes/year (via 3.2 Gigapixel camera)
• Create scientific communities with “Science
Gateways” to data
30
Some Particularly Challenging Computations
• Science
•
•
•
•
•
Global climate modeling
Biology: genomics; protein folding; drug design
Astrophysical modeling
Computational Chemistry
Computational Material Sciences and Nanosciences
• Engineering
•
•
•
•
•
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Crash simulation
• Business
• Financial and economic modeling
• Transaction processing, web services and search engines
• Defense
• Nuclear weapons -- test by simulations
• Cryptography
31
Economic Impact of HPC
• Airlines:
• System-wide logistics optimization systems on parallel systems.
• Savings: approx. $100 million per airline per year.
• Automotive design:
• Major automotive companies use large systems (500+ CPUs) for:
• CAD-CAM, crash testing, structural integrity and aerodynamics.
• One company has 500+ CPU parallel system.
• Savings: approx. $1 billion per company per year.
• Semiconductor industry:
• Semiconductor firms use large systems (500+ CPUs) for
• device electronics simulation and logic validation
• Savings: approx. $1 billion per company per year.
• Energy
• Computational modeling improved performance of current nuclear power
plants, equivalent to building two new power plants.
32
$5B World Market in Technical Computing in 2004
1998 1999 2000 2001 2002 2003
100%
90%
80%
70%
Other
Technical Management and
Support
Simulation
Scientific Research and R&D
Mechanical
Design/Engineering Analysis
Mechanical Design and
Drafting
60%
Imaging
50%
Geoscience and Geoengineering
40%
Electrical Design/Engineering
Analysis
Economics/Financial
30%
Digital Content Creation and
Distribution
20%
Classified Defense
10%
Chemical Engineering
0%
Biosciences
Source: IDC 2004, from NRC Future of Supercomputing Report
33
Why writing (fast) parallel
programs is hard
34
Principles of Parallel Computing
•
•
•
•
•
•
Finding enough parallelism (Amdahl’s Law)
Granularity
Locality
Load balance
Coordination and synchronization
Performance modeling
All of these things makes parallel programming
harder than sequential programming.
35
“Automatic” Parallelism in Modern Machines
• Bit level parallelism
• within floating point operations, etc.
• Instruction level parallelism (ILP)
• multiple instructions execute per clock cycle
• Memory system parallelism
• overlap of memory operations with computation
• OS parallelism
• multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need
user to identify, schedule and coordinate parallel tasks
36
Finding Enough Parallelism: Amdahl’s Law
T1 = execution time using 1 processor (serial execution time)
Tp = execution time using P processors
S = serial fraction of computation (i.e. fraction of computation
which can only be executed using 1 processor)
C = fraction of computation which could be executed by p
processors
Then S + C = 1 and
Tp = S * T1+ (T1 * C)/P = (S + C/P)T1
Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S
• Maximum speedup (i.e. when P=∞), Smax = 1/S; example S=.05 ,
speedup max= 20
• Currently the fastest machine has 705K processors; 2nd fastest
has ~186K processors +GPUs
• Even if the parallel part speeds up perfectly performance is limited
37
by the sequential part
Speedup Barriers: (a) Overhead of Parallelism
• Given enough parallel work, overhead is a big barrier to getting
desired speedup
• Parallelism overheads include:
•
•
•
•
cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of
flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run
fast in parallel (I.e. large granularity), but not so large that there
is not enough parallel work
38
Speedup Barriers: (b) Working on Non Local Data
Conventional
Storage
Hierarchy
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
potential
interconnects
L3 Cache
• Large memories are slow, fast memories are small
• Parallel processors, collectively, have large, fast cache
• the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
39
Processor-DRAM Gap (latency)
Goal: find algorithms that minimize communication, not necessarily arithmetic
CPU
“Moore’s Law”
10
1
Time
40
µProc
60%/yr.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Speedup Barriers: (c) Load Imbalance
• Load imbalance occurs when some processors in the system
are idle due to
• insufficient parallelism (during that phase)
• unequal size tasks
• Algorithm needs to balance load
41
Outline
all
• Why powerful computers must be parallel processors
• Large CSE problems require powerful computers
42
Instructor (Sections 2 & 3)
• Instructor: Dr. Walid Abu-Sufah
• Office: CPE 10
• Email: abusufah@ju.edu.jo
• Office Hours: Monday 11-12, Tuesday 12-1 and by
appointment.
• Course web site:
http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr12/index.
html
43
Prerequisite
• CPE 432: Computer Design, and general C
programming skills
44
Grading Policy
• Programming Assignments: 25%
• Demo/knowledge: 25%
• Functionality and Performance: 40%
• Report: 35%
• Project: 35%
• Design Document: 25%
• Project Presentation: 25%
• Demo/Functionality/Performance/Report: 50%
• Midterm: 15%
• Final: 25 %
45
Bonus Days
• Each of you get five bonus days
• A bonus day is a no-questions-asked one-day
extension that can be used on most assignments
• You can’t turn in multiple versions of a team
assignment on different days; all of you must combine
individual bonus days into one team bonus day.
• You can use multiple bonus days on the same
assignment
• Weekends/holidays don’t count for the number of days
of extension (Thursday-Sunday is one day extension)
• Intended to cover illnesses, just needing more time, etc.
46
Using Bonus Days
• Bonus days are automatically applied to late projects
• Penalty for being late beyond bonus days is 10% of
the possible points/day, again counting only
• Things you can’t use bonus days on:
• Final project design documents, final project
presentations, final project demo, exam
47
Academic Honesty
• You are allowed and encouraged to discuss
assignments with other students in the class.
Getting verbal advice/help from people who’ve
already taken the course is also fine.
• Any reference to assignments from web postings is
unacceptable
• Any copying of non-trivial code is unacceptable
• Non-trivial = more than a line or so
• Includes reading someone else’s code and then going off
to write your own.
48
Academic Honesty (cont.)
• Giving/receiving help on an exam is unacceptable
• Penalties for academic dishonesty:
• Zero on the assignment for the first occasion
• Automatic failure of the course for repeat offenses
49
Team Projects
• Work can be divided up between team members in
any way that works for you
• However, each team member will demo the final
checkpoint of each project individually, and will get a
separate demo grade
• This will include questions on the entire design
• Rationale: if you don’t know enough about the whole
design to answer questions on it, you aren’t involved
enough in the project
50
Text/Notes
1. D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
Morgan Kaufman Publisher, 2010, ISBN 9780123814722
2. Cleve B. Moler, Numerical Computing with
MATLAB, Society for Industrial Mathematics
(January 1, 2004). Available for individual
download at
http://www.mathworks.com/moler/chapters.html
3. NVIDIA, NVidia CUDA C Programming Guide,
version 4.0, NVidia, 2011 (reference book)
4. Lecture notes will be posted at the class web site
51
Rough List of Topics
• Basics of computer architecture, memory
hierarchies, performance
• Parallel Programming Models and Machines
• Shared Memory and Multithreading (OpenMP)
• Distributed Memory and Message Passing (MPI)
52
Rough List of Topics (continued)
• Programming NVIDIA processors using CUDA
• Introduction to CUDA C
• CUDA Parallel Execution Model with Fermi Updates
• CUDA Memory Model with Fermi Updates
• Tiled Matrix-Matrix Multiplication
• Debugging and Profiling, Introduction to Convolution
• Convolution, Constant Memory and Constant Caching
CPE 779 Parallel Computing
53
- Spring 2012
• Programming NVIDIA processors using CUDA
(continued)
• Tiled 2D Convolution
• Parallel Computation Patterns - Reduction Trees
• Memory Bandwidth
• Parallel Computation Patterns - Prefix Sum (Scan)
• Floating Point Considerations
• Atomic Operations and Histogramming
• Data Transfers and GMAC
• Multi-GPU Programming in CUDA and GMAC
• MPI and CUDA Programming
54
• Selected numerical computing topics (with MATLAB)
• Linear Equations
• Eigenvalues
55

Introduction to Parallel Computing

Transcription

Similar documents

1. Check the parts in your box

how to configure an e2/e2e for direct connection with ultrasite

2014 Fall Conference Connecticut Chapter NATIONAL ASSOCIATION OF TAX PROFESSIONALS

KNOWLEDGE SHARING WORKSHOP FOR EVALUATION MANAGERS:

View Brochure - Enterprise Worldwide

How to Prevent Unionization CPEhr Management Training

5/9/2012 How to Compensate in Today’s Regulatory Environment

US GAAP Accounting for Property, Plant and

CHARACTERIZATION OF EELPOUT RHABDOVIRUS (EpRV) ON

W Q