The Hitchhiker`s Guide to the Data Center Galaxy

Transcription

The Hitchhiker`s Guide to the Data Center Galaxy
Paolo Costa
paolo.costa@microsoft.com
• What is Cloud Computing?
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
2
• What
is Cloud
“The
interesting
thingComputing?
about cloud computing is that
we’ve
computing to include everything
1. redefined
I have nocloud
idea!!
that2.weYet
already
do.
Maybe
I’m
an
idiot,
but
I
have
no
another buzzword
idea what anyone is talking about. What is it? It’s
3. A possible
complete
gibberish.one...
It’s insane. When is this idiocy
going to stop?”
Elison
“Apps delivered as services over the InternetLarry
and the Data
Center
hardware and software providing them” Oracle’s co-founder
Micheal Armburst et al. “Above the Clouds”, UC Berkley
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
3
• What is Cloud Computing?
“The interesting thing about cloud computing is that
we’ve redefined cloud computing to include everything
that we already do. Maybe I’m an idiot, but I have no
idea what anyone is talking about. What is it? It’s
complete gibberish. It’s insane. When is this idiocy
going to stop?”
Larry Elison
Oracle’s co-founder
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
4
• What is Cloud Computing?
1. I have no idea!!
2. Yet another buzzword
3. A possible one...
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
5
• What is Cloud Computing?
“Apps delivered as services over the
Internet and the Data Center hardware
and software providing them”
Micheal Armburst et al.
“Above the Clouds”, UC Berkley
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
6
• Amazon EC2
– user can control the whole sw stack (through Xen)
– hard to scale
• Google App Engine
– targets exclusively web applications
– constrained stateless/stateful tier
– automatic scaling
• Windows Azure
– .NET environment
– auto provisioning of stateless app
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
7
• Amazon EC2
– user can control the whole sw stack (through Xen)
– hard to scale
• Google App Engine
– targets exclusively web applications
Data
Centers
are
the
key
enablers
– constrained stateless/stateful tier
– automatic
scaling
for
Cloud Computing
• Windows Azure
– .NET environment
– auto provisioning of stateless app
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
8
• Providers’ perspective
– economies of scales
• cheaper electricity, hardware, network, and operations
• analogous to semi-conductor market (e.g., nVidia)
– massive infrastructure already in place
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
9
• Providers’ perspective
– economies of scales
• cheaper electricity, hardware, network, and operations
• analogous to semi-conductor market (e.g., nVidia)
– massive infrastructure already in place
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
10
• Providers’ perspective
– economies of scales
• cheaper electricity, hardware, network, and operations
• analogous to semi-conductor market (e.g., nVidia)
– massive infrastructure already in place
• Users’ perspective
– illusion of infinite resources available on demand
• no upfront commitment
• pay-as-you go model
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
11
• Providers’ perspective
– economies of scales
• cheaper electricity, hardware, network, and operations
• analogous to semi-conductor market (e.g., nVidia)
– massive infrastructure already in place
• Users’ perspective
– illusion of infinite resources available on demand
• no upfront commitment
• pay-as-you go model
Paolo Costa
3.5 days
The Hitchhiker's Guide to the Data Center Galaxy
12
• Ok, but hype removed, isn’t a DC
•
just a bigger cluster?
Yes...
– some ideas can be re-used / extended
• ...and NO
–
–
–
–
–
–
Paolo Costa
larger scale (tens/hundreds of thousands of node)
low-end servers
higher failure rate
availability more important than consistency (CAP)
several applications running concurrently
emphasis on energy
The Hitchhiker's Guide to the Data Center Galaxy
13
•
•
•
•
•
•
Introduction
Hardware
Energy
Software
Network
Takeaways
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
14
• Typical first year for a new cluster
–
–
–
–
–
–
–
–
–
–
–
–
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packet loss)
~8 network maintenances (4 might cause ~30-minute connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for DNS
~1000 individual machine failures
~thousands of hard drive failures
Source: Jeff Dean, Google
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
15
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
16
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
17
• Source: Google, April 2009
3.5 inches thick--2U, or 2 rack units
12-volt battery as UPS (increased reliability)
2 AMD/Intel x86 processors
8 memory slot
2 hard drives
removed all USB plugs, graphic card, etc.
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
18
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
19
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
20
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
21
• Microsoft DC (Quincy, WA)
–
–
–
–
–
43,600 m2 (10 football fields)
4.8 km of chiller piping
965 km of electrical wire
1.5 metric tons of backup batteries
48 MW (~ 40,000 homes)
• Yahoo (Quincy, WA)
– 13,000 m2
• Google (Columbia, OR)
– 6,500 m2
– 45,000 servers
• Number of servers
–
–
–
–
–
Google: 450,000 (2005), 1M+? (2009)
Microsoft: 218,000 (mid-2008)
Amazon: 25,000 (of which 17,000 bought in 2008)
Yahoo: 50, 000 (2009)
Facebook: 10,000+ (2009), hosting 40 billions photo and serving 200 M users
The Hitchhiker's GuideSource:
to the Datahttp://www.datacenterknowledge.com
Center Galaxy
22
Paolo Costa
• Adopted by Google and Microsoft
• Containers are assembled in
factories and then shipped to the DC
– bulk ordering and repeatable design
• MS Chicago DC (July 2009)
–
–
–
–
700,000 square feet
220 shipping containers
1,800 to 2,500 servers/container
500,000+ servers
–
–
–
–
75,000 square feet
45 containers
1,160 servers / container
250 kw / container
• Google DC (2005)
Full video at http://www.youtube.com/watch?v=zRwPSFpLX8I
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
23
Chart Title
Network
15%
Power
15%
Infrastruc
ture
25%
Servers
45%
• Power and infrastructure dominate server costs
• DCs consume 0.5% of all world electricity
– CO2 emission like Netherlands
– In UK, 2.2-3.3% of the country electricity
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
24
• Power Usage Efficiency (PUE)
– PUE = (Total facility power) / (IT Equipment
power)
– Average PUE = 1.7, some between 2 and 3
– Google Container DC has 1.24 (2008)
• Let’s assume PUE = 1.7
distribution
loss
cooling
8%
33%
Paolo Costa
servers
59%
The Hitchhiker's Guide to the Data Center Galaxy
25
• Build DCs where energy costs are lower
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
26
• Build DCs where energy costs are lower
• Raise the temperature of the aisles
– usually 18-20 C, Google at 27
– possibly up to35 (trade-off failures vs. cooling costs)
• Reduce conversion
– e.g., Google motherboards works at 12 V rather than 3.3/5V
– also distributed UPS more efficient than centralized one
• Replace expensive chillers with cooling towers
– e.g., MS in Dublin or Google in Belgium
• Go to extreme resorts
– floating boats (Google)
– Siberia
– Iceland
• Build in resilience at system Level
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
27
• Servers are idle most of the time
– overprovision for peak loads
• Energy consumption is not
proportional to load
– CPUs are not so bad
• Optimize workloads
– on is better than off
– virtualization to compact resources
Paolo Costa
The
Hitchhiker's
Guide
to the Data
Galaxy
Source:
Luiz
Barroso,
UrsCenter
Hölzle
“The
28
Datacenter as a Computer”
• Things will crash!!
– take super reliable servers (MTBF of 30 years)
– build a machine with 10 thousand of those
– watch one fail per day
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
29
• Things will crash!!
– take super reliable servers (MTBF of 30 years)
– build a DC with 10 thousand of those
– watch one fail per day
• Recovery Oriented Computing
– assume sw & hw will fail frequently
– emphasis on availability rather than consistency
– heavily instrument apps to detect failures
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
30
• Things will crash!!
– take super reliable servers (MTBF of 30 years)
– build a machine with 10 thousand of those
– watch one fail per day
• Recovery Oriented Computing
– assume sw & hw will fail frequently
– emphasis on availability rather than consistency
– heavily instrument apps to detect failures
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
31
• Things will crash!!
– take super reliable servers (MTBF of 30 years)
– build a DC with 10 thousand of those
– watch one fail per day
• Recovery Oriented Computing
– assume sw & hw will fail frequently
– emphasis on availability rather than consistency
– heavily instrument apps to detect failures
• Highly customized distributed services
– File system: e.g., GFS (Google), HDFS (Yahoo)
– Distributed key-value storage:
• e.g., BigTable (Google), Dynamo (Amazon), Cassandra (Facebook)
– Parallel Computation:
• e.g., MapReduce (Google), Hadoop (Yahoo), Dryad/DryadLINQ (MS)
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
32
• Things will crash!!
– take super reliable servers (MTBF of 30 years)
– build a DC with 10 thousand of those
– watch one fail per day
• Recovery
Oriented infrastructure
Computing
A scalable
does
not
– assume sw & hw will fail frequently
– emphasis on availability imply
rather than consistency
automatically
a
scalable
app
– heavily instrument apps to detect failures
• Highly customized distributed services
– File system: e.g., GFS (Google), HDFS (Yahoo)
– Distributed key-value storage:
• e.g., BigTable (Google), Dynamo (Amazon), Cassandra (Facebook)
– Parallel Computation:
• e.g., MapReduce (Google), Hadoop (Yahoo), Dryad/DryadLINQ (MS)
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
33
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
34
• Files broken into 64 MB chunks
• Optimized for sequential reads and appends
• 200+ GFS clusters
– 5,000+ machines / cluster
– 5+ PB of disk usage / cluster
– 10,000 clients / cluster
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
35
• Key goal: performance and availability
– can’t afford system-wide locking
– Amazon found every 100ms of latency cost them 1% in sales
– Google found an extra .5 seconds dropped traffic by 20
• CAP theorem
– consistency, availability, partition fault-tolerance
– only pick 2
• Eventual consistency
– replicas are updated asynchronously
– conflicts may arise in case of failures or concurrent writes
• From ACID...
– Atomic, Consistent, Isolated, Durable
• ...to BASE
– Basically Available, Soft-state, Eventual Consistency
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
36
• Key goal: performance and availability
– can’t afford system-wide locking
– Amazon found every 100ms of latency cost them 1% in sales
– Google found an extra .5 seconds dropped traffic by 20
• CAP theorem
– consistency, availability, partition fault-tolerance
– only pick 2
• Eventual consistency
– replicas are updated asynchronously
– conflicts may arise in case of failures or concurrent writes
• From ACID...
– Atomic, Consistent, Isolated, Durable
• ...to BASE
– Basically Available, Soft-state, Eventual Consistency
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
37
• Key-value storage system
•
used by majority of Amazon Services
Reminiscent of Chord
– nodes are responsible for some keys
– replication occurs on next N-1 nodes
– O(1) distance (global knowledge)
• Quorum-based read-write (N,R,W)
–
–
–
–
R+W>N (akin to quorum, high latency)
usually R+W <= N to decrease latency
e.g., shopping cart (W=1, always writeable)
e.g., product catalogue (R=1, always readable)
• Eventual consistency
– vector clocks to reconciliate conflicts
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
38
• Microsoft parallel execution engine
– support arbitrary acyclic graphs
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
39
• Enables writing Dryad
tasks using LINQ
declarative syntax
– e.g., counting words
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
40
• 48-port gigabit switches are used in each rack
• 8 ports are used to connect to the cluster switch
• 40 Gbps to 8 Gbps 1:5 oversubscription
Paolo Costa
The
Hitchhiker's
Guide
to the Data
Galaxy
Source:
Luiz
Barroso,
UrsCenter
Hölzle
“The
41
Datacenter as a Computer”
• 48-port gigabit switches are used in each rack
• 8 ports are used to connect to the cluster switch
• 40 Gbps to 8 Gbps 1:5 oversubscription
Paolo Costa
The
Hitchhiker's
Guide
to the Data
Galaxy
Source:
Luiz
Barroso,
UrsCenter
Hölzle
“The
42
Datacenter as a Computer”
 Limited server-to-server
capacity
– up 1:240
oversubscription
 Fragmentation of
resources
– hard to use resources
beyond L2 domain
 High costs
– up to 700,000$ / router
 Poor reliability
– failures at higher levels
may compromise
several users
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
43
10 Gbps
• Use redundancy to
•
improve bandwidth
and fault-tolerance
Full-bisection
bandwidth
– all nodes can
communicate at full
bandwidth
Paolo Costa
1 Gbps
1 Gbps
The Hitchhiker's Guide to the Data Center Galaxy
44
• Internet-scale DCs are still using networking devised for

other domains
Hard to (efficiently) design distributed services
– the network still acts as a black box
– services have to reverse-engineer the network
• e.g., end-to-end signalling
• DCs are not mini-internets
– Single owner / administration domain
• we know (and define) the topology
• low hardware/software diversity
• no malicious or selfish node
• They should be designed as single systems
– not design and optimize components in isolation
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
45
• “Closing the gap between networks and
services”
– every node is a router
– the network topology is directly exposed to the
services
• Nodes arranged in a wrapped 3D cube
– no switches or routers (direct connect)
– inspired by HPC cube topologies
• Very simple one hop API to
– send/receive packets to/from 1-hop neighbours
– be informed when a direct link fails
• Other functions implemented as services
– multi-hop routing is a service
– GFS, BigTable, etc
Paolo
Costa et al., “Why Should WeThe
Integrate
Services,
and Networking
in a Data Center?”, WREN’09
Hitchhiker's
Guide to Servers,
the Data Center
Galaxy
46
Paolo Costa
• Our topology is reminiscent of the CAN overlay
– we can apply techniques used in structured overlays
• Some popular DC services already resemble overlay
networks, e.g.,
–
–
–
–
Amazon Dynamo storage (DHT-based)
Facebook log collection (tree-based)
Amazon monitoring service (gossip-based)
MapReduce / Dryad (graph-based)
• Key benefit:
– the logical and physical topology coincide
• no overhead to maintain the overlay
• no need to infer physical proximity and hidden congestions
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
47
• Current status
– small testbed with 27 nodes (3 x 3 x 3)
– multi-hop routing protocol
• fault-tolerant
• energy-aware
• multi-path
– tree-based VM distribution service (<500 LOC)
– legacy support
• able to run existing apps (Dryad)
• support for TCP over multiple paths
• Future work
– transport protocol (hop-to-hop vs. end-to-end)
– distributed storage (the network is the disk)
– naming service
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
48
• Very large DC are very expensive
 estate costs
 power costs
• Idea
– massively geo-distribute the DC
– re-use infrastructure / energy cost
• Condo DCs
– 100 server / condo
• Nano data-centers
– use ISP boxes as computation unit
• XtreemOS (?)
– use personal machines to offload computation
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
49
• Cloud computing, albeit not radically new, is
changing the way we design systems
– availability, scalability, and energy efficiency
• We need a holistic approach to DCs
– not optimize component in isolation but regard DCs as
single systems
– “Data center as a computer” (Google)
• Hardware will get cheaper, more reliable, and
more energy efficient...
– but ultimately software is what will let DCs based on
inexpensive servers to grow
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
50
Paolo Costa
The Hitchhiker's Guide to the Data Center Galaxy
51