The Hitchhiker`s Guide to the Data Center Galaxy
Transcription
The Hitchhiker`s Guide to the Data Center Galaxy
Paolo Costa paolo.costa@microsoft.com • What is Cloud Computing? Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 2 • What is Cloud “The interesting thingComputing? about cloud computing is that we’ve computing to include everything 1. redefined I have nocloud idea!! that2.weYet already do. Maybe I’m an idiot, but I have no another buzzword idea what anyone is talking about. What is it? It’s 3. A possible complete gibberish.one... It’s insane. When is this idiocy going to stop?” Elison “Apps delivered as services over the InternetLarry and the Data Center hardware and software providing them” Oracle’s co-founder Micheal Armburst et al. “Above the Clouds”, UC Berkley Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 3 • What is Cloud Computing? “The interesting thing about cloud computing is that we’ve redefined cloud computing to include everything that we already do. Maybe I’m an idiot, but I have no idea what anyone is talking about. What is it? It’s complete gibberish. It’s insane. When is this idiocy going to stop?” Larry Elison Oracle’s co-founder Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 4 • What is Cloud Computing? 1. I have no idea!! 2. Yet another buzzword 3. A possible one... Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 5 • What is Cloud Computing? “Apps delivered as services over the Internet and the Data Center hardware and software providing them” Micheal Armburst et al. “Above the Clouds”, UC Berkley Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 6 • Amazon EC2 – user can control the whole sw stack (through Xen) – hard to scale • Google App Engine – targets exclusively web applications – constrained stateless/stateful tier – automatic scaling • Windows Azure – .NET environment – auto provisioning of stateless app Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 7 • Amazon EC2 – user can control the whole sw stack (through Xen) – hard to scale • Google App Engine – targets exclusively web applications Data Centers are the key enablers – constrained stateless/stateful tier – automatic scaling for Cloud Computing • Windows Azure – .NET environment – auto provisioning of stateless app Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 8 • Providers’ perspective – economies of scales • cheaper electricity, hardware, network, and operations • analogous to semi-conductor market (e.g., nVidia) – massive infrastructure already in place Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 9 • Providers’ perspective – economies of scales • cheaper electricity, hardware, network, and operations • analogous to semi-conductor market (e.g., nVidia) – massive infrastructure already in place Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 10 • Providers’ perspective – economies of scales • cheaper electricity, hardware, network, and operations • analogous to semi-conductor market (e.g., nVidia) – massive infrastructure already in place • Users’ perspective – illusion of infinite resources available on demand • no upfront commitment • pay-as-you go model Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 11 • Providers’ perspective – economies of scales • cheaper electricity, hardware, network, and operations • analogous to semi-conductor market (e.g., nVidia) – massive infrastructure already in place • Users’ perspective – illusion of infinite resources available on demand • no upfront commitment • pay-as-you go model Paolo Costa 3.5 days The Hitchhiker's Guide to the Data Center Galaxy 12 • Ok, but hype removed, isn’t a DC • just a bigger cluster? Yes... – some ideas can be re-used / extended • ...and NO – – – – – – Paolo Costa larger scale (tens/hundreds of thousands of node) low-end servers higher failure rate availability more important than consistency (CAP) several applications running concurrently emphasis on energy The Hitchhiker's Guide to the Data Center Galaxy 13 • • • • • • Introduction Hardware Energy Software Network Takeaways Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 14 • Typical first year for a new cluster – – – – – – – – – – – – ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packet loss) ~8 network maintenances (4 might cause ~30-minute connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for DNS ~1000 individual machine failures ~thousands of hard drive failures Source: Jeff Dean, Google Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 15 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 16 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 17 • Source: Google, April 2009 3.5 inches thick--2U, or 2 rack units 12-volt battery as UPS (increased reliability) 2 AMD/Intel x86 processors 8 memory slot 2 hard drives removed all USB plugs, graphic card, etc. Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 18 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 19 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 20 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 21 • Microsoft DC (Quincy, WA) – – – – – 43,600 m2 (10 football fields) 4.8 km of chiller piping 965 km of electrical wire 1.5 metric tons of backup batteries 48 MW (~ 40,000 homes) • Yahoo (Quincy, WA) – 13,000 m2 • Google (Columbia, OR) – 6,500 m2 – 45,000 servers • Number of servers – – – – – Google: 450,000 (2005), 1M+? (2009) Microsoft: 218,000 (mid-2008) Amazon: 25,000 (of which 17,000 bought in 2008) Yahoo: 50, 000 (2009) Facebook: 10,000+ (2009), hosting 40 billions photo and serving 200 M users The Hitchhiker's GuideSource: to the Datahttp://www.datacenterknowledge.com Center Galaxy 22 Paolo Costa • Adopted by Google and Microsoft • Containers are assembled in factories and then shipped to the DC – bulk ordering and repeatable design • MS Chicago DC (July 2009) – – – – 700,000 square feet 220 shipping containers 1,800 to 2,500 servers/container 500,000+ servers – – – – 75,000 square feet 45 containers 1,160 servers / container 250 kw / container • Google DC (2005) Full video at http://www.youtube.com/watch?v=zRwPSFpLX8I Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 23 Chart Title Network 15% Power 15% Infrastruc ture 25% Servers 45% • Power and infrastructure dominate server costs • DCs consume 0.5% of all world electricity – CO2 emission like Netherlands – In UK, 2.2-3.3% of the country electricity Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 24 • Power Usage Efficiency (PUE) – PUE = (Total facility power) / (IT Equipment power) – Average PUE = 1.7, some between 2 and 3 – Google Container DC has 1.24 (2008) • Let’s assume PUE = 1.7 distribution loss cooling 8% 33% Paolo Costa servers 59% The Hitchhiker's Guide to the Data Center Galaxy 25 • Build DCs where energy costs are lower Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 26 • Build DCs where energy costs are lower • Raise the temperature of the aisles – usually 18-20 C, Google at 27 – possibly up to35 (trade-off failures vs. cooling costs) • Reduce conversion – e.g., Google motherboards works at 12 V rather than 3.3/5V – also distributed UPS more efficient than centralized one • Replace expensive chillers with cooling towers – e.g., MS in Dublin or Google in Belgium • Go to extreme resorts – floating boats (Google) – Siberia – Iceland • Build in resilience at system Level Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 27 • Servers are idle most of the time – overprovision for peak loads • Energy consumption is not proportional to load – CPUs are not so bad • Optimize workloads – on is better than off – virtualization to compact resources Paolo Costa The Hitchhiker's Guide to the Data Galaxy Source: Luiz Barroso, UrsCenter Hölzle “The 28 Datacenter as a Computer” • Things will crash!! – take super reliable servers (MTBF of 30 years) – build a machine with 10 thousand of those – watch one fail per day Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 29 • Things will crash!! – take super reliable servers (MTBF of 30 years) – build a DC with 10 thousand of those – watch one fail per day • Recovery Oriented Computing – assume sw & hw will fail frequently – emphasis on availability rather than consistency – heavily instrument apps to detect failures Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 30 • Things will crash!! – take super reliable servers (MTBF of 30 years) – build a machine with 10 thousand of those – watch one fail per day • Recovery Oriented Computing – assume sw & hw will fail frequently – emphasis on availability rather than consistency – heavily instrument apps to detect failures Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 31 • Things will crash!! – take super reliable servers (MTBF of 30 years) – build a DC with 10 thousand of those – watch one fail per day • Recovery Oriented Computing – assume sw & hw will fail frequently – emphasis on availability rather than consistency – heavily instrument apps to detect failures • Highly customized distributed services – File system: e.g., GFS (Google), HDFS (Yahoo) – Distributed key-value storage: • e.g., BigTable (Google), Dynamo (Amazon), Cassandra (Facebook) – Parallel Computation: • e.g., MapReduce (Google), Hadoop (Yahoo), Dryad/DryadLINQ (MS) Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 32 • Things will crash!! – take super reliable servers (MTBF of 30 years) – build a DC with 10 thousand of those – watch one fail per day • Recovery Oriented infrastructure Computing A scalable does not – assume sw & hw will fail frequently – emphasis on availability imply rather than consistency automatically a scalable app – heavily instrument apps to detect failures • Highly customized distributed services – File system: e.g., GFS (Google), HDFS (Yahoo) – Distributed key-value storage: • e.g., BigTable (Google), Dynamo (Amazon), Cassandra (Facebook) – Parallel Computation: • e.g., MapReduce (Google), Hadoop (Yahoo), Dryad/DryadLINQ (MS) Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 33 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 34 • Files broken into 64 MB chunks • Optimized for sequential reads and appends • 200+ GFS clusters – 5,000+ machines / cluster – 5+ PB of disk usage / cluster – 10,000 clients / cluster Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 35 • Key goal: performance and availability – can’t afford system-wide locking – Amazon found every 100ms of latency cost them 1% in sales – Google found an extra .5 seconds dropped traffic by 20 • CAP theorem – consistency, availability, partition fault-tolerance – only pick 2 • Eventual consistency – replicas are updated asynchronously – conflicts may arise in case of failures or concurrent writes • From ACID... – Atomic, Consistent, Isolated, Durable • ...to BASE – Basically Available, Soft-state, Eventual Consistency Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 36 • Key goal: performance and availability – can’t afford system-wide locking – Amazon found every 100ms of latency cost them 1% in sales – Google found an extra .5 seconds dropped traffic by 20 • CAP theorem – consistency, availability, partition fault-tolerance – only pick 2 • Eventual consistency – replicas are updated asynchronously – conflicts may arise in case of failures or concurrent writes • From ACID... – Atomic, Consistent, Isolated, Durable • ...to BASE – Basically Available, Soft-state, Eventual Consistency Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 37 • Key-value storage system • used by majority of Amazon Services Reminiscent of Chord – nodes are responsible for some keys – replication occurs on next N-1 nodes – O(1) distance (global knowledge) • Quorum-based read-write (N,R,W) – – – – R+W>N (akin to quorum, high latency) usually R+W <= N to decrease latency e.g., shopping cart (W=1, always writeable) e.g., product catalogue (R=1, always readable) • Eventual consistency – vector clocks to reconciliate conflicts Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 38 • Microsoft parallel execution engine – support arbitrary acyclic graphs Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 39 • Enables writing Dryad tasks using LINQ declarative syntax – e.g., counting words Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 40 • 48-port gigabit switches are used in each rack • 8 ports are used to connect to the cluster switch • 40 Gbps to 8 Gbps 1:5 oversubscription Paolo Costa The Hitchhiker's Guide to the Data Galaxy Source: Luiz Barroso, UrsCenter Hölzle “The 41 Datacenter as a Computer” • 48-port gigabit switches are used in each rack • 8 ports are used to connect to the cluster switch • 40 Gbps to 8 Gbps 1:5 oversubscription Paolo Costa The Hitchhiker's Guide to the Data Galaxy Source: Luiz Barroso, UrsCenter Hölzle “The 42 Datacenter as a Computer” Limited server-to-server capacity – up 1:240 oversubscription Fragmentation of resources – hard to use resources beyond L2 domain High costs – up to 700,000$ / router Poor reliability – failures at higher levels may compromise several users Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 43 10 Gbps • Use redundancy to • improve bandwidth and fault-tolerance Full-bisection bandwidth – all nodes can communicate at full bandwidth Paolo Costa 1 Gbps 1 Gbps The Hitchhiker's Guide to the Data Center Galaxy 44 • Internet-scale DCs are still using networking devised for other domains Hard to (efficiently) design distributed services – the network still acts as a black box – services have to reverse-engineer the network • e.g., end-to-end signalling • DCs are not mini-internets – Single owner / administration domain • we know (and define) the topology • low hardware/software diversity • no malicious or selfish node • They should be designed as single systems – not design and optimize components in isolation Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 45 • “Closing the gap between networks and services” – every node is a router – the network topology is directly exposed to the services • Nodes arranged in a wrapped 3D cube – no switches or routers (direct connect) – inspired by HPC cube topologies • Very simple one hop API to – send/receive packets to/from 1-hop neighbours – be informed when a direct link fails • Other functions implemented as services – multi-hop routing is a service – GFS, BigTable, etc Paolo Costa et al., “Why Should WeThe Integrate Services, and Networking in a Data Center?”, WREN’09 Hitchhiker's Guide to Servers, the Data Center Galaxy 46 Paolo Costa • Our topology is reminiscent of the CAN overlay – we can apply techniques used in structured overlays • Some popular DC services already resemble overlay networks, e.g., – – – – Amazon Dynamo storage (DHT-based) Facebook log collection (tree-based) Amazon monitoring service (gossip-based) MapReduce / Dryad (graph-based) • Key benefit: – the logical and physical topology coincide • no overhead to maintain the overlay • no need to infer physical proximity and hidden congestions Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 47 • Current status – small testbed with 27 nodes (3 x 3 x 3) – multi-hop routing protocol • fault-tolerant • energy-aware • multi-path – tree-based VM distribution service (<500 LOC) – legacy support • able to run existing apps (Dryad) • support for TCP over multiple paths • Future work – transport protocol (hop-to-hop vs. end-to-end) – distributed storage (the network is the disk) – naming service Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 48 • Very large DC are very expensive estate costs power costs • Idea – massively geo-distribute the DC – re-use infrastructure / energy cost • Condo DCs – 100 server / condo • Nano data-centers – use ISP boxes as computation unit • XtreemOS (?) – use personal machines to offload computation Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 49 • Cloud computing, albeit not radically new, is changing the way we design systems – availability, scalability, and energy efficiency • We need a holistic approach to DCs – not optimize component in isolation but regard DCs as single systems – “Data center as a computer” (Google) • Hardware will get cheaper, more reliable, and more energy efficient... – but ultimately software is what will let DCs based on inexpensive servers to grow Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 50 Paolo Costa The Hitchhiker's Guide to the Data Center Galaxy 51