Cost Effective Computing using Statistics
Transcription
Cost Effective Computing using Statistics
Cost Effective Computing using Statistics Who are we ● SigmaStream ● Fully Apache licensed open source Domain specific DSL for data analytics: Adtech, IOT ● Libraries for Geo, Timeseries, Online Machine learning and more ● Extendable framework on top of Spark & Spark Streaming jobs ● Compose custom pipelines using domain specific packages Case Study: Data Processing Load Customer Data Filter Customers Count Unique Customers Update most active Customers Check if customer already exist Send out Email Recommend news based on similar customers Sigma Stream Script val customers= LOAD "hdfs://HNode:9000/input/" separatedBy "," fields (customer_id:int, location:geo) val filteredCustomer= FILTER customers by userid regex “a%” val groupedCustomers = GROUP filterCustomer by userid val UniqueCustomers = GENERATE groupedCustomers AS Unique(groupedCustomers) Val store = UniqueCustomers STORE "hdfs://HNode:9000/output/” AS json Previous World (1-2 Months) Ask for Machines Machines Allocated by IT IT Team takes 2 months to install software Data processing pipeline Previous World(4-5 Months) Ask for Machines Capacity overshoot Data processing pipeline Machines Allocated by IT IT Team takes 2 months to install software Current World Ask for More Machines Capacity overshoot Upside • • • • • On-demand machine request Scale up & down as required, based on data Spot allows for more cost explorations Fault tolerance translating to $$ Developer has much broader control Lets Optimize Load Data Filter Customers Count Unique Customers Update most active Customers Check if customer already exist Send out Email Disk IO Recommend news based on similar customers Disk IO • • • • Spark provides in-memory Caching out of box High beneficial for multiple iterations of data Extendable off-heap storage in Tachyon 10x improvement off Magnetic disk What about the Rest? Load Data O(N) Filter Customers Update most active Customers Disk IO Recommend O(N ^ 2) news based on similar customers O(N log N) Count Unique Customers O(N log N) Check if customer already exist O(N log N) Send out Email Computational Complexity • Complexity translates to Time to process. • O(N) : Touching 1TB of data in distributed environment can cost ~ $ 100 • O(N ^ 2) : Translates to as high as $10 000 • Too much memory/disk requirements • Too much time taken for real life datasets • Cost escalates beyond the value of analytics Lets Optimize Load Data O(N) Filter Customers Update most active Customers In-memory Cache Recommend O(N ^ 2) news based on similar customers O(N log N) Count Unique Customers O(N log N) Check if customer already exist O(N log N) Send out Email Much ado about Sets Load Data O(N) Filter Customers Rank on Set Cardinality In-memory Cache Find similar Sets O(N log N) Set Count O(N log N) Set Membership O(N log N) Send out Email O(N ^ 2) Base DataStructure: Monoids Mathematical concept ( A Set & Function) Contain a “0” Can be “added” Are Associative in Addition Base DataStructure: Monoids Simplest Monoid Structure : + Integer Set 0 (a+b)+c = a + (b + c) Not a Monoid / Integers 1 (a/b)/c != a/(b/c) Bloom Filter A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is greatly suitable in cases when we need to quickly filter items which are present in a set. Bloom Filter Maybe Yes or Surely No HyperLogLog ● Hyperloglog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using a small amount of memory. For instance, to achieve 99% accuracy, it needs only 16 KB. HyperLogLog HyperLogLog HyperLogLog Count-Min-Sketch ● The Count–min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. Count Min Sketch Solutions Specific algorithms for specific usecases • Set membership: Bloom Filter • Set Size: HyperLogLog • Set Similarity: Min Hash • Set Cardinality: CountMinHash 100x Faster & bounded error Load Data O(N) Filter Customers In-Memory Update most active Customers Recommend Min Hash news based on similar customers Count Min Hash Count Unique Customers HyperLogLog Check if customer already exist Bloom Filter Send out Email Algorithm Implementations Algebird by Twitter ● ● ● ● Embedded in Scalding Embedded in SigmaStream DSL Examples in Spark https://github.com/twitter/algebird Additional Features ● ● ● ● ● Many other algorithms Better Datastructures that capture result & Probability of Error Switchable light memory Hash implementations Fully composable Functional API Configurable memory for lower error Results ● 1TB of data in Hadoop ● Spot Instances in EMR ● ~ $1000 USD /TB for calculating uniques ● 1TB of data in Spark ● ● Spot Instances in EMR ~ $200 USD /TB for calculating uniques using HyperLogLog with .2% error Whats Next for Sigmoid SigmaStream ● ● ● ● ● Fully opensource Domain specific DSL for data analytics: Adtech, IOT Higher level functions for domains specific logic Uses linear algebra as base operators to build domain specific functions Libraries for Geo, Timeseries, Online Machine learning and more Extendable framework on top of Spark & Spark Streaming jobs CloudFlux ● ● ● ● Scalable cloud agnostic deployment framework based on Apache Mesos Resource & latency based cluster scaling Cost based optimization for cluster deployment Cloud & Cluster management united Questions