Yahoo Audience Expansion: Migra2on from Hadoop
Transcription
Yahoo Audience Expansion: Migra2on from Hadoop
Yahoo Audience Expansion: Migra5on from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo Agenda • Audience Expansion Spark Applica5on • Spark scalability: problems and our solu5ons • Performance tuning How we built audience expansion on Spark AUDIENCE EXPANSION Audience Expansion • Train a model to find users perform similar as sample users • Find more poten5al “converters” System Large scale machine learning system Logis5c Regression TBs input data, up to TBs intermediate data Hadoop pipeline is using 30000+ mappers, 2000 reducers, 16 hrs run 5me • All hadoop streaming, ~20 jobs • • • • • Use Spark to reduce latency and cost Pipeline Labeling Feature Extrac5on Model Training Score/Analyze models Valida5on/ Metrics • Label posi5ve/nega5ve samples • 6-‐7 hrs, IO intensive, 17 TB intermediate IO in hadoop • Extract Features from raw events • Logis5c regression phase, CPU bound • Validate trained models, parameters combina5ons, select new model • Validate and publish new model How to adopt to Spark efficiently? • • • • • Very complicated system 20+ hadoop streaming map reduce jobs 20k+ lines of code Tbs data, person.months to do data valida5on 6+ person, 3 quarters to rewrite the system based on Scala from scratch Our migrate solu5on • Build transi5on layer automa5cally convert hadoop streaming jobs to Spark job • Don’t need to change any Hadoop streaming code • 2 person*quarter • Private Spark ZIPPO Audience Expansion Pipeline 20+ Hadoop Streaming jobs ZIPPO: Hadoop Streaming Over Spark Hadoop Streaming Spark HDFS ZIPPO • A layer (zippo) between Spark and applica5on • Implemented all Hadoop Streaming interfaces • Migrate pipeline without code rewri5ng • Can focus on rewri5ng perf bojleneck • Plan to open source Audience Expansion Pipeline Hadoop Streaming ZIPPO: Hadoop Streaming Over Spark HDFS Spark ZIPPO -‐ Supported Features • Par55on related – Hadoop Par55oner class (-‐par55oner) – Num.map.key.fields, num.map.pari5on.fields • Distributed cache – -‐cacheArchive, -‐file, -‐cacheFile • Independent working directory for each task instead of each executor • Hadoop Streaming Aggrega5on • Input Data Combina5on (to mi5gate many small files) • Customized OutputFormat, InputFormat Performance Comparison 1Tb data • Zippo Hadoop streaming • Spark cluster – 1 hard drive – 40 hosts • Perf data: – 1hr 25 min • Original Hadoop streaming • Hadoop cluster – 1 hard drives – 40 Hosts • Perf data – 3hrs 5 min SPARK SCALABILITY Spark Shuffle • Mapper side of shuffle write all the output to disk(shuffle files) • Data can be large scale, so not able to all hold in memory • Reducers transfer all the shuffle files for each par55on, then process Spark Shujle Reducer Par55on 1 Reducer Par55on 2 Mapper m-‐2 Reducer Par55on 3 Shuffle n Shuffle 3 Shuffle 2 Shuffle 1 Reducer Par55on n Mapper 1 Shuffle n Shuffle 3 Shuffle 2 Shuffle 1 On each Reducer Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 • Uncompressed Par55on 4 Par55on 3 Shuffle mapper n • Every par55on needs to hold all the data from all the mappers • In hash map Reducer i of 4 cores Par55on 1 Par55on 2 • In memory Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 How many par55ons? • Need to have small enough par55ons to put all in memory …… Host 2 (4 cores) Host 1 (4 cores) Par55on n Par55on 1 Par55on 2 Par55on 3 Par55on 4 Par55on 5 Par55on 6 Par55on 7 Par55on 8 Par55on 9 Par55on 10 Par55on 11 Par55on 12 Par55on 13 Par55on 14 …… Spark needs many Par55ons • So a common pajern of using Spark is to have big number of par55ons On each Reducer • • • • • For 64 Gb memory host 16 cores CPU For compression ra5o 30:1, 2 5mes overhead To process 3Tb data, Needs 46080 par55ons To process 3Pb data, Need 46 million par55ons Non Scalable • Not linear scalable. • No majer how many hosts in total do we have, we always need 46k par55ons Issues of huge number of par55ons • Issue 1: OOM in mapper side – Each Mapper core needs to write to 46k shuffle files simultaneously – 1 shuffle file = OutputStream + FastBufferStream + CompressionStream – Memory overhead: • FD and related kernel overhead • FastBufferStream (for making ramdom IO to sequen5al IO), default 100k buffer each stream • CompressionStream, default 64k buffer each stream – So by default total buffer size: • 164k * 46k * 16 = 100+ Gb Issues of huge number of pari5ons • Our solu5on to Mapper OOM – Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream (kernel block size) – Based on our Contributed patch hjps://github.com/mesos/spark/pull/685 • Set spark.storage.compression.codec to spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint • Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can s5ll have good compression ra5o) – Total buffer size ater this: • 12k * 46k * 16 = 10Gb Issues of huge number of par55ons • Issue 2: large number of small files – Each Input split in Mapper is broken down into at least 46K par55ons – Large number of small files makes lots of random R/W IO – When each shuffle file is less then 4k (kernel block size), overhead becomes significant – Significant meta data overhead in FS layer – Example: only manually dele5ng the whole tmp directory can take 2 hour as we have too many small files – Especially bad when splits are not balanced. – 5x slower than Hadoop Input Split 1 Input Split 2 … Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 … Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 … Input Split n Reduce side compression • Current shuffle in reducer side data in memory is not compressed • Can take 10-‐100 5mes more memory • With our patch hjps://github.com/mesos/spark/pull/686, we reduced memory consump5on by 30x, while compression overhead is only less than 3% • Without this patch it doesn’t work for our case • 5x-‐10x performance improvement Reduce side compression • Reducer side – compression – 1.6k files – Noncompression – 46k shuffle files Reducer Side Spilling Reduce Compression Bucket n Spill n Compression Bucket 3 Compression Spill 2 Bucket 2 Compression Spill 1 Bucket 1 … Reducer Side Spilling • Spills the over-‐size data to Disk in the aggrega5on hash table • Spilling -‐ More IO, more sequen5al IO, less seeks • All in mem – less IO, more random IO, more seeks • Fundamentally resolved Spark’s scalability issue Align with previous Par55on func5on • Our input data are from another map reduce job • We use exactly the same hash func5on to reduce number of shuffle files Align with previous Par55on func5on • New hash func5on, More even distribu5on Spark Job Previous Job Genera5ng Input data Key 0, 4, 8… shuffule file 0 shuffule file 1 shuffule file 2 Input Data 0 shuffule file 3 Key 1,5,9… shuffule file 4 Mod 4 Key 2, 6, 10… Mod 5 shuffule file 0 shuffule file 1 shuffule file 2 Key 3, 7, 11… shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 Align with previous Par55on func5on • Use the same hash func5on Spark Job Previous Job Genera5ng Input data Key 0, 4, 8… Input Data 0 1 shuffle file Key 1,5,9… Mod 4 1 shuffle file Mod 4 Key 2, 6, 10… 1 shuffle file Key 3, 7, 11… 1 shuffle file Align with previous Hash func5on • Our Case: – 16m shuffle files, 62kb on average (5-‐10x slower) – 8k shuffle files, 125mb on average • Several different input data sources • Par55on func5on from the major one PERFORMANCE TUNNING All About Resource U5liza5on • Maximize the resource u5liza5on • Use as much CPU,Mem,Disk,Net as possbile • Monitor vmstat, iostat, sar Resource U5liza5on • (This is old diagram, to update) Resource U5liza5on • Ideally CPU/IO should be fully u5lized • Mapper phase – IO bound • Final reducer phase – CPU bound Shuffle file transfer • Spark transfers all shuffle files to reducer memory before start processing. • Non-‐streaming(very hard to change to streaming). • For poor resource u5liza5on – So need to make sure maxBytesInFlight is set big enough – Consider alloca5ng 2x more threads than physical core number Thanks. Gavin Li liyu@yahoo-‐inc.com Jaebong Kim pitecus@yahoo-‐inc.com Andrew Feng afeng@yahoo-‐inc.com