MALT: Distributed Data-Parallelism for Existing ML
Transcription
MALT: Distributed Data-Parallelism for Existing ML
MALT: Distributed Data-Parallelism for Existing ML Applications Hao Li, Asim Kadav, Erik Kruus, Cristian Ungureanu ML transforms data into insights Properties of ML applications Machine learning tasks have all of the following properties: • Fine-Grained and Incremental: ML tasks perform repeated model updates over new input data. Large amounts of data is being generated by user-software interactions, social networks, and hardware devices. • Asynchronous: ML tasks may communicate asynchronously. E.g. communicating model information, back-propagation etc. • Approximate: ML applications are stochastic and often an approximation of the trained model is sufficient. Timely insights depend on providing accurate and updated machine learning (ML) models using this data. • Need Rich Developer Environment: Developing ML applications requires a rich set of libraries, tools and graphing abilities which is often missing in many highly scalable systems. Our Solution: MALT Network-efficient learning Goal: Provide an efficient library for providing data-parallelism to existing ML applications. In a peer-to-peer learning, instead of sending model info. to all replicas, MALT sends model updates to log(N) nodes, such that (i) the graph of all nodes is connected (ii) the model updates are disseminated uniformly across all nodes. SGD using V1 as primary model param SGD using V2 as primary model param V1 V2 Vn V1 V2 .... Vn Replica n SGD using Vn as primary model param V1 V2 Data model 6 Data Data model 1 model 5 Vn Data model 2 MALT performs peer-to-peer machine learning. It provides abstractions for finegrained in-memory updates using onesided RDMA, limiting data movement costs when training models. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. model 3 Traditional: all-reduce exchange of model information. As number of nodes (N) increase, the total number of updates transmitted in the network increases as O(N^2). model 6 Data loss=0.03 loss=0.03 DNA detection (DNA) SVM 800 10 GB Genome detection (splice-site) SVM 11M 250 GB Webspam detection (webspam) SVM 16.6M 10 GB Collaborative filtering (netflix) Matrix Factorization 14.9M 1.6 GB Ad perdition (KDD 2012) Neural networks 12.8M 3.1 GB 0 all Halton parameter server 12 10 8 6 4 2 0 0 Wait Halton-model avg Halton-grad-avg PS-model-avg PS-grad-avg 0 1 2 3 RCV1, all, BSP, gradavg, ranks=10 4 RCV1, all, BSP, gradavg, ranks=10 model 4 0.185 Splice−site, all, modelavg, cb=5000, ranks=8 0.185 goal 0.145 single rank SGD cb=5000 7.3X Data goal 0.145 single rank SGD cb=5000 6.7X 0.18 goal 0.01245 BSP ASYNC 6X SSP 7.2X 0.175 0.17 0.024 loss 0.165 0.022 model 3 0.02 0.16 loss x 10 0.012 Data loss 0.02 14 Compute server (PS) for distributed SVM for webspam workload for asynchronous training, with achieved loss values for 20 ranks. 0.155 0.165 0.16 0.155 MALT model propagation: Each machine sends updates to log(N) nodes (to N/2 + i and N/4 + i for node i). As N increases, the outbound nodes follows Halton (1 sec) Figure 4. This figure shows time convergence for RCV1 workload for MALT a single machine workload. We find that sequence (N/2, N/4, 3N/4, N/8,with 3N/8..).and the total Figure 10. This figure shows the convergence for bulkMALT converges quicker to achieve the desired accuracy. synchronous (BSP), asynchronous processing (ASP) and number of updates transmitted increases as O(N logN). bounded staleness processing (SSP) for splice-site workload. 0.15 0.145 0.145 0.014 0 1 10 0.012 2 10 0 10 1 10 50 2 10 iterations (10000) 0 10 time (0.01 sec) 100 150 200 250 all all 300 Speedup over single SGD for fix loss 0.022 0.014 Data Figure 9. This figure compares MALTHalton with parameter Serial SGD goal 0.01245 BSP all ASYNC all 6X ASYNC Halton 11X Webspam, BSP, gradavg, cb=5000 4 Splice−site, modelavg, cb=5000, ranks=8 40 0.026 alpha, all, BSP, modelavg, ranks=10 250 1: procedure SERIALSGD MALT-SVM RCV1, Halton, BSP, gradavg, ranks=10 200 RCV1, all, BSP, gradavg, ranks=10 2: Gradient g; 0.185 0.185 goal 0.145 goal 0.145 3: Parameter W; single rank SGD single rank SGD 0.18 0.18 cb=1000 5.2X cb=1000 5.9X 4: cb=5000 6.7X cb=5000 8.1X 150 0.175 0.175 cb=10000 5.5X cb=10000 5.7X 5: for epoch = 1 : maxEpochs 0.17 do 0.17 6: 0.165 0.165 MR-SVM 7: for i = 1 : maxData do 1000.16 0.16 8: g = cal gradient(data[i]); 0.155 0.155 50 9: W = W + g; 0.15 0.15 10: 0.145 0.145 11: return W 0 loss loss Transforming serial SGD (Stochastic Gradient Descent) to data-parallel SGD. 0 Works with existing applications. Currently integrated with SVM-SGD, HogWild-MF and NEC RAPID. 1 GB 0.016 0.016 Re-use developer environment 500 loss=0.05 80 0.15 Approximate SVM 0.018 0.17 MALT allows different consistency models to trade-off consistency and training time. Image classification (PASCAL - alpha) 0.026 120 0.018 Asynchronous 480 MB 0.024 0.028 Models train and scatter updates to per-sender receive queues. This mechanism when used with one-sided RDMA writes, ensure no interruption to the receiver CPU. 47K Data 160 0.175 MALT provides a scatter-gather API. scatter allows sending of model Efficient model updates to the peer replicas. Local communication gather function applies any userdefined function on the received values. SVM loss=0.05 0.18 MALT’s solution Document Classification (RCV1) 0.028 model 5 model 2 Design requirements Dataset size (uncompressed) 200 model 1 Data # Parameters 5 0 10 1 10 2 10 Runtime time (0.01 sec) 0 10 1 10 configurationstime (0.01 sec) 2 10 1: procedure PARALLELSGD Figure This figure shows convergence (loss with vs time 2: maltGradient g(SPARSE, ALL); Figure 5. 11. This figure shows speedup by iterations PAS-in seconds) RCV1 dataset for all MALT (left)MR-SVM. and MALTMRHalton 3: Parameter W; CAL alpha for workload for MALT SVMallwith (right) for different communication sizes. library We findover that SVM algorithm is implemented using batch the MALT 4: MALTHalton faster than MALTall . for some workinfiniBand. Weconverges achieve 5: for epoch = 1 : maxEpochs do super-linear speedup loads because of the averaging effect from parallel replicas [52]. 6: benefit of different synchronization 7: for i = 1 : maxData/totalMachines do methods. We compare speedup of the systems under test by running them until 8: g = cal gradient(data[i]); they reach the same loss value and compare the total time 9: g.scatter(ALL); and number of iterations (passes) over data per machine. 10: g.gather(AVG); 11: W = W + g; Distributed training requires fewer iterations per machine since examples are processed in parallel. For each of our 12: experiments, we pick the desired final optimization goal as 13: return W achieved by running a single-rank SGD [6, 12]. Figure 4 the speedup Data-Parallelcompares SGD with MALTof MALTall with a single machine for the RCV1 dataset [12], for a communication batch size or cb size of 5000. By cb size of 5000, we mean that every model 50 100 150 200 250 2 4 10 20 time (1 sec) ranks SVM Convergence (loss vs time in seconds) for Network costs for MALT-all, MALT-Halton and Figure 12. This figure shows convergence (loss vs Figure time in 13. This figure shows the data sent by MALTall , SVM using splice dataset for MALT-all and the parameter server for the whole network for seconds) for splice-site dataset for MALTall and MALT . HaltonHalton MALT and the parameter server for the webspam workMALT-Halton. We find that MALT-Halton the webspam workload. We find that MALTWe find that MALTHalton converges faster than MALTallload. . MALT sends and receives gradients while parameter converges faster than MALT-all. Halton reduces network communication costs server sends gradients but needs to receive whole models. machines. Second, each node performs model averagingand provides fast convergence. To summarize, we find that MALT provides sending graof fewer (log(N )) incoming models. Hence, even though dients (instead of sending the model) that saves network MALTHalton may require more iterations than MALT all , the costs. MALT Halton is network efficient and 400 overall time required for every iteration is less, and over-Furthermore, KDD2012, all, BSP, modelavg, ranks=8 speedup over MALT all . 0.71 all convergence time to reach the desired accuracyachieves is less. 350 failure Finally,0.7since MALTHalton spreads out its updates across all saturation tests: 1-node Network We perform infiniBand network fault-free 300 nodes,0.69 that aids faster convergence. throughput tests, and measure the time to scatter updates Figure 12 shows the model convergence for the splice- 250 0.68 in MALTall case with the SVM workload. In the synchronous site dataset and speedup over BSP-all in reaching the desired 200 0.67 case, we find that all ranks operate in a log step fashion, goal with 8 nodes. From the figure, we see that MALT 0.66 andHalton during the scatter phase, all machines send models at desired goal 0.7 converges faster than MALTall . Furthermore, we find that 150 0.65 single rank SGD the line rate (about 5 GB/s). Specifically, for the webspam that until the model converges to the desired goal, each node 100 cb=15000 1.13X workload, we see about 5.1 GB/s (about 40 Gb/s) during 0.64 in MALTall sends out 370 GB of updates for every cb=20000 1.5X machine, 50 scatter. In the asynchronous case, to saturate the network, 1.24X while 0.63 MALTHalton only sends 34 GB cb=25000 of data for every mawe run replicas on every machine. When running 0 0.62 chine. As the number of nodes increase, the logarithmic fan- multiple 500 1000 1500 2000 2500 3000 three ranks on every machine, we find that each machine Runtime configurations out of MALTHalton shouldtime result in lower amounts of data (1 sec) sends model updates at 4.2 GB/s (about 33 Gb/s) for the transferred andnetworks faster convergence. Neural AUC (area under curve) vs tolerance: Timedemonstrate taken to converge for a large Figure 6. This figure shows the AUC (Area Underwebspam Curve)Fault dataset. These tests that using MALT trades-off freshness of updates at peer repliHalton (in seconds) a three neural network DNA dataset with fault-free and a single vs timetime (in seconds) for afor three layerlayer neural network for textthe network bandwidth is beneficial for training models with cas with savings in network communication time. For workfor(click text learning (click prediction) using KDD rank failure case. MALT is able to recover learning prediction). large number of parameters. Furthermore, using networkloads where the model is dense and network communication 2012 data. from the failure and train the model correctly. efficient techniques such as MALTHalton can improve perforSVM algorithm is based toonthecommon Hadoop implemencosts are small compared update costs, MALT all conmance. tations andmay communicates gradients every figuration provide similar or betterafter results overpartitionMALT epoch [56].example, We implement MR-SVM algorithms the for the SSI workload, which is over a fully Halton . For 6.3 Developer Effort MALT library andnetwork, run it over our infiniBand connected neural we only see a 1.1⇥ cluster. speedup MRfor We evaluate the ease of implementing parallel learning in SVM averaging at theasend every of epoch to MALTuses over MALT theof number nodes Haltonone-shot all . However, and model sizes increase, (cb the size cost=of25K). communication beginsby adding support to the four applications listed in communicate parameters MALT is MALT designed 3. For each application we show the amount of code to dominate, andcommunication using MALTHalton beneficial. Table for low-latency andiscommunicated parameweHalton modified as well as the number of new lines added. 13 shows the=data by MALT MALT , tersFigure more often (cb size 1K).sent Figure 5 shows by itall ,speedup and the for parameter serveralpha over workload the entire for network, the InforSection 4, we described the specific changes required. erations the PASCAL MR-SVM (imwebspam workload. We find MALT is the most netThe new Halton plemented over MALT) and that MALT find that both the code adds support for creating MALT objects, to all . We work efficient. Webspam is a high-dimensional scatter-gather the model updates. In comparison, imworkloads achieve super-linear speedup over a workload. single maMALTSGD only updates log(N )This nodes. The paplementing Haltonon chine the sends PASCAL alphatodataset. happens be- a whole new algorithm takes hundreds of lines rameter sendseffect gradients but needsprovides to receivesuper-linear the whole new code assuming underlying data parsing and arithmetic cause theserver averaging of gradients model from central server.[52]. We note that otherweoptimizalibraries speedup for the certain datasets In addition, find thatare provided by the processing framework. On avtions such as compression, and filters can further erage, we moved 87 lines of code and added 106 lines, repMALT provides 3⇥ speedup (byother iterations, about 1.5⇥reby duce the network costs as noted in [36]. Furthermore, when time) over MR-SVM. MALT converges faster sinceresenting it is de- about 15% of overall code. the parameter server is replicated for high-availability, there signed over low latency communication, and sends gradients 6.4 Fault Tolerance is more network traffic for additional N (asynchronous) more frequently. This result shows that existing algorithms messages for N way chain replication of the parameters. evaluate the time required for convergence when a node designed for map-reduce may not provide the mostWe optimal fails. When the MALT fault monitor in a specific node respeedup latency frameworks as MALT. [1] for A.low Halevy, P. Norvig,such and F. Pereira. The unreasonable effectiveness Figure 6 shows the speedup with time for convergence of data. Intelligent Systems, IEEE, 24(2):8–12, 2009. for ad-click prediction implemented using a fully connected, J. neural Deannetwork. et. al.,This Large distributed three[2] layer three scale layer network needs deep networks, NIPS 2012. to synchronize parameters at each layer. machine Furthermore,learning these [3] L. Bottou. Large scale with SGD. COMPSTAT 2010. networks denseet. parameters and thereA islock computa[4] B.have Recht al., HogWild: free approach to parallelizing tion in the forward and the reverse direction. Hence, fullystochastic descent, NIPS connected neural networks are harder to 2011. scale than convoluB. Bai e.t. SSI: Supervised Indexing. ACM CIKM 2009. tion [5] networks [23]. Weal. show the speedup by usingSemantic MALT all to train over KDD-2012 data on 8 processes over single machine. We obtain up to 1.5⇥ speedup with 8 ranks. The AUC HDFS/NFS (for loading training data) model 4 Data dstorm (distributed one-sided remote memory) Model Data Time to run 100 epochs for webspam vector object library Application (Dataset) total network traffic (MBs) Replica 2 We integrate MALT with three applications: SVM[3], matrix factorization[4] and neural network[5]. MALT requires reasonable developer efforts and provides speedup over existing methods. loss Replica 1 Results Time to process 50 epochs Large learning models, trained on large datasets often improve model accuracy [1]. 0 5 We demonstrate that MALT outperforms single machine performance for small workloads and can efficiently train models over large datasets that span multiple machines (See our paper in EuroSys 2015 for more results). References and Related Work