Vowpal Wabbit A Machine Learning System
Transcription
Vowpal Wabbit A Machine Learning System
Vowpal Wabbit A Machine Learning System John Langford Microsoft Research http://hunch.net/~vw/ git clone git://github.com/JohnLangford/vowpal_wabbit.git Why does Vowpal Wabbit exist? Why does Vowpal Wabbit exist? 1. Prove research. Why does Vowpal Wabbit exist? 1. Prove research. 2. Curiosity. 3. Perfectionist. 4. Solve problem better. A user base becomes addictive 1. Mailing list of >400 A user base becomes addictive 1. Mailing list of >400 2. The ocial strawman for large scale logistic regression @ NIPS :-) A user base becomes addictive 1. Mailing list of >400 2. The ocial strawman for large scale logistic regression @ NIPS :-) 3. An example wget http://hunch.net/~jl/VW_raw.tar.gz vw -c rcv1.train.raw.txt -b 22 --ngram 2 --skips 4 -l 0.25 --binary provides stellar performance in 12 seconds. Surface details 1. BSD license, automated test suite, github repository. Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc... Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc... 4. Mostly C++, but bindings in other languages of varying maturity (python good). Surface details 1. BSD license, automated test suite, github repository. 2. VW supports all I/O modes: executable, library, port, daemon, service (see next). 3. VW has a reasonable++ input format: sparse, dense, namespaces, etc... 4. Mostly C++, but bindings in other languages of varying maturity (python good). 5. A substantial user base + developer base. Thanks to many who have helped. VW service http://tinyurl.com/vw-azureml Problem: How to deploy model for large scale use? VW service http://tinyurl.com/vw-azureml Problem: How to deploy model for large scale use? Solution: This Tutorial in 4 parts How do you: 1. use all the data? This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems? This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems? 4. solve interactive problems? Using all the data: Step 1 Using all the data: Step 1 Small RAM ⇒ data ⇒ Online Learning Active research area, 4-5 papers related to online learning algorithms in VW. Using all the data: Step 2 1. 3.2 ∗ 106 labeled emails. 2. 433167 users. 3. ∼ 40 ∗ 106 unique tokens. How do we construct a spam lter which is personalized, yet uses global information? Using all the data: Step 2 1. 3.2 ∗ 106 labeled emails. 2. 433167 users. 3. ∼ 40 ∗ 106 unique tokens. How do we construct a spam lter which is personalized, yet uses global information? Bad answer: Construct a global lter + 433167 personalized lters using a conventional hashmap to specify features. This might require 433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM. Using Hashing Use hashing to predict according to: hw , φ(x )i + hw , φu (x )i 2x text document (email) x xl NEU Votre Apotheke en ligne Euro ... USER123_NEU USER123_Votre USER123_Apotheke USER123_en USER123_ligne USER123_Euro ... + tokenized, duplicated bag of words xh h 323 0 5235 0 0 123 0 626 232 ... hashed, sparse vector w! xh classification (in VW: specify the userid as a feature and use -q) !"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2% Results !"'#% !"!'% !"#$% !"##% #"$#% !"#&% #"$'% #")#% !"##% !"##% !% #")$% #")(% #"(#% #"*#% +,-./,01/2134% 5362-7/,8934% ./23,873% #"'#% #"##% !$% '#% ''% '*% ')% 0%0&)!%&1%3#!3')#0,*% 26 2 parameters = 64M parameters = 256MB of RAM. An x270K savings in RAM requirements. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point. Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x ) = P i wi xi ? Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x ) = 17B Examples 16M parameters 1K nodes How long does it take? P i wi xi ? Using all the data: Step 3 Given 2.1 Terafeatures of data, how can you learn a good linear predictor fw (x ) = P i wi xi ? 17B Examples 16M parameters 1K nodes How long does it take? 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ faster than all possible single machine linear learning algorithms. MPI-style AllReduce Allreduce initial state 5 1 7 2 6 3 4 MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 Properties: 1. How long does it take? 2. How much bandwidth? 3. How hard to program? 28 MPI-style AllReduce Create Binary Tree 7 5 1 6 2 3 4 MPI-style AllReduce Reducing, step 1 7 8 1 13 2 3 4 MPI-style AllReduce Reducing, step 2 28 8 1 13 2 3 4 MPI-style AllReduce Broadcast, step 1 28 28 1 28 2 3 4 MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1. How long does it take? 2. How much bandwidth? 3. How hard to program? MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1. How long does it take? O(1) time(*) 2. How much bandwidth? O(1) bits(*) 3. How hard to program? Very easy (*) When done right. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1. While (examples left) 1.1 Do online update. 2. AllReduce(weights) 3. For each weight w ← w /n An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1. While (examples left) 1.1 Do online update. 2. AllReduce(weights) 3. For each weight Code tour w ← w /n What is Hadoop AllReduce? Data Program 1. Map job moves program to data. What is Hadoop AllReduce? Data Program 1. Map job moves program to data. 2. Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. What is Hadoop AllReduce? Data Program 1. Map job moves program to data. 2. Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. 3. Speculative execution: In a busy cluster, one node is often slow. Use speculative execution to start additional mappers. Robustness & Speedup Speed per method 10 Average_10 Min_10 Max_10 linear 9 8 Speedup 7 6 5 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 Splice Site Recognition 0.55 0.5 auPRC 0.45 0.4 0.35 0.3 Online L−BFGS w/ 5 online passes L−BFGS w/ 1 online pass L−BFGS 0.25 0.2 0 10 20 30 Iteration 40 50 Splice Site Recognition 0.6 0.5 auPRC 0.4 0.3 0.2 L−BFGS w/ one online pass Zinkevich et al. Dekel et al. 0.1 0 0 5 10 15 Effective number of passes over data 20 This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. solve interactive problems? Applying Machine Learning in Practice Applying Machine Learning in Practice 1. Ignore mismatch. Often faster. 2. Understand problem and nd more suitable tool. Often better. Importance Weighted Classication Importance-Weighted Classication I I I Given training data {(x1, y1, c1), . . . , (xn, yn , cn )}, produce a classier h : X → {0, 1}. Unknown underlying distribution X × {0, 1}×[0, ∞). Find D over h with small expected cost: `(h, D ) = E(x ,y ,c )∼D [c · 1(h(x ) 6= y )] Where does this come up? 1. Spam Prediction (Ham predicted as Spam much worse than Spam predicted as Ham.) 2. Distribution Shifts (Optimize search engine results for monetizing queries.) 3. Boosting (Reweight problem examples for residual learning.) 4. Large Scale Learning (Downsample common class and importance weight to compensate.) Multiclass Classication X × Y , where Y = {1, . . . , k }. Find a classier h : X → Y minimizing the multi-class loss on D Distribution D over `k (h, D ) = Pr(x ,y )∼D [h(x ) 6= y ] Multiclass Classication X × Y , where Y = {1, . . . , k }. Find a classier h : X → Y minimizing the multi-class loss on D Distribution D over `k (h, D ) = Pr(x ,y )∼D [h(x ) 6= y ] 1. Categorization: Which of 2. Actions: Which of k k things is it? choices should be made? Use in VW Multiclass label format: Label [Importance] ['Tag] Methods: oaa k one-against-all prediction. baseline O (k ) time. O (log (k )) time. O (log (n)) time. ect k error correcting tournament. log_multi n Adaptive log time. The One-Against-All (OAA) Create k For class binary problems, one per class. i predict Is the label i or not? (x , 1(y = 1)) (x , 1(y = 2)) (x , y ) 7−→ ... (x , 1(y = k )) The inconsistency problem Given an optimal binary classier, one-against-all doesn't produce an optimal multiclass classier. Prob(label|features) 1 2 −δ 1 4 + δ 2 1 4 + δ 2 Prediction 1v23 1 0 0 0 2v13 0 1 0 0 3v12 0 0 1 0 The inconsistency problem Given an optimal binary classier, one-against-all doesn't produce an optimal multiclass classier. Prob(label|features) 1 2 −δ 1 4 + δ 2 1 4 + δ 2 Prediction 1v23 1 0 0 0 2v13 0 1 0 0 3v12 0 0 1 0 Solution: always use one-against-all regression. Cost-sensitive Multiclass Cost-sensitive multiclass classication X × [0, 1]k , where a vector in [0, 1]k species the cost of each of the k choices. Find a classier h : X → {1, . . . , k } minimizing the Distribution D over expected cost c , D ) = E(x ,c )∼D [ch(x )]. cost( Cost-sensitive Multiclass Cost-sensitive multiclass classication X × [0, 1]k , where a vector in [0, 1]k species the cost of each of the k choices. Find a classier h : X → {1, . . . , k } minimizing the Distribution D over expected cost c , D ) = E(x ,c )∼D [ch(x )]. cost( 1. Is this packet {normal,error,attack}? 2. A subroutine used later... Use in VW Label information via sparse vector. A test example: |Namespace Feature A test example with only classes 1,2,4 valid: 1: 2: 4: |Namespace Feature A training example with only classes 1,2,4 valid: 1:0.4 2:3.1 4:2.2 |Namespace Feature Use in VW Label information via sparse vector. A test example: |Namespace Feature A test example with only classes 1,2,4 valid: 1: 2: 4: |Namespace Feature A training example with only classes 1,2,4 valid: 1:0.4 2:3.1 4:2.2 |Namespace Feature Methods: csoaa k cost-sensitive OAA prediction. O (k ) time. csoaa_ldf Label-dependent features OAA. wap_ldf LDF Weighted-all-pairs. Code Tour Let's take a break This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. How do you solve interactive problems? Structured Prediction = joint prediction with joint loss Structured Prediction = joint prediction with joint loss Example: Machine Translation Simple Example: Part of Speech Tagging Pierre Vinken , 61 years old Proper N. Proper N. Comma Number Noun Adj. How can you best do structured prediction? We care about: 1. Programming complexity. How can you best do structured prediction? We care about: 1. Programming complexity. Most structured predictions are not addressed with structured learnign algorithms, because it it too complex to do so. How can you best do structured prediction? We care about: 1. Programming complexity. Most structured predictions are not addressed with structured learnign algorithms, because it it too complex to do so. 2. Prediction accuracy. It had better work well. How can you best do structured prediction? We care about: 1. Programming complexity. Most structured predictions are not addressed with structured learnign algorithms, because it it too complex to do so. 2. Prediction accuracy. It had better work well. 3. Train speed. Debug/development productivity + maximum data input. How can you best do structured prediction? We care about: 1. Programming complexity. Most structured predictions are not addressed with structured learnign algorithms, because it it too complex to do so. 2. Prediction accuracy. It had better work well. 3. Train speed. Debug/development productivity + maximum data input. 4. Test speed. Application eciency A program complexity comparison lines of code for POS 1000 100 10 1 CRFSGD CRF++ S-SVM Search Accuracy (per tag) 0.98 Part of speech tagging (tuned hps) 0.96 96.6 96.1 95.896.1 95.7 95.7 95.0 0.94 0.92 0.90 0.88 1m 100 VW Search VW Search (own fts) 90.7 VW Classification CRF SGD CRF++ Str. Perceptron SVM 10m 30m 1hStructured Str.SVM (DEMI-DCD) 101 102 103 Training Time (minutes) Prediction (test-time) Speed POS 13 5.7 129 133 5.3 14 5.6 NER 98 24 0 VW Search VW Search (own fts) CRF SGD CRF++ Str. Perceptron Structured SVM Str. SVM (DEMI-DCD) 218 50 100 150 200 Thousands of Tokens per Second 285 250 300 How do you program? Sequential_RUN( 1: 2: 3: 4: 5: 6: examples ) for i = 1 to len(examples) do prediction ← predict(examples[i ], examples[i ].label ) if output .good then output ' ' prediction end if end for How do you program? Sequential_RUN( 1: 2: 3: 4: 5: 6: examples ) for i = 1 to len(examples) do prediction ← predict(examples[i ], examples[i ].label ) if output .good then output ' ' prediction end if end for In essence, write the decoder, providing a little bit of side information for training. examples , false_negative_loss ) Seq_Detection( Let max_label = 1, max_prediction = 1 for i = 1 to len(examples) do max_label ← max(max_label , examples[i ].label ) end for for i = 1 to len(examples) do max_prediction ← max(max_prediction, predict(examples[i ], examples[i ].label )) end for if max_label > max_prediction then loss(false_negative_loss) else if max_label < max_prediction then loss(1) else loss(0) end if end if How does it work? An Application of Learning to Search algorithms (e.g. Searn/DAgger). Decoder run many times at train time to optimize predict(...) for loss(...). A Search Space Start State A Search Space Start State A Search Space Start State A Search Space Start State A Search Space Start State A Search Space Start State End States A Search Space Start State Predict Loss End States Learning to Search Start State Rollout End States Learning to Search Start State Rollout Collapse End States Learning to Search Start State Rollout Collapse Loss End States TDOLR program equivalence Theorem: Every algorithm which: 1. Always terminates. 2. Takes as input relevant feature information X. 3. Make 0+ calls to predict. 4. Reports loss on termination. denes a search space, and such an algorithm exists for every search space. An outline 1. How? 1.1 Programming 1.2 Learning to Search 1.3 Equivalence 2. Other Results Named Entity Recogntion Is this word part of an organization, person, or not? Named entity recognition (tuned hps) 79.280.0 76.5 0.80 F-score (per entity) 0.75 73.3 76.5 78.3 75.9 74.6 0.70 0.65 0.60 0.55 0.50 0.45 10s 10-1 1m 100 Training Time (minutes) 10m 101 Entity Relation Goal: nd the Entities and then nd their Relations Method Entity F1 Relation F1 Train Time Structured SVM 88.00 50.04 300 seconds L2S 92.51 52.03 13 seconds Requires about 100 LOC. Dependency Parsing Goal: Find the dependency structure of words in a sentence. Method dependency accuracy Redshift (single beam) 89.25 L2S 90.27 Redshift (beam search) 91.32 Requires about 450 LOC. A demonstration http://bilbo.cs.uiuc.edu/~kchang10/ tmp/wsj.vw.zip wget vw -b 24 -d wsj.train.vw -c search_task sequence search 45 search_alpha 1e-8 search_neighbor_features -1:w,1:w ax -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw This Tutorial in 4 parts How do you: 1. use all the data? 2. solve the right problem? 3. solve complex joint problems easily? 4. solve interactive problems? Examples of Interactive Learning Repeatedly: 1. A user comes to Microsoft (with history of previous visits, IP address, data related to an account) 2. Microsoft chooses information to present (urls, ads, news stories) 3. The user reacts to the presented information (clicks on something, clicks, comes back and clicks again,...) Microsoft wants to interactively choose content and use the Another Example: Clinical Decision Making Repeatedly: 1. A patient comes to a doctor with symptoms, medical history, test results 2. The doctor chooses a treatment 3. The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients. The Contextual Bandit Setting For t = 1, . . . , T : x ∈X The learner chooses an action a ∈ A The world reacts with reward ra ∈ [0, 1] 1. The world produces some context 2. 3. Goal: Learn a good policy for choosing actions given context. The Direct method Use past data to learn a reward predictor act according to arg maxa r̂ (x , a). r̂ (x , a), and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. x1 x2 a1 r̂ (x , a). Deployed policy always takes a2 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. Deployed policy always takes Observed x1 x2 a1 .8 ? r̂ (x , a). a2 ? .2 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. Deployed policy always takes Observed/Estimated x1 x2 r̂ (x , a). a1 a2 .8/.8 ?/.5 ?/.5 .2 /.2 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. Deployed policy always takes Observed/Estimated x1 x2 r̂ (x , a). a1 a2 .8/.8 .3/.5 ?/.5 .2 /.2 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. Deployed policy always takes Observed/Estimated x1 x2 a1 .8/.8 .3/.3 r̂ (x , a). a2 ?/.514 .2 /.014 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. r̂ (x , a). Deployed policy always takes Observed/Estimated/True x1 x2 a1 a2 .8/.8/.8 .3/.3/.3 ?/.514/1 .2 /.014 /.2 r̂ (x , a), and a1 on x1 and The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. r̂ (x , a). Deployed policy always takes r̂ (x , a), and a1 on x1 and Observed/Estimated/True x1 x2 a1 a2 .8/.8/.8 .3/.3/.3 ?/.514/1 .2 /.014 /.2 Basic observation 1: Generalization insucient. The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. r̂ (x , a). Deployed policy always takes r̂ (x , a), and a1 on x1 and Observed/Estimated/True x1 x2 a1 a2 .8/.8/.8 .3/.3/.3 ?/.514/1 .2 /.014 /.2 Basic observation 2: Exploration required. The Direct method Use past data to learn a reward predictor act according to arg maxa Example: a2 on x2. r̂ (x , a). Deployed policy always takes r̂ (x , a), and a1 on x1 and Observed/Estimated/True x1 x2 Basic a1 a2 .8/.8/.8 .3/.3/.3 ?/.514/1 .2 /.014 /.2 observation 3: Errors 6= exploration. The Evaluation Problem Let π:X →A be a policy mapping features to actions. How do we evaluate it? The Evaluation Problem Let π:X →A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world. Very Expensive! The Importance Weighting Trick Let π:X →A be a policy mapping features to actions. How do we evaluate it? The Importance Weighting Trick Let π:X →A be a policy mapping features to actions. How do we evaluate it? One answer: Collect T exploration samples (x , a, ra , pa ), where x = context a = action ra = reward for action pa = probability of action a then evaluate: Value(π) = Average ra 1(π(x ) = a) pa The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x ,~r )∼D rπ(x ) = E[ Value(π) ] Proof: i P r a 1(π(x )=a) Ea∼p = a pa ra 1(π(pax )=a) = rπ(x ) pa Example: h Action 1 2 Reward 0.5 1 Probability Estimate 1 4 3 4 The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x ,~r )∼D rπ(x ) = E[ Value(π) ] Proof: i P r a 1(π(x )=a) Ea∼p = a pa ra 1(π(pax )=a) = rπ(x ) pa Example: h Action 1 2 Reward 0.5 1 1 4 Probability Estimate 2 3 4 0 The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E(x ,~r )∼D rπ(x ) = E[ Value(π) ] Proof: i P r a 1(π(x )=a) Ea∼p = a pa ra 1(π(pax )=a) = rπ(x ) pa Example: h Action 1 2 Reward 0.5 1 Probability 1 4 Estimate 2 | 0 3 4 0 | 4 3 How do you test things? Use format: action:cost:probability | features Example: 1:1:0.5 | tuesday year million short compan vehicl line stat nanc commit exchang plan corp subsid credit issu debt pay gold bureau prelimin ren billion telephon time draw basic relat le spokesm reut secur acquir form prospect period interview regist toront resourc barrick ontario qualif bln prospectus convertibl vinc borg arequip ... How do you train? How do you train? Reduce to cost-sensitive classication. How do you train? Reduce to cost-sensitive classication. vw cb 2 cb_type dr rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.25 Progressive 0/1 loss: 0.04582 vw cb 2 cb_type ips rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.125 Progressive 0/1 loss: 0.05065 vw cb 2 cb_type dm rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.125 Progressive 0/1 loss: 0.04679 Reminder: Contextual Bandit Setting For t = 1, . . . , T : 1. The world produces some context x ∈ X 2. The learner chooses an action a ∈ A 3. The world reacts with reward ra ∈ [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Eciently competing with some large reference class of policies Π = {π : X → A}: Regret = max averaget (rπ(x ) − ra ) π∈Π Building an Algorithm For t = 1, . . . , T : 1. The world produces some context a∈A The world reacts with reward ra ∈ [0, 1] 3. The learner chooses an action 4. x ∈X Building an Algorithm Q1 = uniform distribution For t = 1, . . . , T : Let 1. The world produces some context 2. Draw π ∼ Qt a ∈ A using π(x ). The world reacts with reward ra ∈ [0, 1] Update Qt +1 3. The learner chooses an action 4. 5. x ∈X What is good I Qt ? Exploration: Qt allows discovery of good policies I Exploitation: Qt small on bad policies How do you nd Qt ? How do you nd Qt ? by Reduction ... [details complex, but coded] How well does this work? losses on CCAT RCV1 problem loss 0.12 0.1 0.08 0.06 0.04 0.02 0 C B* C er ov nU Li t rs fi u- ta dy ee gr sep How long does this take? running times on CCAT RCV1 problem 1e+06 seconds 100000 10000 1000 100 10 1 C B* C er ov nU Li t rs fi u- ta dy ee gr sep Further reading https://github.com/JohnLangford/ vowpal_wabbit/wiki VW wiki: Search: NYU large scale learning class NIPS tutorial: http://hunch.net/~jl/interact.pdf Bibliography Release Vowpal Wabbit open source project, http://github.com/JohnLangford/vowpal_ wabbit/wiki, 2007. L2Search H. Daume, J. Langford, and S. Ross, Ecient programmable learning to search, http://arxiv.org/abs/1406.1837 Explore A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, http://arxiv.org/abs/1402.0555 Terascale A. Agarwal, O. Chapelle, M. Dudik, J. Langford A Reliable Eective Terascale Linear Learning System, http://arxiv.org/abs/1110.4198 Bibliography: Parallel other Average G. Mann et al. Ecient large-scale distributed training of conditional maximum entropy models, NIPS 2009. Ov. Av. M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized Stochastic Gradient Descent, NIPS 2010. D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, Optimal Distributed Online Predictions Using Minibatch, http://arxiv.org/abs/1012.1367