Abstracts Booklet - The New York Academy of Sciences
Transcription
Abstracts Booklet - The New York Academy of Sciences
Sixth Annual Machine Learning Symposium October 21, 2011 ABSTRACTS FOR POSTER SESSIONS & SPOTLIGHT TALKS 1 of 104 SCIENTIFIC ORGANIZING COMMITTEE Naoki Abe, PhD IBM Research Corinna Cortes, PhD Google Patrick Haffner, PhD AT&T Research Tony Jebara, PhD Columbia University John Langford, PhD Yahoo! Research Mehryar Mohri, PhD Courant Institute of Mathematical Sciences, NYU Robert Schapire, PhD Princeton University ACKNOWLEDGMENT OF SUPPORT Gold Sponsor Academy Friends Google Yahoo! Labs The “HackNY presentations for students” event is co-organized by Science Alliance with the support of hackNY. 2 of 104 AGENDA 9:30 AM 10:00 AM 10:10 AM 10:55 AM Breakfast & Poster Set-up Opening Remarks Stochastic Algorithms for One-Pass Learning Léon Bottou, PhD, Microsoft adCenter Spotlight Talks Opportunistic Approachability Andrey Berstein, Columbia University Large-Scale, Sparse Kernel Logistic Regression — With a Comparative Study on Optimization Algorithms Shyam S. Chandramouli, Columbia University Online Clustering with Experts Anna Choromanska, PhD, Columbia University Efficient Learning of Word Embeddings via Canonical Correlation Analysis Paramveer Dhillon, University of Pennsylvania A Reliable, Effective, Terascale Linear Learning System Miroslav Dudik, PhD, Yahoo! Research 11:20 AM Networking and Poster Session 12:05 PM Online Learning without a Learning Rate Parameter Yoav Freund, PhD, University of California, San Diego 1:00 PM Networking Lunch 2:30 PM Spotlight Talks Large-Scale Collection Threading Using Structured k-DPPs Jennifer Gillenwater, University of Pennsylvania Online Learning for Mixed Membership Network Models Prem Gopalan, Princeton University Planning in Reward Rich Domains via PAC Bandits Sergiu Goschin, Rutgers University The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Matthew D. Hoffman, PhD, Columbia University Place Recommendation with Implicit Spatial Feedback Berk Kapicioglu, Princeton University, Sense Networks 3:00 PM Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd, PhD, Stanford University 3:45 PM Spotlight Talks Hierarchically Supervised Latent Dirichlet Allocation Adler Perotte, MD, Columbia University Image Super-Resolution via Dictionary Learning Gungor Polatkan, Princeton University MirroRank: Convex Aggregation and Online Ranking with the Mirror Descent Benoit Rostykus, Ecole Centrale Paris Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings Amina Shabbeer, Rensselaer Polytechnic Institute A Reinforcement Learning Approach to Variational Inference David Wingate, PhD, Massachusetts Institute of Technology 4:10 PM Networking and Poster Session 5:00 PM Student Award Winner Announcement & Closing Remarks 5:15 PM End of Program 5:30 PM HackNY Presentations for Students Foursquare, Hunch, Intent Media, Etsy, Media6Degrees, Flurry 3 of 104 SPEAKERS’ BIOGRAPHIES 4 of 104 Léon Bottou, PhD Microsoft adCenter Léon Bottou received the Diplôme d'Ingénieur de l'Ecole Polytechnique (X84) in 1987, the Magistère de Mathématiques Fondamentales et Appliquées et d'Informatique from Ecole Normale Superieure in 1988, the Diplôme d'Etudes Approndies in Computer Science in 1988, and a PhD in Computer Science from LRI, Université de Paris-Sud in 1991. After his PhD, Bottou joined AT&T Bell Laboratories from 1991 to 1992. He then became chairman of Neuristique, a small company pioneering machine learning for data mining applications. He returned to AT&T Labs from 1995 to 2002 and NEC Labs America at Princeton from 2002 to March 2010. He joined the Science Team of Microsoft Online Service Division in April 2010. Bottou's primary research interest is machine learning. His contributions to the field cover both theory and applications, with a particular interest for large-scale learning. Bottou's secondary research interest is data compression and coding. His best known contribution in this field is the DjVu document compression technology. Bottou has published over 80 papers and won the 2007 New York Academy of Sciences Blavatnik Award for Young Scientists. He is serving or has served on the boards of the Journal of Machine Learning Research and IEEE Transactions on Pattern Analysis and Machine Intelligence. Yoav Freund, PhD University of California Yoav Freund is a professor of Computer Science and Engineering at University of California, San Diego. His work is in the area of machine learning, computational statistics information theory, and their applications. He is best known for his joint work with Dr. Robert Schapire on the Adaboost algorithm. For this work they were awarded the 2003 Gödel prize in Theoretical Computer Science, as well as the Kanellakis Prize in 2004. Stephen P. Boyd, PhD Stanford University Stephen P. Boyd is the Samsung Professor of Engineering, and Professor of Electrical Engineering in the Information Systems Laboratory at Stanford University. He received the A.B. degree in Mathematics from Harvard University in 1980, and the Ph.D. in Electrical Engineering and Computer Science from the University of California, Berkeley, in 1985. Then he joined the faculty at Stanford. His current research focus is on convex optimization applications in control, signal processing, and circuit design. SPEAKERS’ ABSTRACTS 5 of 104 Stochastic Algorithms for One-Pass Learning Léon Bottou, Microsoft adCenter The goal of the presentation is to describe practical stochastic gradient algorithms that process each training example only once, yet asymptotically match the performance of the true optimum. This statement needs, of course, to be made more precise. To achieve this, we'll review the works of Nevel'son and Has'minskij (1972), Fabian (1973, 1978), Murata & Amari (1998), Bottou & LeCun (2004), Polyak & Juditsky (1992), Wei Xu (2010), and Bach & Moulines (2011). We will then show how these ideas lead to practical algorithms that not only represent a new state of the art but are also arguably optimal. Online Learning without a Learning Rate Parameter Yoav Freund, PhD, University of California, San Diego Online learning is an approach to statistical inference based on the idea of playing a repeated game. A "master" algorithm recieves the prediction of N experts before making its own prediction. Then the outcome is revealed, and experts and master suffer a loss. Algorithms have been developed for which the regret, the difference between the cumulative loss of the master and the cumulative loss of the best expert, is bounded uniformly over all sequences of expert predictions and outcome. The most successful algorithms of this type are the exponential weights algorithms discovered by Littlestone and Warmuth and refined by many others. The exponential weights algorithm has a parameter, the learning rate, which has to be tuned appropriately to achieve the best bounds. This tuning typically depends on the number of experts and on the cumulative loss of the best expert. We describe a new algorithm - NormalHedge, which has no parameter and achieves comparable bounds to tuned exponential weights algorithms. As the algorithm does not depend on the number of experts it can be used effectively when the set of experts grows as a function of time and when the set of experts is uncountably infinite. In addition, the algorithm has a natural extension for continuous time and has a very tight analysis when the cumulative loss is described by an Ito process. This is joint work with Kamalika Chaudhuri and Daniel Hsu. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen P. Boyd, Stanford University Problems in areas such as machine learning and dynamic optimization on a large network lead to extremely large convex optimization problems, with problem data stored in a decentralized way, and processing elements distributed across a network. We argue that the alternating direction method of multipliers is well suited to such problems. The method was developed in the 1970s, with roots in the 1950s, and is equivalent to or closely related to many other algorithms, such as dual decomposition, the method of multipliers, DouglasRachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for $\ell_1$ problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to statistical and machine learning problems such as the lasso and support vector machines, and to dynamic energy management problems arising in the smart grid. INDEX OF POSTER ABSTRACTS 6 of 104 Multiple Instance Regression without the Ambiguity Charles Bergeron, Rensselaer Polytechnic Institute 10 Opportunistic Approachability Andrey Berstein*, Columbia University 12 Large-Scale, Sparse Kernel Logistic Regression — With a Comparative Study on Optimization Algorithms Shyam S Chandramouli*, Columbia University 14 Visualizing Topic Models Allison Chaney, Princeton University 16 Weakly Supervised Neural Nets for POS Tagging Sumit Chopra, AT&T Labs Research 18 Online Clustering with Experts Anna Choromanska*, Columbia University 20 Efficient Learning of Word Embeddings via Canonical Correlation Analysis Dhillon Paramveer*, University of Pennsylvania 23 Improved Semi-Supervised Learning Using Constraints and Relaxed Entropy Regularization via Deterministic Annealing Dhillon Paramveer, University of Pennsylvania 25 A Reliable Effective Terascale Linear Learning System Miroslav Dudik *, Yahoo! Research 27 Density Estimation Based Ranking from Decision Trees Haimonti Dutta, Columbia University 29 Optimization Techniques for Large-Scale Learning Clement Farabet, New York University 31 Real-time, Multi-class Segmentation using Depth Cues Clement Farabet, New York University 33 A Text-based HMM Model of Foreign Affairs Sentiment: A Mechanical Turker’s History of Recent Geopolitical Events Sean Gerrish, Princeton University 35 A Probabilistic Foundation for Policy Priors Sam Gershman, Princeton University 37 Large-Scale Collection Threading Using Structured k-DPPs Jennifer Gillenwater*, University of Pennsylvania 38 Online Learning for Mixed Membership Network Models Prem Gopalan*, Princeton University 40 Planning in Reward Rich Domains via PAC Bandits Sergiu Goschin*, Rutgers University 42 Nonparametric, Multivariate Convex Regression with Applications to Value Function Approximation Lauren A. Hannah, Duke University 44 The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Matt Hoffman*, Columbia University 45 Distributed Collaborative Filtering Over Social Networks Sibren Isaacman, Princeton University 47 Can Public Data Help With Differentially-Private Machine Learning? Geetha Jaganathan, Columbia University 49 Place Recommendation with Implicit Spatial Feedback Berk Kapicioglu*, Princeton Unversity 52 Recovering Euclidean Distance Matrices via Landmark MDS Akshay Krishnamurthy, Carnegie Mellon University 54 Efficient Evaluation of Large Sequence Kernels Pavel Kuksa, NEC Laboratories America, Inc 56 Unsupervised Hashing with Graphs Wei Liu, Columbia University 58 Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction Siwei Lyu, State University of New York, Albany 60 Active Prediction in graphical models Satyaki Mahalanbis, University of Rochester 62 An Ensemble of Linearly Combined Reinforcement-Learning Agents Vukosi Marivate, Rutgers University 65 Autonomous RF Surveying Robot for Indoor Localization and Tracking Piotr Mirowski, Bell Labs, Alcatel-Lucent Statistics and Learning Department 67 A Comparison of Text Analysis and Social Network Analysis Using Twitter Data John Myles White, Princeton University 69 Time-dependent Dirichlet Process Mixture Models for Multiple Target Tracking Willie Neiswanger, Columbia Unversity 70 7 of 104 An Efficient and Optimal Scrabble Playing Algorithm Based on Modified DAWG and Temporal Difference Lambda Learning Algorithm Manjot Pahwa, University of Delhi 72 Hierarchically Supervised Latent Dirichlet Allocation Adler Perotte*, Columbia Unversity 73 Image Super-Resolution via Dictionary Learning Gungor Polatkan*, Princeton University 75 Structured Sparsity via Alternating Directions Methods Zhiwei Tony Qin, Columbia University 77 Ranking Annotators for Crowdsourced Labeling Tasks Vikas Raykar, Siemens Healthcare 79 Extracting Latent Economic Signal from Online Activity Streams Joseph Reisinger, Metamarkets Group 81 Intrinsic Gradient Networks Jason Rolfe, New York University 84 MirroRank: Convex Aggregation and Online Ranking with the Mirror Descent Benoit Rostykus*, Ecole Centrale Paris, ENS Cachan 86 Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings Amina Shabbeer*, Rensselaer Polytechnic Institute 88 A Learning-based Approach to Enhancing Checkout Non-Compliance Detection Hoang Trinh, IBM T J Watson Center 90 Ensemble Inference on Single-Molecule Time Series Measurements Jan-Willem Van de Meent, Columbia University 92 Bisection Search in the Presence of Noise Rolf Waeber, Cornell University 94 PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework Shen Wang, Columbia University 95 A Reinforcement Learning Approach to Variational Inference David Wingate*, MIT 97 Using Support Vector Machine to Forecast Energy Usage of a Manhattan Skyscraper Rebecca Winter, Columbia University 99 Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries Zhen James Xiang, Princeton University 101 Using Iterated Reasoning to Predict Opponent Strategies Michael Wunder, Rutgers University 103 8 of 104 9 of 104 10 of 104 Multiple instance regression without the ambiguity Charles Bergeron chbergeron@gmail.com Department of Electrical, Systems and Computer Engineering Rensselaer Polytechnic Institute, 110 8th Street, Troy, New York, 12180 Multiple instance learning Multiple instance learning (MIL) is a variation of supervised learning where the response is only known for bags of items (or instances), instead of for each item. Many MIL formulations have been proposed, including classification [1], regression [2] and partial ranking [3]. Typically in MIL, there exists an ambiguity as to which item in a bag determines the response (a classification label, ranking information or regression value, as the case may be) for the entire bag. This ambiguity stems from either a lack of knowledge (e.g. chemists don’t know what conformations cause a compound to smell musky) or a desire for automation (e.g. elephants are easy to identify on an image but annotating a large number of images is a boring task). Fundamentally, the ambiguity exists because it is assumed that a single item determines the response for the whole bag or there exists a lack of knowledge as to how multiple items contribute to the bag’s response. Selecting a single item to stand for the entire bag is common in MIL and is easily formulated using the max operator fi (w) = max {xTij w} j (1) where fi is the prediction for bag i, xij contains the features corresponding to item j of bag i and w is a vector containing the model parameters. Multi-instance aggregation What if the combination of several indicators is a better predictor of hard drive failure, or could several passages in a document better determine its relevancy? Item aggregation to form a bag prediction is a complicated task. For example, we could take a convex combination of items weighted by coefficients λj : n X fi (w) = λj xTij w. (2) j However, that makes for a lot of extra parameters to estimate, increasing the risk of overfitting. Simply averaging items (setting λj = n1 ) to form an aggregate prediction does not work well in practice as the contribution of critical items is diluted. Recently, [4] proposed aggregating items using a log-mean-exp function controlled by η > 0: X 1 fi (w) = ln exp ηxTij w . (3) η j Under this formulation, the relative weighting is such that terms with larger xTj w are more heavily weighted as η → ∞, and has the advantage of adding a single parameter η as opposed to n extra parameters as with 2. They obtained best partial ranking models using this form, thereby showing that multi-instance aggregation can be successful. However, we do not know whether their log-sum-exp formulation is optimal. Put otherwise, the ambiguity remains, in that they are required to attempt various strategies to handle the ambiguity and select one of them (in an unbiased manner). 1 11 of 104 Multispecies multimode binding model Physiological compounds, drugs, and other man-made bioactive chemicals containing ionizable substructures are analyzed in a multiple-species context that accounts for the various binding species. It is becoming increasingly clear that this analysis must be further broken down to consider the effect of multiple modes [5]. Incorporating these molecular representations can improve our understanding of binding for improved drug design. The relationship between these representations and the binding affinity is derived from chemical thermodynamics principles (and the assumption that the overall interaction energy is equal to the sum of pairwise energies between atoms of the interacting ligands and those of the binding site) [5]. Our model has the form n X BAi (w, γ) = ln φij exp xTij w (4) j=1 where BAi is the binding affinity of compound i and 0 ≤ φij ≤ 1 is the species fraction corresponding to P molecular representation j such that j φij = 1. Comparing with Eq. 3, we set η = 1 and weigh each exponential term by φij . Subject to the influence of the species fractions, this formulation behaves as Eq. 3. This is a MIL model in that the binding affinity is known for the compounds (bags) and the features are known for the molecular representations (items) [6]. However, unlike most MIL models, we know how these molecular representations aggregate to determine the each compound’s response [6]. This is a positive result for two reasons: • There exists MIL problems for which the form of item aggregation is known, and these forms may translate well to other applications. • In the absence of the usual MIL ambiguity, greater effort can be put on estimating the parameters, which can be challenging as MIL problems are generally nonconvex. Linking empirical results from [4] with analytical observations from [6] strengthen the case for aggregation formulations involving multiple instances. References [1] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems, volume 15, 2003. [2] Soumya Ray and David Page. Multiple instance regression. In Proceedings of the International Conference on Machine Learning, volume 18, pages 425–432, 2001. [3] Charles Bergeron, Gregory Moore, Jed Zaretzki, Curt M. Breneman, and Kristin P. Bennett. Fast bundle algorithm for multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, (In press), 2011. [4] Yang Hu, Mingjing Li, and Nenghai Yu. Multiple instance ranking: Learning to rank images for image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. [5] Senthil Natesan, Tiansheng Wang, Viera Lukacova, Vladimir Bartus, Akash Khandelwal, and Stefan Balaz. Rigorous treatment of multispecies multimode ligand-receptor interactions in 3D-QSAR: CoMFA analysis of thyroxine analogs binding to transthyretin. Journal of Chemical Information Modeling, 51:1132–1150, 2011. [6] Charles Bergeron, Senthil Natesan, and Stefan Balaz. Controlling model capacity: An exploration of multispecies multimode binding affinity modeling. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Under review. 2 12 of 104 Opportunistic Approachability Andrey Bernstein†∗, Shie Mannor†, and Nahum Shimkin† September 9, 2011 Blackwell’s approachability has played a key role in the analysis of regret minimization within the on-line learning framework and in the theory of learning in games. The notion of approachability, as introduced in Blackwell (1956), concerns a repeated matrix game with vector-valued payoffs that is played by two players, the agent and the opponent. Thus, for each pair of simultaneous actions a and z, a payoff vector r(a, z) ∈ Rℓ is obtained. Given a closed set S in Rℓ , the agent’s goal is to have the long-term average reward vector approach S, namely converge to S almost surely in the point-to-set distance. If that convergence can be ensured irrespectively of the opponent’s strategy, the set S is said to be approachable, and a strategy of the agent that satisfies this property is an approaching strategy (or algorithm) of S. Approachability has close connections with on-line algorithms, and in particular with the concept of no-regret. In fact, soon after the no-regret play was introduced in Hannan (1957), it was shown by Blackwell (1954) that this problem can be formulated as an approachability problem for a suitably defined set and payoff vector. By its very definition, an approachable set accommodates a worst-case scenario, as it should be approached for any strategy of the opponent. However, as the game unfolds it may turn out that the opponent’s actions are limited in some sense. For example, the opponent may choose to use only a subset of his available actions, or even employ a stationary strategy, namely restrict to a single mixed action. If these self-imposed restrictions were known in advance, the agent could possibly ensure convergence to some strict subset of the target set S. Our goal here is to formulate opportunistic approachability algorithms, which attain similar goals in an online manner, without knowing in advance the opponent’s strategy restrictions. To this end, we need to formulate meaningful opportunistic goals that are to be pursued by such algorithms. Our starting point will be Blackwell’s dual approachability condition. Let r(p, q) denote the expected payoff vector given mixed actions p and q of the agent and opponent, respectively. Blackwell’s dual condition1 requires that for any q there exists a p so that r(p, q) ∈ S. This condition is clearly necessary for a closed set S to be approachable. Indeed, if it fails to hold for some q, the opponent may exclude S simply by playing q repeatedly. Remarkably, Blackwell showed that this condition is also sufficient when S is convex. We refer to a set S that satisfies the dual condition as a D-set. Note that a convex D-set is approachable. More generally, the convex hull conv(S) of any D-set S is approachable. By definition, if S is a D-set, there exists a map p∗ (q) so that r(p∗ (q), q) ∈ S for all q. Fix one such map, and call it the response function. A feasible goal for an opportunistic approachability ∗ Department of Electrical Engineering, Columbia University Department of Electrical Engineering, Technion 1 Blackwell’s primal condition is a geometric separation condition, which forms the basis for standard approachability policies. The equivalence of the primal and dual conditions for convex sets is a generalization of the standard minimax theorem, as reflected in the title of Blackwell’s paper (Blackwell, 1956). † 1 13 of 104 algorithm can now be specified: Suppose that the opponent’s actions are empirically restricted in some (possibly asymptotic) sense to a subset Q of his mixed actions. The average reward vector should then converge to conv{r(p∗ (q), q), q ∈ Q}. Clearly, the latter is included in conv(S), and may be considerably smaller, depending on Q. In particular, if Q is a singleton then so is the latter set. The central contribution of this paper is an approachability algorithm that is opportunistic in the above-mentioned sense. To the best of our knowledge, this is the first algorithm to offer such guarantees. The proposed algorithm is based on on-line convex programming, here implemented via the gradient algorithm due to Zinkevich (2003), with its scalar rewards obtained by projecting the original vector payoffs onto properly selected directions. The algorithm is computationally simple, in the sense that it does not require explicit identification of the restriction set Q, nor of the convex hull associated with the reduced target set. As an application of opportunistic approachability, we consider the problem of on-line regret minimization in the presence of average cost constraints. That problem generalizes the standard no-regret framework by adding side constraints, and was recently applied in Bernstein et al. (2010) to the problem of online classification with specificity constraints. As shown in Mannor et al. (2009), the constrained no-regret problem can be naturally formulated as an approachability problem with respect to a target set S that traces the best-reward-in-hindsight given the empirical frequencies of the opponent’s actions, subject to the specified constraints. While this set is a D-set by its definition, it turns out to be non-convex in general and therefore need not be approachable. Our opportunistic approachability algorithm applies to this problem, with the response function p∗ (q) naturally selected as the constrained best-response. The algorithm approaches conv(S) without requiring its computation. Furthermore, in the fortuitous case of a stationary opponent, the algorithm converges to the set S itself. References A. Bernstein, S. Mannor, and N. Shimkin. Online classification with specificity constraints. In NIPS, 2010. D. Blackwell. Controlled random walks. In Proceedings of the International Congress of Mathematicians, volume III, pages 335–338, 1954. D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956. J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957. S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10:569–590, 2009. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML ’03), pages 928–936, 2003. 2 14 of 104 Large-Scale Sparse Kernel Logistic Regression — with a comparative study on optimization algorithms ABSTRACT Kernel Logistic Regression (KLR) is a powerful probabilistic classification tool, but its training and testing both suffer from severe computational bottlenecks when used with large-scale data. Traditionally, L1-penalty is used to induce sparseness in the parameter space for fast testing. However, most of the existing optimization methods for training l1 penalized KLR do not scale well in large-scale settings. In this work, we present highly scalable training of KLR model via three first-order optimization methods: Fast Shrinkage Thresholding Algorithm(FISTA), Coordinate Gradient Descent (CGD), and a variant of Stochastic Gradient Descent (SGD) method. To further reduce the space and time complexity, we apply a simple kernel linearization technique which achieves similar results at a fraction of the computational cost. While SGD appears the fastest in training large-scale data, we show that CGD performs considerably better in some cases on various quality measures. Based on this observation, we propose a multi-scale extension of FISTA which improves its computational performance significantly in practice while preserving the theoretical global convergence rate. We further propose a two-stage active set training scheme for CGD and FISTA, which boosts the prediction accuracies by up to 4%. Extensive experiments on several data sets containing up to millions of samples demonstrate the effectiveness of our approach. using different kernels allows one to learn a nonlinear classifier in the feature space. However, its training and testing both suffer from severe computational bottlenecks when used with large-scale data. Moreover, when the number of training data is small compared to the number of the features, straightforward KLR training may lead to over-fitting. KLR with L1-regularization in which an extra λk·k1 is added to penalize large “w” has received considerable attention since it enforces sparsity in w. Specifically, L1-regularized KLR often yields a sparse solution w whose nonzero components correspond to the underlying more “meaningful” features. Therefore, it greatly saves the testing time especially when the feature dimension is huge for only a few features are used in computation. Indeed, L1-regularized KLR makes it possible to learn and apply KLR at large scale with similar performance as of L2-regularization. In , it is shown that L1regularization can outperform L2-regularization especially when number of observations is smaller than the number of features. Thus, L1-regularization technique has been widely used in many other problems, such as compressed sensing and Lasso , despite the fact that it is more challenging to solve L1-regularization than L2-regularization due to the non-smoothness of k · k1 . In this paper, we consider the following Sparse Kernel Logistic Regression (SKLR) model. min F (w) = − w 1. INTRODUCTION 1.1 Kernel Logistic Regression w N X log(σ(yi wT ki )) log(σ(yi wT ki )) + λkwk1 (2) i=1 We choose Radial Basis Function (RBF) kernel to form the Kernel matrix K, i.e., Kernel Logistic Regression (KLR) is a powerful probabilistic classification tool. Given N training points, the following minimization is used to train KLR: min F (w) = − N X (1) i=1 where ki is the i-th column of the kernel matrix K of the training data, and σ(v) = 1/(1 + exp−v ). The possibility of k(i, j) = exp 1.2 − kxi −xj k2 σ2 (3) Algorithms for L1-regularization (L2-norm squared) KLR instead of the non-smooth SKLR. Interior Point method approach is proposed for sparse logistic regression (SLR) in . Certain sparsity assumption is made for the training data matrix in large-scale computation, thus it is not applicable for nonlinear kernel model. Iteratively Re-weighted Least Squares (IRLS) via LARS calls LARS as a subroutine and also solves KLR only. Various first-order optimization algorithms have been proposed to solve the L1-regularized problems. In this paper, we explore three algorithms, (Fast) Iterative ShrinkageThresholding algorithm (FISTA/ISTA) , the Coordinate Gradient Descent algorithm (CGD) and the Stochastic gradient descent (SGD) . FISTA is an accelerated proximal-gradient 15 of 104 Data set mnist10k gisette cod-rna mnist8M training 10,000 6,000 483,565 1,560,154 test 1000 1000 5000 19,000 no. features 784 5,000 8 784 factor of λmax 0.5,0.2,0.1,0.05 0.5,0.2,0.1,0.05 0.001 0.1 With a given kernel matrix K, we can also explicitly compute an upper bound λmax on the meaningful range of λ following the formulation given in literature, λmax = kK T b̃k∞ , Table 1: Various data sets used in the experiments. where algorithm which was originally proposed to solve the linear inverse problem arising in image processing. Recently, it has been proven that its complexity bound is optimal among all the first-order methods. CGD is a coordinate-descent type of algorithm and also extended the former work to L1regularized convex minimization problems which achieves the aforementioned goals. SGD uses approximate gradients from subsets of the training data and updates the parameters in an online fashion. In many applications this results in less training time in practice than batch training algorithms. The SGD-C method that we use in our study is based on the Stochastic sub-Gradient Descent algorithm originally proposed to solve the L1-regularized log-linear models. It uses the cumulative penalty heuristic to improve the sparsity of w. SGD-C also incorporates the gradient-averaging idea from the dual averaging algorithm , which extends Nesterov’s work to L1-regularized kernel logistic regression. 2. IMPLEMENTATION 2.1 Kernel linearization Nonlinear kernel classifiers are attractive because they can approximate the decision boundary better given enough training data. However, they do not scale well in both training time and storage in large-scale settings, and it may take days to train on data sets with millions of points. On the other hand, linear classifiers run much more quickly, especially when the number of features is small, but behave relatively poorly when the underlying decision boundaries are non-linear. Kernel linearization combines the advantages of the linear and nonlinear classifiers. The key idea is to map the training data to a low-dimensional Euclidean space by a randomized feature map z : Rn → RD , D n, so that k(x, y) = hφ(x), φ(y)i ≈ z(x)0 z(y). (4) Therefore, we can directly put the transformed data z(x) to the linear classifier and speed up the computation. To calculate z(x), randomized Fourier transform is used as, r 2 cos(ωj x + b). (5) zj (x) = D 3. EXPERIMENTS 3.0.1 The regularization parameter λ One of the key parameters to set for the SKLR problem is the penalty λ on the L1 -norm. There are two major approaches for choosing λ. One is through computing the regularization path, as done recently. Specifically, we start with a large λ, on which the optimization algorithms converge faster. We then gradually decrease λ and use the solution from the previous λ as the starting solution for the current λ. This technique is also known as continuation or warm-starting . It uses 10-fold cross validation to compute the hold-out accuracy and select the λ that achieves the highest accuracy. (6) b̃i = N− /N, if yi = 1, i = 1 · · · N. −N+ /N, yi = −1, N− and N+ above denote the number of positive and negative training labels respectively. The optimal solution is the zero vector for any λ larger than λmax . For mediumscale datasets, we ran the three algorithms with four values of λ: 0.5λmax , 0.2λmax , 0.1λmax , and 0.05λmax . We show that usually a good balance between accuracy and sparsity occurs at around 0.2λmax to0.1λmax . In Table 1, we have specified the actual factors of λmax that we used in our experiments. This approach is simple and the chosen parameter values are usually appropriate enough for ensuring that we test the optimization algorithms in the relevant regime for classification. 4. CONCLUSION In this work, we explore and analyze three algorithms FISTA, CGD and SGD for solving the L1-regularized large-scale KLR, which has never been performed to the best of our knowledge. We observe that CGD performs surprisingly better than FISTA in the number of iterations to convergence. For large-scale data with size up to millions, SGD appears faster in terms of the training time, but it comes with the loss of optimality guarantee upon termination, and SGD has lower prediction accuracy or sparsity compared to the deterministic methods (i.e. FISTA and CGD) in some cases. We also study the effect of various values of the regularization parameter on the training time, prediction accuracy, and sparsity. In the algorithmic aspect, we propose a two-stage active-set training approach which can boost the prediction accuracy by up to 4% for the deterministic algorithms. Based on the observation that CGD converges faster than FISTA on SKLR problems, we adopt the feature of CGD and propose a multiple-scaled FISTA which improves its performance significantly while preserving the theoretical global convergence rate. We have also applied the gradientaveraging technique and the cumulative penalty heuristic to SGD to make it well-suited for large-scale SKLR training. We have also proposed a multiple-scale extension to FISTA and a two-stage active set training scheme, which helps improve the prediction accuracies of the deterministic learning algorithms. Through our experiments, We demonstrated the effectiveness of kernel linearization and the computational advantage that it brings, which makes large-scale kernel learning highly feasible. Our observations show that while SGD-C remains to be the most efficient algorithm for largescale data, the deterministic algorithms (especially CGD) are also well-capable of handling a wide range of data with optimality guarantee. Finally, the success of the algorithms that we have introduced in this work demonstrates the potential and computational advantage of sparse modeling in large-scale classification. 16 of 104 Visualizing Topic Models Allison J. B. Chaney and David M. Blei Princeton University achaney@cs.princeton.edu (The variable θd is a distribution over K elements.) (b) For each word in the document i. Choose a topic assignment zn from θd . (Each zn is a number from 1 to K.) ii. Choose a word wn from the topic distribution βzn . (Notation βzn selects the zn th topic from step 1.) This generative process defines a joint distribution over the hidden topical structure and observed documents. This joint, in turn, induces a posterior distribution of the hidden structure given the documents. Topic modeling algorithms, like Gibbs sampling [4] and variational inference [1, 3], seek to approximate this posterior. Here, we use this approximation to visualize a new organization of the collection through posterior estimates of the hidden variables, as shown in Figure 1. As one demonstration, we analyzed and visualized 100,000 Wikipedia articles with a 50-topic LDA model. The browser can be found at http: //www.princeton.edu/˜achaney/tmve/ wiki100k/browse/topic-list.html. Figure 2 depicts several pages from this browser. An extension that incorporates time series of topics and document impact [2] into is in progress with preliminary results. Probabilistic topic modeling is an unsupervised machine learning method for uncovering the latent thematic structure in large archives of otherwise unorganized texts. Topic models learn the underlying themes of the collection and how each document exhibits them. This provides the potential to both summarize the documents and organize them in a new way. However, topic models are high-level statistical tools which require scrutinization of probability distributions to understand and explore their results. The contribution of this work is to present a method for visualizing the results of the topic model. Our method creates a web interface that allow users to explore all of the observed and hidden relationships that a topic model discovers. These browsing interfaces help in research and development of topic models and let end-users explore and understand the collection in new ways. We propose a method for visualizing topic models that summarizes the corpus for the user and reveals relationships between and across content and summaries. With our browser design, every possible relationship between topics, terms, and documents is represented and integrated into the navigation. Overview pages allow users to understand the corpus as a whole before delving into more specific exploration via individual variable pages. We have achieved this with a browser design that illuminates a given corpus to users of varying technical ability; understanding our browser does not require an understanding of the deep technical details of topic modeling. We have built a browser of one of the simplest topic models, LDA [1] and can be used for both its static and online variants [3]. Recall that LDA assumes the following generative process of assigning topic to documents. References [1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003. [2] S. Gerrish and D. Blei. A language-based approach to measuring scholarly impact. In International Conference on Machine Learning, 2010. [3] M. Hoffman, D. Blei, and F. Bach. On-line learning for latent Dirichlet allocation. In Neural Information Processing Systems, 2010. 1. For K topics, choose each topic distribution βk . (Each βk is a distribution over the vocabulary.) [4] M. Steyvers and T. Griffiths. Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, and 2. For each document in the collection: W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006. (a) Choose a distribution over topics θd . 1 17 of 104 Associated topics, ordered by θd Related topics, ordered by a function of βk × β1:K (Eq. 1) Associated documents, ordered by θ1:D Terms wd present in the document Related documents, ordered by a function of θd × θ1:D (Eq. 4) Associated terms, ordered by βk Figure 1: A topic page and document page from the browser of Wikipedia. We have labeled how we compute each component of these pages from the output of the topic modeling algorithm. Figure 2: Navigating Wikipedia with a topic model. Beginning in the upper left, we see a set of topics, each of which is a theme discovered by a topic modeling algorithm. We click on a topic about film and television. We choose a document associated with this topic, which is the article about film director Stanley Kubrick. The page about this article includes its content and the topics that it is about. We explore a related topic about philosophy and psychology, and finally view a related article about Existentialism. This browsing structure—the themes and how the documents are organized according to them—is created by running a topic modeling algorithm on the raw texts of Wikipedia and visualizing its output. 2 18 of 104 Weakly Supervised Neural Nets for POS Tagging Sumit Chopra and Srinivas Bangalore AT&T Labs-Research, 180 Park Ave, Florham Park, NJ {schopra,srini}@research.att.com We introduce a simple yet novel method for the problem of Part-Of-Speech (POS) tagging with a dictionary. It involves training a neural network which simultaneously learns a distributed latent representation of the words, while learning a discriminative function to optimize accuracy on the POS tagging task. The model is trained using a Curriculum: a structured sequence of training samples is presented to the model, as opposed to in random order. On standard data set, we demonstrate that our model is able to outperform the standard Expectation-Maximization (EM) algorithm and its performance is comparable to other state-of-the-art models. In natural language processing (NLP), machine learning techniques such as Hidden Markov Models (HMM), Maximum Entropy models, Support Vector Machines and Conditional Random Fields have become the norm to solve a range of disambiguation tasks such as part-of-speech (POS) tagging, named-entity tagging, supertagging, and parsing. Training of these models crucially rely on the availability of large amounts of text annotated with the appropriate task labels. However, text annotation is a tedious and expensive process. In order to address this limitation, a number of alternate techniques have been proposed recently, both in the unsupervised and weakly supervised settings. One such technique, first introduced in [1], involves training a POS tagger for a language using large amounts of unannotated text aided by a lexicon of words with all possible POS tags for each word in that language (a dictionary). Such a lexicon can easily be extracted from a physical dictionary of a language without the need for annotating texts of that language. This problem has recently witnessed a lot of activity. Various authors have proposed a wide variety of solutions, ranging from variants of HMM-based generative models trained using EM [1], [2], [3], to Bayesian approaches [4], [5], to contrastive training of log-linear models [6]. However, one major drawback of most of these techniques is their explicit modeling of syntactic constrains, typically encoded using ngrams of POS history. This results in an expensive decoding step, rendering them useless for real time application. To this end we propose a simple yet novel model for the problem which uses only lexical information and does not explicitly models syntactic constraints. Our model is similar to the works of [7], and [8]. It is composed of two layers: a classification layer stacked on top of an embedding layer. The embedding layer, is a learnable linear mapping which maps each word onto a low dimensional latent space. Let the set of N unique words in the vocabulary of the corpus be denoted by W = {w1 , w2 , . . . , wN }. Let us assume that each word wi is coded using a 1-of-n coding. The embedding layer maps each word wi to a continuous vector zi which lies in a D dimensional space (25 in the experiments): zi = C · wi , i ∈ {1, . . . , N }, where C ∈ <D×N is a projection matrix. Continuous representation of any k-gram (wi wi+1 . . . wi+k ) is z = (zi zi+1 . . . zi+k ) – obtained by concatenating the representation of each of its words. The encoding of the dictionary is as follows. Let K be the total number of tags in the dictionary. Then the dictionary entry di , is a binary vector of size K with the j th element set to 1 if the tag j is associated with word wi . The classification layer takes as input the continuous representation of the k-gram generated by the embedding layer and produces as output a vector of tag probabilities which are used to make the decision. In our experiments this layer was composed of a single standard perceptron layer consisting of 400 units, followed by a fully connected linear layer with size equal to the number of classes (47). The output of this linear layer is passed through a soft-max non-linearity to generate conditional probabilities for each class (tag). In particular, let oj denote the output of the j-th unit of the last linear layer. Then the output of the i-th unit of the soft-max non-linearity oi is given by pi = Pe eoj . This classifier can be viewed as a j parametric function GM with parameters M. The two sets of trainable parameters for the model are the mapping matrix C, and the set of parameters M associated with GM . For the purpose of training, we use the POS tags of each word in the dictionary to construct a training set S = {(xi , y i ) : i = 1, . . . , N }, consisting of the input-output pairs. Each input xi is a sequence of k words obtained by sliding a window of size k over the entire corpus. In particular, for a sentence (W = w1 . . . wr ), each training example xi consists of a target word wj , with six words from its left (cl = wj−6 . . . wj−1 ) and right context (cr = wj+1 . . . wj+6 ), in addition to orthographic features (owj ) such as three character suffix and prefix, digit and upper case features. Thus the input xi is the 4-tuple (cl , wj , cr , owj ). The corresponding output y i is set equal to dj : the binary dictionary vector associated with the target word wj . Depending on the word ambiguity one or more elements of the vector dj could be set to 1. Training of the system involves adjusting the parameters M and C so as to maximize the likelihood of the training data. This is achieved by minimizing the negative log likelihood loss using stochastic gradient descent algorithm. In particular each epoch of training is composed of two phases. In Phase 1: the matrix C is kept fixed and the loss is minimized with respect to the parameters M of the classifier using stochastic gradient descent. In Phase 2: the parameters M are fixed and 19 of 104 the loss is minimized with respect to the matrix C. However, motivated by [9], we follow a fixed curriculum to pick a training sample during each step of the stochastic gradient descent, as opposed to random sampling. The curriculum involves first using only the “easy” samples to learn simpler aspects of the task, and gradually moving to “harder” samples to learn more complex aspects. In particular we start the model training using samples that consist of words which have only one POS tag in the dictionary (i.e. no ambiguity). After training the model for some number of epochs on these training samples, we grow the learner’s training set (and hence increase its entropy) by adding in examples of words which have two POS tags in the dictionary (i.e. an ambiguity of one) and train the model until convergence. We follow this training regime with ever increasing size of the training set until all the available samples are included in the learner’s training set. We evaluated our model on the standard Penn Treebank part-of-speech (POS) corpus, and used the same dictionary of word-to-tags association as in [3]. We replaced the tag associated with each word in the labeled training set of around 1.1M words with the tags associated with that word in the dictionary. This resulted in an ambiguously tagged sentence corpus, with average per word ambiguity of 2.3 tags/words. Due to the high non-linearity of the model different initializations could potentially lead to different local minima, resulting in vastly different task accuracy. To alleviate this problem, we trained several models independently, each initialized with a different set of random word embedding and classifier parameters, and the model which performed the best on the tuning set was selected as the final model for evaluation. The tagging accuracy of our model, along with that of others is summarized in Table I. The accuracy of our model with curriculum learning (CNet+CL) is superior to the bigram EMHMM, trigram EM-HMM-2, and the Bayesian HMM models. Those models which outperform ours, namely CE+spl and IP+EM, make use of a number of global characteristics, such as access to the full sequence of POS tags of a sentence. The simplicity of our model (which is only lexically driven with no dependencies on the POS tag sequence), makes the results in the table particularly noteworthy. We attribute the good performance to the lexical representations that are learnt jointly with the discriminative non-linear architecture trained to optimize the POS tagging accuracy. We expect, exploiting such global information within our model would likely improve the tagging accuracy even further. It is interesting to note that the model without curriculum learning (CNet), where training samples are presented in no particular order is far inferior to all other models. Furthermore, analysis of the dictionary confirmed that a number of words, with ambiguity greater than 0 had wrong tags associated with them in their respective dictionary entries. This was due to the fact that some instances of these words were wrongly tagged in the corpus. The incorrect tagging of the words, coupled with the greedy nature of training the model resulted in the model wrongly tagging all the instances of these words. After removing such erroneous entries from the dictionary as suggested in [2], the test accuracy of our model improved to 90.55%. Note that in Table I the performance of 91.4% reported in [2] uses such a pruned dictionary along with other language specific and linguistic constraints. Lastly, once the parameters of our model are learnt, the process of decoding a new sample is extremely fast. Our model is capable of tagging 7945 words per second on a Xeon7550 machine. TABLE I POS TAGGING ACCURACY OF VARIOUS MODELS . EM - BIGRAM : EM WITH A BI - GRAM TAG MODEL ; EM - TRIGRAM : EM WITH 3- GRAM TAG MODEL ; BHMM: BAYESIAN HMM WITH S PARSE P RIORS [4]; CE+ SPL : C ONTRASTIVE E STIMATION WITH S PELLING M ODEL [6]; I NIT EM-HMM: EM WITH GOOD INITIALIZATION [2]; IP+EM: EM WITH I NTEGER P ROGRAMMING [3]; CN ET: C ONNECTIONIST N ETWORK ; CL: C URRICULUM L EARNING ; CD: C LEAN D ICTIONARY; Model EM - bigram EM - trigram BHMM [4] CE+spl [6] InitEM-HMM [2] IP+EM [3] CNet CNet + CL CNet + CL + CD Percentage Accuracy using 47 tags 81.7 74.5 86.8 88.6 91.4 91.6 67.72 88.34 90.55 We presented a novel method for weakly supervised training of neural networks for part-of-speech tagging which does not makes use of explicit annotated data. Our method makes use of an organized training regime, called Curriculum Learning, to train the neural network while at the same time training a distributed embedding of the words. While the performance of the model is based entirely on lexical information, it stands to question as to whether the performance can be further improved using global or long-distance constraints. Finally, the technique used in this paper might also be used for other disambiguation tasks such as Supertagging. R EFERENCES [1] B. Merialdo, “Tagging english text with a probabilistic model,” Computational Linguistics, 1994. [2] Y. Goldberg, M. Adler, and M. Elhadad, “Em can find pretty good hmm pos-taggers (when given a good start),” In Proc. of ACL, pp. 746 – 754, June 2008. [3] S. Ravi and K. Knight, “Minimized models for unsupervised part-ofspeech tagging,” In Proc. of ACL-IJCNLP, pp. 504 – 512, August 2009. [4] S. Goldwater and T. Griffiths, “A fully bayesian approach to unsupervised part-of-speech tagging,” In Proc. of ACL, pp. 744 – 751, June 2007. [5] K. Toutanova and M. Johnson, “A bayesian lda based model for semisupervised part-of-speech tagging,” Advances in Neural Information Processing Systems, no. 20, 2008. [6] N. Smith and J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data,” In Proc. ACL, pp. 354 – 362, 2005. [7] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. [8] R. Collobert and J. Weston, “A unified architecture for natural language processing: deep neural networks with multitask learning,” in ICML ’08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp. 160–167. [9] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning. New York, NY, USA: ACM, 2009, pp. 41–48. 20 of 104 Online Clustering with Experts Claire Monteleoni∗ Department of Computer Science George Washington University cmontel@gwu.edu Anna Choromanska Department of Electrical Engineering Columbia University aec2163@columbia.edu Abstract Approximating the k-means clustering objective with an online learning algorithm is an open problem. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to re-weight the experts, the algorithms compute an approximation to the current value of the k-means objective obtained by each expert. When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k-means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Our algorithms track the best clustering algorithm on real and simulated data sets. 1 Introduction As data sources continue to grow at an unprecedented rate, it is increasingly important that algorithms to analyze this data operate in the online learning setting. This setting is applicable to a variety of data stream problems including forecasting, real-time decision making, and resourceconstrained learning. Data streams can take many forms, such as stock prices, weather measurements, and internet transactions, or any data set that is so large compared to computational resources, that algorithms must access it in a sequential manner. In the online learning model, only one pass is allowed, and the data stream is infinite. Most data sources produce raw data (e.g. speech signal, or images on the web), that is not yet labeled for any classification task, which motivates the study of unsupervised learning. Clustering refers to a broad class of unsupervised learning tasks aimed at partitioning the data into clusters that are appropriate to the specific application. Clustering techniques are widely used in practice, in order to summarize large quantities of data (e.g. aggregating similar online news stories), however their outputs can be hard to evaluate. For any particular application, a domain expert may be useful in judging the quality of a resulting clustering, however having a human in the loop may be undesirable. Probabilistic assumptions have often been employed to analyze clustering algorithms, for example i.i.d. data, or further, that the data is generated by a well-separated mixture of Gaussians. Without any distributional assumptions on the data, one way to analyze clustering algorithms is to formulate some objective function, and then to prove that the clustering algorithm either optimizes it, or is an approximation algorithm. Approximation guarantees, with respect to some reasonable objective, are therefore useful. The k-means objective is a simple, intuitive, and widely-cited clustering objective, however few algorithms provably approximate it, even in the batch setting. In this ∗ This work was done while C.M. was at the Center for Computational Learning Systems, Columbia University. 1 21 of 104 work, inspired by an open problem posed by Dasgupta [4], our goal is to approximate the k-means objective in the online setting. The k-means clustering objective 1.1 One of the most widely-cited clustering objectives for data in Euclidean space is the k-means objective. For a finite set, S, of n points in Rd , and a fixed positive integer, k, the k-means objective is to choose a set of k cluster centers, C in Rd , to minimize: X ΦX (C) = min kx − ck2 x∈S c∈C which we refer to as the “k-means cost” of C on X. This objective formalizes an intuitive measure of goodness for a clustering of points in Euclidean space. Optimizing the k-means objective is known to be NP-hard, even for k = 2 [2]. Therefore the goal is to design approximation algorithms. Definition 1. A b-approximate k-means clustering algorithm, for a fixed constant b, on any input data set X, returns a clustering C such that: ΦX (C) ≤ b · OP TX , where OP TX is the optimum of the k-means objective on data set X. Surprisingly few algorithms have approximation guarantees with respect to k-means, even in the batch setting. Even the algorithm known as “k-means” does not have an approximation guarantee. Our contribution is a family of online clustering algorithms, with regret bounds, and approximation guarantees with respect to the k-means objective, of a novel form for the online clustering setting. We extend algorithms from [5] and [6] to the unsupervised learning setting, and introduce a flexible framework in which our algorithms take a set of candidate clustering algorithms, as experts, and track the performance of the “best” expert, or best sequence of experts, for the data. Our approach lends itself to settings in which the user is unsure of which clustering algorithm to use for a given data stream, and exploits the performance advantages of any batch clustering algorithms used as experts. Our algorithms vary in their models of the time-varying nature of the data; we demonstrate encouraging performance on a variety of data sets. Online k-means approximation 2 Specifying an evaluation framework for online clustering is a challenge. One cannot use a similar analysis setting to [3, 1] in the finite data stream case. With no assumptions on the data, one can always design a sequence of observations that will fool a seeding algorithm (that picks a set of centers and does not update them) into choosing seeds that are arbitrarily bad (with respect to the k-means objective) for some future observations, or else into simply abstaining from choosing seeds. Our analysis is inpired, in part, by an evaluation framework proposed by Dasgupta as an analog to regret [4]. The regret framework, for the analysis of supervised online learning algorithms, evaluates algorithms with respect to their additional prediction loss relative to a hindsight-optimal comparator method. With the goal of analyzing online clustering algorithms, Dasgupta proposed bounding the difference between the cumulative clustering loss since the first observation: X LT (alg) = min kxt − ck2 (1) t≤T c∈Ct where the algorithm outputs a clustering, Ct , before observing the current point, xt , and the optimal k-means cost on the points seen so far. We provide clustering variants of predictors with expert advice from [5] and [6], and analyze them by first bounding this quantity in terms of regret with respect to the cumulative clustering loss of the best (in hindsight) of a finite set of batch clustering algorithms. Then, adding assumptions that the batch clustering algorithms are b-approximate with respect to the k-means objective, and that the clustering setting is non-predictive, i.e. the clustering algorithms observe the current point before trying to cluster it, we extend our regret bounds to obtain bounds on a non-predictive variant of Equation 1 (where the point xt is now observed before the clusterings are output), with respect to the optimal k-means cost on the points seen so far.1 1 We also show that no b-approximate algorithm will trivially optimize our loss function by outputting xt . 2 22 of 104 References [1] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. In NIPS, 2009. [2] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-ofsquares clustering. Mach. Learn., 75:245–248, May 2009. [3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, 2007. [4] Sanjoy Dasgupta. Course notes, CSE 291: Topics in unsupervised learning. Lecture 6: Clustering in an online/streaming setting. Section 6.2.3. In http://www-cse.ucsd.edu/∼dasgupta/291/lec6.pdf, University of California, San Diego, Spring Quarter, 2008. [5] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151–178, 1998. [6] Claire Monteleoni and Tommi Jaakkola. Online learning of non-stationary sequences. In NIPS, 2003. 3 23 of 104 Efficient Learning of Word Embeddings via Canonical Correlation Analysis (Also appearing in NIPS ’11) Paramveer S. Dhillon Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104 U.S.A dhillon@cis.upenn.edu Dean Foster Statistics, Wharton School University of Pennsylvania, Philadelphia, PA 19104 U.S.A foster@wharton.upenn.edu Lyle Ungar Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104 U.S.A ungar@cis.upenn.edu 1 Abstract Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations/embeddings which can then be used as features in supervised classifiers for NLP tasks [1]. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts e.g. Brown Clustering. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W embeddings [3] and Hierarchical log-linear (HLBL) embeddings [4] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI). Unfortunately, most of these representations are slow to train, are sensitive to the scaling of the embeddings (especially `2 based approaches like LSA/PCA) and learn a single embedding for a given word type; i.e. all the occurrences of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a financial institution” or “a river bank”. We propose a novel context-specific word embedding method called Low Rank Multi-View Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. Our LRMVL embeddings are context-specific, but context oblivious embeddings can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [6, 7] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the 1 24 of 104 information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to find the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classifier. The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. Our theorem below shows that the LR-MVL algorithm learns a reduced rank matrix A that allows a significant data reduction while preserving the information in our data, and that the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Theorem 1.1. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. Define A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under some rank assumptions (omitted for brevity), if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. In conclusion, LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-of-the-art performance on named entity recognition (NER) and chunking problems. References [1] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [2] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [3] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [4] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [5] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS-2010. (2010) 2 25 of 104 Improved Semi-Supervised Learning using Constraints and Relaxed Entropy Regularization via Deterministic Annealing Paramveer S. Dhillon Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104 U.S.A dhillon@cis.upenn.edu Sathiya Keerthi, Kedar Bellare, Olivier Chapelle, S. Sundararajan Yahoo! Research Santa Clara, CA 95054, U.S.A 1 Abstract Recently there has been substantial interest in learning in the presence of background knowledge particularly in the form of expectation constraints. To this end several related frameworks have been proposed, e.g. Generalized Expectation (GE) [1], Posterior Regularization (PR) [2] [3] [4], Learning from Measurements [5] and Constraint Driven Learning (CODL) [6]. However, a different thread of research that has shown great empirical success for binary and multiclass semisupervised classification problems (e.g. TSVM) is exploiting the cluster (Low density separation) assumption which states that the separating hyperplane for classification should be placed in a low density region. Entropy regularization [7] represents the same assumption for probabilistic models. This assumption works well for binary, multiclass problems but its unclear if it helps in sequence labeling and other structured output tasks. The main premise of this assumption is that it allows us to place the separating hyperplane more finely compared to a supervised setting, hence leading to better generalization performance and test accuracy. The optimization problem solved by binary/multiclass TSVM [8] is θ∗ , Y∗ = arg min R(θ) + L(YL ; XL , θ) + L(Y; X, θ) + θ,Y 1 kξk2 2 (1) ψ(Y; X) − c ≤ ξ where θs are the parameters of model features, (XL , YL ) and (X, Y) are tokens, labels pair for labeled and unlabeled data respectively. ψ(Y; X) are the constraint features and c is their value (user specified which we would want our model to match). and R(θ) are the regularizers for constraints and model parameters respectively and L(YL ; XL , θ) is a loss function, e.g., log-loss. ξ is a vector of slacks on the constraints. The above combinatorial optimization can be solved efficiently for binary/multiclass problems by using label switching techniques, where the constraints are generally label fraction constraints. However, this becomes intractable for structured prediction as we have to optimize over exponentially many (Y) labeling possibilities for a given sequence. So, we relax this by inducing a distribution a(Y) over the unlabeled data which helps make our approach very general (works for binary/multiclass/structured prediction data). X 1 θ∗ , a(Y)∗ = arg min R(θ) + L(YL ; XL , θ) + a(Y)L(Y; X, θ) + kξk2 (2) 2 θ,a(Y),ξ Y Ea [ψ(Y; X)] − c ≤ ξ The unlabeled term in the above equation above can be viewed as a relaxed version of entropy regularization (ER). While ER sets a(Y) to the model’s distribution pθ (Y) (in the process, complicating the primal as constraints have to be imposed on pθ (Y)) we relax this by using a separate distribution a(Y)). Our formulation aims to find a constraintssatisfying distribution a(Y), which, when used as label distribution will be consistent with the model training. If, in 2 we fix θ and ξ, the remaining inner problem is a linear programming problem in a(Y). The resulting (sparse) solution is nothing but a distribution in the sub-polytope of constrained Viterbi solutions. Because of this property the model resulting from (2) tends to be very good at satisfying constraints in test set inference. Note that methods such 26 of 104 as GE and PR enforce constraints only in an expected sense, and so they may not satisfy the constraints well in test inference. This is a key advantage associated with our approach. P To reach a good minimum we use deterministic annealing and include the homotopy term, (T Y a(Y) log a(Y)); this ensures that the constraints (background knowledge) are satisfied transductively in the deterministic annealing limit (T → 0). X X 1 θ∗ , a(Y)∗ = arg min R(θ) + L(YL ; XL , θ) + a(Y)L(Y; X, θ) + T a(Y) log a(Y) + kξk2 (3) 2 θ,a(Y),ξ Y Y X s.t. Ea [ψ(Y; X)] − c ≤ ξ; a(Y) = 1 Y For any given T the overall objective is non-convex but it becomes convex when one set of variables are fixed. So, we use alternating optimization. Keeping a(Y) fixed the objective becomes X θ∗ = arg min R(θ) + L(YL ; XL , θ) + a(Y)L(Y; X, θ) (4) θ Y and with θ fixed the optimization problem becomes X X 1 a(Y)L(Y; X, θ) + T a(Y) log a(Y) + kξk2 a(Y)∗ = arg min 2 a(Y),ξ Y Y X s.t. Ea [ψ(Y; X)] − c ≤ ξ; a(Y) = 1 (5) Y The above primal objective for optimization of a(Y) contains one variable per labeling Y, so we work in the dual space where we will have one variable per expectation constraint. From a computational point of view, the ideas in this solution are similar to the dual ideas used in posterior regularization [2] [3] [4]. Our formulation can handle two types of constraints: 1. Instance level constraints: e.g. “Journal (or any field in general) field appears at most once in a given sequence”. 2. Corpus level constraints: e.g. “ The average number of authors in a citation should be 3” or “35% of all the tokens should be labeled Author”. We achieve state-of-the-art accuracy on a bunch of information extraction tasks. References [1] Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11 (2010) 955–984 [2] Gärtner, T., Le, Q.V., Burton, S., Smola, A.J., Vishwanathan, S.V.N.: Large-scale multiclass transduction. In: NIPS. (2005) [3] Ganchev, K., Graca, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. Technical Report MS-CIS-09-16, University of Pennsylvania Department of Computer and Information Science (2009) [4] Bellare, K., Druck, G., McCallum, A.: Alternating projections for learning with expectation constraints. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI ’09, Arlington, Virginia, United States, AUAI Press (2009) 43–50 [5] Liang, P., Jordan, M.I., Klein, D.: Learning from measurements in exponential families. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09, New York, NY, USA, ACM (2009) 641–648 [6] Chang, M.W., Ratinov, L.A., Roth, D.: Guiding semi-supervision with constraint-driven learning. In: ACL. (2007) [7] Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: NIPS. (2004) [8] Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML. (1999) 200–209 27 of 104 A Reliable Effective Terascale Linear Learning System Alekh Agarwal UC Berkeley Olivier Chapelle Yahoo! Research Abstract We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with billions of training examples and millions of parameters in an hour or less using a cluster of 1000 machines. This result is highly effective, producing around 500-fold speed-up over an optimized online gradient descent approach. One of the core techniques used is a new communication infrastructure implemented in a Hadoop cluster, which allows rapid aggregation of information across nodes (this operation is often referred to as AllReduce). The communication infrastructure appears broadly reusable for many other tasks. Improving computational complexity of existing algorithms is notoriously difficult, because the space of solutions involves the space of algorithms, which is difficult to comprehend. For example, in machine learning, many algorithms can be stated as statistical query algorithms [2] which are easily phrased in MapReduce framework [1, 3]. The parallelization here might appear quite compelling, as the MapReduce implementation can yield good speed-up curves. However, these algorithms are uncompelling when a basic online gradient descent approach can, on a single node, learn a good linear predictor 100 or 1000 times faster than the slow stastical query algorithm [4, 5]. There are many other crude, but extremely effective tricks, such as learning on a subsampled dataset. Given this observation, we should skeptically evaluate a parallel learning system. When can we be sure that Miroslav Dudı́k Yahoo! Research John Langford Yahoo! Research it is doing something nontrivial? We focus on learning systems fitting generalized linear models such as linear regression or logistic regression. To evaluate efficacy of their parallel implementation, we propose the following two conditions: 1. Is the parallel algorithm’s throughput faster than the I/O interface of a single machine? Throughput is measured as the input size divided by the running time. 2. Is subsampling ineffective on our prediction problem? If we have a trillion examples, but only want to learn one parameter with a convex loss, subsampling is an extremely effective strategy. If the answer is “yes” to the first question, the only possibility in which a single machine could match the performance of the parallel algorithm is by subsampling. If, however, also the answer to the second question is “yes”, this possibility is excluded and hence no single-machine algorithm is a viable alternative to the parallel learning algorithm. For our learning algorithms, the answers are “yes” and “yes”, as verified experimentally. Furthermore, ours appear to be the first parallel learning algorithms for which both answers are “yes”. There is no single trick which allows us to achieve this—instead it is a combination of techniques, most of which have appeared elsewhere, and a few of which are new. One of the most important new tricks is a communication infrastructure that implements the most expensive operation in the fitting of generalized linear models, which is the accumulation of the gradient across examples. It is functionally similar to an MPI-style AllReduce operation (hence we use the 28 of 104 name), but it is made compatible with Hadoop in important ways. AllReduce is an operation which starts with a scalar or a vector in each node, and ends with every node containing the aggregate (typically the sum or average, but there are other possibilities). This is commonly implemented as a sequence of “reduce” (pairwise aggregation) and broadcast operations on a spanning tree over the individual compute nodes. The implementation can be deeply pipelined, implying that latency is unimportant, and the bandwidth requirements at any node in the spanning tree are within a constant factor of the minimum. Implementation of AllReduce using a single spanning tree is less desirable than MapReduce in terms of reliability, because if any individual node fails, the entire computation fails. Instead of implementing a generalpurpose reliability protocol (such as network errorcorrecting codes), which would impose a significant slow-down, we implemented two light-weight tricks which make AllReduce reliable enough to use in practice on computations up to 10K node hours. 1. The primary source of node failure comes from disk read failures. In this case, Hadoop automatically restarts on a different machine with identical data. We delay the initialization of the communication infrastructure until every node finishes reading its data, which avoids disk read failures. 2. It is common for large clusters of machines to be busy with many jobs which use the cluster in an uneven way, which commonly results in one of a thousand nodes to be very slow. To get around this, Hadoop can speculatively execute a job on identical data, using the first job to complete and killing the others. By using a slightly more sophisticated spanning tree construction protocol, AllReduce can benefit from the same system. The net effect is perhaps another order of magnitude of scalability in practice. With the two tricks above, we found the system reliable enough for up to 1000 nodes. In Fig. 1, we plot the speed-up curve for up to 100 nodes. Programming machine learning algorithms with AllReduce is typically easier than programming with MapReduce, because code does not need to be rewritten—it suffices to add a few lines to an existing single thread program to synchronize the relevant state. Figure 1: Speed-up relative to the run with 10 nodes, as a function of the number of nodes. The results are for a logistic regression problem (click-through-rate predicton) with 2.3B examples, 16M parameters, and about 100 non-zero covariates per example. Other new tricks that we use are hybrid online+batch algorithms and more sophisticated weight averaging. Both of these tricks are straightforwardly useful. The net effect is that we can train logistic regression for click-through-rate prediction using 17B examples, 16M parameters (around 100 non-zero covariates per example) in about 70 minutes using 1000 nodes. This means a net throughput of 500M non-zero covariates/s, exceeding the input bandwidth (1Gb/s) of any individual node involved in the computation. References [1] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Oper. Syst. Design and Impl., 2004. [2] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 1993. [3] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun, Map-Reduce for Machine Larning on Multicore, NIPS 2007. [4] L. Bottou. Stochastic Gradient Descent Examples on Toy Problems, http://leon.bottou. org/projects/sgd, 2007. [5] Vowpal Wabbit open source project, http://github.com/JohnLangford/vowpal_ wabbit/wiki, 2007. 29 of 104 Density Estimation Based Ranking from Decision Trees Haimonti Dutta The Center for Computational Learning Systems, Columbia University, NY 10115. Decision tree algorithms (such as C4.5, CART, ID3) are used extensively in machine learning applications for classification problems. However, many applications including recommendation systems for movies, books and music, information retrieval (IR) systems to access documents, software to estimate susceptibility to failure of devices, personalized email filters giving priority to unread or urgent email focus on learning to order instances rather than classifying them. Probability Estimation Trees (PETs) are decision trees that can order instances based on the probability of class membership. They estimate the posterior probabilities P (y|x) instead of the predicted class labels. A simple approach to computing probability from a leaf-node of a decision tree is to divide the number of instances belonging to a class by the total number of instances in the leaf node. Other sophisticated techniques recommend refraining from pruning the tree and using smoothing techniques (such as Laplace smoothing and M-estimation [1]). The popularity of the Probability Estimation Trees for generating ranking models stem from the fact that they are easily understandable by following a path in the tree. PETs, however, suffer from several known problems: (1) All instances falling in the same leaf will share the same probability estimate if the leaf is pure which leads to ties. (2) The number of instances in a leaf is usually too small to provide a reliable probability estimate. Ties result in ambiguity for the task of ranking. Kernel density based estimation techniques can be used to resolve them. The modes of the underlying density are used to define cluster centers and valleys in density define boundaries separating clusters. The value of the kernel function provides an estimate of whether a point is close to the center of the class. Instances which are closer to the center are ranked higher than others in the same class. For leaves with small number of instances an adaptive kernel smoothing technique is used to ensure greater reliability of probability estimates. Non-parametric Kernel Density Estimation (KDE) techniques have been studied in statistics since the 1950’s ([2]). The formulation of KDE presented here is based on these standard techniques. Consider the univariate case of estimating the density f(x) given samples {xi}, 1 <= i <=N, where t p(X < t) = # f (x)dx and ! ! !! ! !" = 1 . The estimate f(x), called Parzen’s estimate, is !" obtained by summing the contributions of the kernel K(x−xi) over all the samples and 1 normalizing such that the estimate itself is a density. Thus, fˆ (x) = N " K( Nh i=1 x!xi ) h where h is the bandwidth of the estimator. Common choices of Kernels include: (a) Gaussian, Cauchy, Epanechnikov etc. A point x which lies in a dense region and has many data points close to it is expected to have a high density estimate f(x) while a point which is far away will only receive contributions from the tails of the kernels and the estimate is relatively small. The kernel function obeys smoothness properties and choice of the bandwidth h can often be difficult. If h is too small, the estimate is “spiky” while too large a value of h smoothes out details. The Parzen method of density estimation (described above) is insensitive to variations in magnitude of f(x). Thus if there is only one sample point, the estimate will have its peak at x = xk and is likely to be too low at other regions; in contrast if f(x) has many sample points clustered 30 of 104 together accounting for a large value of f(x), the Parzen estimate is likely to be spread out. Since the leaf-nodes in an unpruned decision tree can have even one example, the insensitivity to peakedness of the kernel poses problems for ranking. In the method of variable kernel estimation, the estimates are obtained as follows: N x " xj 1 ˆf (x) = 1 ! K( ) d N j=1 (! k DK (x)) ! k DK (x) where ! k is a constant multiplicative factor. This method of estimation enables the kernel to be spread out in low density areas while in high density areas it will be more peaked. Furthermore, the variable kernel estimator combines the desirable smoothness properties of the Parzen-type estimators with the data-adaptive nature of the K-Nearest Neighbor estimators. It also does not incur extensive computation cost as the distance from a point to its K-Nearest Neighbors can be computed once and stored as a pre-processing step in the algorithm. The density estimation-based tree construction and ranking thus proceeds as described below: Algorithm1. Density Estimation based Ranking from Decision Trees Input: The training data set. Output: A ranking of instances in the training data. 1. Construct an unpruned classification tree using a standard decision tree construction algorithm like C4.5 or CART. 2. Obtain the probability estimation for each instance. 3. Resolve ties using KDE. To obtain the probability estimate for a d-dimensional instance follow the path in the decision tree to the appropriate leaf node. 4. Assume there are m classes -- for each class generate a local density estimate. 5. Smooth the Kernel Density Estimates obtained in step 3 by Variable Kernel Density Estimation. Several aspects of the algorithm require further discussion: (1) Scalability: In general, kernel density estimation techniques are known to scale poorly for multi-dimensional data. It can be shown theoretically that to achieve a constant approximation error as the number of dimensions grow one needs exponentially many more examples. The hybrid technique we propose here requires the estimation of the density at the leaf nodes only and thus it suffices to use only those attributes that appear in the path to the leaf. (2) Multi-class extension: While we do not present results for multi-class ranking models, the design of the algorithm is robust enough to support more than two classes. (3) Effect of Pruning Trees: For our algorithm we do not consider pruning the decision trees. Such a design was conceived because prior work [3] discouraged pruning for obtaining better probability estimates. Future work involves examination of the effect of pruning techniques on kernel smoothing, incorporation of shrinkage into the probability estimation scheme and comparison with other ranking algorithms that are not based on decision trees. References [1] B.Censnik and I.Bratko. On estimation probabilities in tree pruning. Proceedings of European Working Session on Learning, 482, 1991:138–150, 1991. [2] Scott D. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, Inc., New York, NY, 1992. [3] F. Provost and P. Domingos. Tree induction for probability-based ranking. Machine Learning, 52(3):199–215, 2003. 31 of 104 Optimization Techniques for Large Scale Learning Clement Farabet1 , Marco Scoffier2 , and Yann LeCun1 1 Courant Institute of Mathematical Sciences 2 Net-Scale Technologies Inc. Sept 9, 2011 Abstract Optimization methods for un-constrained non-convex problems such as Conjugate Gradient Descent (CG) and Limited BFGS (L-BFGS) have long been used in machine learning but for many real world problems, especially problems where the models are large, function evaluations expensive, and data throughput a bottle neck, simple Stochastic Gradient Descent (SGD) has provided equivalent results. The advent of increased parallelization available in today’s computing clusters and multi-core architectures as well as the availability of general purpose computation on FPGA and GPU architectures has changed the landscape of computation. We evaluate several batch and stochastic optimization techniques applied to the task of learning weights for large (deep) models on large datasets. We also make available our source code for a parallel machine learning library so that other researchers can build upon our results. 1 Introduction Recently Quoc Le et. al. [1] have reported good results using L-BFGS and CG to train Deep Networks. In this work we extend their exploration beyond the MNIST dataset. We have found that using L-BFGS as a black box optimization does not immediately improve results on more complicated datasets such as the scene understanding Stanford Backgound Dataset. While being attracted to the ability to parallelize L-BFGS and CG in a map-reduce framework, we find that there are details which must be understood to train large systems using these optimization techniques. In this work we explore the space of large scale optimization of large models on large datasets and seek to answer some rules of thumb for the designer of such systems. 2 Experiments Using several large scale scene parsing datasets (Stanford Background and MSRC as well as the baselines of MNIST and NORB, we evaluate the convergence properties of SGD, L-BFGS and CG, covering details of the algorithms which impact convergence under several different metrics. 2.1 Number of Function Evaluations The overall number of function evaluations required is often the limiting factor when models are large. SGD makes one weight update per function evaluation and is essentially a serial operation 1 32 of 104 which is difficult to parallelize and hence to scale. CG and L-BFGS require multiple function evaluations per parameter update and operate on mini-batches of input which further adds to the computational complexity per update. Do the updates from these more expensive techniques truly out perform SGD by the time each algorithm has performed the same number of function evaluations per training sample? 2.2 Wall Clock Time The overall time spent can include the impact of parallelization. If an algorithm is more easily parallelized it can process much more data in less time which makes it interesting even if it requires more function evaluations. 2.3 Batch Size vs. Time Through Full Training Set L-BFGS and CG re-evaluate the same mini-batch several times to make one update, moving through the full training set more slowly than SGD which updates for every function evaluation. Smaller batches resemble SGD where as larger batches are a better approximation of the full training set, can be easily parallelized, but are harder to optimize. Where is the trade-off? Can we benefit from adaptive batch sizes – emulating SGD at the begining of training and large batches at the end? 2.4 Final Generalization Performance In machine learning we are interested in generalization. Here we explore a set of under the hood tweaks to attempt the best overall generalization error. 100% accuracy on the training set, perfect optimization, over-fits in most cases. A better optimization algorithm may be more prone to over-fitting and thus require a smaller model, making the direct comparison of the same model with different optimization techniques difficult. The optimization guides the choice of model and vice-versa. Does a perfect model/optimization combination generalize better? Are we better off optimizing less on each batch with less rigorous stopping criteria or even allowing updates on partial batch evaluations as can happen frequently in less stable cloud computing situations? References [1] Quoc Le, Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, and Andrew Ng. On optimization methods for deep learning. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 265–272, New York, NY, USA, June 2011. ACM. 2 33 of 104 Real-time Multi-class Segmentation using Depth Cues Nathan Silberman, Clément Farabet, Rob Fergus, Yann LeCun Courant Institute of Mathematical Sciences New York University September 9, 2011 This work demonstrates a real-time multi-class segmentation system. While signicant progress has been made in multi-class segmentation over the last few years, per-pixel label prediction for a given image typically takes on the order of minutes. This renders use of these systems impractical for real-time applications such as robotics, navigation and humancomputer interaction. Concurrent with these advances, there has been a renewed interest in the use of depth sensors following the release of the Microsoft Kinect [4] to aid various tasks in computer vision. In this work, we describe a real-time system that provides dense label predictions for a scene given both intensity and depth images. We evaluate this system against a recently released, densly labeled depth dataset[7]. Our approach is motivated by the desire to provide real-time prediction while not sacricing accuracy. Using a multi-scale neural network, we are able to provide local predictions using scene-level context. Guided by recent success in detection using high-throughput depth cameras [6], we provide the neural network with both intensity and depth inputs of a scene. The end-to-end prediction pipeline is structured as follows: Using open source drivers [3], we read RGB and depth images from the Kinect in real time. Before they can be fed to the neural network, each pair of frames is aligned using standard techniques for recovery of homography matrices. Following alignment, the depth maps still contain numerous artifacts. Most notable of these is a depth shadow on the left edges of objects. These regions are visible from the depth camera, but not reached by the infra-red laser projector pattern. Consequently their depth cannot be estimated, leaving a hole in the depth map. A similar issue arises with specular and low albedo surfaces. The internal depth estimation algorithm also produces numerous eeting noise artifacts, particularly near edges. Before extracting features for recognition, these artifacts must be removed. To do this, we ltered each image using the cross-bilateral lter of Paris [5]. Using the RGB image intensities, it guides the diusion of the observed depth values into the missing shadow regions, respecting the edges in intensity. D images of size M ×N , we train a multi-scale convolutional network [2] to associate each input pixel Iij to one of K classes. Intuitively, the multi-scale representation allows the network to predict the class distribution at each location ij by looking at the entire scene, with high-resolution around ij and a logarithmically decreasing resolution as the distance to ij increases. Each pixel Iij is a 4−dimensional vector with components being the 3 classical color chanGiven nels (red, green, blue), and the aormentioned ltered depth estimate. Each target (label) is a K−dimensional vector t with a single non-zero element. To achieve the multi-scale repre- sentation, a Gaussian pyramid G = {Xs | s = 1..S} 1 is rst constructed on the input image 34 of 104 I. C(Xs , W ) is then applied to each scale, yielding a pyramid of feature maps {Ys | s = 1..S} . These maps are then upsampled producing S maps of size M × N and concatenated to produce a single map of P −dimensional vectors. A two-layer classier (perceptron) is then applied at each location ij , producing a single K−dimensional vector, which is normalized using a sof tmax function, to produce a distribution of classes for that location. Let fij (I, W ) be the transform that associates each input pixel Iij to an output A convolutional network class distribution. The error function we minimize is the negative log-likelihood (or multi-class cross-entropy) over the dataset: E(f (I, W ), t) = − D X M X N X K X tkd ln(fij (I, W )). d=1 i=1 j=1 k=1 The convolutional network at each scale is the same, using a single parameter vector W. Intuitively, having a dierent set of parameters for each scale would result in a much larger model, which is thus more prone to over-tting. Sharing weights can be seen as a simple form of regularization, as the ConvNet C(G, W ) is trying to learn a good representation for all scales. Once trained, the convolutional network can be eciently computed on commodity hardware, such as GPUs, or FPGAs, thanks to their homogeneous structure, which essentially involves 2D convolutions. We use the neuow processor [1] to process the convolutional network, reducing its computation from a few seconds in software to about 100ms. To evaluate our system, we used the dataset from [7], which contains 2347 densely labeled images. While the original dataset contains over 1000 classes, we selected 12 of the most common classes and a generic background class covering the remaining labels. Since many of the images in the dataset are taken from the same room, we split the images into a training and test set ensuring that images taken from the same scene were never split across training and test sets. We measured the performance of the system using the mean diagonal of the confusion matrix, computed for per-pixel classication over the 13 classes on the test set. We obtained 51% accuracy which is on par with [7], yet runs at several frames a second rather than over a minute per image. References [1] Clément Farabet, Berin Martini, Polina Akselrod, Selcuk Talay, Yann LeCun, and Eugenio Culurciello. Hardware accelerated convolutional neural networks for synthetic vision systems. In International Symposium on Circuits and Systems (ISCAS'10), Paris, May 2010. IEEE. [2] Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional networks and applications in vision. In International Symposium on Circuits and Systems (ISCAS'10), Paris, May 2010. IEEE. [3] Hector Martin. Openkinect.org. Website, 2010. http://openkinect.org/. [4] Microsoft. Microsoft kinect. Website, 2010. http://www.xbox.com/kinect/. [5] Sylvain Paris and Fredo Durand. A fast approximation of the bilateral lter using a signal processing approach. In In Proceedings of the European Conference on Computer Vision, pages 568580, 2006. [6] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. June 2011. [7] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition, 2011. 2 35 of 104 A text-based HMM model of foreign affair sentiment: A Mechanical Turkers’ history of recent geopolitical events by Sean Gerrish (with helpful comments from David Blei) Historians currently have limited resources to review thousands of relevant news sources, leading to research biased by popular knowledge or even politics and culture. The goal of this work is to create a history of the relationships between the world’s nations using the text of newspaper articles and political commentary. An assumption of our work is that the tension between two nations – or a warm and robust relationship between them – is reflected by the language we use to discuss them, an idea inspired by relational topic models [1]. An advantage of a text-based approach to history is that we can incorporate information from all articles of a given collection with modest computational cost. This means that historians and political scientists can then search and review past documents to identify forgotten or overlooked “blips” in history. A Spatial Model of Political Sentiment To infer the sentiment between nations, we assume that each nation lies in a space of latent ‘”foreign sentiment”. A spatial a model has two benefits. First, it allows for interpretability: nations with similar positions in this latent space tend to interact more positively, while nations further apart tend to have more tension in their relationship. Second, a spatial model allows us to draw on existing work in multidimensional scaling, including work in both item response theory [2] and latent space models [3]. A temporal model of interaction. We assume that each nation c starts at a mean position x̄c,0 ∈ R p and drifts over time with the Markov transition x̄c,t |x̄c,t−1 ∼ N(x̄c,t−1 , σ2K ). (1) At any time t, countries c1 and c2 may appear in an article d in the news, and the news may reflect their relationship with sentiment sd ∈ R with the noisy model xc1 ,d ∼ N(x̄c1 ,t , σ2D ) xc2 ,d ∼ N(x̄c2 ,t , σ2D ) sd := xcT1 ,d xc2 ,d , (2) where we interpret sd as the sentiment between c1 and c2 : a higher sentiment suggests warm relations between the nations, while a lower sentiment suggests tension or conflict between the nations. A graphical model representing these assumptions is shown in Figure 1 (a). We assume that the sentiment between these two countries will be reflected by the language used to describe them in each news article. We therefore use a discriminative language model to learn sentiment based on these articles’ text with text regression [4]. 1 As alluded to above, we fit text regression parameters with supervision using Amazon Mechanical Turk. Inference We fit the MAP objective of this probabilistic model. This has the benefit of both cleaner exposition and simpler implementation. one of the simplest language models, the nth word wn of the article is determined by a mixture of the sentiment p(wn |sd ) ∝ exp (sd aw + bw ), where we interpret the coefficient aw to be term w’s sentiment parameter and the intercept bw to be its background distribution. 1 In 1 36 of 104 zero zero zero x x x 3 israel c,t c,t+1 iraq 2 0.2 x c,t-1 pakistan x c,t c,t+1 C canada 0.0 afghanistan russia/ussr Ip2 x india switzerland china −0.2 s s s w D w D w D t-1 t united_states Sentiment with the United States c,t-1 0.4 Country afghanistan canada china 1 france iraq israel pakistan 0 russia/ussr switzerland united_states −1 −0.4 t+1 (a) −2 −0.6 france −1.5 −1.0 −0.5 0.0 0.5 1.0 Ip1 (b) 1.5 2.0 1990 1995 2000 2005 Date (c) Figure 1: (a) A time-series model of countries’ interactions. The large plate shows Markov drift for each country. Articles with words wd appear at various intervals, expressing sentiment s between countries. Pseudo-observations at position zero are added for regularization for sparsely mentioned countries. (b) Example positions of countries in the latent space of national sentiment. (c) Mutual sentiment s = x̄cT1 ,· x̄united states,· with the United States over time. We optimize the MAP objective in this model using an EM-like algorithm. In the M step, the mean x̄c of each country’s position is estimated using a modified Kalman filter; the primary difference from a standard Kalman filter is that we may have zero or multiple observations at each time period for each country pair. In the E step, the sentiment for each news article is inferred, given its mean and the sentiment suggested by the article’s text, combined with a sentiment suggested by Mechanical Turkers. Experiments and Results We fit this model over twenty years of the New York Times’s Foreign Desk, a collection of 94,589 articles from 1987 to 2007. We performed standard stopword removal and tagged each article with the two most-frequently mentioned countries. Figures 1 (b,c) show results based on both mechanical turkers’ ratings and manually-curated scores for individual words. In future work, we hope to extend the language model and to apply this method to both Foreign Affairs, a journal covering much of the past century’s international foreign policy (from a U.S. perspective), and the Journal of Modern History. References [1] Chang, J., D. M. Blei. “Relational topic models for document networks.” Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AIStats) 2009, 5, 2009. [2] Martin, A. D., K. M. Quinn. “Dynamic ideal point estimation via markov chain monte carlo for the u.s. supreme court, 1953-1999.” Political Analysis, 10:134–153, 2002. [3] Hoff, P., A. E. Raftery, M. S. Handcock. “Latent space approaches to social network analysis.” Journal of the American Statistical Association, 97:1090–1098, 2002. [4] Kogan, S., D. Levin, B. Routledge, J. Sagi, N. Smith. “Predicting risk from financial reports with regression.” In “ACL Human Language Technologies,” 272–280. Association for Computational Linguistics, 2009. 2 37 of 104 Title: A Probabilistic Foundation for Policy Priors Authors: Samuel Gershman (Princeton), Carlos Diuk (Princeton), David Wingate (MIT) Abstract We describe a probabilistic framework for incorporating structured inductive biases into reinforcement learning. These inductive biases arise from policy priors, probability distributions over optimal policies. Borrowing recent ideas from computational linguistics and Bayesian nonparametrics, we define several families of policy priors that express compositional, abstract structure in a domain. Compositionality is expressed using probabilistic context-free grammars, enabling a compact representation of hierarchically organized sub-tasks. Useful sequences of sub-tasks can be cached and reused by extending the grammars nonparametrically. We present Monte Carlo methods for performing inference, and show how structured policy priors lead to substantially faster learning in complex domains compared to methods without inductive biases. Contact: Samuel Gershman Graduate student Department of Psychology and Princeton Neuroscience Institute Princeton University sjgershm@princeton.edu 38 of 104 Large-Scale Collection Threading using Structured k-DPPs Jennifer Gillenwater Alex Kulesza Ben Taskar Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 {jengi,kulesza,taskar}@cis.upenn.edu September 9, 2011 Thanks to the increasing availability of large, interrelated document collections, we now have easy access to vast stores of information that are orders of magnitude too big for manual examination. Tools like search have made these collections useful for the average person; however, search tools require prior knowledge of likely document contents in order to construct a query, and typically reveal no relationship structure among the returned documents. Thus we can easily find needles in haystacks, but understanding the haystack itself remains a challenge. One approach for addressing this problem is to provide the user with a small, structured set of documents that reflect in some way the content space of the collection (or possibly a sub-collection consisting of documents related to a query). In this work we consider structure expressed in “threads”, i.e., singly-connected chains of documents. For example, given a corpus of academic papers, we might want to identify the most significant lines of research, representing each by a citation chain of its most important contributing papers. Or, in response to a search for news articles from a particular time period, we might want to show the user the most significant stories from that period, and for each such story provide a timeline of its major events. We formalize collection threading as the problem of finding diverse paths in a directed graph, where the nodes correspond to items in the collection (e.g., papers), and the edges indicate relationships (e.g., citations). A path in this graph describes a thread of related items, and by assigning weights to nodes and edges we can place an emphasis on high-quality paths. A diverse set of high-quality paths then forms a cover for the most important threads in the collection. To model sets of paths in way that allows for repulsion (and hence diversity), we employ the structured determinantal point process (SDPP) framework [1], incorporating k-DPP extensions to control the size of the produced threads [2]. The SDPP framework provides a natural model over sets of structures where diversity is preferred, and offers polynomial-time algorithms for normalizing the model and sampling sets of structures. However, even these polynomial-time algorithms can be too slow when dealing with many real-world datasets, since they scale quadratically in the number of features. (In our experiments, the exact algorithms would require over 200 terabytes of memory.) We address this problem using random feature projections, which reduce the dimensionality to a manageable level. Furthermore, we show that this reduction yields a close approximation to the original SDPP distribution, proving the following theorem based on a result of Magen and Zouzias [3]. Theorem 1. Let P k be the exact k-SDPP distribution on sets of paths, and let P̃ k (Y ) be the k-SDPP distribution after projecting the similarity features to dimension d = O(max{k/, (log(1/δ) + log N )/2 }), where N is the total number of possible path sets. Then with probability at least 1 − δ, kP k − P̃ k k1 ≤ e6k − 1 ≈ 6k . (1) Finally, we demonstrate our model using two real-world datasets. The first is the Cora research paper dataset, where we extract research threads from a set of about 30,000 computer science papers. Figure 1 1 39 of 104 • Retrieval and Reasoning in Distributed Case Bases • Cooperative Information Gathering: A Distributed Problem Solving Approach • MACRON: An Architecture for Multi-agent Cooperative Information Gathering • Research Summary of Investigations Into Optimal Design-to-time Scheduling • Control Heuristics for Scheduling in a Parallel Blackboard System • Partial Global Planning: A Coordination Framework for Distributed Hypothesis Formation • 3 Distributed Problem Solving and Planning • Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code • A Model and Compilation Strategy for Out-of-Core Data Parallel Programs • Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors • Designing Memory Consistency Models For SharedMemory Multiprocessors • Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors • Integrating Message-Passing and Shared-Memory: Early Experience • Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor • Designing a Family of Coordination Algorithms • Quantitative Modeling of Complex Environments • Introducing the Tileworld: Experimentally Evaluating Agent Architectures • Compiling Passing for Shared-Memory and Message- • Compiling for Distributed-Memory Systems • Distributed Memory Compiler Design for Sparse Problems Figure 1: Two example threads (left and right) from running a 5-SDPP on the Cora dataset. CLS NMX k-SDPP 2005a 3.53 3.87 6.91* 2005b 3.85 3.89 5.49* 2006a 3.76 4.59 5.79* 2006b 3.62 5.12 8.52* 2007a 3.47 3.73 6.83* 2007b 3.32 3.49 4.37* 2008a 3.70 4.58 4.77 2008b 3.00 3.59 3.91 Table 1: Quantitative results on news text, measured by similarity to human summaries. a = January-June; b = July-December. Starred (*) entries are significantly higher than others in the same column at 99% confidence. shows some sample threads pulled from the collection by our method. The second dataset is a multi-year corpus of news text from the New York Times, where we produce timelines of the major events over six month periods. We compare our method against multiple baselines using human-produced news summaries as references. Table 1 contains measurements of similarity to the human summaries for our approach (kSDPP) versus clustering (CLS) and non-max suppression (NMX) baselines. Future work includes applying the diverse path model to other applications, and studying the empirical tradeoffs between speed, memory, and accuracy inherent in using random projections. References [1] Alex Kulesza and Ben Taskar. Structured determinantal point processes. In Proc. Neural Information Processing Systems, 2010. [2] A. Kulesza and B. Taskar. k-DPPs: fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, 2011. [3] A. Magen and A. Zouzias. Near optimal dimensionality reductions that preserve volumes. Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 523–534, 2008. 2 40 of 104 Online Learning for Mixed Membership Network Models Prem Gopalan, David Mimno, Michael J. Freedman and David M. Blei 9 September 2011 Introduction. In this era of “Big Data”, there is intense interest in analyzing large networks using statistical models. Applications range from community detection in online social networks to predicting the functions of a protein. MMSB [1] is a powerful mixed-membership model for learning communities and their interactions. It assigns nodes to multiple communities rather than simple clusters. Posterior inference for MMSB is intractable, and approximate inference algorithms such as variational inference or MCMC sampling are applied. However, these methods require multiple passes through the data, and do not easily work with streaming data. Inspired by the recent work on online variational Bayes for LDA [2], we develop an online variational Bayes algorithm for MMSB based on stochastic optimization. A mixed-membership model. MMSB is a Bayesian probabilistic model of relational data that assumes context dependent membership of nodes in K groups, and that each interaction can be explained by two interacting groups. Given the groups, the generative process draws a K-dimensional mixed-membership vector πa ∼ Dir(α) for each node a and per-interaction membership indicators za→b , za←b for each binary pair ya,b . The indicators are used to index into a blockmodel matrix BK×K of Bernoulli rates, and ya,b is drawn from it. The only observed variables in this model are ya,b . There is a degree of non-identifiability in MMSB due to both π p and B competing to explain reciprocated interactions. If communities are known to be densely connected internally with sparse external interactions, or have only reciprocated interactions, as is the case with some online social networks, then a simpler model suffices. We replace B with K intragroup interaction rates βk ∼ Beta(ηk ) and a small, fixed interaction rate ε between distinct groups. Posterior inference. In variational Bayes for MMSB, the true posterior is approximated by a simpler distribution q(β, π, z→ , z← |γ, φ→ , φ← , λ). Following [1], we choose a fully factorized distribution q of the form q(za→b = k) = φa→b,k , q(π p ) =Dir(π p ; γ p ) and q(βk ) =Dir(βk ; λk ). We then apply stochastic optimization to the variation objective. We subsample the interactions ya,b , compute an approximate gradient and follow the gradient with decreasing step-size. Options for subsampling include selecting a node or a pair of interactions uniformly at random, or sampling by exploration where we first select a node uniformly at random and then explore its neighbors. We derive a first-order stochastic natural gradient algorithm for MMSB below assuming random pair sampling. ya,b 1: Define f (ya,b , βk ) = βk .(1 − βk )(1−ya,b ) . Initialize γ, λ. 2: for t = 0 to ∞ do 3: E step ∀ (a, b) in mini-batch S. Initialize φta→b , φta←b . 4: repeat 5: Set g(φ, k) = ∑i6=k φt−1 log f (ya,b , ε) i 10: Set φta→b,k ∝ exp{Eq [log πa,k ] + φt−1 a←b,k Eq [log f (ya,b , βk )] + g(φa←b , k)} ∀k t−1 t Set φa←b,k ∝ exp{Eq [log πb,k ] + φa→b,k Eq [log f (ya,b , βk )] + g(φa→b , k)} ∀k until convergence M step Compute γ̃a,k = αk + N(N−1) ∑S (φta→.,k + φt.←a,k ) ∀k, ∀a |S| 11: Compute λ̃k,i = ηk,i + N(N−1) ∑S (φta→b,k φta←b,k ya,b,i )∀i ∈ (0, 1), ∀k |S| 6: 7: 8: 9: ′ ′ 12: Set γ = (1 − ρt )γ + ρt γ̃. Set λ = (1 − ρt )λ + ρt λ̃ 13: end for Figure 1: Online variational Bayes for MMSB Despite conceptual and notational similarities with online LDA [2], online MMSB faces unique challenges. First, the online LDA E step finds locally optimal values of parameters associated with a selected document holding the topics fixed. In MMSB a given node a’s γa cannot be optimized independently of other nodes. Therefore, we derive updates for both γ and λ. Second, the dimension of γ, the number of 1 41 of 104 K=3 K=5 K=3 9 K=5 0.30 8 0.25 6 5 kappa=0.5 kappa=0.5 7 0.20 0.15 0.10 3 0.05 Per−edge RSS 4 9 0.30 Perplexity 8 0.25 6 5 kappa=0.7 kappa=0.7 7 0.20 online 0.15 batch online batch 0.10 4 3 0.05 9 0.30 8 0.25 6 5 kappa=0.9 kappa=0.9 7 0.20 0.15 0.10 4 3 0.05 512 1024 2048 512 1024 2048 512 1024 number of nodes 2048 512 1024 2048 number of nodes (a) (b) Figure 2: Perplexity (left) and accuracy (right) comparisons of batch and online MMSB on simulated networks for various node sizes and learning parameter κ, run for 4 hours each. Accuracy is measured as the RSS between two sets of Hellinger distances, one based on true π and the other based on E[π]. The distances are computed only for ya,b = 1. In both plots, the lower the value the better the performance of the algorithms. For both K=3 and K=5, online has lower perplexity than batch as N increases, and approximately equal perplexity for N=512. Accuracy becomes significantly poorer for batch as N increases compared to online. Batch values for N=2048 are unavailable. They took long after 4 hours to compute. 50 −20000 40 group:1 30 −1 20 −40000 10 30 20 −80000 held−out likelihood number of papers −3 40 group:2 approx. log likelihood −2 0 50 −60000 10 0 −4 −5 −100000 50 40 −6 group:3 30 −120000 20 −7 online, arXiv subgraph held−out edges held−out non−edges 10 0 0 1000 2000 3000 time in seconds (a) 4000 5000 6000 1992 1993 1994 1995 1996 years (b) 1997 1998 1999 2000 20000 40000 60000 80000 100000 120000 140000 time in seconds (c) Figure 3: (a) and (b) show the convergence and resulting groups (shown are histograms of publications by years) respectively on a 256 node subgraph of arXiv citation dataset. (c) shows the convergence of held-out likelihood computed over incoming mini-batches on a 4096 node subgraph of arxiv citation dataset. We set K = 3 in both cases. nodes, can be very large in contrast to LDA’s topics. Finally, MMSB can leverage efficient subsampling strategies, such as forest fire sampling, to exploit network structure. Preliminary results on real datasets We ran online MMSB on a complete subgraph of 256 nodes from the Arxiv high-energy physics citation graph and obtained reasonable groups. The histogram of publications by years in each group is shown in Fig 3(b). Group 3 consists of many publications in 1995. The average frequency of certain top words (Calabi-Yau, string, symmetry etc.) was 150 − 180% of that in Groups 1 and 2. 1995 marked the beginning of the second superstring revolution with a flurry of research activity spurred by M-theory. There is a greater frequency of singular memberships in group 3. We are currently evaluating results on a subgraph of 4096 nodes. References [1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels, Journal of Machine Learning Research, 9:1981–2014, 2008. [2] M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation, Neural Information Processing Systems, 2010. 2 42 of 104 Planning in Reward Rich Domains via PAC Bandits Sergiu Goschin, Ari Weinstein, Michael Littman, Erick Chastain {sgoschin,aweinst,mlittman,erickc}@cs.rutgers.edu Rutgers University complexity has a dependence on D, as it may be likely or unlikely to encounter an arm with high reward. Specifically, define ρ = Pa∼D (E[a] ≥ r0) as the probability of sampling a “good enough” arm. We assume the domain is reward rich—specifically, that ρ is bounded away from zero. Formally, we define an (, δ, r0)-correct algorithm ALG for an IB(D) problem to be an algorithm that after a number of samples T (, δ, r0, D) (that is finite with probability 1) returns an arm a with expected value E[a] ≥ r0 − with probability at least 1 − δ. This setting extends the PAC Bandit model [1, 3] to an infinite number of arms. The definition of the performance measure generalizes earlier work [4] by allowing the agent to aspire to any reward level, not just the optimal value. 1 Introduction In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0. We use the algorithms presented to identify reliable strategies for solving several screens from the video games Infinite Mario and Pitfall!. Consider the following simple problem. A huge jar of marbles contains some fraction ρ of black (success) marbles and the rest white (failure) marbles. We want to find a black marble as quickly as possible. If the black marbles are sufficiently plentiful in the jar, the problem is simple: Repeatedly draw marbles from the jar until a black one is found. The expected sample complexity is Θ(1/ρ). This kind of generate-and-test approach is simple, but can be extremely effective when solutions are common—for example, finding an unsatisfying assignment for a randomly generated CNF formula is well solved with this approach. The corresponding noisy problem is distinctly more challenging. Imagine the marbles in our jar will be used to roll through some sort of obstacle course and (due to weight or balance or size) some marbles are more successful at completing the course than others. If we (quickly) want to find a marble that navigates the obstacle course successfully at least r0 = 25% of the time, how do we best allocate our test runs on the course? When do we run another evaluation of an existing marble and when do we grab a new one out of the jar? How do we minimize the (expected) total number of runs while still assuring (with high probability) that we end up with a good enough marble? 3 Results We can prove a lower bound on the expected sample complexity of a correct algorithm for the case of Bernoulli arms of the form T (, δ, r0, D) = Ω( 12 ( ρ1 + log 1δ )). On the algorithmic side, the interesting trade-off is between getting better accuracy estimates for the expectations of previously sampled arms versus sampling new arms to try to get one with a higher value. When the concentration of good rewards (ρ) is known, the problem can be reduced to the PAC-Bandit setting [1] with the upper bounds within a logarithmic factor of the lower bound. Algorithm 1 Template (, δ, r0, RejectionProcedure) 1: 2: 3: 4: 5: 6: 7: 8: 2 Setting We formalize this problem as an infinite-armed bandit. We define an arm as a probability distribution over possible reward values over a bounded range [rmin, rmax]. When an arm is pulled, it returns a reward value. One arm a is preferred to another a0 if it has a higher expected reward value, E[a] > E[a0]. Arms are sampled from an arm space S, possibly infinitely large. The distribution D over the arm space defines an infinite-armed bandit problem IB(D). We seek algorithms that take a reward level r0 as input and attempt to minimize the number of pulls needed to identify an arm with expected value of r0 or more. This sample i = 1, found = FALSE for i = 1, 2, ... do Sample a new arm ai ∼ D decision = RejectionProcedure(ai, i, , δ, r0) if decision = ACCEPT then return ai end if end for In the more general (and interesting) case (ρ unknown), we designed algorithms that have the structure of Algorithm 1: they sample an arm, make a bounded number of pulls for the arm, check if the arm should be accepted (and in this case, stop and return the arm) or rejected (sample a new arm from D and repeat). The decision rule for acceptance / rejection and when it can be applied is what differentiates the algorithms. A basic strategy (that we label Iterative Uniform Rejection - IUR) pulls an arm for a number of times that 1 Infinite Mario (hard), r0 = 1, ε = 0.1 0.4 0.6 ● ● IGR IHR IUR 0.0 0.2 Det IGR IHR IUR ● 0.4 Probability of completing 0.6 ● ● 0.8 1.0 43 of 104 0.2 0.8 ● ● 0.0 Probability of completing 1.0 Pitfall! (Crocs), r0 = 0.3, ε = 0.1 1e+00 1e+02 1e+04 1e+06 1e+03 Pulls 5e+03 5e+04 5e+05 Pulls Figure 1: A screenshot of Infinite Mario and plots of the Figure 2: Plot of distribution of the sample complexity (pulls distribution of the sample complexity for four algorithms, needed) for IGR, IHR and IUR over a set of 5000 repetitions. including one that pulls each arm once (Det). The distributions are plotted for one Pitfall level (shown along with a representation of a successful policy in the left half of the figure). Average sample complexity for each algoallows it to decide with high confidence if the arm has an rithm is marked with a circle. δ = 0.01 for all experiments. expected reward that is at least r0 − and, if so, it stops. Otherwise, the algorithm rejects the arm and samples a new one. With high probability, as soon as it sees it has a good one was found with reward greater than r0 = 1. arm, the algorithm will stop and return that particular arm. The average number of pulls needed to find a strategy for The algorithm is simple, correct, and achieves a bound close completing the first screen of each level in a set of 50 levels 1 to the lower bound for the problem—O( ρ12 log ρδ ). ranged from 1 to 1000, with a median of 7.7 pulls and a One problem with IUR is that it is very conservative in mean of 55.7 (due to a few very tough levels). Thus, testing the sense of taking a large number of samples for each arm just a handful of randomly generated action sequences was (with the dominant term always being 12 ). The algorithm sufficient to find a successful policy in this game. Our second experiment involved Pitfall!, a game does not take advantage of the fact that when the difference between r0 and the reward of an arm is larger than , the developed for the Atari 2600. The objective in the game decision to stop pulling the arm could be made sooner. To is to guide the protagonist, Pitfall Harry, through the jungle address this issue, we designed another algorithm using while avoiding items that harm him. For our experiments, ideas from the Hoeffding Races framework [2] (Iterative the goal was defined simply as arriving at the right side of the screen on the top tier (some levels can be finished on Hoeffding Rejection - IHR). For an even more aggressive rejection strategy (that can a lower tier). To introduce variability and encourage more sometimes throw away ”good arms” , with the advantage robust solutions, noise was added by randomly changing of also quickly rejecting ”bad arms”) we used ideas from the joystick input from that requested to that of a centered joystick with no button press 5% of the time. We chose random walks (Iterative Greedy Rejection - IGR). several levels of the game of varying difficulty and defined 4 Experiments the arms to be action sequences of up to 500 steps (once Our first experiment used a version of Infinite Mario (a again, excluding “backward” actions). clone of the Super Mario video game, see Figure 1) that Figure 2 illustrates the results of running the three was previously modified to fit the RL-Glue framework for algorithms mentioned above on one Pitfall! level. The the Reinforcement Learning Competition [5]. The game is random-walk-based IGR outperformed IHR, which outperdeterministic and gives us an opportunity to present a natural formed the highly conservative IUR by a very large margin. problem that illustrates the “reward richness” phenomenon References motivating our work. We modeled the starting screen in Mario, for 50 different [1] E. Even-Dar, S. Mannor, and Y. Mansour. Pac bounds for multi-armed bandit and markov decision processes. In difficulty levels, as a bandit with the arms being action Conference on Computational Learning Theory, 2002. sequences with a length of at most 50 actions. In the experiments, the agent’s goal was to reach a threshold on [2] V. Heidrich-Meisner and C. Igel. Hoeffding and bernstein races for selecting policies in evolutionary direct policy search. the right side of the screen (just the very beginning of the In International Conference on Machine Learning, 2009. level). We restricted the action set of the agent to remove the (somewhat unhelpful) backward action, resulting in 8 [3] S. Mannor, J. N. Tsitsiklis, K. Bennett, and N. Cesa-Bianchi. The sample complexity of exploration in the multi-armed bantotal actions or an arm space of size 850. Action sequences dit problem. Journal of Machine Learning Research, 2004. were tested in the actual game, assigning rewards of −1 if the agent was destroyed, 0 if it did not reach the goal in [4] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms for infinitely many-armed bandits. In Advances in Neural 50 steps, and a value of 100 − t, otherwise (where t was Information Processing Systems, 2008. the number of steps taken before reaching the goal). As the domain is deterministic, no arm needed to be pulled more [5] S. Whiteson, B. Tanner, and A. White. The reinforcement than once. Thus, the agent simply sampled new arms until learning competitions. AI Magazine, 2010. 2 44 of 104 Nonparametric Multivariate Convex Regression with Applications to Value Function Approximation Lauren A. Hannah, joint work with David B. Dunson September 6, 2011 We propose two new, nonparametric methods for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, reinforcement learning and financial engineering. Although this problem has been studied since the 1950’s, there is currently no multivariate method that is computationally feasible for more than a couple of thousand observations. We introduce frequentist Convex Adaptive Partitioning (CAP) and Bayesian Multivariate Bayesian Convex Regression (MBCR), which both create a globally convex regression model from locally linear estimates fit on adaptively selected covariate partitions. Adaptive partitioning makes computation efficient even on large problems. We leverage the search procedure of CAP to create a computationally efficient reversible jump MCMC sampler for MBCR. We give strong consistency results for CAP in the univariate case and strong consistency results for MBCR in the general case. We also give convergence rates for MBCR, which scale adaptively to the dimensionality of an underlying linear subspace. Convexity offers a few properties that can be exploited. First, it acts as a regularizer, making CAP and MBCR resistant to overfitting. Second, convex functions can be quickly minimized with commercial solvers. We CAP and MBCR to fit value function approximation for real-world sequential decision problems with convex value-to-go functions. By allowing efficient search over the decision space, CAP and MBCR allow us much larger stochastic optimization problems than can currently be solved with state of the art methods. We show that MBCR produces much more robust policies than CAP through model averaging. The methods are applied to pricing large basket options and fitted Q-iteration for a complex inventory management problem. Contact Information: Lauren A. Hannah, postdoctoral researcher at Duke University in the Department of Statistical Science. Email: lauren.hannah@duke.edu Phone: (805)748-2894 Mailing address: Box 90251, Duke University, Durham, NC 27708 1 45 of 104 T HE N O -U-T URN S AMPLER The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Matthew D. Hoffman MDHOFFMA @ CS . PRINCETON . EDU Department of Statistics Columbia University New York, NY 10027, USA Andrew Gelman GELMAN @ STAT. COLUMBIA . EDU Departments of Statistics and Political Science Columbia University New York, NY 10027, USA Editor: Abstract Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlations that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than popular methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We present the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. NUTS is able to achieve similar performance to a well tuned standard HMC method, without requiring user intervention or costly tuning runs. TMC can thus be used in applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms. 1. Introduction Hierarchical Bayesian models are a mainstay of the machine learning and statistics communities. Exact posterior inference in such models is rarely tractable, however, and so researchers and practitioners must usually resort to approximate statistical inference methods. Deterministic approximate inference algorithms (for example, those described in (Wainwright and Jordan, 2008)) can be efficient, but introduce bias and can be difficult to apply to some models. Rather than computing a deterministic approximation to a target posterior (or other) distribution, Markov Chain Monte Carlo (MCMC) methods offer schemes for drawing a series of correlated samples that will converge in distribution to the target distribution (Neal, 1993). MCMC methods tend to be less efficient than their deterministic counterparts, but are more generally applicable and are (asymptotically) unbiased. Not all MCMC algorithms are created equal. For complicated models with many parameters, simple methods such as random-walk Metropolis (Metropolis et al., 1953) and Gibbs sampling (Geman and Geman, 1984) may require an unacceptably long time to converge to the target distribution. 1 H OFFMAN AND G ELMAN 46 of 104 This is in large part due to the tendency of these methods to explore parameter space via inefficient random walks (Neal, 1993). When model parameters are continuous rather than discrete, Hamiltonian Monte Carlo (HMC), also known as hybrid Monte Carlo, is able to suppress such random walk behavior by means of a clever auxiliary variable scheme that transforms the problem of sampling from a target distribution into the problem of simulating Hamiltonian dynamics (Neal, 2011). The cost of HMC per independent sample from a target distribution of dimension D is roughly O(D5/4 ), which stands in sharp contrast with the O(D2 ) cost of random-walk Metropolis. This increased efficiency comes at a price. First, HMC requires the gradient of the log-posterior; computing the gradient for a complex model is at best tedious and at worst impossible. This requirement can be made less onerous by using automatic differentiation. Second, HMC requires that the user specify at least two parameters: a step size and a number of steps L for which to run a simulated Hamiltonian system. A poor choice of either of these parameters will result in a dramatic drop in HMC’s efficiency. Although good heuristics exist for choosing , setting L typically requires one or more costly tuning runs, as well as the expertise to interpret the results of those tuning runs. This hurdle limits the more widespread use of HMC, and makes it challenging to incorporate into a general-purpose inference engine such as BUGS (Gilks and Spiegelhalter, 1992), JAGS (http://mcmc-jags.sourceforge.net), or Infer.NET (Minka et al.). The main contribution of this work is the No-U-Turn Sampler (NUTS), an MCMC algorithm that closely resembles HMC, but eliminates the need to choose the problematic number-of-steps parameter L. We also provide schemes for automatically tuning the step size parameter in both HMC and NUTS, which make it possible to run NUTS with no tuning at all. We will show that the tuning-free version of NUTS samples as efficiently as (and often more efficiently than) HMC, even discounting the cost of finding optimal tuning parameters for HMC. NUTS is thus suitable for use in generic inference systems and by users who are unable or disinclined to spend time tweaking an MCMC algorithm. References S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. W. Gilks and D. Spiegelhalter. A language and program for complex Bayesian modelling. The Statistician, 3:169–177, 1992. N. Metropolis, A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953. T. Minka, J. Winn, J. Guiver, and D. Knowles. Infer.NET 2.4, Microsoft Research Cambridge, 2010. http://research.microsoft.com/infernet. R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRGTR-93-1, Department of Computer Science, University of Toronto, 1993. R.M. Neal. Handbook of Markov Chain Monte Carlo, chapter 5: MCMC Using Hamiltonian Dynamics. CRC Press, 2011. M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. R in Machine Learning, 1(1-2):1–305, 2008. Foundations and Trends 2 47 of 104 Distributed Collaborative Filtering Over Social Networks Sibren Isaacman, Margaret Martonosi Princeton University Statis Ioannidis Technicolor Augustin Chaintreau Columbia University Content created and exchanged on social networks receives everyday a larger and larger share of the online users' attention. Its sheer volume and its heterogeneous quality call for collaborative tools, to ensure users collectively divide the burden of identifying truly relevant content. But current collaborative filtering are frequently blind to the social properties of content exchange, and they are usually centralized, raising privacy concern. We prove that these two issues can be addressed through a distributed algorithm that is built on top of any process of content exchange. Our technique operates with strict restriction on information disclosure (i.e., content rating are only exchanged between communicating pairs of users), and we characterize its evolution related to the minimization of a weighted root mean square prediction error. We also present results to establish that this approach is highly practical, as it converges quickly to an accurate prediction that is competitive with centralized approach on the Netflix dataset, and is shown to work in practice on a prototype application deployed on Facebook among 43 users. More precisely, our main contributions are (details can be found in [1]): • We propose a mathematical model of a system for distributed sharing of user-generated content streams. Our model captures a variety of different applications, and incorporates correlations both on how producers deliver content and how consumers rate it. • We illustrate that estimating the probability distribution of content ratings can be naturally expressed as a Matrix Factorization (MF) problem. This is in contrast to standard MF formulations that focus on estimating ratings directly, rather than their distribution [5], or assumes that the ratings follow the Gaussian distribution [4]. To the best of our knowledge, our work is the first to apply a MF technique in the context of rating prediction in user-generated content streams. • Using the above intuition, we propose a decentralized rating prediction algorithm in which information is exchanged only across content producer/consumer pairs. Producers and consumers maintain their own individual profiles; a producer shares its profile only with consumers to which it delivers content, and consumers share a rating they give to an item, as well as their profile, only with the producer that generated it. • In spite of the above restriction on how information is exchanged among users, our distributed prediction algorithm optimizes a global performance objective. In particular, we formally characterize the algorithm’s convergence properties under our model, showing that it reduces a weighted mean square error of its rating distribution estimates. 48 of 104 • We validate our algorithm empirically. First, we use the Netflix data set as a benchmark to compare the performance of our distributed approach to offline centralized algorithms. Second, we developed a Facebook application that reproduces the main features of a peer-to-peer content exchange environment. Using a month-long experiment with 43 users we show that our algorithm predicts ratings accurately with limited user feedback. The experience of browsing the web is more and more embedded in an explicit process of information created, rated and exchanged by users (through Facebook like buttons, re-tweets, and digg, reddit and other aggregators). This behavioral web experience has been met by a sharp rise in privacy concern, even within the US congress [3]. Enabling users to view and access relevant and quality user-generated content without releasing their private information to untrusted third parties is hence a critical challenge. These results prove that information exchange can be deployed only among trusted pairs while providing strong guarantees on the rating prediction error, thanks to a fully distributed and asynchronous learning algorithm. Our results expand the scope of recommender systems to operate on top of any social content sharing platform, while providing privacy protection, which unveils important research questions [2]. The case where trust is not reciprocal (such as the follower relationship within Twitter) remains an interesting open case. Our algorithm could leverage secure multiparty computation, to provide at a smaller cost the privacy guarantee offered by centralized schemes. Finally, our model can be used to analyze how social proximity, captured through “rate of delivery” for producer-consumer pairs, impacts the efficiency of learning. Both questions are interesting open problems. References: [1] Isaacman, S., Ioannidis, S., Chaintreau, A., and Martonosi, M. “Distributed rating prediction in user generated content streams”. ACM RecSys 2011 (full paper). [2] Isaacman, S., Ioannidis, S., Chaintreau, A., and Martonosi, M. “Distributed collaborative filtering over Social Networks”. IEEE Allerton Conference 2011 (invited paper). [3] Angwin, J. US seeks web privacy ‘bill of rights’. Wall Street Journal (Dec. 17th 2010). [4] Salakhutdinov, R., and Mnih, A. Probabilistic matrix factorization. Advances in Neural Information Processing Systems 20 (2008). [5] Takacs, G., Pilaszy, I., Nemeth, B., and Tikk, D. Scalable collaborative filtering approaches for large recommender systems. JMLR 10 (2009), 623–656. 49 of 104 Can Public Data Help With Differentially-Private Machine Learning? Geetha Jagannathan Columbia University Claire Monteleoni George Washington University Krishnan Pillaipakkamnatt Hofstra University Abstract In the field of the design of privacy preserving algorithms, public knowledge is usually used in an adversarial manner to gain insight into an individual’s information stored in the database. Differential privacy is a privacy model that offers protection against such attacks. But it assumes all data being analysed is private. In this paper, we ask a question in the other direction. Can public data be used to “boost” the accuracy of differentially private algorithms? We answer this question in the affirmative. Motivated by the well known semi-supervised model in machine learning literature, we present a more realistic form of the differential privacy model in which privacy-preserving analysis is strengthened with non-private data. Our main result is a differentially private classifier that makes use of non-private data to increase the accuracy of a classifier constructed from a small amount of private data. Our new model expands the range of useful applications of differential privacy whereas most of the current results in the differentially private model require large private data sets to obtain reasonable utility. Differential Privacy This privacy model introduced by Dwork et al. [1] assures that the removal or addition of a single item in a database does not have a substantial impact on the output of a private database access mechanism. It provides protection against arbitrary amounts of auxiliary data available to an attacker. Since the publication of [1] a large number of results have appeared in differentially private data analysis. One major weakness of most of these results is that they require a large quantity of data to obtain meaningful accuracy. Many of the real world datasets are small, and hence many of the differential privacy results do not work on such data sets. In this paper, we propose an enhanced privacy model that tries to address the above mentioned weakness. Privacy Model Our new privacy model is motivated by the well known “semi-supervised” model in machine learning literature. We propose a more realistic privacy model that assumes data analyses are performed not only on private data but also on non-private data. This model is useful in scenarios in which the data collector has both private and non-private data. For example, some respondents in a survey may insist on privacy for their data, while others may be willing to make their data publicly available on the basis of some inducement. The model also applies in situations where the public data (such as voter registration data) misses a confidential attribute (such as salary information). This is similar to the semi-supervised model in machine learning. Problem Our goal is to construct a differentially private classifier that makes use of non-private unlabeled data to ‘boost” the accuracy of the differentially private classifier constructed from a small amount of private labeled data. We make the reasonable assumption that the private and 1 0.45 0.3 0.4 0.25 50 of 104 0.35 0.2 Labeled Only Both 0.2 0.15 Error Rate AA Error Rate AA 0.3 0.25 Labeled Only 0.15 Both 0.1 0.1 0.05 0.05 0 0 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Epsilon Epsilon Figure 1: Errors on the Nursery (12960 rows) and Mushroom (8124 rows) datasets. the non-private data are both from the same distribution. In this paper, we present the RDT# classifier which is a non-trivial extension of the random decision tree classifier (RDT) [2]. Differentially Private Random Decision Trees An RDT is created by choosing test attributes for the decision tree nodes completely at random. The entire tree structure is created using the list of attributes without looking at the training data. The training instances are then incorporated into the structure to compute the distribution of class labels at all the leaves of the tree. Adding noise to the class distribution according to the Laplace mechanism makes the classifier differentially private. A random decision tree classifier is an ensemble of such trees. To classify a test instance the classifier averages the predictions from all the trees in the ensemble. RDT# - A New Approach There are two issues in using RDT directly in our setting. (1) Random decision trees were not designed to handle unlabeled data. (2) The problem for random decision tree classifiers fundamentally lies in the “scattering” of the training instances into the partitions of the instance space induced by the tree. If the number of partitions induced by a tree is large, each partition would likely have a small number of rows of the training data. The noise added to the summary values in each partition could overwhelm the true counts, thereby leading to poor utility. On the other hand, if the number of partitions is small, there would be more rows of the data set in each partition. However, the discriminatory power of each such partition (the “purity” of a leaf node) would likely be low because it spans a large region of the instance space. This also leads to poor utility. In both situations the problem is particularly acute when instances are distributed unevenly in the instance space. This situation also occurs when a dataset suffers from sparsity. We extend the random decision tree idea to exploit the availability of (non-private) unlabeled data. We use the unlabeled instances in two ways. First, we use them to control the partitioning of the instance space so that denser regions of the space are partitioned more finely than sparse regions. Second, we use the unlabeled examples to “propagate” the labels from the labeled instances to larger regions of the instance space. Together, these techniques boost the utility of the differentially private classifier without lowering privacy. We experimentally demonstrate that our private classifier produces good prediction accuracies even in the situations where the private data is fairly limited. See Figure 1. The X-axis shows increasing values of ϵ, the privacy parameter, and the Y -axis shows the error rate. [1] C. Dwork, F. McSherry, K. Nissim and A. Smith, ”Calibrating Noise to Sensitivity in Private Data Analysis”, in TCC 2006. 2 [2] G. Jagannathan, K. Pillaipakkamnatt, R. N. Wright, A Practical Differentially Private Random Decision Tree Classifier, in ICDMW 09. 51 of 104 3 52 of 104 Place Recommendation with Implicit Spatial Feedback Berk Kapicioglu (Princeton, Sense Networks), David Rosenberg (Sense Networks), Robert Schapire (Princeton), Tony Jebara (Columbia) 08/08/2011 1 Introduction Since the advent of the Netflix Prize [1], there has been an influx of papers on recommender systems in machine learning literature. A popular framework to build such systems has been collobarative filtering (CF) [6]. On the Netflix dataset, CF algorithms were one of the few stand-alone methods shown to have superior performance. Recently, web services such as Foursquare and Facebook Places started to allow users to share their locations over social networks. This has led to an explosion in the number of virtual places that are available for checking in, inundating users with many irrelevant choices. In turn, there is a pressing need for algorithms that rank nearby places according to the user’s interest. In this paper, we tackle this problem by providing a machine learning perspective to personalized place recommendation. Our contributions are as follows: First, we transform and formalize the publicly available checkin information we scraped from Twitter and Foursquare as a collobarative filtering dataset. Second, we introduce an evaluation framework to compare algorithms. The framework takes into account the limitations of the mobile device interface (i.e. one can only display a few ranked places) and the spatial constraints (i.e. user is only interested in a ranking of nearby venues). Third, we introduce a novel algorithm that exploits implicit feedback provided by users and demonstrate that it outperforms state-of-the-art CF algorithms. Finally, we discuss and report preliminary results on extending our CF algorithm with explicitly computed user, place, and time features. 2 Dataset Our dataset consists of 106127 publicly available Foursquare check-ins that occurred in New York City over the span of two months. Each datapoint represents an interaction between a user i and a place j, and includes additional information such as the local time of check-in, user’s gender and hometown, and place’s coordinates and categories. One can view the dataset as a bipartite graph between users and places, where the edges between users and places correspond to check-ins. We preprocess the dataset by computing the largest bipartite subgraph where each user and each place interacts with at least a certain number of nodes in the subgraph. In the end, we obtain 2993 users, 2498 places, and 35283 interactions. We assume that this dataset is a partially observed subset of an unknown binary matrix M of size (m, n), where m is number of users and n is number of places. Each entry Mi,j is 1 if user i likes place j, −1 otherwise. Furthermore, denoting the set of all observed incides with Ω, we assume that ∀ (i, j) ∈ Ω, Mi,j = 1. In other words, if a user checked into a place, we assume she likes the place, but if she didn’t check in, we assume that we don’t know whether she likes the place or not. 3 Evaluation Framework Here, we explain the evaluation framework we used to compare algorithms. We assigned each (i, j) ∈ Ω to either a train or a test partition. We allowed our algorithms to train on the train partition. Then, for each datapoint (i, j) in the test partition, we only show the algorithms the user i, the place j, and places N (j), where N (j) is the set of neighboring places of j within a given radius (i.e. 500 meters). The algorithms do not know which one of the candidate places is the actual checked in place. We constrain the candidates to be within a certain radius, since we expect the user to be only interested in nearby venues. Algorithms provide a personalized ranking over the given places and we measure the 0 − 1 error to predict whether the actual checked in place is among the top t places, where t ∈ {1, 2, 3}. Due to the limitations of the mobile device interface, we are only interested in the accuracy of the top few rankings. We do randomized 10 fold cross-validation and report the results. 4 Algorithms The algorithms we compare range from basic baseline algorithms to the state-of-the-art CF approaches. The baseline algorithms are a predictor that ranks randomly and a predictor that ranks the most popular venue with respect to the training 1 data. We also have matrix completion algorithms, where the objective is min kAk∗ s.t. Ai,j = Mi,j , ∀ (i, j) ∈ Ω and kk∗ denotes 53 of 104 the trace norm [5]. In this case, we interpret the entries of the approximation matrix A as confidence scores and rank venues accordingly. This method is justified by [2], where they show that as long as certain technical assumptions are satisfied, with high probability, M could be recovered exactly. In case M is not exactly low rank but approximately low rank, we use Optspace [4]. The use of Optspace is motivated by the upper bound associated with RM SE M, M̂ , where RM SE indicates root mean square error and M̂ indicates the approximate matrix constructed by the algorithm. The problem with these matrix completion algorithms is that they try to minimize RM SE M, M̂ . However, what we are interested in is an algorithm that simply scores the actual checked in venue higher than all the neighboring venues. In order to construct such an algorithm, we decided to both measure the performance of vanilla Maximum Margin Matrix Factorization (MMMF) and build upon it [7]. Analogous to how support vector machines were extended to deal with structured outputs, we extended MMMF to rank checked in venues higher than nearby venues. Our approach was partially motivated by [3], where users provided implicit feedback when they clicked on a webpage that was ranked low and the algorithm exploited that feedback. Similarly, everytime a user checks in a venue during training, we assume that she preferred that venue over nearby venues. The optimization problem associated with our algorithm is as follows: λ X 1 2 2 kU k + kV k + min √ U,V 2 mn |K| X h UV T i,j − UV T i,k (i,j)∈Ω k∈N (j) where U ∈ Rm∗p , V ∈ Rn∗p are the user and place factors, K = {(i, j, k) | (i, j) ∈ Ω, k ∈ N (j)} is an extended set of indices, and h is the smooth hinge function 1 z≤0 2 − z 2 1 h (z) = 2 (1 − z) 0 z 1 0 z≥1 The objective is convex and smoothly differentiable in U and V seperately, and similar to [9], we use alternating minimization and BMRM [8] to minimize it. We have a fast implementation in Python where we implemented the objective and gradient computations in C using Cython. We will demonstrate the complete results of our comparison during the poster session. We will also show some preliminary results on extending the algorithm to exploit explicit user, venue, and time features. However, here’s a sneak peek comparing some of the algorithms: Algorithm Random Popular MMMF Spatial MMMF Average Accuracy (within top 1) 11.02% 30.67% 33.06% 35.88% Average Accuracy (within top 3) 29.94% 52.38% 55.27% 59.11% References [1] James Bennett and Stan Lanning. The netflix prize. In In KDD Cup and Workshop in conjunction with KDD, August 2007. [2] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, December 2009. [3] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 133–142, New York, NY, USA, 2002. ACM. [4] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. J. Mach. Learn. Res., 11:2057–2078, August 2010. [5] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted Low-Rank matrices. arXiv, March 2011. [6] B. Marlin. Collaborative filtering: A machine learning perspective. Master’s thesis, University of Toronto, 2004. [7] Jasson D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, ICML ’05, pages 713–719, New York, NY, USA, 2005. ACM. [8] Choon H. Teo, Alex Smola, Vishwanathan, and Quoc V. Le. A scalable modular convex solver for regularized risk minimization. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 727–736, New York, NY, USA, 2007. ACM. [9] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. COFI RANK - maximum margin matrix factorization for collaborative ranking. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, NIPS. MIT Press, 2007. 2 54 of 104 Recovering Euclidean Distance Matrices via Landmark MDS Akshay Krishnamurthy akshaykr@cs.cmu.edu Knowledge of a network’s topology is essential to facilitating network design, improving performance, and maintaining reliability, among several other applications. However, typical networks of interest, such as the internet, are highly decentralized and grow organically, so that no single entity can maintain global structural information. Thus, a fundamental problem in networking is topology discovery, which involves recovering the structure of a network from measurements taken between end hosts or along paths. Early solutions to the topology discovery problem were tools such as traceroute, that provide detailed path-level measurements, but that also rely on cooperation from intermediate routers to collect this information. For security and privacy reasons, routers are increasingly blocking these types of requests, rendering these tools obsolete. More recent algorithms use measurements taken between cooperating end-hosts and infer the network structure, including uncooperative intermediate nodes. In this direction there are two main lines of research. The first focuses on finding a network structure consistent with the measurements, often assumed to be observed or directly measureable. Proposed algorithms for this task typically actively probe for measurements, injecting large amounts of traffic into the network and disturbing regular activity. Moreover, they often require co-operation from the end-hosts to collect measurements. These drawbacks motivate the second line of research, which involves using few measurements to accurately infer additional ones. Recent work in this direction focuses on using passively gathered measurements to identify the network’s structural characteristics[1] . A recent algorithm[2] is particularly remarkable in that it can accurately estimate all of the measurements (specifically, hop counts) between pairs of end hosts without requiring their cooperation and without injecting large amounts of traffic into the network. This algorithm, known as landmark MDS, involves instrumenting a few routers (landmark nodes) in the network that monitor traffic and passively collect distance information about the end hosts. Using just the collected distances and measurements between the landmark nodes, landmark MDS finds a distancepreserving embedding of the end hosts and landmarks in a Euclidean space and uses [1] [2] Brian Eriksson, Paul Barford, and Robert Nowak. Network Discovery from Passive Measurements Categories and Subject Descriptors. ACM SIGCOMM Computer Communication Review, 38(4):291–302, 2008. Brian Eriksson, Paul Barford, and Robert Nowak. Estimating Hop Distance Between Arbitrary Host Pairs. In IEEE INFOCOM 2009 - The 28th Conference on Computer Communications, pages 801–809. Ieee, April 2009. 1 55 of 104 this embedding to infer the unobserved distances. In this paper, we study the theoretical properties of landmark MDS. Specifically, we focus on recovering a distance matrix D ∈ R(m+n)×(m+n) on m landmarks and n end hosts using the observations between the landmarks and only a few observations between end hosts and landmarks. Our analysis assumes that there exists a set of points X1 , . . . Xn+m ∈ Rp such that Dij = ||Xi − Xj ||22 . We obtain the following results: 1. In the absence of noise, we give sufficient conditions under which Landmark MDS finds an embedding that perfectly recovers the distance matrix D. We show that if the observed landmarks for each of the n end hosts span Rp then we can exactly recover that end host’s coordinates. Applying this to all end hosts results in exact recovery of the distance matrix. 2. Network measurements are invariably corrupted by noise, and we model this with a symmetric perturbation matrix R. In this context, we derive bounds on the average entrywise error between the recovered distance matrix D̂ and D in terms of the noise variance σ 2 of the perturbation. We use this bound to understand conditions under which Landmark MDS is statistically consistent. Simply stated, our result shows that if the number of end hosts n = o(m) where m is the number of landmarks then we can tolerate noise with variance σ 2 = O(m) and still obtain consistency as n, m → ∞. The noisy setting we consider has strong connections to the well-studied Noisy Low-Rank Matrix Completion problem. Our work differs from the matrix completion literature in that typical matrix completion results assume that observations are sampled uniformly from the matrix, whereas we place structure on the observations. The former model is not suitable for network tomography settings, where we can only collect measurements from a few hosts and are interested in limiting the amount of traffic we inject into the network. Unfortunately, the added structure of our model results in worse rates of convergence in comparison with matrix completion, and we believe this is caused by the inherent randomness in the observations of the matrix completion model. In addition to deriving our results on recovery both with and without noise, we present several experiments validating our theoretical results. We also perform an empirical evaluation of Landmark MDS in comparison with state-of-the-art matrix completion algorithms. Finally, we demonstrate the performance of Landmark MDS on both Erdös-Rényi and power-law networks, the latter being a suitable model for communication networks. These experiments encourage the use of Landmark MDS in practice. 2 Efficient evaluation of large sequence kernels 1 56 of 104 Pavel P. Kuksa1 , Vladimir Pavlovic2 NEC Laboratories America, Inc 2 Department of Computer Science, Rutgers University Classification of sequences drawn from a finite alphabet using a family of string kernels with inexact matching (e.g., spectrum or mismatch) has shown great success in machine learning [6, 3, 9, 4]. However, selection of optimal mismatch kernels for a particular task is severely limited by inability to compute such kernels for long substrings with potentially many mismatches. We extend prior work on algorithms for computing (k, m) mismatch string kernels and introduce a new method that allows us to evaluate kernels for large k, m. This makes it possible to explore a larger set of kernels with a wide range of kernel parameters, opening a possibility to better model selection and improved performance of the string kernels. To investigate the utility of large (k,m) string kernels, we consider several sequence classification problems, including protein remote homology detection, and music classification. Our results show that increased k-mer lengths with larger substitutions can improve classification performance. Background. A number of state-of-the-art approaches to classification of sequences over finite alphabet Σ rely on fixed-length representations Φ(X) of sequences as the spectra (|Σ|k -dimensional histogram) of counts of short substrings (k-mers), contained, possibly with up to m mismatches, in a sequence, c.f., spectrum/mismatch methods [6, 7, 3]. However, computing similarity scores, or kernels, K(X, Y )=Φ(X)T Φ(Y ) using these representations can be challenging, e.g., efficient O(k m+1 |Σ|m (|X| + |Y |)) trie-based mismatch kernel algorithm [7] strongly depends on the alphabet size and the number of mismatches m. More recently, [4] introduced linear time algorithms with alphabet-independent complexity O(ck,m (|X| + |Y |) applicable to computation of a large class of existing string kernels. The authors show that it is possible to compute an inexact (k, m) kernel as min(2m,k) XX X Mi Ii , (1) K(X, Y |m, k) = I(a, b) = a∈X b∈Y i=0 where I(a, b) is the number of common substrings in the intersection of the mutation neighborhoods of a and b, Ii is the size of the intersection of k-mer mutational neighborhood for Hamming distance i, and Mi is the number of observed k-mer pairs in X and Y having Hamming distance i. This result however requires that the number of identical substrings in (k, m)-mutational neighborhoods of kmers a and b (the intersection size) be known in advance, for every possible pair of m and the Hamming distance d between k-mers (k and |Σ| are free variables). Obtaining the closed form expression for the intersection size for arbitrary k, m is challenging, with no clear systematic way of enumerating the intersection of two mutational neighborhoods. Closed form solutions obtained in [4] were only provided for cases when m is small (m ≤ 3). No systematic way of obtaining these intersection sizes has been proposed in [4]. In this work we introduce a systematic and efficient procedure for obtaining intersection sizes that can be used for large k and m and arbitrary alphabet size |Σ|. This will allow us to effectively explore a much larger class of (k, m) kernels in the process of model selection which could further improve performance of the string kernel method as we show experimentally. Efficient evaluation of large sequence kernels. For large values of k and m finding intersection sizes needed for kernel computation can be problematic. This is because while for smaller values of m combinatorial closed form solution can be found easily, for larger values of m finding it becomes more difficult due to an increase in the number of combinatorial possibilities as the mutational neighborhood increases (exponentially) in size. On the other hand, direct computation of the intersection by trie traversal algorithm is computationally difficult for large k and m as the complexity of traversal is O(k m+1 |Σ|k ), i.e. is exponential in both k and m. The above mentioned issues do not allow for efficient kernel evaluation for large k and m. Reduction-based computation of intersection size coefficients. We will now show that it is possible to efficiently compute the intersection sizes by reducing (k, m, |Σ|) intersection size problem to a set of less complex intersection size computations and solving linear systems of equations. We discuss this approach below. The number of k-mers at the Hamming distance of at most m from both k-mers a and b, I(a, b), can be m found in a weighted form X I(a, b) = wi (|Σ| − 1)i . (2) i=0 Coefficients wi depend only on the Hamming distance d(a, b) between k-mers a and b for fixed k, m, |Σ|. For every Hamming distance 0 ≤ d(a, b) ≤ 2m, the corresponding set of coefficients wi , i = 0, 1, . . . , m can be found by solving a linear system Aw = I of m + 1 equations with each equation corresponding to a particular alphabet size |Σ| ∈ {2, 3, . . . , m + 2}. The left-hand side matrix A is an (m+1,m+1) matrix with elements aij = ij−1 , i = 1, . . . , m + 1, j = 1, . . . , m + 1. 10 11 12 ... 1m 20 21 22 ... 2m A= ... (m + 1)0 (m + 1)1 (m + 1)2 ... (m + 1)m 1 57 of 104 The right-hand side I = (I0 , I1 , . . . , Im )T is a vector of intersection sizes for a particular setting of k, m, d, |Σ| = 2, 3, . . . , m + 2. Here, Ii , i = 0 . . . m is the intersection size for a pair of k-mers over alphabet size i + 2. Note that Ii need only be computed for small alphabet sizes, up to m + 2. Hence, this vector can feasibly be computed using a trie traversal for a pair of k-mers at Hamming distance d even for moderately large k as the size of the trie is only (m + 2)k as opposed to |Σ|k . This allows now to evaluate kernels for large k and m as the traversal is performed over much smaller tries, e.g., even in case of relatively small protein alphabet with |Σ| = 20, for m = 6 and k = 13, the size of the trie is 2013 /813 = 149011 times smaller. Coefficients w obtained by solving Aw = I do not depend on the alphabet size |Σ|. In other words, once found for a particular combination of values (k, m), these coefficients can be used to determine intersection sizes for any given finite alphabet |Σ| using Eq. 2. Experimental evaluation. We evaluate the utility of large (k, m) computations as a proxy for model selection, by allowing a significantly wider range of kernel parameters to be investigated during the selection process. Such large range evaluation is the first of its kind, made possible by our efficient kernel evaluation algorithm. In these evaluations we follow the experimental settings considered in [5] and [4]. We use standard benchmark datasets: the SCOP dataset (7329 sequences, 54 experiments) [9] for remote protein homology detection, and music genre data1 (10 classes, 1000 seqs) [8] for multi-class genre prediction. Table 1: Remote homology. Classification perfor- Table 2: Multi-class music genre recognition. Classification performance of the mismatch method mance of the mismatch kernel method Kernel Mean ROC Mean ROC50 mismatch(5,1) 87.75 41.92 mismatch(5,2) 90.67 49.09 mismatch(6,2) 90.74 49.66 mismatch(6,3) 90.98 49.36 mismatch(7,3) 91.31 52.00 mismatch(7,4) 90.84 49.29 mismatch(9,4) 91.45 53.51 mismatch(10,5) 91.60 53.78 mismatch(13,6) 90.98 50.11 Kernel Error mismatch(5,1) mismatch(5,2) mismatch(6,3) mismatch(7,4) mismatch(9,3) mismatch(9,4) mismatch(10,3) mismatch(10,4) 34.8 32.6 31.2 31.1 31.4 32.2 32.3 31.7 Top-2 Error 18.3 18.0 17.2 18.0 18.0 17.8 18.0 19.1 F1 Top-2 F1 65.36 67.51 68.92 68.96 68.59 67.83 67.65 68.29 81.95 82.21 83.01 82.16 82.33 82.36 82.12 81.04 Results of mismatch kernel classification for the remote homology detection problem are shown in Table 1. We observe that larger values of k and m perform better compared to typically used values of k=5-6, m=1-2. For instance, (k=10,m=5)-mismatch kernel achieves significantly higher average ROC50 score of 53.78 compared to ROC50 of 41.92 and 49.02 for the (k=5,m=1)- and (k=5,m=2)- mismatch kernels. The utility of such large mismatch kernels was not possible to investigate prior to this study. We also note that the results for per-family or per-superfamily based parameter selection suggest the need for model selection and the use of multiple kernels, e.g., per-family kernel selection results in much higher ROC50 of 60.32 compared to 53.78 of the best single kernel. For the music genre classification task (Table 2), parameter combinations with moderately long k and larger values of m tend to perform better than kernels with small m. As can be seen from results, larger values of m are important for achieving good classification accuracy and outperform setting with small values of m. Conclusions. In this work we proposed a new systematic method that allows evaluation of inexact string family kernels for long substrings k with large number of mismatches m. The method finds the intersection set sizes by explicitly computing them for small alphabet size |Σ| and then generalizing this to arbitrary large alphabets. We show that this enables one to explore a larger set of kernels which as we demonstrate experimentally can further improve performance of the string kernels. References [1] Chris H.Q. Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001. [2] Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein fold recognition using adaptive codes. In ICML ’05, pages 329–336, New York, NY, USA, 2005. ACM. [3] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina S. Leslie. Profile-based string kernels for remote homology detection and motif extraction. In CSB, pages 152–160, 2004. [4] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable algorithms for string kernels with inexact matching. In NIPS, 2008. [5] Pavel P. Kuksa and Vladimir Pavlovic. Spatial representation for efficient sequence classification. In ICPR, 2010. [6] Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res., 5:1435–1455, 2004. [7] Christina S. Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernels for SVM protein classification. In NIPS, pages 1417–1424, 2002. [8] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification. In SIGIR ’03, pages 282–289, New York, NY, USA, 2003. ACM. [9] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William Stafford Noble. Semi-supervised protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005. 1 http://opihi.cs.uvic.ca/sound/genres 2 58 of 104 Unsupervised Hashing with Graphs † Wei Liu† Jun Wang‡ Sanjiv Kumar§ Shih-Fu Chang† Electrical Engineering Department, Columbia University, New York, NY, USA {wliu,sfchang}@ee.columbia.edu ‡ IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA wangjun@us.ibm.com § Google Research, New York, NY 10011, USA sanjivk@google.com Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However, learning short codes that yield good search performance is still a challenge. Moreover, in many cases real-world data lives on a low-dimensional manifold, which should be taken into account to capture meaningful nearest neighbors. In this work, we propose a novel graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes in an unsupervised manner. One of the most critical shortcomings of the existing unsupervised hashing methods such as Locality-Sensitive Hashing (LSH) [1] is the need to specify a global distance measure. On the contrary, in many real-world applications data resides on an intrinsic manifold. For these cases, one can only specify local distance measures, while the global distances are automatically determined by the underlying manifold. Our basic idea is motivated by Spectral Hashing (SH) [2] in which the goal is to embed the data in a Hamming space such that the neighbors in the original data space remain to be neighbors in the Hamming space. Solving the above problem requires three main steps: (i) building a neighborhood graph using all n points from the database (O(dn2 )), (ii) computing r eigenvectors of the graph Laplacian (O(rn)), and (iii) extending r eigenvectors to any unseen data point (O(rn)). Unfortunately, step (i) is intractable for offline training while step (iii) is infeasible for online hashing given very large n. To avoid these bottlenecks, [2] made a strong assumption that data is uniformly distributed. This leads to a simple analytical eigenfunction solution of 1-D Laplacians, but the manifold structure of the original data is almost ignored, substantially weakening the basic theme of that work. In this work, we propose a novel unsupervised hashing approach named Anchor Graph Hashing (AGH) to address both of the above bottlenecks. We build an approximate neighborhood graph using Anchor Graphs [3], in which the similarity between a pair of data points is measured with respect to a small number of anchors (typically a few hundred). The resulting graph is built in O(n) time and is sufficiently sparse with performance approaching to the true kNN graph as the number of anchors increases. Because of the low-rank property of an Anchor Graph’s adjacency matrix, our approach can solve the graph Laplacian eigenvectors in linear time. One critical requirement to make graph-based hashing practical is the ability to generate hash codes for unseen points. This is known as outof-sample extension in the literature. Significantly, we show that the eigenvectors of the Anchor Graph Laplacian can be extended to the generalized eigenfunctions in constant time, thus leading to fast code generation. Finally, to deal with the problem of poor quality of hash functions associated with the higher eigenfunctions of the graph Laplacian, we propose a hierarchical threshold learning procedure in which each eigenfunction yields multiple bits. Thus, one avoids picking higher eigenfunctions to generate more bits, and bottom few eigenfunctions are visited multiple times. We describe a simple method for optimizing the thresholds to obtain multiple bits. One interesting characteristic of the proposed hashing method AGH is that it tends to capture semantic neigh- 59 of 104 Figure 1. Hash functions. The left subfigure shows the hash function of LSH, and the right subfigure shows the hash function of our approach AGH. Table 1. Hamming ranking performance on MNIST and NUS-WIDE. r denotes the number of hash bits used in hashing algorithms, and also the number of eigenfunctions used in SE ℓ2 linear scan. The K-means execution time is 20.1 sec and 105.5 sec for training AGH on MNIST and NUS-WIDE, respectively. All training and test time is recorded in sec. Method ℓ2 Scan SE ℓ2 Scan LSH PCAH USPLH SH KLSH SIKH One-Layer AGH Two-Layer AGH MNIST (70k samples) MAP Train Time Test Time r = 24 r = 48 r = 48 r = 48 0.4125 – 0.5269 0.3909 – – 0.1613 0.2196 1.8 2.1×10−5 0.2596 0.2242 4.5 2.2×10−5 0.4699 0.4930 163.2 2.3×10−5 0.2699 0.2453 4.9 4.9×10−5 0.2555 0.3049 2.9 5.3×10−5 0.1947 0.1972 0.4 1.3×10−5 0.4997 0.3971 22.9 5.3×10−5 0.6738 0.6410 23.2 6.5×10−5 NUS-WIDE (270k samples) MP Train Time Test Time r = 24 r = 48 r = 48 r = 48 0.4523 – 0.4866 0.4775 – – 0.3196 0.2844 8.5 1.0×10−5 0.3643 0.3450 18.8 1.3×10−5 0.4269 0.4322 834.7 1.3×10−5 0.3609 0.3420 25.1 4.1×10−5 0.4232 0.4157 8.7 4.9×10−5 0.3270 0.3094 2.0 1.1×10−5 0.4762 0.4761 115.2 4.4×10−5 0.4699 0.4779 118.1 5.3×10−5 borhoods. In other words, data points that are close in the Hamming space, produced by AGH, tend to share similar semantic labels (see Fig. 1). This is because for many real-world applications close-by points on a manifold tend to share similar labels, and AGH is derived using a neighborhood graph which reveals the underlying manifold, especially at large scale. Fig. 1 indicates that the hash function of AGH can generate hash bits along manifolds. The key characteristic, semantic hash bits, of AGH is validated by extensive experiments carried out on two datasets, where AGH outperforms state-of-the-art hashing methods as well as exhaustive linear scan in the input space with the commonly used ℓ2 distance. The results are shown in Table 1. References [1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry, pp. 253-262, Brooklyn, New York, USA, 2004. [2] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances in Neural Information Processing Systems (NIPS) 21, MIT Press, pp. 1753-1760, 2009. [3] W. Liu, J. He, and S.-F. Chang. Large Graph Construction for Scalable Semi-Supervised Learning. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 679-686, Haifa, Israel, 2010. 60 of 104 Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction∗ Siwei Lyu Computer Science Department University at Albany, State University of New York lsw@cs.albany.edu Practical applications of machine learning algorithms in data analysis call for efficient estimation of probabilistic data models. However, when used to learn high dimensional parametric probabilistic models (e.g., Markov random fields [9] and products of experts [6]), the classical maximum likelihood (ML) learning oftentimes suffers from computational intractability due to the normalizing partition function. Such a difficulty motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated – each is conceived with specific insights on the learning problem. The multitude of different non-ML learning methods causes a significant “burden of choice” for machine learning practitioners. In this work, we describe a general non-ML learning principle that we term as minimum KL contraction (MKC). The MKC objective is based on an information geometric view of parametric learning for probabilistic models on a statistical manifold. In particular, we first define a KL contraction operator as a mapping between probability distributions on statistical manifolds, under which the Kulback-Leibler (KL) divergence of two distributions always decreases unless the two distributions are equal. The MKC objective then seeks optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. Preliminary results have shown that the MKC principle provides a unifying framework of a wide range of important or recently developed non-ML learning methods, including contrastive divergence [6], noise contrast estimation [5], partial likelihood [3], non-local contrastive ∗ This work is to appear at NIPS 2011. 1 61 of 104 objectives [11], score matching [7], pseudo-likelihood [2], maximum conditional likelihood [8], maximum mutual information [1], maximum marginal likelihood [4], and conditional and marginal composite likelihood [10]. Each of these learning objective functions can be recast as an instantiation of the general MKC objective functions with different KL contraction operators. The MKC principle provides a deepened and unified understanding of existing non-ML learning methods, which can facilitate designing new efficient non-ML learning methods to focus on essential aspects of the KL contraction operators. To our best knowledge, such a unifying view of the wide range of existing non-ML learning methods has not been explicitly described previously in the literature of machine learning and statistics. References [1] L. Bahl, P. Brown, P. de Souza, and R. Mercer. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In ICASSP, 1986. [2] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–95, 1975. [3] D. R. Cox. Partial likelihood. Biometrika, 62(2):pp. 269–276, 1975. [4] I.J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, 1965. [5] M. Guttmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010. [6] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [7] A. Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6:695–709, 2005. [8] T. Jebara and A. Pentland. Maximum conditional likelihood via bound maximization and the CEM algorithm. In NIPS, 1998. [9] J. Laurie Kindermann, Ross; Snell. Markov Random Fields and Their Applications. American Mathematical Society, 1980. [10] B. G Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):22–39, 1988. [11] D. Vickrey, C. Lin, and D. Koller. Non-local contrastive objectives. In ICML, 2010. 2 62 of 104 Active prediction in graphical models Satyaki Mahalanabis and Daniel Štefankovič ∗ 1 Introduction Given an undirected graph with an unknown labeling of its nodes (using a fixed set of labels) and the ability to observe the labels of a subset of the nodes, consider the problem of predicting the labels of the unobserved nodes. This is a widely studied problem in machine learning, where the focus until now has largely been on designing the prediction algorithm. Recently however, the problem of selecting the set of vertices to observe (i.e. active prediction) has been receiving attention [1], and this is what our work tries to address. We are interested in designing adaptive strategies for the selection problem, i. e., at any point in time, the next node we choose to observe can depend on the labels of nodes that have already been selected and (hence) revealed. While previous research [2, 1] has mostly focused on the worst case prediction error of the selection strategy (i. e., assume adversarial labeling of all nodes), we analyze the expected error under the assumption that the labels are random variables with a known joint distribution. In other words we consider graphical models (such as the Ising model) with known parameters. We assume that the selection stategy is provided a budget of the number of nodes it can observe, and we study the expected number of mispredictions as a function of this budget. We are particularly interested in the question of whether there exist very simple yet optimal node selection strategies. The example we study is the (ferromagnetic) Ising model (one of the simplest graphical models) on a chain graph. To further simplify the problem we consider a continuous version of the Ising model, that is, the limit as the number of vertices on the path goes to infinity. We show that some simple adaptive selection strategies have optimal error (up to a constant factor), and have asymptotically smaller number of mispredictions than non-adaptive strategies. We also prove the hardness of selecting the optimal set of nodes (either adaptively or non-adaptively) for graphical models on general graphs. 1.1 Related work In [2] authors consider the problem of non-adaptive selection for minimizing the worst case prediction error for the case of binary labels. They give an upper bound on the error of the min-cut predictor, for a given set of observed nodes S, in terms of the cut size (that is, the number of edges whose endpoints have different labels) of the unknown labeling and a graph cut function Ψ(S) (which depends on the underlying graph). The problem of choosing the observed nodes then reduces to that of minimizing Ψ(S) over sets S of a given (budget) size, for which they suggest a heuristic strategy (the goal of that ∗ Department of Computer Science, University of Rochester, {smahalan,stefanko}@cs.rochester.edu 1 63 of 104 paper is not to provide provable guarantees for the heuristic). Authors of [1] give a non-adaptive selection strategy in a similar adversarial setting for the special case of trees such that the min-cut predictor’s number of mispredictions is at most a constant factor worse than any (non-adaptive) prediction strategy (analyzed in the worst case for labelings that have bounded cut size). We point out that the algorithm of [1] does not need to know the cut size constraint while in our case, the expected cut-size is known and depends on the model parameters. Further, [1] provides a lower bound on the prediction error for general graphs as well. Unlike [2, 1], [3] considers an adaptive version of the selection problem for (certain classes of) general graphs where the cut size induced by the labeling is known. Their goal, unlike ours, is to recover the correct label of every node using as few observations as possible. They claim to give a strategy whose required number of labels matches the information-theoretic lower bound (ignoring constant factors). Finally note that our goal here, namely, the active selection of nodes to observe, is different from that of, e. g., [4] where the nodes to be labeled are chosen by an adversary, or from that of semi-supervised settings, e. g., [5, 6] where the query nodes are chosen randomly. Also, we aim to minimize the expected number of mispredictions, as opposed to inferring the most likely labeling (i. e., the “minimum energy configuration”) of the unobserved nodes which is what most applications in computer vision tend to focus on, e. g., [7] (see also [8]). 2 Summary of Results We state our main results briefly (see the full version [9] for details). Proposition 1. Computing the non-adaptive (or adaptive) strategy with budget B that minimizes the expected average error is NP-hard. Next, we study the continuous limit of a simple chain (i.e. 1-D) Ising model, which we define as follows. Consider a 1-D Ising model on n vertices with inverse temperature β > 0. Its much easier to study the behavior of this model in the limit as n → ∞ and β scales appropriately so that L = − n−1 2 ln λ stays fixed, where λ = tanh(β). We now state results for predicting labels for such a model with a label budget of B. The simplest non-adaptive strategy—querying uniformly spaced points is, not surprisingly, optimal (among non-adaptive strategies). Proposition2. The non-adaptive strategy that queries points is an optimal non-adaptive strategy and 1−exp(−L/(B+1)) 1 L has error 2 1 − = 4(B+1) + O(1/B 2 ). L/(B+1) Next we give a lower bound on the error of any adaptive strategy. Proposition 3. The expected average error of any adaptive strategy is at least For B ≥ L − 1 the lower bound can be simplified to is a constant.) (L/(B + 1))2 /9. 1 2 1− tanh(L/(B+1)) L/(B+1) . (For B ≤ L − 1 the lower bound Now we show that combining the non-adaptive strategies with binary search yields an adaptive strategy with asymptotically better performance. Proposition 4. There is an adaptive strategy with error bounded by 21(L/B)2 , assuming L ≥ 1. 2 64 of 104 References [1] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella, “Active learning on trees and graphs,” in COLT, pp. 320–332, 2010. [2] A. Guillory and J. Bilmes, “Label selection on graphs,” in Conference on Neural Information Processing Systems, 2009. [3] P. Afshani, E. Chiniforooshan, R. Dorrigiv, A. Farzan, M. Mirzazadeh, N. Simjour, and H. ZarrabiZadeh, “On the complexity of finding an unknown cut via vertex queries,” in COCOON, pp. 459– 469, 2007. [4] N. Cesa-Bianchi, C. Gentile, and F. Vitale, “Fast and optimal prediction on a labeled tree,” in COLT, 2009. [5] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in ICML, pp. 19–26, 2001. [6] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, pp. 912–919, 2003. [7] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222–1239, 2001. [8] J. M. Kleinberg and É. Tardos, “Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields,” in FOCS, pp. 14–23, 1999. [9] S. Mahalanabis and D. Štefankovič, “Active http://www.cs.rochester.edu/ smahalan/ising.pdf. 3 prediction in graphical models.” 65 of 104 An Ensemble of Linearly Combined Reinforcement-Learning Agents Vukosi Marivate, vukosi@cs.rutgers.edu Michael Littman, mlittman@cs.rutgers.edu September 9, 2011 Abstract range of computational domains, ensemble learning methods have proven extremely valuable for tackling complex problems reliably. Ensemble (or sometimes modular or portfolio) methods [Rokach, 2010] harness multiple, perhaps quite disparate, algorithms for a problem class to greatly expand the range of specific instances that can be addressed. They have proven themselves to be state-of-the-art approaches for crossword solving, satisfiability testing, movie recommendation and question answering. We believe the success of ensemble methods on these problems is because they can deal with a range of instances that require different low-level approaches. RL instances share this attribute, suggesting that an ensemble approach could prove valuable there as well. Reinforcement-learning (RL) algorithms are often tweaked and tuned to specific environments when applied, calling into question whether learning can truly be considered autonomous in these cases. In this work, we show how more robust learning across environments is possible by adopting an ensemble approach to reinforcement learning. We find that a learned linear combination of Q-values from multiple independent learning algorithms results in an agent that compares favorably to the best tuned algorithm across a range of problems. We compare our proposed approach, which tries to minimize the error between observed returns and the predicted returns from the underlying ensemble learners, to an existing approach called policy fusion and show that our approach is considerably more effective. Our work provides a promising basis for further study into the use of ensemble RL methods. 1 2 Linearly Combined Ensemble Reinforcement Learning In this work, we present an approach to ensemblebased RL using a linear Temporal Difference (TD) learning algorithm as a meta learner to combine the value estimates from multiple base RL algorithm agents. Our approach goes beyond earlier efforts in ensemble RL [Wiering and van Hasselt, 2008], which fused policies or actions, in that we develop a fusion method that is learned and adjusted given the base agents in the ensemble instead of combining low-level agents inflexibly. We propose a flexible method to combine Action-Value functions ∑n (Q-values) from different learners, Q(s, a) = i=1 wi Qi (s, a), where wi Introduction The task of creating a single reinforcement-learning (RL) agent that can learn in many possible environments without modification is not a simple one. It is typical for algorithm designers to modify state representations, learning protocols, or parameter values to obtain good performance on novel environments. However, the more problem-specific tuning needed, the less “autonomous” an RL system is, eroding some of the value of RL systems in practice. Across a wide 1 66 of 104 is the weight for base learner i and Qi (s, a) is the Q-value of state s and action a as estimated by the ith learner. By using Q-values, we are able to solve the credit assignment problem in terms of deciding how to weigh the individual contribution of each algorithm. The high level learner (Meta learner) uses the linearly combined and weighed individual Q-values from each of the n low level (base) learners to choose actions. Given that both the base learners and the meta learner need to adapt, learning can be run in either one or two stages. In the two-stage approach, the base learners are trained on the environment in question by themselves, they are frozen, and then the meta learner learns the weights to combine the Qvalues of the base learners. With the above description of the combination of base learners, we can view each base learner’s Q-value as a feature for the high level learner that is dependent on a state action pair, (s, a). In the single stage learning approach, both the base and meta learners are trained at the same time. The two-stage meta learner is a least squares algorithm that minimizes the Bellman residual error , and adjusts weights using the TD update rule with individual agents Q-values, Qi (s, a), as features. The two stage learner thus converges to the weights that results in the smallest error between the estimates and the real returns. In the single stage learning approach, the simultaneous adaptation of base and meta learner makes convergence less clear. Further analysis of this case is still needed. Figure 3.1: Cat & Mouse Results ters could have a dramatic effect on their success in a given environment. Promising results showing the performance of the linearly combined ensemble RL are shown in Figure 3.1. Here, the Two-stage meta learner discovers the best single algorithm learner while the single stage learners has a slightly higher mean than all the other configurations. 4 Future Work The two-stage and single-stage learning dynamics need further investigation as they both can lead to high performance gains in different environments. The impact of individual effectiveness to the weighting and performance of the two-stage meta learner could lead to better understanding of the approach. Nevertheless, these promising initial results indicate 3 Experiments that reinforcement-learning algorithms, like regresTo better understand the strengths and weaknesses sion and classification algorithms, can also benefit of our proposed algorithm, we carried out a num- substantially from the ensemble approach. ber of experiments on learning in multiple environments that were part of the 2009 RL Competition References Polyathlon challenge. The Polyathlon required each participant to build a general RL agents that per- L. Rokach. Ensemble-based classifiers. Artificial Informed well across multiple environments without telligence Review, 33(1):1–39, 2010. prior knowledge of those environments. For the base learners, we used 3 different types of TD(0) learn- M.A. Wiering and H. van Hasselt. Ensemble algorithms in reinforcement learning. Systems, ing agents: Table-Based Q-Learning, Table-Based Man, and Cybernetics, Part B: Cybernetics, IEEE SARSA and Linear SARSA (L-SARSA). Within Transactions on, 38(4):930–936, 2008. these broad categories, variations of their parame2 67 of 104 1 Autonomous RF Surveying Robot for Indoor Localization and Tracking Ravishankar Palaniappan, Piotr Mirowski, Tin Kam Ho, Harald Steck, Philip Whiting and Michael MacDonald Alcatel-Lucent Bell Labs, Murray Hill, NJ; ravishankar.palaniappan@alcatel-lucent.com I. INTRODUCTION Technologies for tracking people within buildings are important enablers for many public safety applications and commercial services. The need has not been met by conventional navigation aids such as GPS and inertial systems. There have been many approaches to solve this problem such as using pseudolites, highly sensitive Inertial Measurement Units (IMUs), RF based systems using techniques like Received Signal Strength Indicator (RSSI), Time Difference of Arrival (TDOA) or Time of Flight (TOF) on different spectral bands and standards [1], [2], [3], [4], and hybrid methods using sensor fusion. Some solutions require special devices such as ultrasonic or RFID tags and readers, or are constrained by requiring line of sight such as in the use of video cameras. The most desirable are systems that cause minimal intervention to ongoing activities, use existing device and infrastructures, and are cheap to deploy and maintain. Over the years our team of researchers has worked on several components of a WLAN-based tracking technology that are ready to be refined and integrated into an operational solution. These include (1) a simulator for predicting radio propagation using path-loss models and ray-tracing computation; (2) a set of statistical algorithms for interpolating RSSI maps from manually collected, irregularly spaced raw data; and (3) algorithms for using the refined signal maps to determine positions in real time. But a practical solution still requires the elimination of the painstaking process of creating a signal map manually by walking/driving through the space and collecting signal strength measurements along with precise position information. Furthermore, it is necessary to be able to repeat the signal map construction periodically to adapt to potential changes in the radio environment. Motivated by these needs, we have recently developed a robotic test-bed that can carry multiple sensors and navigate within buildings autonomously while gathering signal strength data at various frequencies of interest. A robotic platform that can perform such tasks with minimal human intervention is a highly promising solution to bridge the Fig. 1. Indoor mapping robot. gap between laboratory trials and practical deployments in large scale. Our robotic platform is capable of conducting repeated surveying of an indoor environment to update the signal maps as frequently as required. Also the robot serves as a test-bed to collect data in controlled conditions such as under uniform motion and with accurate positioning. As it moves autonomously in the indoor environment, it uses a new method of multi-sensor integration for Simultaneous Localization & Mapping (SLAM) to construct the spatial model and infer its own location. In this way, the construction of the spatial map of RF signal strength can be fully automated. The robot can carry different receivers and systematically compare the effectiveness of different localization methods. Currently we experiment with WLAN fingerprinting techniques. We expect that the robot can support similar experiments on other RF signals of interest, including GSM, CDMA and LTE. II. INDOOR MAPPING ROBOT We now describe the hardware of the Indoor Mapping Robotic (IMR) test-bed vehicle which is used to survey and construct radio signal maps within buildings quickly and as frequently as needed. Since the robotic platform (see Fig. 1) is primarily intended for indoor application we chose a vehicle that was easily maneuverable through narrow corridors and doorways. Our choice was a Jazzy Jet 3 wheelchair with two motors, two 12 V rechargeable batteries and a maximum operating time of 4 hours. The chair and chair mount was replaced by a custom built aluminum base that hosted all our sensor, electronics and hardware. The heart of the electronic system is a lowpower 1.66 GHz Mini-ITX computer motherboard that handles all the operations of the robot including navigation and signal strength data collection. The robot is controlled through a microcontroller that sends serial PWM signals to the motor controllers for navigation. Data from different on-board sensor such as the sonar, inertial sensors, Microsoft Kinect and video cameras are streamed through an on-board 802.11g link to a control station for post processing and analysis. The robot 68 of 104 2 autonomously navigates the indoor environment using collision detection algorithms. The main sensor used for this R feature is the Microsoft Kinect RGB-D sensor developed by PrimeSense Inc, which can simultaneously acquire depth and color images. An open-source software driver (available at http://openkinect.org) enables to grab both depth and RGB images at 30Hz and 640x480 pixel resolution, and respectively as 11-bit depth and 8-bit color definitions [5]. Autonomous robot navigation is currently implemented as a collision avoidance strategy using the Kinect and other sensors including sonar and contact sensors. The collision detector primarily relies on the output from the Kinect RGB-D sensor. After sub-sampling the depth image to an 80x30 pixel size and converting depth information to metric distances, a simple threshold-based algorithm makes decisions to turn if there are obstacles within 1m on the right or left of the field of depth vision. Another threshold is used for stairs detection, by scanning the bottom five lines of the depth image. The robot moves forward if no collisions or falls are perceived. If the Kinect does not detect small obstacles such as a pillar or trash cans which are not in the field-of-view of the Kinect camera the sonar sensors serve as additional backup sensors for obstacle avoidance. III. SIMULTANEOUS LOCALIZATION AND MAPPING A critical need in constructing a spatial map of RF signals is to accurately record the location of the receiver at the time when the signal strength is measured. In addition, the location needs to be registered to the building layout for use in applications and services, even if no detailed blueprints are available for the building of interest. On the robotic platform, these can be accomplished using a SLAM (Simultaneous Localization And Mapping) algorithm. Conventional 3D database modeling of indoor environment by mobile robots has involved the use of laser scanner and cameras. The laser scanner is used to build a 3D point cloud map and the textures from the camera images are stitched to these point clouds to form the complete database. A recently published method succeeded in building precise 3D maps directly from the Kinect RGB-D images [7], taking advantage of the dense depth Fig. 2. Example of 3D office reconstruc- information, without relying on other sensors. It would estimate 3D point tion from RGB-D-based odometry. cloud transformations only between two consecutive RGB-D frames, and would resolve loop closures (going twice through the same location) using a maximum likelihood smoothing, but would be too computationally intensive for real-time application. We used a simplified implementation of [7] (available at http://nicolas.burrus.name) to perform post-processing reconstruction of the robot trajectory. Fig. 2 shows a snapshot of the 3D map being constructed from RGB-D images. IV. ONGOING WORK We are currently integrating additional sensing modalities on the robotic platform including sonar to improve the navigation and localization capabilities of the robot. Since the most of the robot hardware was built from commercial off the shelf hardware we plan to replicate this design for multiple robotic platforms which can be used to quickly cover large buildings in a coordinated manner. R EFERENCES [1] K. Ozsoy, A. Bozkurt and I. Tekin, “2D Indoor positioning system using GPS signals”, Indoor Positioning and Indoor Navigation (IPIN), 2010 International Conference on, 15-17 Sept. 2010. [2] M. Ciurana, D. Giustiniano, A. Neira, F. Barcelo-Arroyo and I. Martin-Escalona, “Performance stability of software ToA-based ranging in WLAN”, Indoor Positioning and Indoor Navigation (IPIN), 2010 International Conference on, 15-17 Sept. 2010. [3] H. Kroll and C. Steiner, “Indoor ultra-wideband location fingerprinting”, Indoor Positioning and Indoor Navigation (IPIN), 2010 International Conference on, 15-17 Sept. 2010. [4] M. Hedley, D. Humphrey and P. Ho, “System and algorithms for accurate indoor tracking using low-cost hardware”, Position, Location and Navigation Symposium, 2008 IEEE/ION , pp.633–640, 5-8 May 2008. [5] R.M. Geiss, Visual Target Tracking, US Patent Application 20110058709, 2011. [6] T. Roos, P. Myllymaki, H. Tirri, P. Misikangas and J. Sievanen, “A Probabilistic Approach to WLAN User Location Estimation”, International Journal of Wireless Information Networks, vol.7, n.3, pp.155–163. 2002. [7] P. Henry, M. Krainin, E. Herbst, X. Ren and D. Fox, “RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments”, Experimental Robotics, 2010 12th International Symposium on, 18-21 Dec. 2010. 69 of 104 Title: A Comparison of Text Analysis and Social Network Analysis using Twitter Data Authors: John Myles White (Princeton), Drew Conway (NYU) Abstract Within the social sciences, interest in statistical methods for the automatic analysis of text has grown considerably in recent years. Specifically, within the field of political science there has been continued interest in using text analysis to extend well-known spatial models of political ideology. Simultaneously, interest in how the structure of social networks affects political attitudes and outcomes has been both growing and controversial. In an attempt to bridge the gap between these two research areas we have collected a sample of Twitter data related to the U.S. Congress. Twitter provides a rich data set containing both text and social network information about its members. Here we compare the usefulness of text analysis and social network analysis for predicting the political ideology of Twitter users — two methods that are, in principle, applicable to both members of Congress (for whom roll call data and precise spatial estimates of political ideology already exist) and to the surrounding network of Twitter users (for whom precise estimates of political ideology do not exist). To compare text analysis methods with tools from social network analysis, we fit a variety of L1- and L2-regularized regression models that use word count data from individual tweets to predict the ideal points of Members of Congress. We then compare the performance of the resulting text models with the performance of social network models that employ techniques developed for predicting the spread of transmissible diseases to predict the ideal points for the same Members of Congress. Contact: John Myles White Graduate Student Department of Psychology and Princeton Neuroscience Institute Princeton University jmw1729@princeton.edu 70 of 104 Time Dependent Dirichlet Process Mixture Models for Multiple Target Tracking Willie Neiswanger Frank Wood Columbia University, New York, NY 10027, USA wdn2101@columbia.edu, fwood@stat.columbia.edu Abstract Multiple Target Tracking (MTT) is a machine vision problem that involves identifying multiple concurrently moving targets in a video and maintaining identification of these targets over time while tracking the frame-by-frame locations of each. Difficulties in producing a single, general-purpose algorithm capable of successfully carrying out MTT over a wide range of videos are mainly due to the potential for videos to be highly dissimilar; parameters such as the targets’ physical characteristics (including size, shape, color, and behavioral patterns), backgrounds over which the targets move, and filming conditions may all differ. The technique developed in this project makes use of a known time dependent Dirichlet process mixture model, comprised of a sequence of interdependent infinite Gaussian mixture models constructed to accommodate behavior that changes over time. This work describes how pixel location values can be extracted from a wide variety of videos containing multiple moving targets and clustered using the aforementioned model; this process is intended to reduce the need for explicit target identification and serve as a general MTT method applicable to targets with arbitrary characteristics moving over diverse backgrounds. The technique is demonstrated on video segments showing multiple ant targets, of distinct species and with diverse physical and behavioral characteristics, moving over non-uniform backgrounds. 1 7 Conclusion This paper has exhibited successful MTT over short video sequences with the use of a time dependent GPUDPM model. The technique was illustrated on two multiple target tracking scenarios involving disparate targets and backgrounds. A general data extraction algorithm was outlined as a simple way to collect data from a wide variety of videos containing multiple moving targets. We hope that this research might help in development of more general MTT algorithms, with the ability to track a large variety of targets with arbitrary characteristics moving over diverse backgrounds References [1] F. Caron, M. Davy, and A. Doucet. Generalized polya urn for time-varying Dirichlet process mixtures. 71 of 104 In 23rd Conference on Uncertainty in Artificial Intelligence (UAI’2007), Vancouver, Canada, July 2007, 2007. [2] J. Gasthaus, F. Wood, D. Görür, and Y. W. Teh. Dependent Dirichlet process spike sorting. In Advances in Neural Informations Processing Systems 22, 2008. [3] Jan Gasthaus. Spike sorting using time-varying Dirichlet process mixture models, 2008. [4] J. E. Griffin and M. F. J. Steel. Order-based dependent Dirichlet processes. Journal of the American Statistical Association, 101(473):179–194, 2006. [5] W. Ng, J. Li, S. Godsill, and J. Vermaak. A review of recent results in multiple target tracking. In Image and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on, pages 40 – 45. sept. 2005. doi: 10.1109/ISPA.2005.195381. [6] Simo Sarkka, Toni Tamminen, Aki Vehtari, Aki Vehtari, and Jouko Lampinen. Probabilistic methods in multiple target tracking, review and bibliography, 2004. 10 72 of 104 Manjot Pahwa, Undergraduate Student Researcher Netaji Subhas Institute of Technology, University of Delhi Recipient of Google Women in Engineering Award, 2011 manjotpahwa@gmail.com +91-(9953)310-807 An Efficient and Optimal Scrabble Playing Algorithm based on Modified DAWG and Temporal Difference Lambda Learning Algorithm Abstract: The aim of this poster is to focus on the development and analysis of an algorithm to play the popular game Scrabble. The poster will first describe and analyze the existing algorithms used. The improvements which can be possible will be described in the algorithm. The algorithm will not only be one of the most efficient algorithms for Scrabble but will also choose the highest scoring move or in other words, play as optimally as possible. The first part is achieved with the help of a modified version of a popular data structure known as DAWG (directed acyclic word graph). This data structure is modified so as to enable the movement from a node at top bottom or anywhere in the middle in any direction. This would enable to efficiently generate prefixes, suffixes or hook up words between existing tiles on the board in either direction. Although such a structure would inevitably occupy much more space than a DAWG, the processing time is reduced by an appreciable factor. Furthermore, processing is speeded up further by searching using the tiles already on the board using a backtracking algorithm which searches for the highest scoring word using tiles already on the board. Although a primitive form of this algorithm has been described by Jacobsen and Appel in their paper “The World’s Fastest Scrabble Program”, this poster will present an algorithm which augments their work in terms of finding not the solution only efficiently but also giving the highest score. This can be possible using an algorithm which chooses the word or searches for the word in order of descending highest scores possible. This involves consideration of not only the length of the word but also the individual tile scores allotted to each and every alphabet. Furthermore, greater score is expected with the help of an approximate dynamic programming solution based on Temporal Difference (TD) Learning. This Temporal Difference Learning algorithm is based on the formulation of Richard S. Sutton known as TD-Lambda learning algorithm. Prediction of future scores can be done on the basis of tiles remaining in the bag and possible patterns. Moreover, as the game progresses, the predictions can be fine tuned to be more accurate. Such a methodology can vastly improve the game play of a computer in comparison to a human. A discussion of the time and space efficiency of this solution will also be presented. Lastly, this poster will describe some more changes which can further improve the score fetched by the computer. 73 of 104 Hierarchically Supervised Latent Dirichlet Allocation Adler Perotte Nicholas Bartlett Noémie Elhadad Frank Wood Columbia University, New York, NY 10027, USA {ajp9009@dbmi,bartlett@stat,noemie@dbmi,fwood@stat}.columbia.edu We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of-word data are also of interest. Our work operates within the framework of topic modeling. Our approach learns topic models of the underlying data and labeling strategies in a joint model, while leveraging the hierarchical structure of the labels. For the sake of simplicity, we focus on is-a hierarchies, but the model can be applied to other structured label spaces. Our work extends supervised latent Dirichlet allocation (sLDA) [2] to take advantage of hierarchical supervision and proposes an efficient way to incorporate such information into the model. We hypothesize that the context of labels within the hierarchy provides valuable information about labeling. Other models, such as LabeledLDA [5], incorporate LDA and supervision; however, none of these models leverage dependency structure in the label space. We demonstrate our model on large, real-world datasets in the clinical and web retail domains. We observe that hierarchical information is valuable when incorporated into the learning and improves our primary goal of multi-label classification. Our results show that a joint, hierarchical model outperforms a classification with unstructured labels as well as a disjoint model, where the topic model and the hierarchical classification are inferred independently of each other. HSLDA is a model for hierarchically, multiply-labeled, bag-of-word data. We will refer to individual groups of bag-ofword data as documents. Let wn,d ∈ Σ be the nth observation in the dth document. Let wd = {w1,d , . . . , w1,Nd } be the set of Nd observations in document d. Letthere be D such documents and let the size of the vocabulary be V = |Σ|. Let the set of labels be L = l1 , l2 , . . . , l|L| . Each label l ∈ L, except root, has a parent pa(l) ∈ L also in the set of labels. We will for exposition purposes assume that this label set has hard “is-a” parent-child constraints, although this assumption can be relaxed at the cost of more computationally complex inference. Such a label hierarchy forms a multiply rooted tree. Without loss of generality we will consider a tree with a single root r ∈ L. Each document has a variable yl,d ∈ {−1, 1} for every label which indicates whether the label is applied to document d or not. In most cases yi,d will be unobserved, in some cases we will be able to fix its value because of constraints on the label hierarchy, and in the relatively minor remainder its value will be observed. In the applications we consider, only positive label applications are observed. In HSLDA, documents are modeled using the LDA mixed-membership mixture model with global topic estimation. Label responses are generated using a conditional hierarchy of probit regressors[3]. The HSLDA graphical model is given in Figure 1. In the model, K is the number of LDA “topics” (distributions over the elements of Σ), φk is a distribution over “words,” θ d is a document-specific distribution over topics, and β is a global distribution over topics. Posterior inference in HSLDA was performed using Gibbs sampling and Markov chain Monte Carlo. Note that, like in collapsed Gibbs samplers for LDA [4], we have analytically marginalized out the parameters φ1:K and θ 1:D . HSLDA also employs a hierarchical Dirichlet prior over topic assignments (i.e., β is estimated from data rather than assumed to be symmetric). This has been shown to improve the quality and stability of inferred topics [6]. The hyperparameters α, α0 , and γ are sampled using Metropolis-Hastings. We applied HSLDA to data from two domains: predicting medical diagnosis codes from hospital discharge summaries and predicting product categories from Amazon.com product descriptions. The clinical dataset consists of 6,000 clinical notes along with associated billing codes that are used to document conditions that a particular patient was treated for. These billing codes (7298 distinct codes in our dataset) are organized in an is-a hierarchy. The retail dataset consists of product descriptions for DVDs from the Amazon.com product catalog. This data was partially 1 74 of 104 Figure 1: HSLDA graphical model obtained from the Stanford Network Analysis Platform (SNAP) dataset [1]. The comparison models included sLDA with independent regressors (hierarchical constraints on labels ignored) and HSLDA fit by first performing LDA then fitting tree-conditional regressions. The number of topics for all models was set to 50, the prior distributions of p (α), p (α0 ), and p (γ) were gamma distributed with a shape parameter of 1 and a scale parameters of 1000. 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.2 0.00.0 Sensitivity 1.0 Sensitivity 1.0 0.4 0.2 0.2 0.4 (a) 0.6 0.8 1.0 0.00.0 0.4 0.2 0.2 0.4 0.6 1-Specificity 0.8 1.0 0.00.0 0.2 (b) 0.4 0.6 1-Specificity 0.8 1.0 (c) Figure 2: ROC curves for out-of-sample ICD-9 code prediction from patient free-text discharge records ((a),(c)). ROC curve for out-of-sample Amazon product category predictions from product free-text descriptions (b). Figures (a) and (b) are a function of the prior means of the regression parameters. Figure (c) is a function of auxiliary variable threshold. In all figures, solid is HSLDA, dashed are independent regressors + sLDA (hierarchical constraints on labels ignored), and dotted is HSLDA fit by running LDA first then running tree-conditional regressions. The results in Figures 2(a) and 2(b) suggest that in most cases it is better to do full joint estimation of HSLDA. An alternative interpretation of the same results is that, if one is more sensitive to the performance gains that result from exploiting the structure of the labels, then one can, in an engineering sense, get nearly as much gain in label prediction performance by first fitting LDA and then fitting a hierarchical probit regression. There are applied settings in which this could be advantageous. References [1] Stanford network analysis platform. http://snap.stanford.edu/, 2004. [2] D. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing, 20:121–128, 2008. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2nd ed. edition, 2004. [4] T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228–5235, 2004. [5] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248–256, 2009. [6] Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA: Why priors matter. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1973– 1981. 2009. 2 75 of 104 Image Super-Resolution via Dictionary Learning Gungor Polatkan, David Blei Princeton University Mingyuan Zhou, Ingrid Daubechies, Lawrence Carin Duke University Nowadays, one popular research problem in image analysis is super-resolution (SR). The main task can be summarized as recovering a high-resolution (HR) image from a low-resolution (LR) input. The need for enhancing the resolution of smart phone and surveillance cameras, the need for HR images in medical and satellite visualizations motivates superior SR algorithms. SR problem is generally an ill-posed problem since the LR to HR transformation is not unique. Several regularization methods have been proposed in order to overcome this issue. In this abstract, we consider SR algorithms in two groups: interpolation based and example based approaches. Interpolation based methods, such as Bicubic and Biliniear, generally overly smooth the image, and details are lost. Example based approaches use machine learning in order to overcome this problem. These algorithms use ground truth HR and LR image pairs during training and uses the learned statistical relation in the testing phase to reconstruct an HR image from a held-out low-resolution input. Multiscale Extension for Landmark-Dependent Dictionary Learning. Recently a landmarkdependent hierarchical beta process is developed for dictionary learning problems. The incorporation of the covariates into dictionary learning provides a tool where feature usages are likely to be similar for samples close to each other in covariate space. By extending this framework into multi-stage dictionarylearning, we present a new algorithm for image super-resolution problem. The multiscale extension for super-resolution is based on the fundamental assumption: given the HR and LR patches, the goal is to learn dictionaries for both resolutions such that the sparse coefficients is same for both resolutions. The original dictionary learning models the observation with xi Dpsi d zi q i where xi represents the observations, D represents the dictionary, zi represents the binary factor assignment, si represents the sparse factor score and i represents the error. The prior of zi incorporates the covariates. We write the multistage version of the model in a similar form of the original model. pcq xi pl q xi phq xi pcq , dk plq dk phq dk pcq , i plq i phq i where the superscript pcq corresponds to concatenation of both resolutions. In this setup, the training of the dictionaries are same as the original model. The only difference is that we couple patches plq phq pl q phq xi , xi and couple dictionaries dk , dk . The covariates can be the patch positions within the image or the relative position in the observation space. In the latter, if we define `i xi {||xi ||2 , then ||`i `j ||2 a 2 2cosdistpxi , xj q since cosdistpxi , xj q ||xi ||2i ,||jxj ||2 . Since the training set includes patches from multiple image pairs, the locations of the patches can not be used as covariates (i.e. they are coming from different images). Instead, we use the normalized observation. Moreover, in order to speed-up the inference, we first reduce dimensionality to 3 with PCA and use the dimension reduced observations as covariates. One important problem is that the size of the patches for LR and HR are different which will lead to the bias in fitting the shared factor score in favor of the HR patches. In order to prevent this we first 1 xT x 76 of 104 employ Bicubic interpolation and magnify the LR image such that it has the same size with an HR image. By this way we also have a perfect match between LR and HR patches. Super-Resolution of an LR Test Image. During the testing phase our goal is first to find the plq pl q sparse factor scores by using the LR image patches xi and the LR dictionary dk learned at the training. phq Then, we reconstruct the corresponding HR patches by using the HR dictionary dk and learned factor scores psi d zi q. While learning the sparse factor scores one important difference from regular inference plq phq of the landmark dependent dictionary learning is that we do not sample the dictionary dk , dk , and the precision γs of the factor score si . The former is obvious from the training-testing framework. However, the letter is crucial as well in order to prevent over-fitting. If we also sample γs during testing, the sparse pl q factors si over-fits the LR patches xi and the reconstruction quality degrades. In other words, the main assumption of sparse coefficient sharing is regularized by learning the γs in the training and using the estimate in the testing. By this way we control the sparsity of factor scores in testing. pcq xi pl q xi pcq , dk plq dk phq dk pcq , i pl q i Experiments. In experiments we use two data sets: (1) Berkeley natural image data set for testing, (2) set of images collected from the the web for training. The diversity of both data sets provides us with rich set of HR-LR patch pairs. We use the super-resolition ratio 3, patch size 8 8. We apply all the algorithms only to the illuminance channel. We use the Bicubic interpolation in the color layers (Cb, Cr) for all methods. In terms of baselines, we use both interpolation and example based super-resolution algorithms. Bicubic interpolation is the gold standard in super-resolution literature. We also use nearest neighbor and bilinear interpolation as the interpolation based methods. In terms of example based methods, we use super-resolution via sparse representation (ScSR). The dictionary learning stage of both our method and ScSR uses the same training set (100K patches sampled from the data set). (a) High (g) BPFA (b) Low (c) Bicubic (h) OurLoc (i) OurVal (d) NNI (e) Bilinear (j) HR Dictionary (f) ScSR (k) LR Dictionary Figure 1: OurLoc: Spatial location used as covariate, OurVal: Dimension reduced observation used as covariate, BPFA: Beta Process Factor Analysis, NNI: Nearest neighbor interpolation. 2 77 of 104 Structured Sparsity via Alternating Directions Methods Zhiwei (Tony) Qin 1 Donald Goldfarb Introduction We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm which incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. min F (x) ≡ l(x) + Ω(x), x∈Rm (1) P Ωl1 /l2 (x) ≡ λ g∈G wg kxg k, P where l(x) = − ∈ = G = {g1 , · · · , g|G| } is the set of Ωl1 /l∞ (x) ≡ λ g∈G wg kxg k∞ . group indices with |G| = J, and the elements (features) in the groups possibly overlap [2, 4]. In this model, λ, wg , G are all pre-defined. k · k without a subscript denotes the l2 -norm. Whenever the l1 /l2 - and l1 /l∞ -regularization terms are mentioned, we assume that the groups overlap. 1 2 kAx 2 bk2 , A Rn×m , Ω(x) Algorithms We reformulate problem 1 into an unconstrained problem min s.t. 1 kAx − bk2 + Ω̃(y) 2 Cx = y, (2) where Ω̃(y) is the non-overlapping group-structured penalty term corresponding to Ω(y) defined above. The 1 augmented Lagrangian of (2) is L(x, y, v) = 21 kAx − bk2 − v T (Cx − y) + 2µ kCx − yk2 + Ω̃(y). We build a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. The core building-block of this framework is to compute an approximate minimizer (x, y) of L(x, y, v) given v. For this specific task, we propose FISTA-p (Algorithm 2.1), which is a partial linearization variant of FISTA [1], and we prove that this algorithm requires O( √1 ) iterations to obtain an -optimal solution. Algorithm 2.1 FISTA-p (partial linearization) 1: Given x0 , y 0 , v. Choose ρ, and z 1 = y 0 . 2: for k = 1, 2, · · · , K do 1 3: xk ← arg minx 12 kAx − bk2 − v T (Cx − z k ) + 2µ kCx − z k k2 1 4: y k ← arg miny f (xk , z k ) + ∇y f (xk , z k )T (y − z k ) + 2ρ ky − z k k2 + g(y) √ 1+ 1+4t2k 5: tk+1 ← 2 −1 z k+1 ← y k + ttkk+1 (y k − y k−1 ) 7: end for 8: return (xK+1 , y K+1 ) 6: In addition, we propose a partial-split version of ALM-S [3] as well as direct application of ALM-S and FISTA to solve the core subproblem. 1 78 of 104 Figure 1: Left two: Scalability test results of the algorithms on the synthetic overlapping Group Lasso data sets from [2]. The y-axis is in logarithmic scale. ALM-S is not included in the left plot because we did not run it on the last two data sets due to the computational burden (expected CPU time exceeding 104 seconds). Right two: Scalability test results on DCT set with l1 /l2 -regularization (left column) and l1 /l∞ -regularization (right column). The y-axis is in logarithmic scale. Figure 2: Separation results for the video sequence background substraction example [4]. Each training image had 120 × 160 RGB pixels. The training data contained 200 images in sequence. The accuracy indicated for each of the different models is the percentage of pixels that matched the ground truth. The CPU time reported is for an average run on the regularization path. 3 Experiments Results We compare the algorithms on a collection of data sets and apply them to a video sequence background subtraction task. The results are presented in Figures 1 and 2. References [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [2] X. Chen, Q. Lin, S. Kim, J. Peña, J. Carbonell, and E. Xing. An Efficient Proximal-Gradient Method for Single and Multi-task Regression with Structured Sparsity. Arxiv preprint arXiv:1005.4717, 2010. [3] D. Goldfarb and S. Ma. Fast alternating linearization methods for minimizing the sum of two convex functions. Arxiv preprint arXiv:0912.4571, 2009. [4] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1558–1566. 2010. 2 79 of 104 Ranking annotators for crowdsourced labeling tasks Vikas C. Raykar Shipeng Yu Siemens Healthcare, Malvern, PA 19355 USA vikas.raykar@siemens.com shpieng.yu@siemens.com 7 11 15 12 12 77 67 75 346 26 6 2 5 11 3 9 20 22 31 10 13 12 30 18 8 4 1 19 29 14 17 27 15 28 21 25 7 16 33 23 24 32 654 171 238 917 104 284 428 374 249 229 453 525 541 437 175 119 1099 1211 572 Spammer Score 43 7 87 35 29 Annotating data is one of the major bottlenecks in using supervised learning to build good sentiment | 1660 instances | 33 annotators predictive models. Getting a dataset labeled 1 by experts can be expensive and time consuming. With the advent of crowdsourcing ser0.8 vices (Amazon’s Mechanical Turk (AMT) being a prime example) it has become quite easy and 0.6 inexpensive to acquire labels from a large number of annotators in a short amount of time. 0.4 However one drawback of most crowdsourcing services is that we do not have control over 0.2 the quality of the annotators. The annotators 0 can come from a diverse pool including genuine experts, novices, biased annotators, malicious Annotator annotators, and spammers. Hence in order to Figure 1. The ranking of annotators obtained using the proget good quality labels requestors typically get posed spammer score for the Irish economic sentiment analeach instance labeled by multiple annotators ysis data (Brew et al., 2010) annotatoed by 33 annotators. and these multiple annotations are then conThe spammer score ranges from 0 to 1—lower the score more spammy is the annotator. The mean spammer score and the solidated either using a simple majority voting 95% confidence intervals (CI) are shown—obtained from 100 or more sophisticated methods that model and bootstrap replications. The annotators are ranked based on correct for the annotator biases (Raykar et al., the lower limit of the 95% CI. The number at the top of the 2010). While majority voting assumes all anno95% CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator tators are equally good the more sophisticated labels only a few instances. methods model the annotator performance and then appropriately give different weights to the annotators to reach the consensus. Here we are interested in ranking annotators based on how spammer like each annotator is. In our context a spammer is an annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, or does not look at the instances when labeling). Spammers can significantly increase the cost of acquiring annotations and at the same time decrease the accuracy of the final consensus labels. A mechanism to detect/eliminate spammers is a desirable feature for any crowdsourcing market place. One can give monetary bonuses to good annotators and deny payments to spammers. We formalize the notion of a spammer and specifically we define a scalar metric which can be used to rank the annotators—with the spammers having a score close to 0 and the good annotators having a score close to 1. Annotator model for categorical labels Suppose there are K ≥ 2 classes. Let yij ∈ {1, . . . , K} be the label assigned to the ith instance by the j th annotator, and let yi ∈ {1, . . . , K} be the actual j j (unobserved) label. We model each annotator by the multinomial parameters αjc = (αc1 , . . . , αcK ), ∑K j j j j where αck := Pr[yi = k|yi = c] and k=1 αck = 1. The term αck denotes the probability that 80 of 104 Ranking annotators for crowdsourced labeling tasks annotator j assigns class k to an instance given the true class is c. We do not dwell too much on the estimation of the annotator model parameters. These parameters can either be estimated directly using known gold standard or the iterative EM algorithms that estimate the annotator model parameters without actually knowing the gold standard (Raykar et al., 2010). Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator. More precisely an annotator is a spammer if the probability of observed label yij being k given the true label yi is independent of the true label, i.e., Pr[yij = k|yi ] = Pr[yij = k], ∀k. This is equivalent to Pr[yij = k|yi = c] = Pr[yij = k|yi = c′ ], ∀c, c′ , k = 1, . . . , K, which means knowing the true class label being c or c′ does not change the probability of the annotator’s j label. This indicates that the annotator j is a spammer if αck = αcj′ k , ∀c, c′ , k = 1, . . . , K. Spammer metric Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer would have all the rows of Aj equal. Essentially Aj is a rank one matrix of the form Aj = ev⊤ j , K ⊤ for some vector vj ∈ R that satisfies vj e = 1, where e is vector of ones. One natural way to summarize this would be in terms of the distance (Frobenius norm) of the confusion matrix to the j 2 ⊤ 2 closest rank one approximation, i.e, S j := ∥Aj − ev̂⊤ j ∥F , where v̂j = arg minvj ∥A − evj ∥F such that j j ⊤ v⊤ j e = 1. This yields v̂j = (1/K)A e, which is the mean of the rows of the matrix A . Hence we ( ) 2 ∑ ∑ j have S j = I − K1 ee⊤ Aj F = K1 c<c′ k (αck − αcj′ k )2 . So a spammer is an annotator for whom j j S is close to zero. A perfect annotator has S = K − 1. We can normalize this to lie between 0 and 1. ∑∑ j 1 (αck − αcj′ k )2 Sj = K(K − 1) ′ c<c k While this spammer score was implicit for binary labels in earlier works the extension to categorical labels is novel and is quite different from the accuracy computed from the confusion rate matrix. A recent attempt to quantify the quality of the workers was made by (Ipeirotis et al., 2010) where they transformed the observed labels into posterior soft labels. We use the proposed metric to rank annotators for the data collected for Irish economic sentiment analysis (Brew et al., 2010). A total of 1660 articles from three Irish online news sources were annotated by 33 volunteer users as positive, negative, or irrelevant. Each article was annotated by 6 annotators. Figure 1 plots the ranking of annotators obtained using the spammer score with the annotator model parameters estimated via the EM algorithm. The mean and the 95% confidence intervals obtained via bootstrapping are also shown. The number at the top of the CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances. Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide. We need to factor the number of instances labeled by the annotator into the ranking. Ideally we would like to have annotators with a high score and at the same time label a lot of instances. Hence we rank them based on the lower limit of the CI. References Brew, A., Greene, D., & Cunningham, P. (2010). Using crowdsourcing and active learning to track sentiment in online media. Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems(PAIS’10). Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. Proceedings of the ACM SIGKDD Workshop on Human Computation (pp. 64–67). Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322. 81 of 104 Extracting Latent Economic Signal from Online Activity Streams Joseph Reisinger Metamarkets 670 Broadway, New York, NY joe@metamarketsgroup.com Abstract Online activity generates a myriad of weak economic signals, both as a result of direct interaction, e.g. purchases, location-based checkins, etc, and indirect interaction, e.g. realtime display ad auctions, audience targeting and profile generation. In this work we develop an approach to constructing latent economic indicators from large collections of such realtime activity streams. In particular we investigate the relative impact of realtime topical discourse and ad pricing on a basket of financial indicators. In order to effectively overcome noise inherent in the extracted weak signals, we develop a novel Bayesian vector autoregression (VAR) formulation with sparse underlying factors. Our preliminary results indicates that sparse subsets of online activity do indeed correlate strongly with economic activity, and can be used to improve the robustness of economic forecasts. 1 Background Online textual activity streams have been used to predict a wealth of economic and political indicators. For example, O’Connor et al. use Twitter sentiment to predict public opinion around elections [12] and Joshi et al. use movie reviews to predict revenue performance [11]. Online sentiment has also been shown to potentially inform equity markets, as financial decisions are often based on mood [7]. Google uses flu-related search queries to predict outbreaks, often faster than the CDC [8], and also maintains a domestic trends index,1 tracking economically relevant search traffic across a diverse set of sectors [5]. Finally the Billion Prices Project seeks to define a bottom-up measure of CPI based on extracting and classifying prices from thousands of online retailers [4]. Our work attempts to bridge the gap between large-scale extraction of such “nonstandard economic indicators” and traditional econometric analysis, introducing a novel structural vector autoregression (VAR) model based on group sparsity. Our model assumes low-rank autoregressive factor structure for time-series within the same group (e.g. Twitter popularity or ad prices) linked by sparse betweengroup parameter vectors. This formulation is efficient and allows us to robustly scale our analyses to upwards of 10k base time-series. We collect eight months of daily time-series data (2011-01-01 to 2011-09-01) from two sources: (Ad Exchanges) Ad impressions can be sold in realtime auctions on ad exchanges such as Google’s AdX platform and Yahoo’s RightMedia exchange. These exchanges match advertisers (buyers) with publishers (supply providers) at the granularity of individual impressions, and take a small flat fee from each successful transaction. We obtained rich pricing data from three such exchanges consisting of 10B impressions across a large number of publishers. 1 http://www.google.com/finance/domestic_trends 1 82 of 104 (Twitter) We extract daily time-series of news content and public mood from Twitter, a popular microblogging service.2 Raw unigram counts are smoothed using Latent Dirichlet Allocation (LDA) with K “ 100 topics yielding coarse-grained topical content [3]. Using this data, we uncover components of both Twitter topics and ad exchange pricing dynamics that are predictive of sectoral shifts in broader equities markets. 2 Sparse Pooling Vector Autoregression We propose an approach to structural modeling based on the Bayesian vector autoregressive (VAR) process [9] that is capable of scaling to a large number of noisy time-series. In particular we impose restrictions on the covariance structure so that it is low-rank and sparse, limiting the number of parameters to estimate. Our sparse pooling VAR model (spVAR) consists of a set of K endogenous variables partitioned into g g g G groups: ytg “ py1t , . . . , ykt , . . . , yKt q, k P r1, Ks, g P r1, Gs. The spVARppq process is defined as » 1 fi » 1 1 fi » 1 1 fi Ap yt´p yt A1 yt´1 — .. ffi — ffi — ffi .. .. – . fl “ Z1 – fl ` . . . ` Zp – fl ` ut . . ytG G AG 1 yt´1 G AG p yt´p where Agi is a mˆK coefficient matrix, Zi is a sparse K ˆpG¨mq matrix of random factor loadings, and ut is a K-dimensional zero-mean, covariance-stationary innovation process, i.e. Erut s “ 0 and g Eruy uJ t s “ Σu . That is, each covariate vector yt´i for each signal group G is projected onto a separate linear subspaces of rank m and are combined via the shared sparse factor loading matrix Zi .3 The constraints introduced by this procedure are similar to those found in structural VAR [9]. Note that unlike Factor-augmented VAR models [2], our model does not assume time-varying factor structure, making it less suitable for modeling, e.g. the effects of exogenous shocks. 3 Structural Analysis and Future Work Figure 1 shows and example structural decomposition of the ad exchange-Twitter-Equities threegroup system. Lagged Twitter factors 5 and 7 are found to have statistically significant (although small) impact on a broad range of equities, while ad exchange factor 2 has a narrower impact across commodities and utilities sectors and ad exchange factor 4 is significant for Yahoo stock returns. (Exogenous Shocks) Both data sources exhibit large exogenous shocks; in the exchange case due to new bidders entering the market or the availability of new targeting information, and in the Twitter case due to spikes in topical content (e.g. “hurricapocalypse”). Switching VAR processes would be better able to account for these regime changes [6]. (Cointegration) Residual correlation indicates that there may be significant cointegration between groups. Extending spVAR to model Ip1q series can be achieved by replacing the underlying VAR model with a VECM [9]. (Price Discovery) Microstructure models of such price discovery can be developed leveraging spVAR as a base model in order to understand how multiple ad exchanges interact to form market pricing. [10] (Forecasting) VAR models are commonly used for macroeconomic forecasting. We can use the spVAR framework to identify subsets of underlying activity streams that are most predictive of economic variables of interest. References [1] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. CoRR, abs/0812.1869, 2008. 2 3 http://twitter.com Zi and Agi can be estimated efficiently as multiple joint sparse-PCA problems [13, 1]. 2 83 of 104 * * ** ** * * * ** ** ** * *** *** * * ** * * * * * *** ** * ** * * ** * *** *** ** *** ** *** ** * *** * *** *** * * * ** * ** * *** ** *** *** * ** *** ** *** *** * ** ** * ** * * * * * * ** * ** * * * * ** ** 1 ** ** *** ** *** *** *** ** *** 2 * twitter_4.l1 twitter_3.l1 twitter_2.l1 twitter_10.l1 trend twitter_1.l1 const ad_ex_9.l1 ad_ex_8.l1 ad_ex_7.l1 ad_ex_6.l1 ad_ex_5.l1 ad_ex_4.l1 ad_ex_3.l1 ad_ex_2.l1 ad_ex_1.l1 ad_ex_10.l1 XLY.l1 YHOO.l1 XLU.l1 ** XLK.l1 XLI.l1 0 * ** XLF.l1 XLE.l1 SPY.l1 VXX.l1 −2 * * GSP.l1 Estimate −1 * *** ** *** * ** *** *** * * *** *** * * * *** * ** * twitter_9.l1 * *** *** twitter_8.l1 * *** *** * twitter_7.l1 * * ** * ** twitter_6.l1 * ** * * twitter_5.l1 * *** ** ** * * *** * * * *** *** * *** *** ** ** *** ** * * * *** *** *** ** * * *** *** *** ** *** * * * * ** * *** * AAPL.l1 twitter_9 twitter_8 twitter_7 twitter_6 twitter_5 twitter_4 twitter_3 twitter_2 twitter_10 twitter_1 ad_ex_9 ad_ex_8 ad_ex_7 ad_ex_6 ad_ex_5 ad_ex_4 ad_ex_3 ad_ex_2 ad_ex_10 ad_ex_1 YHOO XLY XLU XLK XLI XLF XLE VXX SPY GSP AAPL Figure 1: spVAR model parameters fit to daily returns data drawn from latent Twitter topical factors (twitter), latent ad exchange factors (ad ex) and several equity sectors. All data has been differenced and smoothed using SMA(3). [2] B. Bernanke, J. Boivin, and P. S. Eliasz. Measuring the effects of monetary policy: A factoraugmented vector autoregressive (favar) approach. The Quarterly Journal of Economics, 120(1):387–422, 2005. [3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:2003, 2003. [4] A. Cavallo and R. Rigobon. Billion prices project. http://bpp.mit.edu/. [5] H. Choi and H. Varian. Predicting the present with Google trends. Technical report, Google, 2009. [6] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Sharing features among dynamical systems with beta processes. In Neural Information Processing Systems 22. MIT Press, 2010. [7] E. Gilbert and K. Karahalios. Widespread worry and the stock market. In ICWSM. The AAAI Press, 2010. [8] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, Nov. 2008. [9] J. D. Hamilton. Time-series analysis. Princeton Univerity Press, 1994. [10] J. Hasbrouck. Empirical Market Microstructure. Oxford University Press, 2006. [11] M. Joshi, D. Das, K. Gimpel, and N. A. Smith. Movie reviews and revenues: an experiment in text regression. In Proc. of NAACL-HLT 2010, pages 293–296. Association for Computational Linguistics, 2010. [12] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the International AAAI Conference on Weblogs and Social Media, 2010. [13] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. 3 84 of 104 Intrinsic Gradient Networks Jason Rolfe, Matthew Cook, and Yann LeCun September 9, 2011 Artificial neural networks are computationally powerful and exhibit brain-like dynamics. However, before an artificial neural network can be used to perform a computational task, its parameters must be trained to minimize an error function which measures the desirability of the network’s outputs for each set of inputs. It is generally believed that the backpropagation algorithm, conventionally used to train neural networks, is not biologically plausible (Crick, 1989). As suggested by its name, the backpropagation algorithm requires feedback messages to be propagated backwards through the network. These feedback messages, used to calculate the gradient of the error function, are directly dependent on the feedforward messages, and so must evolve on the same time scale as the network’s inputs. Nevertheless, the feedback messages must not directly affect the original feedforward messages of the network; any influence of the feedback messages on the feedforward messages will disrupt the calculation of the gradient. Direct biological implementations of the backpropagation algorithm are not consistent with experimental evidence from the cortex, since the cortex neither shows signs of a sufficiently fast reverse signaling mechanism (Harris, 2008), nor segregates feedforward and feedback signals (Douglas & Martin, 2004). We address this biological implausibility by constructing a novel class of recurrent neural networks, intrinsic gradient networks, for which the gradient of an error function with respect to the parameters is a simple function of the network state. Intrinsic gradient networks do not generally have segregated feedforward computation and feedback training signals, and so are potentially consistent with the recurrence observed in the cortex. The highly recurrent connection topology (Felleman & Van Essen, 1991; Douglas & Martin, 2004) and locality of training signals (Malenka & Bear, 2004) observed in the brain imply that the cortex is able to learn efficiently using only local signals within a recurrently interconnected network. That is, the cortex appears to use a single interdependent set of messages for both computation and learning. Since recurrence and local training alone do not uniquely specify a neural network architecture, we make the biologically and computationally motivated assumption that learning requires the approximate calculation of the gradient of an error function. We restrict our attention to networks in which the gradient can be calculated completely at a single network state, rather than stochastic networks in which learning is a function of some statistic of the network activity over time, since it can take a long time to accurately estimate the statistics of large networks. In light of these experimental observations and computational assumptions, we wish to construct highly recurrent neural networks for which the gradient of an error function defined in terms of the network’s activity can be calculated via a simple, local function of the intrinsic network activity. To reach this goal, we first generalize slightly and characterize the class of (not necessarily highly recurrent) neural networks for which the gradient of an error function defined in terms of the network’s activity is a simple (but not necessarily local) function of the intrinsic network activity; we call this novel class of networks intrinsic gradient networks. In contrast to traditional artificial neural networks, intrinsic gradient networks do not require a separate, implicit set of backpropagation dynamics to train their parameters. Intrinsic gradient networks consist of a time-varying vector of real-valued units~x(t) ∈ Rn . We write~x to denote ~x(t) at some arbitrary point in time t. The point at which the network has finished computing and produced an output is defined by the output functions ~F(~x,~w) according to the fixed-point equation ~F(~x,~w) =~x, where 1 85 of 104 ~w is a vector of parameters. At these output states ~x where ~F(~x,~w) = ~x, the desirability of the outputs is defined by the error function E(~x). We find that intrinsic gradient networks are characterized by the equation ~T (~x) = ~S(~x, ~F(~x,~w)) + ∇E(~x) + ∇~F > (~x) · ~T (~x) , (1) where ~S(~x, ~F(~x,~w)) = 0 when ~x = ~F(~x,~w). When equation 1 is satisfied, dE(~x∗ (~w)) ~ > ∗ ∂ ~F(~x,~w) = T (~ x (~ w ),~ w ) · dw0 ∂ w0 at output states x∗ (~w) defined implicitly as a function of the parameters ~w by ~F(~x∗ (~w),~w) = ~x∗ (~w). We show that we can choose ~F(~x,~w) so that the gradient-calculating function ~T (~x,~w) is a simple, local function d~x(t) 1 ~ of ~x. Any network dynamics that converge to a fixed point of F(~x,~w) may be used, such as dt = τ · ~F(~x(t),~w) − x(t) . So long as ~F(~x,~w) is highly recurrent and Ti (~x) is simple and local to unit xi , intrinsic gradient networks satisfy our desiderata. We find that equation 1 is satisfied when ~F(~x,~w) is the sum of a component determined by E(~x), corresponding to the inputs to the network, and D +1 D +1 ψk ~F(~x,~w) = T−1 · (D + I)−1 · ∇ c + ∑ xψDψk · k k j D x j k j h ∏ j Dψk +1 Dψk j6=ψk xψk (2) for any set of indices ψk indexed by k, and any set of differentiable scalar functions hkj (x) indexed by both > j and k, assuming that ~T (~x) = T ·~x and T−1 · T = D is diagonal. We show that it is easy to choose parameterizations of the matrix T and the functions hkj (x) so that the network is highly recurrent, with a local gradient-calculating function ~T (~x). For instance, if we choose T to be a pairwise permutation of an appropriate form, then using hkj (x) ∈ w · x1/2 , x2/3 , 1 , we can construct networks consisting of linear nodes with output functions Fi (~x) = ∑ j (Wi j +W ji ) · x j for some matrix W, and nonlinear nodes with output 2/3 functions Fα (~x) = 2/3 xβ ·xγ 1/3 xα for all mappings of {i, j, k} to {α, β , γ}. As a proof-of-concept, we construct some example networks satisfying equation 2, and show that they can be trained via stochastic gradient descent to recognize handwritten digits. References Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129–132. Douglas, R. J., & Martin, K. A. C. (2004). Neuronal circuits of the neocortex. Annual Review of Neuroscience, 27, 419–451. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Harris, K. D. (2008). Stability of the fittest: Organizing learning through retroaxonal signals. Trends in Neurosciences, 31(3), 130–136. Malenka, R. C., & Bear, M. F. (2004). LTP and LTD: An embarrassment of riches. Neuron, 44, 5–21. 2 86 of 104 MirroRank: Convex aggregation and online ranking with the Mirror Descent Benoit Rostykus1 , Nicolas Vayatis2 1 Abstract We study the learning problem of pairwise ranking of objects, from a set of scoring functions as base learners. Restricting to convex aggregation of these learners with a pairwise missranking loss, we present how the Mirror Descent algorithm can be adapted to this framework to derive an efficient and sequential solution to the problem, that we call MirroRank. Along with a theoretical analysis, our empirical results on recommendation datasets show that the method proposed outperforms several other ranking algorithms. 2 Convex surrogates and convex aggregation for ranking Let (X,Y ) ∈ X × [0, 1] be a pair of random variables. In a ranking setup, X typically represents the data, and Y is a score for point X. If Y > Y 0 , then the pair (X,Y ) is said to be prefered to the pair (X 0 ,Y 0 ). In the pairwise ranking problem, one only observes X and X 0 , and wants to predict the relation Y > Y 0 . We define a scoring function s : X → [0, 1] such that if s(X) > s(X 0 ), then X is prefered to X 0 . Given a convex function ϕ : R → R+ , we are interested in finding a scoring function s that mimizes the convex risk: A(s) = Eϕ((s(X) − s(X 0 ))(Y −Y 0 )) (1) Note that such a choice of risk emphazises the magnitude-preserving properties of s (see [3]). We will focus on procedures optimizing A (or its empirical counterpart) from a convex set of base learners. The motivation is driven by applications: in a recommendation setting, it seems indeed intuitive to restrict the search space to convex combination of similar scoring functions (the base learners). Let M ∈ N∗ and s1 , ..., sM be a set of base scoring functions. If Θ is the simplex in RM , the family of convex combinations of base scoring functions sk is noted: ( ) M S = sθ = ∑ θ(k) sk , θ ∈ Θ (2) k=1 3 Ranking with the Mirror Descent Mirror Descent is an online procedure for convex optimization ([6],[5]). It can be interpreted as a stochastic gradient descent in a dual space. We now show how to adapt it to the ranking problem. Let Q(θ, i, j) = ϕ((sθ (xi ) − sθ (x j ))(yi − y j )) be the stochastic loss associated with scoring function s and evaluated on pairs (xi , yi ) and (x j , y j ). Noticing that (∇θ sθ )(k) (x) = sk (x), and noting ϕ0 the derivative of ϕ, one can easily show that: (∇θ Q(θ, i, j))(k) = (yi − y j )(sk (xi ) − sk (x j ))× ϕ0 ((sθ (xi ) − sθ (x j ))(yi − y j )) (3) Mirror Descent for Ranking is then described by the iterative scheme presented in the following pseudo-code, where V is called a proxy function and (γ p ) and (β p ) are predefined sequences. At each round, a new pair of couples (xi , yi ),(x j , y j ) is evaluated. The first step in the loop is a classical stochastic gradient update ([1]) in a dual space. The second one updates the weights in the primal. This key step minimizes a local approximation of A, penalized by the ”information” 1 Ecole 2 ENS Centrale Paris, ENS Cachan - benoit.rostykus@centraliens.net Cachan - nicolas.vayatis@cmla.ens-cachan.fr 1 87 of 104 MirroRank for p = 1, ...,t s.t i p < j p : do ζ p = ζ p−1 + γ p ∇θ Q(θ p−1 , i p , j p ) θ p = argmin θT ζ p + β pV (θ) θ∈Θ end for return sθ̂t with θ̂t = ∑tp=1 γ p θ p ∑tp=1 γ p criterion V ([5]). In fact, choosing an entropic loss for V leads to a closed form solution for θ̂t , making the algorithm very efficient. Taken as a sequential versionp of the MRE strategy described in [2], we show that MirroRank has a generalization bound in O( log M/n), which can be seen as optimal in some sense. 4 Experimental Results Choosing ϕ to be the hinge-loss, experiments conducted on different recommendation datasets show that MirroRank outperforms other ranking algorithms, on different classical metrics. This is especially true when the range of possible scores is large. As opposed to RankBoost ([4]), MirroRank weak learners are taken to be normalized scoring functions of neighbours users. The following Table is an example of comparison between MirroRank and RankBoost that we obtained on different datasets. The procedure employed for these experiments is very similar to the one described in [3]. MovieLens1 MovieLens2 HTREC Last.fm Mean Disagreement RankBoost MirroRank 44.7% ± 0.8 40.3% ± 1.1 42.8% ± 0.7 39.1% ± 0.7 47.2% ± 0.8 41.5% ± 1.0 NDCG@3 RankBoost MirroRank 81.4% ± 1.2 82.7% ± 1.2 83.7% ± 0.8 84.7% ± 1.0 39.9% ± 2.7 49.1% ± 1.7 Experiments also exhibit the excellent time complexity of our algorithm. Finally, we provide a natural interpretation of the weights θ(k) , and experimentally show that they can be related to a notion of similarity between scoring functions, enabling MirroRank to jointly recommend similar objects and similar users. References [1] L. Bottou and Y. LeCun. Large scale online learning. Advances in neural information processing systems, 16:217–224, 2004. [2] S. Clemençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of ustatistics. The Annals of Statistics, 36(2):844–874, 2008. [3] C. Cortes, M. Mohri, and A. Rastogi. Magnitude-preserving ranking algorithms. In Proceedings of the 24th international conference on Machine learning, pages 169–176. ACM, 2007. [4] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4:933–969, 2003. [5] AB Juditsky, A.V. Nazin, A.B. Tsybakov, and N. Vayatis. Recursive aggregation of estimators by the mirror descent algorithm with averaging. Problems of Information Transmission, 41(4):368–384, 2005. [6] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. 1983. 2 88 of 104 Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings Amina Shabbeer∗ Cagri Ozcaglar† Bülent Yener‡ Kristin P. Bennett§ Rensselaer Polytechnic Institute We propose a novel approach to embedding heterogeneous data in high-dimensional space characterized by a graph. Targeted towards data visualization, the objectives of the embedding are twofold: (i) preserve proximity relations as measured by some embedding objective, and (ii) simultaneously optimize an aesthetic criterion, no edge-crossings in the embedding, to create a clear representation of the underlying graph structure. The data-points are the nodes of the graph and have distances or similarity measures defined. It is often desirable, that drawings of such graphs map nodes from high-dimensional feature space to low-dimensional vectors that preserve these pairwise distances. This desired quality is frequently expressed as a function of the embedding and then optimized, eg. Multidimensional Scaling (MDS), the goal is to minimize the difference between the actual distances and Euclidean distances between all nodes in the embedding. However, layouts that preserve proximity relations can have a large number of edge-crossings obfuscating the relationships between nodes. The quality of a graph visualization can be gauged on the basis of how easily it can be understood and interpreted. Therefore for graphs, it is desirable to minimize edge crossings. This is a challenging problem in itself; determining the minimum number of crossings for a graph is NP-complete [4]. Thus, a natural question for heterogeneous data that comprises of data points characterized by features and by an underlying graph structure is how to optimize the embedding criteria while minimizing the large number of edge crossings in the embedded graph. The principle contributions of this paper are (i) expressing edge-crossing minimization as a continuous optimization problem so that both the embedding criteria and aesthetic graph visualization criteria can be simultaneously optimized. (ii) An iterative penalty algorithm that efficiently optimizes the layout with respect to a large number of non-convex quadratic constraints. This paper provides a new approach for addressing the edgecross minimization criteria that exploits Farkas’ Lemma to reexpress the condition for no edge-crossings as a system of nonlinear inequality constraints. The approach has an intuitive geometric interpretation closely related to support vector machine classification. While edge crossing minimization can be utilized in conjunction with any optimization-based embedding objective, here we demonstrate the approach on multi-dimensional scaling by modifying the stress majorization algorithm to include penalties for edge crossings. An alternating iterative penalty algorithm is developed that is capable of elegantly minimizing the stress subject to a large number of non-convex quadratic constraints. The algorithm is applied to a problem in tuberculosis molecular epidemiology, creating ‘spoligoforests’ for visualizing genetic relatedness between strains characterized by fifty-five biomarkers with associated non-Euclidean ge∗ e-mail: shabba@cs.rpi.edu Figure 1: Embeddings of spoligoforests of LAM (Latin-AmericanMediterranean) sublineages. Graph (c) is a planar embedding generated using Graphviz Twopi, the radial layout is visually appealing, but genetic distances between strains are not faithfully reflected. Graph (b), that optimizes the MDS objective (generated using Neato), preserves proximity relations but has edge-crossings. (d), Laplacian eigenmaps [1, 5, 7] use the eigenvectors of the weighted Laplacian for dimensionality reduction such that locality properties are preserved making genetically similar groups cluster together. These methods are fast but generate graphs that have edge-crossings and the genetic relatedness between all pairs of strains is less evident. In graph (a), the proposed approach eliminates all edge crossings with little change in the overall stress. Note how in graph (a), the radial structure emerges naturally when both distances and the graph structure are considered. (a) (b) † e-mail:ozcagc2@cs.rpi.edu ‡ e-mail:yener@cs.rpi.edu § e-mail:bennek@rpi.edu Figure 2: Embeddings for randomly generated graph in R7 with 50 nodes and 80 edges using (a) Stress majorization (stress=131.8, number of crossings=369) and (b) EdgeCrossMin (stress=272.1, number of crossings=0). The original planar embedding had stress= 352.5. 89 of 104 Figure 3: In (a) Edge A from a to c and edge B from b to d do not cross. Any line between xu − γ = 1 and xu − γ = −1 strictly separates the edges. Using a soft margin, the plane in (b) xu − γ = 0 separates the plane into half spaces that should contain each edge. is given by stress(X) = ∑i< j wi j (||Xi − X j || − di j )2 where Xi is the position of the node i in the embedding and di j represents the distance between nodes i and j. Majorization algorithms, such as used in this work, optimize this stress function by minimizing a quadratic function that bounds the stress function from above [3]. We explore how edge-crossing constraints can be added to stress majorization algorithms. We develop an algorithm which simultaneously minimizes stress while eliminating or reducing edge crossings using penalized stress majorization. m ρi [||(−Ai (X)ui +(1+γ i )e)+ ||1 +||(Bi (X)ui +(1−γ i )e)+ ||1 ] i=1 2 (2) We chose to use a penalty approach because it provides an efficient mechanism for dealing with the large number of potential edge l(l−1) crossings. For ` edge objects there are 2 possible intersections. We investigate the use of alternating direction method of multipliers to solve this program efficiently. Animations of the algorithm illustrating how the edge crossing penalty progressively transform the graphs are provided http://www.cs.rpi.edu/∼shabba/Final pictures/. Computational results demonstrate that this approach is practical and tractable. Results on spoligoforests and high dimensional random graphs with planar embeddings show that the method can find much more desirable solutions from a visualization point of view with only relatively small changes in stress as compared with other proximity-preserving dimensionality reduction techniques. min stress(X)+ ∑ X,u,γ netic distances of the Mycobacterium tuberculosis complex. Evolutionary relationships between strains are defined by a phylogenetic forest as in Fig. 1. The method is also demonstrated on a suite of randomly generated graphs with corresponding Euclidean distances that have planar embeddings with high stress as observed in Fig. 2 (a) and (b). The proposed edge-crossing constraints and iterative penalty algorithm can be readily adapted to other supervised and unsupervised optimization-based embedding or dimensionality reduction methods. The constraints can be generalized to remove intersections of general convex polygons including node-edge and node-node intersections. The condition that two edges do not cross is equivalent to the feasibility of a system of nonlinear inequalities 1 . Two edges do not intersect iff the following system of equations has no solution: 6 ∃ δA , δB s.t. A0 δA = B0 δB e0 δA = 1 e0 δB = 1 δA ≥ 0 δB ≥ 0 (1) where e is a vector of ones of appropriate dimension and A = ax ay b by and B = x . We prove this using a theorem of cx cy dx dy the alternative: Farkas’ lemma [6]. Therefore, two edges do not intersect iff ||(−Au + (1 + γ)e)+ || + ||(Bu + (1 − γ)e)+ || = 0 where (z)+ = max(0, z). Geometrically the theorem states that two edges (or more generally two polyhedrons) do not intersect if and only if there exists a hyperplane that strictly separates the extreme points of A and B. Figure 3 illustrates that when this system is satisfied, any plane that lies between xu + γ = 1 and xu + γ = −1 strictly separates the two edges, and the edges do not intersect. This formulation bears resemblance to the parallel hyperplanes used to find maximum margin hyperplanes in SVM [8]. The no-edge-crossing constraint corresponds to introducing a hyperplane and requiring each edge to lie in opposite half spaces. We develop a method to determine the separating plane based on alternating direction method of multipliers [2] that achieves a 10-fold improvement in computation time as compared to MATLAB’s LP solver linprog. This work demonstrates how edge-crossing constraints can be formulated as a system of nonconvex constraints. Edges do not cross if and only if they can be strictly separated by a hyperplane. If the edges cross, then the hyperplane defines the desired half-spaces that the edges should lie within. The edge-crossing constraints can be transformed into a continuous edge-crossing penalty function in either 1-norm or least-squares form. This general approach is applicable to the intersection of any convex polyhedrons including nodes represented as boxes and edges represented as bars. The number of node-node and node-edge overlaps increases as the graph size grows. Greater clarity can be achieved in visualizations of large graphs with the incorporation of constraints minimizing such overlaps. Proximity relations are preserved by minimizing the stress, a measure of the disagreement between distances in the highdimensional space and the new reduced space. The stress function 1 Details available at http://www.cs.rpi.edu/research/pdf/11-03.pdf R EFERENCES [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373– 1396, 2003. [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Technical report, working paper on line, Stanford, Univ, 2010. [3] E Gansner, Y Koren, and S North. Graph drawing by stress majorization. Graph Drawing, 3383:239–250, 2004. [4] M.R. Garey and D.S. Johnson. Crossing number is np-complete. SIAM Journal on Algebraic and Discrete Methods, 4:312, 1983. [5] Y. Koren. On spectral graph drawing. Computing and Combinatorics, pages 496–508, 2003. [6] O.L. Mangasarian. Nonlinear programming. Society for Industrial Mathematics, 1994. [7] Thomas Puppe. Spectral Graph Drawing: A Survey. VDM Verlag, 2008. [8] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000. 90 of 104 Title: A Learning-based Approach to Enhancing Checkout Non-Compliance Detection Authors: Hoang Trinh, Sharath Pankanti, Quanfu Fan. Abstract In retail stores, cashier non-compliant activities at the Point of Sale (POS) are one of the prevalent sources of retail loss, amounting to billions every year. A checkout non-compliance occurs when an item passes the POS without being registered, e.g., failing to trigger the barcode scanner. This non-compliance event, either done intentionally or unintentionally by the cashier, leads to the same outcome: the item is not charged to the customer, causing loss to the store. Surveillance cameras have long been used to monitor grocery stores, but their effectiveness is much limited by the need of constant human monitoring. Recently many automated video surveillance systems [1, 2, 3, 4, 5] have been introduced to detect such activities, showing more advantages than human surveillance in effectiveness, efficiency and scalability. The ultimate goal of these systems is to detect non-compliance activities, create a real-time alert for each of them for human verification. These systems accomplish this task by detecting checkout activities of cashiers during transactions and identify unmatched evidence through joint analysis of cashier activities and logged transaction data (TLog). The predominant cashier activity at the checkout is a repetitive activity called \visual scan, which is constituted by three distinctive primitive checkout activities: pick-up, scan and drop-off, corresponding to the process of registering one item by the cashier in a transaction. By aligning these detected visual scans with the barcode signals from the POS device, non-compliance activities can be isolated and detected. However when it comes to the real-world deployment of such systems, there are still vast technical challenges to overcome. Changing viewpoints, occlusions, and cluttered background are just a few of those challenges from a realistic environment that any automatic video surveillance system has to handle. In addition, in the low-margin retail business (checkout non-compliance occurs with much lower frequency than regular events), it is crucial that such a system be designed with careful control of alarms rate (AR) while still being scalable. The alarms rate is defined as the total number of detected non-compliance activities divided by the total number of scanned items. A high alarms rate will make it almost impossible for a human verifier to scan through all the alarms, which means that the probability of finding true non-compliance would decrease. In this work, we propose a novel approach to reliably rank the list of detected non-compliance activities of a given retail surveillance system, thereby provide a means of significantly reducing the false alarms and improving the precision in non-compliance detection. A checkout non-compliance activity is defined as a visual scan activity that cannot be associated to any barcode signal. Therefore it is a complex human activity that can only be defined in the joint domain of multiple modalities, in this case, by both the video data and the TLog data, or more precisely, by the misalignment between these two data streams. This is the key difference between non-compliance activity recognition and other human activity recognition tasks. Our approach represents each detected non-compliance activity using multimodal features coming from video data, transaction logs (TLog) data and intermediate results of the video analytics. We then train a binary classifier that successfully separate true positives and false positives in a labeled training set. A confidence score for each detected non-compliance activity can then be computed using the decision value of the trained classifier, and a ranked list of detections can be formed based on this score. 91 of 104 The benefit from having this ranked list is two-fold. First, a large number of false alarms can be avoided by simply keeping the top part of the list and discarding the rest. Second, a trade off between precision and recall can easily be performed by sliding the discarding threshold along this ranked list. Experimental results on a large scale dataset captured from real stores demonstrate that our approach achieves better precision than a state-of-the-art system at the same recall. Our approach can also reach an ROC point that exceeds the retailers' expectation in terms of precision, while retaining an acceptable recall of more than 60%. Although in this work, we applied our approach to the particular problem of retail surveillance, we believe our approach can be generalized to other non-compliance detection applications in other contexts. Reference [1] Agilence. http://www.agilenceinc.com/ [2] A. Dynamics. http://www.americandynamics.net/ [3] Q. Fan, R. Bobbit, Y. Zhai, A. Yanagawa, S. Pankanti, and A. Hampapur. Recognition of repetitive sequential human activity. In CVPR, 2009 [4] StopLift. http://www.stoplift.com [5] H. Trinh, Q. Fan, S. Pankanti, P. Gabbur, J. Pan, and S. Miyazawa. Detecting human activities in retail surveillance using hierarchical finite state machine. In ICASSP, 2011. … 92 of 104 Ensemble Inference on Single-Molecule Time Series Measurements Single-molecule fluorescence resonance energy transfer (smFRET) experiments allow observation of individual molecules in real-time, enabling detailed studies of the mechanics of cellular components such as nucleic acids, proteins, and other macromolecular complexes. In FRET experiments, the intensity of a fluorescence signal is a proxy for the spatial separation of two labeled locations in a molecule. This signal hereby acts as a molecular ruler, allowing detection of transitions between conformational states in a molecule. This gives an experimentalist access to two quantities of interest: the number of conformational steps involved in a biochemical process, and the transition rates associated with these steps. A sequence of conformational transitions is generally well-approximated as a Markov process. From the perspective of a data scientist, these types of experiments are therefore obvious candidates for Hidden Markov Model (HMM) based approaches. Maximumlikelihood inference can be performed with the well-known Baum-Welch algorithm. However, a common limitation of this technique is that it is prone to over fitting, since the log likelihood tends to increase with the number of degrees of freedom in the model parameters. From an experimental point of view, this means that it is difficult to differentiate between a fitting artifact and short-lived state or intermediate. Recently, our group has shown that Variational Bayes Expectation Maximization (VBEM) is an effective tool for both model selection and parameter estimation for these types of datasets. Maximum likelihood techniques optimize log[p(X ∣ θ)], the log-likelihood of the data x given a set of model parameters θ. In VBEM, the optimization criterion is the evidence p(x ∣ u) = ∫ dθ p(x ∣ θ)p(θ ∣ u) , which is approximated by a lower bound L=∑ z ∫ dθ q(z)q(θ ∣ w) log [ p(x, z ∣ θ)p(θ ∣ u) ]. q(z)q(θ ∣ w) Here z is the sequence of conformational states, and q(z) and q(θ ∣ w) are used to approximate the posterior p(z, θ ∣ x). The lower bound L can be optimized by iteratively solving the variational equations δL/δq(z) = 0 and δL/δq(θ ∣ w) = 0. Rather than a point estimate for the parameters θ, this method yields an approximation for the posterior distribution over the latent states and model parameters in light of the data. While this method is approximate, in that it relies on a factorization p(z, θ ∣ x) ≃ q(z)q(θ ∣w), validation on synthetic data shows that VBEM provides accurate estimates under common experimental parameter ranges, while also providing a robust criterion for the selection of the number of states. In the majority of experiments, data reporting on the dynamics of several hundred individual molecules are recorded, giving rise to an ensemble of traces which report on the same process but possess slightly different photo-physical parameters (e.g., smFRET state means and variances). Consequently, beyond learning from the individual traces, learning a consensus model from the ensemble of traces present a key challenge for the experimentalist. Here we present a method to model the entire data set at once. This method performs “ensemble” inference – inference of the parameters of each individual trace and in1 93 of 104 ference of the entire distribution of traces. VBEM is performed on each of N traces, yielding a set of variational parameters w n that determine approximate posterior distributions q(θ n ∣ w n ). Subsequently, the total approximate evidence L = ∑n L n is maximized w.r.t. the prior parameters u. The update scheme for this ensemble method can be summarized as: Iterate until L = ∑n L n converges: 1. For each trace n, iterate over VBEM steps: a. Update q(z n ) by solving δL n /δq(z n ) = 0 b. Update variational parameters w n by solving δL/δq(θ ∣ w) = 0 2. Update hyperparameters u by solving ∂L/∂u = 0 This method differs from what is often called a “hierarchical” approach, e.g. a method where either the prior parameters u are themselves drawn from a prior, or a mixture model of prior parameters u m is assumed. Rather we aim to construct the simplest possible scheme that will present an experimentalist with an ensemble method for model selection and parameter estimation. Using synthetic data, we show the statistical superiority of HMI for data inference over methods which only learn from individual traces. The method is more robust under high levels of emission noise, and gracefully handles inter trace variation of the FRET levels. Moreover, rate constants extracted directly from an ensemble inferred transition matrix are more accurate than rate constants learned from dwell-time analysis. Finally we show that our ensemble method can be used to detect subpopulations of states with degenerate emission levels but different transition rates. We present results using both synthetic data and experimental smFRET data taken from the ribosome. We also apply the method to measurements utilizing carbon nanotube transistors that show a similar switching behavior between transition rates. 2 94 of 104 Bisection Search in the Presence of Noise September 8, 2011 Abstract Stochastic gradient algorithms are popular methods for nding the root of a function that is only observed with noise. Finite time performance and asymptotic performance of these methods heavily depend on a chosen tuning sequence. As an alternative to stochastic gradient algorithms, which are conceptually similar to the Newton-Raphson search algorithm, we analyze a stochastic root nding algorithm that is motivated by the bisection algorithm. In each step the algorithm queries an oracle as to whether the root lies to the left or right of a prescribed point x. The oracle answers this question, but the received answer is incorrect with probability 1 − p(x). A Bayes-motivated algorithm for this problem that assumes knowledge of p(·) repeatedly updates a density giving, in some sense, one's belief about the location of the root. In contrast to stochastic gradient algorithms, the described algorithm does not require the specication of a tuning sequence, but requires knowledge of p(·). This probabilistic bisection algorithm has previously been introduced in Horstein (1963) for the setting where p(·) is constant. However, very little is known about its theoretical properties and how it can be extended when p(·) varies with x. We demonstrate how the algorithm works, and provide new results that shed light on its performance, both when p(·) is constant and when p(·) varies with x. When p(·) is constant, for example, we show that the probabilistic bisection algorithm is optimal for minimizing expected posterior entropy. 95 of 104 PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework Shen Wang (sw2613@columbia.edu), Haimonti Dutta (haimonti@ccls.columbia.edu) The Center for Computational Learning System, Columbia University, NY 10115. Large datasets, of the order of peta- and tera- bytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is a distributed file system that provides support for dataintensive applications. It provides an open source implementation of the MapReduce programming paradigm, which can be used to build scalable algorithms for pattern analysis and data mining. MapReduce has two computation phases – map and reduce. In the map phase, a dataset is partitioned into disjoint parts and distributed to workers called mappers. The mappers implement local data processing and the output of the map phase is of the form < key, value > pairs. These are passed to the second phase of MapReduce called the reduce phase. The workers in reduce phase (called reducers) take all the instances that have the same key and do computeintensive processing and produce the final result. For a complex computation task, several MapReduce pairs may be involved. When processing large datasets, clustering algorithms are frequently used to compress or extract patterns from the original data. Agglomerative Hierarchical clustering is a popular clustering algorithm. It proceeds in four steps: (1) At the start of the algorithm, each object is assigned to a separate cluster. Then, all pair-wise distances between clusters are evaluated using a distance metric of one’s choice (2) The two clusters with the shortest distance are merged into one single cluster. (3) The distance between the newly merged cluster and the other clusters are calculated and the distance matrix is updated accordingly. (4) If more than one cluster still exists, goto step 2. Hierarchical clustering algorithm is known to have several advantages -- it does not require apriori knowledge of the number of clusters in the dataset. Furthermore, the distance metric and cluster cutting criteria can be adjusted easily. However, the time complexity of the hierarchical clustering algorithm is relatively high. More importantly, to find the two clusters that are closest to each other, the algorithm needs to know the distances between all the cluster pairs. This characteristic makes hierarchical clustering algorithm very hard to scale in a distributed computing framework. In this abstract, we present a parallel, random-partition based hierarchical clustering algorithm for the MapReduce framework. The algorithm contains two main components - a divide-and-conquer phase and a global integration phase. In the divide-and-conquer phase, the data is randomly split into several smaller partitions by the mapper by assigning each instance a random number as key. The instances with the same key are forwarded to the same reducer. On the reducers, the sequential hierarchical clustering algorithm is run and a dendrogram is generated. The dendrogram, a binary tree organized by linkage length, is built on the local subset of data. To obtain a global cluster assignment across all mappers and reducers, dendrogram integration needs to be implemented. Such integration is non-trivial because insertion and deletion of a single instance changes the structure of the dendrogram. Our approach is to align them by “stacking” them one on top of another by using a recursive algorithm described in Algorithm 1. Algorithm 1: Recursive dendrogram aligning algorithm 1: function align(Dendrogram D1, Dendrogram D2) 2: if similarity(D1.leftChild, D2.leftChild)+similarity(D1.rightChild,D2.rightChild)< 96 of 104 similarity(D1.rightChild,D2.leftChild)+similarity(D1.leftChild,D2.rightChild) then 3: Switch D2’s two children 4: end if 5: align(D1.leftChild, D2.leftChild) 6: align(D1.rightChild, D2.rightChild) The following example provides an illustration of the technique: Example 1: Consider two dendrograms a and b that need to be aligned. Assume a is the template dendrogram -- this means dendrogram b is aligned to a and all structure changes will happen on b only. First, the roots of these two dendrograms (nodes of depth 1) are aligned to each other. Then, nodes at depth 2 need to be aligned. There are two choices -- the first is to align them as they are seen in the figure and thus - align a2 with b2 and a3 with b3. Another choice is the opposite, which aligns a2 with b3 and a3 with b2. The decision is made by comparing similarity (a2, b2) + similarity (a3, b3) with similarity (a2,b3)+similarity(a3,b2) and taking the one with higher similarity value. In this case, we will find it more reasonable to align a2 with b3 and a3 with b2. Therefore, b2 and b3 are switched and dendrogram b is transformed to dendrogram c. Then, for each pair of nodes that have been aligned in two dendrograms, say a2 and c2, we repeat the same procedure to align c2’s two children with a2’s two children. This procedure is repeated recursively until it reaches a depth deep enough for labeling. The cluster labeling comprises of two steps – the first step involves cutting the template dendrogram into subtrees, similar to what a sequential hierarchical clustering algorithm would do during labeling. Each subtree is given a cluster label. Then for the root ai of each subtree i in the template dendrogram, the algorithm finds out all the nodes in other dendrograms that were aligned with it in the alignment step. Each of these nodes is also a root of a subtree in its own dendrogram and the algorithm will label all the instances belonging to this subtree with the same cluster label as ai. Intuitively, this is like “stacking” all the aligned dendrograms together with the template dendrogram being put on top. Then a knife is used to cut the template dendrogram into pieces. After the template dendrogram is cut, the knife does not stop but cuts all the way to the bottom of the stack. By doing this, the entire stack is cut into several smaller stacks, and instances in each small stack are given the same cluster label. The algorithm is implemented on an Apache Hadoop framework using the MapReduce programming paradigm. Empirical results on two large data sets from the ACM KDD Cup competition suggests that the PArallel RAndom-partition Based hierarchicaL clustEring algorithm (PARABLE) has significantly better scalability than centralized solutions. Future work involves theoretical analysis of convergence properties and performance benefits obtained from this randomized algorithm and implementation of multiple levels of local clustering. Acknowledgements: Funding for this work is provided by National Science Foundation award, IIS-0916186. 97 of 104 A Reinforcement Learning Approach to Variational Inference David Wingate wingated@mit.edu Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139 USA Theophane Weber theo@lyricsemiconductor.com Lyric Semiconductor, One Broadway 14th Floor, Cambridge, MA 02142 USA QT 1. Introduction of marginals as pθ (x) = We adopt a dynamical systems perspective on variational inference in deep generative models. This connects variational inference to a temporal credit assignment problem that can be solved using reinforcement learning: policy search methods (such as policy gradients) become a direct search through variational parameters; state-space estimation becomes structured variational inference, and temporal-difference methods suggest novel inference algorithms. Solving Eq. 1 is typically done by computing derivatives analytically, setting them equal to zero, solving for a coupled set of equations, and deriving an iterative optimization algorithm. However, this general approach fails for highly structured distributions (such as those represented by, for example, probabilistic programs), because there is unlikely to be a tractable set of variational equations. Let p(x1 , · · · , xT ) be a directed graphical model with graph structure G. We consider the directed edges to be an explicit representation of a temporal process: we begin at the root of G, and sequentially sample variables conditioned on the values of their parents. G imposes only a partial order on this “timeseries,” so we impose a total order by selecting a fixed sample path through all of the variables that respects G. To foreshadow things, we note that variational inference can be viewed as a controlled analogue of this. The goal of variational inference is to adjust the parameters θ of a distribution pθ (x) to minimize a cost function C(θ), which is typically the KL divergence Z pθ (x) . C(θ) = KL(pθ (x)||p(x|y)) = pθ (x) log p(x|y) x (1) Like p(x), we assume that pθ (x) decomposes as QT pθ (x) = t=1 pθ (xt |ht , θ). If we further make a meanfield assumption, this would decompose into a product pθ (xt |θt ). We instead consider generic approaches based on direct optimization of Eq. 1. This allows us to borrow tools from other disciplines—including RL. 2. An RL Perspective We begin by noting that C(θ) can be written as: Z C(θ) This prior can now be viewed as an uncontrolled dynamical system that stochastically evolves until data is generated. We decompose this as p(x) = QT t=1 p(xt |x1 , · · · , xt−1 ); to simplify notation, we will let ht = x1 , · · · , xt−1 be the history of the generative process until time t. Given data y, inference is the process of propagating information backwards through this timeseries, creating the posterior conditional disQT tribution p(x|y) ∝ p(y|x)p(x) = p(y|x) t=1 p(xt |ht ). t=1 = x = pθ (x) log " Epθ (x) log " = Epθ (x) T X pθ (x) p(x|y) QT !# p (x |h , θ) θ t t t=1 QT p(y|x) t=1 p(xt |ht ) # Rt (xt |ht ) − log p(y|x) t=1 where Rt (xt |ht , θ) = log pθ (xt |ht , θ) − log p(xt |ht ). Our key insight is that this equation has exactly the same form as the definition of the expected reward of a trajectory in a Markov Decision Process (MDP) when we adopt the explicitly temporal decompositions of both the target and variational distributions. Specifically: • The generative process p(x) is a dynamical system (as we have discussed). • The parameters θ are the policy of an RL agent. 98 of 104 A Reinforcement Learning Approach to Variational Inference • The divergence Rt (xt |ht , θ) is a historydependent reward; the agent incurs a cost whenever it sets θ such that the variational distribution diverges from the prior. • The term log p(y|x) is a terminal reward obtained at the end of an episode. Thus we see the temporal credit assignment problem in generative models: information must be propagated from observed data p(y|x) backwards through the generative process p(x), balancing the “costs” incurred by deviating from the prior against the “reward” obtained by generating the data with high likelihood. Different assumptions about pθ (xt |θt ) map to different kinds of RL. QT If we make the mean-field assumption pθ (x) = t=1 pθ (xt |θt ) (ie, ignoring the generative history ht ), the dynamical system becomes partially observable, and optimizing θ is now the problem of finding a first-order (or reactive) policy in a POMDP. Similarly, state-space estimation is the process of summarizing ht such that the process becomes Markov again. Finally, note that since pθ (x) is unconditional, it is easy to sample from; in terms of our dynamical system perspective, this means that it is easy to roll-out a trajectory. By considering θ to be an agent’s policy, we can borrow techniques from RL to solve this problem. One possible approach is (stochastic) gradient descent on C(θ): 1 X pθ (xj ) ∇θ C(θ) ≈ ∇θ log pθ (xj ) log +1 N x p(xj |y) j (2) with xj ∼ pθ (x) being a trajectory roll-out. We note that this has exactly the same form as a policy gradient equation, such as Williams’ REINFORCE algorithm. We could also use model-free dynamic programming to help optimize C(θ) by defining a value function and using temporal difference methods to find θ. 3. Experiments and Results To illustrate the utility of this perspective, we present one experiment on a complex generative model from geophysics. The process creates a 3D volume of rock by layering new sedimentary deposits on top of each other, with each layer distributed in a complex way based on the surface created by previous layers. The position and “shape” of each layer are the variables xt , · · · , xT of interest. At the end of the process, wells are drilled through the rock, and a well-log y is computed by measuring the rock porosity as a function of depth. The task is to reason about the posterior Figure 1. Results of vanilla policy gradients vs. episodic natural actor critic on the sedimentary model. distribution over layers x given a well-logs y. Because of the complex dependencies between layers, there are no analytically tractable variational equations. Some dependencies can be broken under a mean-field assumption: each layer can be deposited independently of all others. However, there are still dependencies in how the layers generate the final rock volume that cannot be broken. To solve this, we tested two algorithms. The first is vanilla stochastic gradient descent on Eq. 1, with the gradients given by Eq. 2. We also tested the Episodic Natural Actor Critic algorithm (Peters et al., 2005), an algorithm for estimating a natural policy gradient that combines ordinary policy gradients with value function estimation to reduce variance in the gradient estimate. Figure 1 shows the results, which demonstrate much faster convergence. We believe this is because ENAC does a better job of propagating information from the end of the process backward to the beginning, and because of its use of natural gradients. 4. Conclusions We have outlined how variational inference can be viewed as an RL problem, and illustrated how RL algorithms bring new tools to inference problems. This is especially appropriate in the context of deep generative models with complex structure, where RL can help propagate information backwards through the process. Future work will investigate more RL algorithms and their properties when applied to variational inference. References J. Peters, S. Vijayakumar, and S. Schaal. Natural actorcritic. In European Conference on Machine Learning (ECML), pages 280–291, 2005. 99 of 104 Using Support Vector Machine to Forecast Energy Usage of a Manhattan Skyscraper Rebecca Winter1, David Solomon2, Albert Boulanger3, Leon Wu3, Roger Anderson3 1 Department of Earth and Environmental Engineering, Columbia University Fu Foundation School of Engineering and Applied Sciences 2 Department of Environmental Sciences, Columbia College 3 Columbia University Center for Computational Learning Systems Introduction As our society gains a better understanding of our ability to negatively impact the environment, reducing carbon emissions and our overall energy consumption have become important areas of research. One of the simplest ways to reduce energy usage is by making current buildings less wasteful. By improving energy efficiency, this method of lowering our carbon footprint is particularly worthwhile because it actually reduces energy costs to the building, unlike many environmental initiatives that require large monetary investments. In order to improve the efficiency of the heating and air conditioning (HVAC) system of a Manhattan skyscraper, 345 Park Avenue, a predictive computer model was designed to forecast the amount of energy the building will consume. This model uses support vector machine (SVM), a method that builds a regression purely based on history data of the building, requiring no knowledge of its size, heating and cooling methods, or any other physical properties. This pure dependence on history data makes the model very easily applicable to different types of buildings with few model adjustments. The SVM model was built to predict a week of future energy usage based on past energy, temperature, and dew point temperature data. Modeling the energy usage of 345 Park is an important step to improving the efficiency of the system. An accurate model of future energy usage can be compared to actual energy usage to look for anomalies in the actual data that may represent wasteful usage of energy. The short-term predicted energy usage, if accurate enough, could also be used to determine how much energy should be used now. For example, if the model predicts a large increase in energy usage in two hours, a moderate energy increase could be forced now to help combat the high future demand. Alternatively, if a low energy requirement is predicted for the day, pre-heating or pre-cooling times can be pushed later to decrease total consumption. Fundamentally, in order to control a pattern, one must first be able to model its behavior. Modeling the energy usage of 345 Park will allow the management to better understand their building’s energy requirements, which inevitably leads to new and better ways of optimizing the system. Data The rate of energy consumption, called demand, is dependent on a number of factors. The time and day of the week affect the energy demand most dramatically because the building is primarily commercial, and uses relatively little energy when it is closed at night and on weekends. By including past energy data in the prediction, this weekly cycle of high usage during regular business hours is learned by the model and is used to produce future graphs of energy demand. Another important factor in predicting energy usage of the heating and air conditioning is the weather. On particularly hot days, far more energy is required to cool the building, and more energy goes into heating the building on very cold days. Similarly, humidity and the presence of precipitation can change the perceived temperature of the building, which affects the amount of 100 of 104 energy required to regulate temperature. Energy demand, temperature, and dew point temperature were all used in the creation of the computer model. Methods There are various kinds of models that can be used to predict and analyze the energy demand of a building. The DOE-2 model, created by the U.S. Department of Energy takes inputs that characterize the physical aspects of the building in order to predict its energy needs (“Overview of DOE-2.2”, Simulation Research Group, Lawrence Berkeley National Laboratory, University of California, 1998). SimaPro is a tool that evaluates the embedded energy in the building’s materials and construction history and also predicts operational energy. For this study, a purely operational approach to energy demand forecasting was taken. The Support Vector Machine algorithm developed by V. Vapnik in 1995 to perform Support Vector Machine Regression was used to predict energy demand for 345 Park Avenue. This method requires no inputs pertaining to the physical characteristics of the building. Rather, the method employs past hourly energy data and corresponding hourly weather data to create a model that predicts energy demand into the future. The strength of this model is that it can work for any building that has historical energy demand data. Support vector machine is a computer learning tool that plots past data in multi-dimensional space in order to output a regression. Each data point of past energy usage is graphed against 92 hours of past energy data and against 92 hours of corresponding temperature, dew point temperature, and energy consumption one year before. Each of these supporting energy, temperature, and dew point temperature values in a data point, called time delay coordinates, adds a new dimension to the model. When a long enough set of data points is given, each with its corresponding time delay coordinates, the computer can “learn” correlations between energy usage and the rest of the corresponding data. This learned model, along with the most recent data, is used to and project into the future. A five month long hourly data set was used with a one week test set, an RBF kernel, a gamma value of 0.1, and a c value of 200. The parameter values and the data included in the model were chosen by optimizing the R2 and RMSE values of regressions. Results and Conclusions The computer generated models were reasonable estimates of future energy demand, with R2 values from .71 to .95. Additionally, by predicting an entire week into the future, these models can be quite useful for buildings looking to predict and streamline their energy usage. Even more accurate results could likely be achieved using multiple years of data because seasonal cycles could be more easily learned by the computer model. References Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, vol. 2 (1998): 121-167. Farmer, JD. And Sidorowich, JJ. “Predicting Chaotic Time Series”, Physical Review Letters, vol 59 (1987). Simulation Research Group, Lawrence Berkeley National Laboratory, University of California, “Overview of DOE-2.2”, June 1998. Xi, X-C. Poo, A-N, Chou, S-K. “Support Vector Regression Model Predictive Control on a HVAC Plant”, Control Engineering Practice, vol 15 (2007). Xuemei, L. Jin-hu, L. Lixing, D. Gang, X. Jibin, L. “Building Cooling Load Forecasting Model Based on LS-SVM”, Asia Pacific Conference on Information Processing (2009). 101 of 104 Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries Zhen James Xiang, Hao Xu, Peter J. Ramadge Department of Electrical Engineering, Princeton University Abstract for the 6th Annual Machine Learning Symposium Finding an effective representation of data is an important process in many machine learning applications. Representing a p-dimensional data point x as a sparse linear combination of the elements of a dictionary of m (possibly linearly dependent) codewords: B = [b1 , b2 , . . . , bm ] ∈ Rp×m is an approach that has recently attracted much attention. As a further refinement, when given a training set X = [x1 , x2 , . . . , xn ] ∈ Rp×n of data points, the dictionary B can be optimized to make the representation weights as sparse as possible. This leads to the following problem: min B,W s.t. 1 kX − BWk2F + λkWk1 2 kbi k22 ≤ 1, ∀i = 1, 2, . . . , m. (1) Average % of codewords discarded Here k·kF and k·k1 denote the Frobenius norm and element-wise l1 -norm of a matrix, respectively. There are two major advantages to this adaptive, sparse representation method. First, in the spirit of many modern approaches (e.g. PCA, SMT [1], tree-induced bases [2]), rather than fixing B a priori (e.g. Fourier, wavelet, DCT), problem (1) assumes minimal prior knowledge and uses sparsity as a cue to learn a dictionary adapted to the data. Second, the new representation w is obtained by a nonlinear mapping of x. In many other approaches (including [1, 2]), although the codebook B is cleverly chosen, the new representation w is simply a linear mapping of x, e.g. w = B† x. As a final point, we note that the human visual cortex uses similar mechanisms to encode visual scenes [3] and sparse representation has exhibited superior performance on many difficult computer vision problems [4]. The challenge, however, is that solving (1) is computationally expensive for large scale dictionaries. Most state-of-theLearning non−hierarchical sparse representation art algorithms solve this non-convex optimization problem by ST3, original data ST3, projected data 80 iteratively optimizing W and B, which result in large scale ST2, original data ST2, projected data lasso and large scale constrained least square problems. This ST1/SAFE, original data 60 ST1/SAFE, projected data has limited the use of the sparse representation to applications of moderate scales. In this work, we develop efficient methods 40 for learning sparse representations on large scale dictionaries 20 by controlling the dictionary size m and data dimension p. In each iterative updating of W with fixed B, we have to 0 0 0.2 0.4 0.6 0.8 1 solve n lasso problems, each with m codewords (or regressors λ in the standard lasso literature). To speed up this process when m is large, we first investigate “dictionary screening” to Figure 1: The average rejection percentage of identify codewords that are guaranteed to have zero coefficient different screening tests. Examples are from in the solution. The standard result in this area is the SAFE the COIL rotational image dataset. rule [5]. We provide a new, general and intuitive geometrical framework for deriving screening tests and using this framework we derive two new screening tests that are significantly better than all existing tests. The new tests perform particularly well when the data points and codewords are highly correlated, 1 96 95 Traditional sparse representation: m=64, with 6 different λ settings m=128, with 6 λ (same as above) m=192, with 6 λ m=256, with 6 λ m=512, with 6 λ Our hierarchical framework: m1=32, m2=512, with 6 λ 94 93 m =64, m =2048, with 6 λ 1 92 2 m1=16, m2=256, m3=4096, with 6 λ Baseline: the same linear classifier using 250 principal components using original pixel values 91 2 3 5 10 20 30 Average encoding time for a testing image (ms) 100 90 80 70 60 Traditional sparse representation Our hierarchical framework Our framework with PCA projections Linear classifier Wright et al., 2008, SRC 50 32(0.1%) 64(0.2%) 128(0.4%) 256(0.8%) # of random projections (percentage of image size) to use Average encoding time (ms) Classification accuracy (%) on testing set 97 Recognition rate (%) on testing set 102 of 104 80 60 Traditional sparse representation Our hierarchical framework Our framework with PCA projections Linear classifier 40 20 0 32(0.1%) 64(0.2%) 128(0.4%) 256(0.8%) # of random projections (percentage of image size) to use Figure 2: Left: MNIST: The tradeoff between classification accuracy and average encoding time for various sparse representation methods. Each curve is for one value of m as λ varies. Right: Face Recognition: The recognition rate (top) and average encoding time (bottom) for various methods. a typical scenario in sparse representation applications [4]. See Figure 1 for an illustrative example of average rejection fractions obtained by our new tests (ST2, ST3) on a real data set. We also examine projecting the data onto a lower dimensional space so that we can solve the sparse representations for smaller p. We prove some new theoretical properties on the simplest projection method: random projection. It is known that under mild assumptions random projection preserves the pairwise distance of data points. Projection thus provides an opportunity to learn informative sparse representations with fewer dimensions. Finally, we combine the ideas of screening and projection into a hierarchical framework. This deals with large m and p simultaneously and in an integrated way. This framework uses random projections to incrementally extract information from the training data and uses this to refine the sparse representation in stages. At each stage the prescreening process utilizes our new screening tests together with an imposed zero-tree constraint that leads to a hierarchically structured dictionary. In Figure 2 we show the results of testing our framework on the MNIST digits data set [70,000 28x28 hand written digit images, 60,000 training, 10,000 testing] and a face data set [38 subjects in the extended YaleB data set, each with 64 cropped frontal face views under differing lighting conditions, randomly divided into 32 training and 32 testing]. After learing the dictionary, we use the sparse representation vectors wj as features to train a linear SVM classifier. For the digits data set, we tested the traditional sparse representation algorithm for different values of m and λ, and we tested our new framework with several settings of hierarchical levels, dictionary sizes and number of random projections. Compared to the traditional sparse representation, our hierarchical framework achieves roughly a 1% accuracy improvement given the same encoding time and a roughly 2X speedup given the same accuracy. For the face data set, we tested algorithms as we varied the number of random data projections. Our stage-wise random projection framework strikes a good balance between speed and accuracy. References [1] G. Cao and C.A. Bouman. Covariance estimation for high dimensional data vectors using the sparse matrix transform. In Advances in Neural Information Processing Systems, 2008. [2] M. Gavish, B. Nadler, and R.R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In International Conference on Machine Learning, 2010. [3] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311–3325, 1997. [4] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):1031–1044, 2010. [5] L.E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Arxiv preprint arXiv:1009.3515, 2010. 2 103 of 104 Using Iterated Reasoning to Predict Opponent Strategies Michael Wunder mwunder@cs.rutgers.edu John Robert Yaros yaros@cs.rutgers.edu September 9, 2011 in a multi-round location game called the LemonadeStand Game (LSG) [3]. In this recently introduced game, three lemonade-selling players arrange on a circular beach with 12 locations to gain the most business from customers on the circle, where the rule is that customers simply buy from the closest vendor. No communication is allowed, so sellers must use only the move history to establish a strategy. The LSG has a number of properties that benefit players who have a prior model of others, as opposed to those expecting equilibrium or who attempt to gather large amounts of data first. Iterated reasoning methods can be successful in these situations because they provide a range of feasible strategies that contain various degrees of sophistication. Interactive POMDPs provide sufficient generality for finding optimal solutions in these settings [1], but we have modified the process for repeated games to take advantage of inherent structure in games like LSG. Solving the iterated reasoning represented as an I-POMDP will yield the hierarchy of actions so that strategy selection becomes a problem of estimating the sophistication of the population. Our updated framework provides a hierarchy of likely strategies and a way to estimate the amount of reasoning in observed strategies [2], which were key components that earned our team victory in the most recent LSG competition. Abstract The field of multiagent decision making is extending its tools beyond classical game theory by embracing reinforcement learning, statistical analysis, and opponent modeling. For example, behavioral economists conclude from experimental results that people act according to levels of reasoning that form a “cognitive hierarchy” of strategies, rather than merely following the hyper-rational Nash equilibrium solution concept. This paper expands this model of the iterative reasoning process by widening the notion of a level within the hierarchy from one single strategy to a distribution over strategies, leading to a more general framework of multi-agent decision making. It provides a measure of sophistication for strategies and can serve as a guide for designing good strategies for multiagent games, drawing its main strength from predicting opponent strategies. We apply these lessons to the recently introduced Lemonade-Stand Game, a simple setting that includes both collaborative and competitive elements, where an agent’s score is critically dependent on its responsiveness to opponent behavior. The opening moves are significant to the end result and simple heuristics have outperformed intricate learning schemes. Using results from three open tournaments, we show how the submitted entries fit naturally into our model and explain why the top agents were successful. 1 2 Introduction LSG Level Computation In the simple case used in the first two competitions, customers are equally distributed so scores are equivalent to the distance between competitors. Our levelbased model computes that the optimal action is to find a partner to set up directly across from to the disadvantage of the third player. In addition, the model implies that stability is a prized quality of a good coali- We have adapted level-based models of strategic thinking to operate in a repeated domain with multiple agents. Specifically, we detail a process for discovering the solutions to an Interactive POMDP in the context of a tournament between computer competitors 1 104 of 104 [2] M. Wunder, M. Kaisers, M. Littman, and J. R. Yaros. Using iterated reasoning to predict opponent strategies. International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2011. tion member. Therefore, it is most beneficial to wait longer for a partner to join us, as this period increases the likelihood that we will eventually have two partners (or suckers) instead of just one. The length of waiting time becomes a proxy for the amount of reasoning that each agent implements, allowing us to analyze overall levels of sophistication in the population and how they change over time. As we see in Figure 1, the submitted agents in the second year by and large achieved this strategy. Additionally, the average level of reasoning increased between the first and second tournaments. In the latest iteration of the competition, the demand is unequally distributed on each of the 12 spots, and known beforehand. Our model breaks the strategy into two parts: initial action selection and subsequent action if our initial action was suboptimal given others’ locations. In fact the solution follows the same logic as in the previous game: choose a location such that the other players will end up fighting over one half of the customers. Because our agent chooses the second best initial action, the most likely scenario is that the worstoff player will pick a position nearer the third player, to our benefit. If that initial choice fails, the action is changed once we find a location that creates the most rewarding coalition. This combination of strategies resulting from our model led us to victory in the latest competition (see Table 1). 3 [3] M. Zinkevich. The lemonade game competition. http://tech.groups.yahoo.com/group/lemonadegame/, December 2009. Competition Score 9 2009 2010 8.5 8 7.5 7 6.5 0 0.5 1 1.5 2 Approximate Level 2.5 3 Figure 1: Estimated levels of competitors in two Lemonade-stand Game tournaments. Both sets of agents show positive correlation between reasoning and performance. R2 values are 0.77 for 2009 and 0.34 for 2010. The more recent agents show a shift to higher reasoning levels, as well as a compression of scores. Conclusion In more general games, we hope that our solving process will lead to insights for incorporating game structure into these powerful models. The ability to find higher order strategies is crucial to limiting computation costs and optimizing against unknown populations. Rank 1. 2. 3. 4. 5. 6. 7. 8. References [1] P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. Journal of AI Research (JAIR), 24:49–79, 2005. Team Rutgers Harvard Alberta Brown Pujara BMJoe Chapman GATech Score 50.397 48.995 48.815 48.760 47.883 47.242 45.943 45.271 Bound ± 0.022 ±0.020 ±0.022 ±0.023 ±0.020 ±0.021 ±0.019 ±0.021 Table 1: The 2011 Rankings. 2