Abstracts Booklet - The New York Academy of Sciences

Transcription

Abstracts Booklet - The New York Academy of Sciences
Sixth Annual Machine Learning Symposium
October 21, 2011
ABSTRACTS FOR POSTER SESSIONS &
SPOTLIGHT TALKS
1 of 104
SCIENTIFIC ORGANIZING COMMITTEE
Naoki Abe, PhD
IBM Research
Corinna Cortes, PhD
Google
Patrick Haffner, PhD
AT&T Research
Tony Jebara, PhD
Columbia University
John Langford, PhD
Yahoo! Research
Mehryar Mohri, PhD
Courant Institute of Mathematical Sciences, NYU
Robert Schapire, PhD
Princeton University
ACKNOWLEDGMENT OF SUPPORT
Gold Sponsor
Academy Friends
Google
Yahoo! Labs
The “HackNY presentations for students” event is co-organized by Science Alliance with the
support of hackNY.
2 of 104
AGENDA
9:30 AM
10:00 AM
10:10 AM
10:55 AM
Breakfast & Poster Set-up
Opening Remarks
Stochastic Algorithms for One-Pass Learning
Léon Bottou, PhD, Microsoft adCenter
Spotlight Talks
Opportunistic Approachability
Andrey Berstein, Columbia University
Large-Scale, Sparse Kernel Logistic Regression — With a Comparative Study on
Optimization Algorithms
Shyam S. Chandramouli, Columbia University
Online Clustering with Experts
Anna Choromanska, PhD, Columbia University
Efficient Learning of Word Embeddings via Canonical Correlation Analysis
Paramveer Dhillon, University of Pennsylvania
A Reliable, Effective, Terascale Linear Learning System
Miroslav Dudik, PhD, Yahoo! Research
11:20 AM
Networking and Poster Session
12:05 PM
Online Learning without a Learning Rate Parameter
Yoav Freund, PhD, University of California, San Diego
1:00 PM
Networking Lunch
2:30 PM
Spotlight Talks
Large-Scale Collection Threading Using Structured k-DPPs
Jennifer Gillenwater, University of Pennsylvania
Online Learning for Mixed Membership Network Models
Prem Gopalan, Princeton University
Planning in Reward Rich Domains via PAC Bandits
Sergiu Goschin, Rutgers University
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte
Carlo
Matthew D. Hoffman, PhD, Columbia University
Place Recommendation with Implicit Spatial Feedback
Berk Kapicioglu, Princeton University, Sense Networks
3:00 PM
Distributed Optimization and Statistical Learning via the Alternating
Direction Method of Multipliers
Stephen Boyd, PhD, Stanford University
3:45 PM
Spotlight Talks
Hierarchically Supervised Latent Dirichlet Allocation
Adler Perotte, MD, Columbia University
Image Super-Resolution via Dictionary Learning
Gungor Polatkan, Princeton University
MirroRank: Convex Aggregation and Online Ranking with the Mirror Descent
Benoit Rostykus, Ecole Centrale Paris
Preserving Proximity Relations and Minimizing Edge-crossings in Graph
Embeddings
Amina Shabbeer, Rensselaer Polytechnic Institute
A Reinforcement Learning Approach to Variational Inference
David Wingate, PhD, Massachusetts Institute of Technology
4:10 PM
Networking and Poster Session
5:00 PM
Student Award Winner Announcement & Closing Remarks
5:15 PM
End of Program
5:30 PM
HackNY Presentations for Students
Foursquare, Hunch, Intent Media, Etsy, Media6Degrees, Flurry
3 of 104
SPEAKERS’ BIOGRAPHIES
4 of 104
Léon Bottou, PhD
Microsoft adCenter
Léon Bottou received the Diplôme d'Ingénieur de l'Ecole Polytechnique (X84) in 1987, the Magistère de
Mathématiques Fondamentales et Appliquées et d'Informatique from Ecole Normale Superieure in 1988, the
Diplôme d'Etudes Approndies in Computer Science in 1988, and a PhD in Computer Science from LRI,
Université de Paris-Sud in 1991.
After his PhD, Bottou joined AT&T Bell Laboratories from 1991 to 1992. He then became chairman of
Neuristique, a small company pioneering machine learning for data mining applications. He returned to
AT&T Labs from 1995 to 2002 and NEC Labs America at Princeton from 2002 to March 2010. He joined the
Science Team of Microsoft Online Service Division in April 2010.
Bottou's primary research interest is machine learning. His contributions to the field cover both theory and
applications, with a particular interest for large-scale learning. Bottou's secondary research interest is data
compression and coding. His best known contribution in this field is the DjVu document compression
technology. Bottou has published over 80 papers and won the 2007 New York Academy of Sciences Blavatnik
Award for Young Scientists. He is serving or has served on the boards of the Journal of Machine Learning
Research and IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yoav Freund, PhD
University of California
Yoav Freund is a professor of Computer Science and Engineering at University of California, San Diego. His
work is in the area of machine learning, computational statistics information theory, and their applications. He
is best known for his joint work with Dr. Robert Schapire on the Adaboost algorithm. For this work they were
awarded the 2003 Gödel prize in Theoretical Computer Science, as well as the Kanellakis Prize in 2004.
Stephen P. Boyd, PhD
Stanford University
Stephen P. Boyd is the Samsung Professor of Engineering, and Professor of Electrical Engineering in the
Information Systems Laboratory at Stanford University. He received the A.B. degree in Mathematics from
Harvard University in 1980, and the Ph.D. in Electrical Engineering and Computer Science from the University
of California, Berkeley, in 1985. Then he joined the faculty at Stanford. His current research focus is on convex
optimization applications in control, signal processing, and circuit design.
SPEAKERS’ ABSTRACTS
5 of 104
Stochastic Algorithms for One-Pass Learning
Léon Bottou, Microsoft adCenter
The goal of the presentation is to describe practical stochastic gradient algorithms that process each training
example only once, yet asymptotically match the performance of the true optimum. This statement needs, of
course, to be made more precise. To achieve this, we'll review the works of Nevel'son and Has'minskij (1972),
Fabian (1973, 1978), Murata & Amari (1998), Bottou & LeCun (2004), Polyak & Juditsky (1992), Wei Xu (2010),
and Bach & Moulines (2011). We will then show how these ideas lead to practical algorithms that not only
represent a new state of the art but are also arguably optimal.
Online Learning without a Learning Rate Parameter
Yoav Freund, PhD, University of California, San Diego
Online learning is an approach to statistical inference based on the idea of playing a repeated game. A "master"
algorithm recieves the prediction of N experts before making its own prediction. Then the outcome is revealed,
and experts and master suffer a loss.
Algorithms have been developed for which the regret, the difference between the cumulative loss of the
master and the cumulative loss of the best expert, is bounded uniformly over all sequences of expert
predictions and outcome.
The most successful algorithms of this type are the exponential weights algorithms discovered by Littlestone
and Warmuth and refined by many others. The exponential weights algorithm has a parameter, the learning
rate, which has to be tuned appropriately to achieve the best bounds. This tuning typically depends on the
number of experts and on the cumulative loss of the best expert. We describe a new algorithm - NormalHedge,
which has no parameter and achieves comparable bounds to tuned exponential weights algorithms.
As the algorithm does not depend on the number of experts it can be used effectively when the set of experts
grows as a function of time and when the set of experts is uncountably infinite.
In addition, the algorithm has a natural extension for continuous time and has a very tight analysis when the
cumulative loss is described by an Ito process.
This is joint work with Kamalika Chaudhuri and Daniel Hsu.
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
Stephen P. Boyd, Stanford University
Problems in areas such as machine learning and dynamic optimization on a large network lead to extremely
large convex optimization problems, with problem data stored in a decentralized way, and processing
elements distributed across a network. We argue that the alternating direction method of multipliers is well
suited to such problems. The method was developed in the 1970s, with roots in the 1950s, and is equivalent to
or closely related to many other algorithms, such as dual decomposition, the method of multipliers, DouglasRachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative
algorithms for $\ell_1$ problems, proximal methods, and others. After briefly surveying the theory and
history of the algorithm, we discuss applications to statistical and machine learning problems such as the lasso
and support vector machines, and to dynamic energy management problems arising in the smart grid.
INDEX OF POSTER ABSTRACTS
6 of 104
Multiple Instance Regression without the Ambiguity
Charles Bergeron, Rensselaer Polytechnic Institute
10
Opportunistic Approachability
Andrey Berstein*, Columbia University
12
Large-Scale, Sparse Kernel Logistic Regression — With a Comparative Study on
Optimization Algorithms
Shyam S Chandramouli*, Columbia University
14
Visualizing Topic Models
Allison Chaney, Princeton University
16
Weakly Supervised Neural Nets for POS Tagging
Sumit Chopra, AT&T Labs Research
18
Online Clustering with Experts
Anna Choromanska*, Columbia University
20
Efficient Learning of Word Embeddings via Canonical Correlation Analysis
Dhillon Paramveer*, University of Pennsylvania
23
Improved Semi-Supervised Learning Using Constraints and Relaxed Entropy
Regularization via Deterministic Annealing
Dhillon Paramveer, University of Pennsylvania
25
A Reliable Effective Terascale Linear Learning System
Miroslav Dudik *, Yahoo! Research
27
Density Estimation Based Ranking from Decision Trees
Haimonti Dutta, Columbia University
29
Optimization Techniques for Large-Scale Learning
Clement Farabet, New York University
31
Real-time, Multi-class Segmentation using Depth Cues
Clement Farabet, New York University
33
A Text-based HMM Model of Foreign Affairs Sentiment: A Mechanical Turker’s
History of Recent Geopolitical Events
Sean Gerrish, Princeton University
35
A Probabilistic Foundation for Policy Priors
Sam Gershman, Princeton University
37
Large-Scale Collection Threading Using Structured k-DPPs
Jennifer Gillenwater*, University of Pennsylvania
38
Online Learning for Mixed Membership Network Models
Prem Gopalan*, Princeton University
40
Planning in Reward Rich Domains via PAC Bandits
Sergiu Goschin*, Rutgers University
42
Nonparametric, Multivariate Convex Regression with Applications to Value
Function Approximation
Lauren A. Hannah, Duke University
44
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte
Carlo
Matt Hoffman*, Columbia University
45
Distributed Collaborative Filtering Over Social Networks
Sibren Isaacman, Princeton University
47
Can Public Data Help With Differentially-Private Machine Learning?
Geetha Jaganathan, Columbia University
49
Place Recommendation with Implicit Spatial Feedback
Berk Kapicioglu*, Princeton Unversity
52
Recovering Euclidean Distance Matrices via Landmark MDS
Akshay Krishnamurthy, Carnegie Mellon University
54
Efficient Evaluation of Large Sequence Kernels
Pavel Kuksa, NEC Laboratories America, Inc
56
Unsupervised Hashing with Graphs
Wei Liu, Columbia University
58
Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL
Contraction
Siwei Lyu, State University of New York, Albany
60
Active Prediction in graphical models
Satyaki Mahalanbis, University of Rochester
62
An Ensemble of Linearly Combined Reinforcement-Learning Agents
Vukosi Marivate, Rutgers University
65
Autonomous RF Surveying Robot for Indoor Localization and Tracking
Piotr Mirowski, Bell Labs, Alcatel-Lucent Statistics and Learning Department
67
A Comparison of Text Analysis and Social Network Analysis Using Twitter Data
John Myles White, Princeton University
69
Time-dependent Dirichlet Process Mixture Models for Multiple Target Tracking
Willie Neiswanger, Columbia Unversity
70
7 of 104
An Efficient and Optimal Scrabble Playing Algorithm Based on Modified DAWG
and Temporal Difference Lambda Learning Algorithm
Manjot Pahwa, University of Delhi
72
Hierarchically Supervised Latent Dirichlet Allocation
Adler Perotte*, Columbia Unversity
73
Image Super-Resolution via Dictionary Learning
Gungor Polatkan*, Princeton University
75
Structured Sparsity via Alternating Directions Methods
Zhiwei Tony Qin, Columbia University
77
Ranking Annotators for Crowdsourced Labeling Tasks
Vikas Raykar, Siemens Healthcare
79
Extracting Latent Economic Signal from Online Activity Streams
Joseph Reisinger, Metamarkets Group
81
Intrinsic Gradient Networks
Jason Rolfe, New York University
84
MirroRank: Convex Aggregation and Online Ranking with the Mirror Descent
Benoit Rostykus*, Ecole Centrale Paris, ENS Cachan
86
Preserving Proximity Relations and Minimizing Edge-crossings in Graph
Embeddings
Amina Shabbeer*, Rensselaer Polytechnic Institute
88
A Learning-based Approach to Enhancing Checkout Non-Compliance Detection
Hoang Trinh, IBM T J Watson Center
90
Ensemble Inference on Single-Molecule Time Series Measurements
Jan-Willem Van de Meent, Columbia University
92
Bisection Search in the Presence of Noise
Rolf Waeber, Cornell University
94
PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring
Algorithm for the MapReduce Framework
Shen Wang, Columbia University
95
A Reinforcement Learning Approach to Variational Inference
David Wingate*, MIT
97
Using Support Vector Machine to Forecast Energy Usage of a Manhattan
Skyscraper
Rebecca Winter, Columbia University
99
Learning Sparse Representations of High Dimensional Data on Large Scale
Dictionaries
Zhen James Xiang, Princeton University
101
Using Iterated Reasoning to Predict Opponent Strategies
Michael Wunder, Rutgers University
103
8 of 104
9 of 104
10 of 104
Multiple instance regression without the ambiguity
Charles Bergeron
chbergeron@gmail.com
Department of Electrical, Systems and Computer Engineering
Rensselaer Polytechnic Institute, 110 8th Street, Troy, New York, 12180
Multiple instance learning Multiple instance learning (MIL) is a variation of supervised learning where
the response is only known for bags of items (or instances), instead of for each item. Many MIL formulations
have been proposed, including classification [1], regression [2] and partial ranking [3]. Typically in MIL,
there exists an ambiguity as to which item in a bag determines the response (a classification label, ranking
information or regression value, as the case may be) for the entire bag. This ambiguity stems from either
a lack of knowledge (e.g. chemists don’t know what conformations cause a compound to smell musky)
or a desire for automation (e.g. elephants are easy to identify on an image but annotating a large number
of images is a boring task). Fundamentally, the ambiguity exists because it is assumed that a single item
determines the response for the whole bag or there exists a lack of knowledge as to how multiple items
contribute to the bag’s response. Selecting a single item to stand for the entire bag is common in MIL and is
easily formulated using the max operator
fi (w) = max {xTij w}
j
(1)
where fi is the prediction for bag i, xij contains the features corresponding to item j of bag i and w is a
vector containing the model parameters.
Multi-instance aggregation What if the combination of several indicators is a better predictor of hard
drive failure, or could several passages in a document better determine its relevancy? Item aggregation to
form a bag prediction is a complicated task. For example, we could take a convex combination of items
weighted by coefficients λj :
n
X
fi (w) =
λj xTij w.
(2)
j
However, that makes for a lot of extra parameters to estimate, increasing the risk of overfitting. Simply
averaging items (setting λj = n1 ) to form an aggregate prediction does not work well in practice as the
contribution of critical items is diluted. Recently, [4] proposed aggregating items using a log-mean-exp
function controlled by η > 0:


X
1
fi (w) = ln 
exp ηxTij w .
(3)
η
j
Under this formulation, the relative weighting is such that terms with larger xTj w are more heavily weighted
as η → ∞, and has the advantage of adding a single parameter η as opposed to n extra parameters as
with 2. They obtained best partial ranking models using this form, thereby showing that multi-instance
aggregation can be successful. However, we do not know whether their log-sum-exp formulation is optimal.
Put otherwise, the ambiguity remains, in that they are required to attempt various strategies to handle the
ambiguity and select one of them (in an unbiased manner).
1
11 of 104
Multispecies multimode binding model Physiological compounds, drugs, and other man-made bioactive chemicals containing ionizable substructures are analyzed in a multiple-species context that accounts for
the various binding species. It is becoming increasingly clear that this analysis must be further broken down
to consider the effect of multiple modes [5]. Incorporating these molecular representations can improve our
understanding of binding for improved drug design.
The relationship between these representations and the binding affinity is derived from chemical thermodynamics principles (and the assumption that the overall interaction energy is equal to the sum of pairwise
energies between atoms of the interacting ligands and those of the binding site) [5]. Our model has the form


n
X
BAi (w, γ) = ln 
φij exp xTij w 
(4)
j=1
where BAi is the binding affinity of compound
i and 0 ≤ φij ≤ 1 is the species fraction corresponding to
P
molecular representation j such that j φij = 1. Comparing with Eq. 3, we set η = 1 and weigh each
exponential term by φij . Subject to the influence of the species fractions, this formulation behaves as Eq. 3.
This is a MIL model in that the binding affinity is known for the compounds (bags) and the features
are known for the molecular representations (items) [6]. However, unlike most MIL models, we know how
these molecular representations aggregate to determine the each compound’s response [6]. This is a positive
result for two reasons:
• There exists MIL problems for which the form of item aggregation is known, and these forms may
translate well to other applications.
• In the absence of the usual MIL ambiguity, greater effort can be put on estimating the parameters,
which can be challenging as MIL problems are generally nonconvex.
Linking empirical results from [4] with analytical observations from [6] strengthen the case for aggregation formulations involving multiple instances.
References
[1] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural
Information Processing Systems, volume 15, 2003.
[2] Soumya Ray and David Page. Multiple instance regression. In Proceedings of the International Conference on Machine Learning, volume 18,
pages 425–432, 2001.
[3] Charles Bergeron, Gregory Moore, Jed Zaretzki, Curt M. Breneman, and Kristin P. Bennett. Fast bundle algorithm for multiple instance
learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, (In press), 2011.
[4] Yang Hu, Mingjing Li, and Nenghai Yu. Multiple instance ranking: Learning to rank images for image retrieval. In IEEE Conference on
Computer Vision and Pattern Recognition, 2008.
[5] Senthil Natesan, Tiansheng Wang, Viera Lukacova, Vladimir Bartus, Akash Khandelwal, and Stefan Balaz. Rigorous treatment of multispecies
multimode ligand-receptor interactions in 3D-QSAR: CoMFA analysis of thyroxine analogs binding to transthyretin. Journal of Chemical
Information Modeling, 51:1132–1150, 2011.
[6] Charles Bergeron, Senthil Natesan, and Stefan Balaz. Controlling model capacity: An exploration of multispecies multimode binding affinity
modeling. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Under review.
2
12 of 104
Opportunistic Approachability
Andrey Bernstein†∗, Shie Mannor†, and Nahum Shimkin†
September 9, 2011
Blackwell’s approachability has played a key role in the analysis of regret minimization within
the on-line learning framework and in the theory of learning in games. The notion of approachability,
as introduced in Blackwell (1956), concerns a repeated matrix game with vector-valued payoffs that
is played by two players, the agent and the opponent. Thus, for each pair of simultaneous actions
a and z, a payoff vector r(a, z) ∈ Rℓ is obtained. Given a closed set S in Rℓ , the agent’s goal is to
have the long-term average reward vector approach S, namely converge to S almost surely in the
point-to-set distance. If that convergence can be ensured irrespectively of the opponent’s strategy,
the set S is said to be approachable, and a strategy of the agent that satisfies this property is an
approaching strategy (or algorithm) of S.
Approachability has close connections with on-line algorithms, and in particular with the concept
of no-regret. In fact, soon after the no-regret play was introduced in Hannan (1957), it was shown by
Blackwell (1954) that this problem can be formulated as an approachability problem for a suitably
defined set and payoff vector.
By its very definition, an approachable set accommodates a worst-case scenario, as it should
be approached for any strategy of the opponent. However, as the game unfolds it may turn out
that the opponent’s actions are limited in some sense. For example, the opponent may choose to
use only a subset of his available actions, or even employ a stationary strategy, namely restrict
to a single mixed action. If these self-imposed restrictions were known in advance, the agent
could possibly ensure convergence to some strict subset of the target set S. Our goal here is to
formulate opportunistic approachability algorithms, which attain similar goals in an online manner,
without knowing in advance the opponent’s strategy restrictions. To this end, we need to formulate
meaningful opportunistic goals that are to be pursued by such algorithms.
Our starting point will be Blackwell’s dual approachability condition. Let r(p, q) denote the
expected payoff vector given mixed actions p and q of the agent and opponent, respectively. Blackwell’s dual condition1 requires that for any q there exists a p so that r(p, q) ∈ S. This condition is
clearly necessary for a closed set S to be approachable. Indeed, if it fails to hold for some q, the
opponent may exclude S simply by playing q repeatedly. Remarkably, Blackwell showed that this
condition is also sufficient when S is convex. We refer to a set S that satisfies the dual condition
as a D-set. Note that a convex D-set is approachable. More generally, the convex hull conv(S) of
any D-set S is approachable.
By definition, if S is a D-set, there exists a map p∗ (q) so that r(p∗ (q), q) ∈ S for all q. Fix one
such map, and call it the response function. A feasible goal for an opportunistic approachability
∗
Department of Electrical Engineering, Columbia University
Department of Electrical Engineering, Technion
1
Blackwell’s primal condition is a geometric separation condition, which forms the basis for standard approachability policies. The equivalence of the primal and dual conditions for convex sets is a generalization of the standard
minimax theorem, as reflected in the title of Blackwell’s paper (Blackwell, 1956).
†
1
13 of 104
algorithm can now be specified: Suppose that the opponent’s actions are empirically restricted in
some (possibly asymptotic) sense to a subset Q of his mixed actions. The average reward vector
should then converge to conv{r(p∗ (q), q), q ∈ Q}. Clearly, the latter is included in conv(S), and
may be considerably smaller, depending on Q. In particular, if Q is a singleton then so is the latter
set.
The central contribution of this paper is an approachability algorithm that is opportunistic in
the above-mentioned sense. To the best of our knowledge, this is the first algorithm to offer such
guarantees. The proposed algorithm is based on on-line convex programming, here implemented via
the gradient algorithm due to Zinkevich (2003), with its scalar rewards obtained by projecting the
original vector payoffs onto properly selected directions. The algorithm is computationally simple,
in the sense that it does not require explicit identification of the restriction set Q, nor of the convex
hull associated with the reduced target set.
As an application of opportunistic approachability, we consider the problem of on-line regret
minimization in the presence of average cost constraints. That problem generalizes the standard
no-regret framework by adding side constraints, and was recently applied in Bernstein et al. (2010)
to the problem of online classification with specificity constraints. As shown in Mannor et al. (2009),
the constrained no-regret problem can be naturally formulated as an approachability problem with
respect to a target set S that traces the best-reward-in-hindsight given the empirical frequencies
of the opponent’s actions, subject to the specified constraints. While this set is a D-set by its
definition, it turns out to be non-convex in general and therefore need not be approachable. Our
opportunistic approachability algorithm applies to this problem, with the response function p∗ (q)
naturally selected as the constrained best-response. The algorithm approaches conv(S) without requiring its computation. Furthermore, in the fortuitous case of a stationary opponent, the algorithm
converges to the set S itself.
References
A. Bernstein, S. Mannor, and N. Shimkin. Online classification with specificity constraints. In
NIPS, 2010.
D. Blackwell. Controlled random walks. In Proceedings of the International Congress of Mathematicians, volume III, pages 335–338, 1954.
D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics,
6:1–8, 1956.
J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games,
3:97–139, 1957.
S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path constraints. Journal
of Machine Learning Research, 10:569–590, 2009.
M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML ’03), pages 928–936,
2003.
2
14 of 104
Large-Scale Sparse Kernel Logistic Regression
— with a comparative study on optimization algorithms
ABSTRACT
Kernel Logistic Regression (KLR) is a powerful probabilistic classification tool, but its training and testing both suffer from severe computational bottlenecks when used with
large-scale data. Traditionally, L1-penalty is used to induce
sparseness in the parameter space for fast testing. However,
most of the existing optimization methods for training l1 penalized KLR do not scale well in large-scale settings. In
this work, we present highly scalable training of KLR model
via three first-order optimization methods: Fast Shrinkage
Thresholding Algorithm(FISTA), Coordinate Gradient Descent (CGD), and a variant of Stochastic Gradient Descent
(SGD) method. To further reduce the space and time complexity, we apply a simple kernel linearization technique
which achieves similar results at a fraction of the computational cost. While SGD appears the fastest in training
large-scale data, we show that CGD performs considerably
better in some cases on various quality measures. Based
on this observation, we propose a multi-scale extension of
FISTA which improves its computational performance significantly in practice while preserving the theoretical global
convergence rate. We further propose a two-stage active
set training scheme for CGD and FISTA, which boosts the
prediction accuracies by up to 4%. Extensive experiments
on several data sets containing up to millions of samples
demonstrate the effectiveness of our approach.
using different kernels allows one to learn a nonlinear classifier in the feature space. However, its training and testing both suffer from severe computational bottlenecks when
used with large-scale data. Moreover, when the number of
training data is small compared to the number of the features, straightforward KLR training may lead to over-fitting.
KLR with L1-regularization in which an extra λk·k1 is added
to penalize large “w” has received considerable attention
since it enforces sparsity in w. Specifically, L1-regularized
KLR often yields a sparse solution w whose nonzero components correspond to the underlying more “meaningful” features. Therefore, it greatly saves the testing time especially
when the feature dimension is huge for only a few features
are used in computation. Indeed, L1-regularized KLR makes
it possible to learn and apply KLR at large scale with similar
performance as of L2-regularization. In , it is shown that L1regularization can outperform L2-regularization especially
when number of observations is smaller than the number of
features. Thus, L1-regularization technique has been widely
used in many other problems, such as compressed sensing
and Lasso , despite the fact that it is more challenging to
solve L1-regularization than L2-regularization due to the
non-smoothness of k · k1 . In this paper, we consider the
following Sparse Kernel Logistic Regression (SKLR) model.
min F (w) = −
w
1. INTRODUCTION
1.1 Kernel Logistic Regression
w
N
X
log(σ(yi wT ki ))
log(σ(yi wT ki )) + λkwk1
(2)
i=1
We choose Radial Basis Function (RBF) kernel to form the
Kernel matrix K, i.e.,
Kernel Logistic Regression (KLR) is a powerful probabilistic
classification tool. Given N training points, the following
minimization is used to train KLR:
min F (w) = −
N
X
(1)
i=1
where ki is the i-th column of the kernel matrix K of the
training data, and σ(v) = 1/(1 + exp−v ). The possibility of
k(i, j) = exp
1.2
−
kxi −xj k2
σ2
(3)
Algorithms for L1-regularization
(L2-norm squared) KLR instead of the non-smooth SKLR.
Interior Point method approach is proposed for sparse logistic regression (SLR) in . Certain sparsity assumption is
made for the training data matrix in large-scale computation, thus it is not applicable for nonlinear kernel model. Iteratively Re-weighted Least Squares (IRLS) via LARS calls
LARS as a subroutine and also solves KLR only.
Various first-order optimization algorithms have been proposed to solve the L1-regularized problems. In this paper, we explore three algorithms, (Fast) Iterative ShrinkageThresholding algorithm (FISTA/ISTA) , the Coordinate Gradient Descent algorithm (CGD) and the Stochastic gradient
descent (SGD) . FISTA is an accelerated proximal-gradient
15 of 104
Data set
mnist10k
gisette
cod-rna
mnist8M
training
10,000
6,000
483,565
1,560,154
test
1000
1000
5000
19,000
no. features
784
5,000
8
784
factor of λmax
0.5,0.2,0.1,0.05
0.5,0.2,0.1,0.05
0.001
0.1
With a given kernel matrix K, we can also explicitly compute an upper bound λmax on the meaningful range of λ
following the formulation given in literature,
λmax = kK T b̃k∞ ,
Table 1: Various data sets used in the experiments.
where
algorithm which was originally proposed to solve the linear
inverse problem arising in image processing. Recently, it has
been proven that its complexity bound is optimal among
all the first-order methods. CGD is a coordinate-descent
type of algorithm and also extended the former work to L1regularized convex minimization problems which achieves
the aforementioned goals. SGD uses approximate gradients
from subsets of the training data and updates the parameters in an online fashion. In many applications this results in
less training time in practice than batch training algorithms.
The SGD-C method that we use in our study is based on
the Stochastic sub-Gradient Descent algorithm originally
proposed to solve the L1-regularized log-linear models. It
uses the cumulative penalty heuristic to improve the sparsity of w. SGD-C also incorporates the gradient-averaging
idea from the dual averaging algorithm , which extends Nesterov’s work to L1-regularized kernel logistic regression.
2. IMPLEMENTATION
2.1 Kernel linearization
Nonlinear kernel classifiers are attractive because they can
approximate the decision boundary better given enough training data. However, they do not scale well in both training
time and storage in large-scale settings, and it may take
days to train on data sets with millions of points. On the
other hand, linear classifiers run much more quickly, especially when the number of features is small, but behave relatively poorly when the underlying decision boundaries are
non-linear. Kernel linearization combines the advantages of
the linear and nonlinear classifiers. The key idea is to map
the training data to a low-dimensional Euclidean space by a
randomized feature map z : Rn → RD , D n, so that
k(x, y) = hφ(x), φ(y)i ≈ z(x)0 z(y).
(4)
Therefore, we can directly put the transformed data z(x)
to the linear classifier and speed up the computation. To
calculate z(x), randomized Fourier transform is used as,
r
2
cos(ωj x + b).
(5)
zj (x) =
D
3.
EXPERIMENTS
3.0.1 The regularization parameter λ
One of the key parameters to set for the SKLR problem is the
penalty λ on the L1 -norm. There are two major approaches
for choosing λ. One is through computing the regularization
path, as done recently. Specifically, we start with a large λ,
on which the optimization algorithms converge faster. We
then gradually decrease λ and use the solution from the
previous λ as the starting solution for the current λ. This
technique is also known as continuation or warm-starting .
It uses 10-fold cross validation to compute the hold-out accuracy and select the λ that achieves the highest accuracy.
(6)
b̃i =
N− /N,
if yi = 1,
i = 1 · · · N.
−N+ /N, yi = −1,
N− and N+ above denote the number of positive and negative training labels respectively. The optimal solution is
the zero vector for any λ larger than λmax . For mediumscale datasets, we ran the three algorithms with four values
of λ: 0.5λmax , 0.2λmax , 0.1λmax , and 0.05λmax . We show
that usually a good balance between accuracy and sparsity
occurs at around 0.2λmax to0.1λmax . In Table 1, we have
specified the actual factors of λmax that we used in our experiments. This approach is simple and the chosen parameter values are usually appropriate enough for ensuring that
we test the optimization algorithms in the relevant regime
for classification.
4.
CONCLUSION
In this work, we explore and analyze three algorithms FISTA,
CGD and SGD for solving the L1-regularized large-scale
KLR, which has never been performed to the best of our
knowledge. We observe that CGD performs surprisingly
better than FISTA in the number of iterations to convergence. For large-scale data with size up to millions, SGD
appears faster in terms of the training time, but it comes
with the loss of optimality guarantee upon termination, and
SGD has lower prediction accuracy or sparsity compared to
the deterministic methods (i.e. FISTA and CGD) in some
cases. We also study the effect of various values of the regularization parameter on the training time, prediction accuracy, and sparsity. In the algorithmic aspect, we propose a
two-stage active-set training approach which can boost the
prediction accuracy by up to 4% for the deterministic algorithms. Based on the observation that CGD converges faster
than FISTA on SKLR problems, we adopt the feature of
CGD and propose a multiple-scaled FISTA which improves
its performance significantly while preserving the theoretical
global convergence rate. We have also applied the gradientaveraging technique and the cumulative penalty heuristic to
SGD to make it well-suited for large-scale SKLR training.
We have also proposed a multiple-scale extension to FISTA
and a two-stage active set training scheme, which helps improve the prediction accuracies of the deterministic learning
algorithms. Through our experiments, We demonstrated the
effectiveness of kernel linearization and the computational
advantage that it brings, which makes large-scale kernel
learning highly feasible. Our observations show that while
SGD-C remains to be the most efficient algorithm for largescale data, the deterministic algorithms (especially CGD)
are also well-capable of handling a wide range of data with
optimality guarantee. Finally, the success of the algorithms
that we have introduced in this work demonstrates the potential and computational advantage of sparse modeling in
large-scale classification.
16 of 104
Visualizing Topic Models
Allison J. B. Chaney and David M. Blei
Princeton University
achaney@cs.princeton.edu
(The variable θd is a distribution over K
elements.)
(b) For each word in the document
i. Choose a topic assignment zn from θd .
(Each zn is a number from 1 to K.)
ii. Choose a word wn from the topic distribution βzn .
(Notation βzn selects the zn th topic
from step 1.)
This generative process defines a joint distribution
over the hidden topical structure and observed documents. This joint, in turn, induces a posterior distribution of the hidden structure given the documents.
Topic modeling algorithms, like Gibbs sampling [4]
and variational inference [1, 3], seek to approximate
this posterior. Here, we use this approximation to visualize a new organization of the collection through
posterior estimates of the hidden variables, as shown
in Figure 1.
As one demonstration, we analyzed and visualized 100,000 Wikipedia articles with a 50-topic
LDA model. The browser can be found at http:
//www.princeton.edu/˜achaney/tmve/
wiki100k/browse/topic-list.html.
Figure 2 depicts several pages from this browser.
An extension that incorporates time series of topics
and document impact [2] into is in progress with
preliminary results.
Probabilistic topic modeling is an unsupervised
machine learning method for uncovering the latent
thematic structure in large archives of otherwise unorganized texts. Topic models learn the underlying
themes of the collection and how each document exhibits them. This provides the potential to both summarize the documents and organize them in a new
way. However, topic models are high-level statistical tools which require scrutinization of probability
distributions to understand and explore their results.
The contribution of this work is to present a method
for visualizing the results of the topic model. Our
method creates a web interface that allow users to
explore all of the observed and hidden relationships
that a topic model discovers. These browsing interfaces help in research and development of topic models and let end-users explore and understand the collection in new ways.
We propose a method for visualizing topic models that summarizes the corpus for the user and reveals relationships between and across content and
summaries. With our browser design, every possible relationship between topics, terms, and documents is represented and integrated into the navigation. Overview pages allow users to understand the
corpus as a whole before delving into more specific
exploration via individual variable pages. We have
achieved this with a browser design that illuminates a
given corpus to users of varying technical ability; understanding our browser does not require an understanding of the deep technical details of topic modeling.
We have built a browser of one of the simplest
topic models, LDA [1] and can be used for both its
static and online variants [3]. Recall that LDA assumes the following generative process of assigning
topic to documents.
References
[1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
[2] S. Gerrish and D. Blei. A language-based approach
to measuring scholarly impact. In International Conference on Machine Learning, 2010.
[3] M. Hoffman, D. Blei, and F. Bach. On-line learning
for latent Dirichlet allocation. In Neural Information
Processing Systems, 2010.
1. For K topics, choose each topic distribution βk .
(Each βk is a distribution over the vocabulary.) [4] M. Steyvers and T. Griffiths. Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, and
2. For each document in the collection:
W. Kintsch, editors, Latent Semantic Analysis: A
Road to Meaning. Laurence Erlbaum, 2006.
(a) Choose a distribution over topics θd .
1
17 of 104
Associated topics, ordered by θd
Related topics, ordered by a
function of βk × β1:K (Eq. 1)
Associated documents, ordered by θ1:D
Terms wd present in the document
Related documents, ordered by
a function of θd × θ1:D (Eq. 4)
Associated terms, ordered by βk
Figure 1: A topic page and document page from the browser of Wikipedia. We have labeled how we compute
each component of these pages from the output of the topic modeling algorithm.
Figure 2: Navigating Wikipedia with a topic model. Beginning in the upper left, we see a set of topics,
each of which is a theme discovered by a topic modeling algorithm. We click on a topic about film and
television. We choose a document associated with this topic, which is the article about film director Stanley
Kubrick. The page about this article includes its content and the topics that it is about. We explore a
related topic about philosophy and psychology, and finally view a related article about Existentialism. This
browsing structure—the themes and how the documents are organized according to them—is created by
running a topic modeling algorithm on the raw texts of Wikipedia and visualizing its output.
2
18 of 104
Weakly Supervised Neural Nets for POS Tagging
Sumit Chopra and Srinivas Bangalore
AT&T Labs-Research, 180 Park Ave, Florham Park, NJ
{schopra,srini}@research.att.com
We introduce a simple yet novel method for the problem
of Part-Of-Speech (POS) tagging with a dictionary. It involves training a neural network which simultaneously learns
a distributed latent representation of the words, while learning
a discriminative function to optimize accuracy on the POS
tagging task. The model is trained using a Curriculum: a
structured sequence of training samples is presented to the
model, as opposed to in random order. On standard data set,
we demonstrate that our model is able to outperform the
standard Expectation-Maximization (EM) algorithm and its
performance is comparable to other state-of-the-art models.
In natural language processing (NLP), machine learning
techniques such as Hidden Markov Models (HMM), Maximum Entropy models, Support Vector Machines and Conditional Random Fields have become the norm to solve a range
of disambiguation tasks such as part-of-speech (POS) tagging,
named-entity tagging, supertagging, and parsing. Training of
these models crucially rely on the availability of large amounts
of text annotated with the appropriate task labels. However,
text annotation is a tedious and expensive process. In order to
address this limitation, a number of alternate techniques have
been proposed recently, both in the unsupervised and weakly
supervised settings.
One such technique, first introduced in [1], involves training
a POS tagger for a language using large amounts of unannotated text aided by a lexicon of words with all possible
POS tags for each word in that language (a dictionary). Such
a lexicon can easily be extracted from a physical dictionary
of a language without the need for annotating texts of that
language. This problem has recently witnessed a lot of activity. Various authors have proposed a wide variety of solutions, ranging from variants of HMM-based generative models
trained using EM [1], [2], [3], to Bayesian approaches [4], [5],
to contrastive training of log-linear models [6]. However, one
major drawback of most of these techniques is their explicit
modeling of syntactic constrains, typically encoded using ngrams of POS history. This results in an expensive decoding
step, rendering them useless for real time application.
To this end we propose a simple yet novel model for the
problem which uses only lexical information and does not
explicitly models syntactic constraints. Our model is similar
to the works of [7], and [8]. It is composed of two layers:
a classification layer stacked on top of an embedding layer.
The embedding layer, is a learnable linear mapping which
maps each word onto a low dimensional latent space. Let
the set of N unique words in the vocabulary of the corpus
be denoted by W = {w1 , w2 , . . . , wN }. Let us assume that
each word wi is coded using a 1-of-n coding. The embedding
layer maps each word wi to a continuous vector zi which
lies in a D dimensional space (25 in the experiments):
zi = C · wi , i ∈ {1, . . . , N }, where C ∈ <D×N is a
projection matrix. Continuous representation of any k-gram
(wi wi+1 . . . wi+k ) is z = (zi zi+1 . . . zi+k ) – obtained by
concatenating the representation of each of its words. The
encoding of the dictionary is as follows. Let K be the total
number of tags in the dictionary. Then the dictionary entry di ,
is a binary vector of size K with the j th element set to 1 if
the tag j is associated with word wi .
The classification layer takes as input the continuous representation of the k-gram generated by the embedding layer
and produces as output a vector of tag probabilities which are
used to make the decision. In our experiments this layer was
composed of a single standard perceptron layer consisting of
400 units, followed by a fully connected linear layer with size
equal to the number of classes (47). The output of this linear
layer is passed through a soft-max non-linearity to generate
conditional probabilities for each class (tag). In particular, let
oj denote the output of the j-th unit of the last linear layer.
Then the output of the
i-th unit of the soft-max non-linearity
oi
is given by pi = Pe eoj . This classifier can be viewed as a
j
parametric function GM with parameters M. The two sets of
trainable parameters for the model are the mapping matrix C,
and the set of parameters M associated with GM .
For the purpose of training, we use the POS tags of
each word in the dictionary to construct a training set S =
{(xi , y i ) : i = 1, . . . , N }, consisting of the input-output pairs.
Each input xi is a sequence of k words obtained by sliding a
window of size k over the entire corpus. In particular, for a
sentence (W = w1 . . . wr ), each training example xi consists
of a target word wj , with six words from its left (cl =
wj−6 . . . wj−1 ) and right context (cr = wj+1 . . . wj+6 ), in
addition to orthographic features (owj ) such as three character
suffix and prefix, digit and upper case features. Thus the input
xi is the 4-tuple (cl , wj , cr , owj ). The corresponding output y i
is set equal to dj : the binary dictionary vector associated with
the target word wj . Depending on the word ambiguity one or
more elements of the vector dj could be set to 1.
Training of the system involves adjusting the parameters M
and C so as to maximize the likelihood of the training data.
This is achieved by minimizing the negative log likelihood
loss using stochastic gradient descent algorithm. In particular
each epoch of training is composed of two phases. In Phase
1: the matrix C is kept fixed and the loss is minimized with
respect to the parameters M of the classifier using stochastic
gradient descent. In Phase 2: the parameters M are fixed and
19 of 104
the loss is minimized with respect to the matrix C.
However, motivated by [9], we follow a fixed curriculum
to pick a training sample during each step of the stochastic
gradient descent, as opposed to random sampling. The curriculum involves first using only the “easy” samples to learn
simpler aspects of the task, and gradually moving to “harder”
samples to learn more complex aspects. In particular we start
the model training using samples that consist of words which
have only one POS tag in the dictionary (i.e. no ambiguity).
After training the model for some number of epochs on these
training samples, we grow the learner’s training set (and hence
increase its entropy) by adding in examples of words which
have two POS tags in the dictionary (i.e. an ambiguity of one)
and train the model until convergence. We follow this training
regime with ever increasing size of the training set until all
the available samples are included in the learner’s training set.
We evaluated our model on the standard Penn Treebank
part-of-speech (POS) corpus, and used the same dictionary of
word-to-tags association as in [3]. We replaced the tag associated with each word in the labeled training set of around 1.1M
words with the tags associated with that word in the dictionary.
This resulted in an ambiguously tagged sentence corpus, with
average per word ambiguity of 2.3 tags/words. Due to the
high non-linearity of the model different initializations could
potentially lead to different local minima, resulting in vastly
different task accuracy. To alleviate this problem, we trained
several models independently, each initialized with a different
set of random word embedding and classifier parameters, and
the model which performed the best on the tuning set was
selected as the final model for evaluation.
The tagging accuracy of our model, along with that of others
is summarized in Table I. The accuracy of our model with
curriculum learning (CNet+CL) is superior to the bigram EMHMM, trigram EM-HMM-2, and the Bayesian HMM models.
Those models which outperform ours, namely CE+spl and
IP+EM, make use of a number of global characteristics, such
as access to the full sequence of POS tags of a sentence.
The simplicity of our model (which is only lexically driven
with no dependencies on the POS tag sequence), makes the
results in the table particularly noteworthy. We attribute the
good performance to the lexical representations that are learnt
jointly with the discriminative non-linear architecture trained
to optimize the POS tagging accuracy. We expect, exploiting such global information within our model would likely
improve the tagging accuracy even further. It is interesting
to note that the model without curriculum learning (CNet),
where training samples are presented in no particular order is
far inferior to all other models. Furthermore, analysis of the
dictionary confirmed that a number of words, with ambiguity
greater than 0 had wrong tags associated with them in their respective dictionary entries. This was due to the fact that some
instances of these words were wrongly tagged in the corpus.
The incorrect tagging of the words, coupled with the greedy
nature of training the model resulted in the model wrongly
tagging all the instances of these words. After removing such
erroneous entries from the dictionary as suggested in [2], the
test accuracy of our model improved to 90.55%. Note that
in Table I the performance of 91.4% reported in [2] uses
such a pruned dictionary along with other language specific
and linguistic constraints. Lastly, once the parameters of our
model are learnt, the process of decoding a new sample is
extremely fast. Our model is capable of tagging 7945 words
per second on a Xeon7550 machine.
TABLE I
POS
TAGGING ACCURACY OF VARIOUS MODELS . EM - BIGRAM : EM
WITH A BI - GRAM TAG MODEL ; EM - TRIGRAM : EM WITH 3- GRAM TAG
MODEL ; BHMM: BAYESIAN HMM WITH S PARSE P RIORS [4]; CE+ SPL :
C ONTRASTIVE E STIMATION WITH S PELLING M ODEL [6]; I NIT EM-HMM:
EM WITH GOOD INITIALIZATION [2]; IP+EM: EM WITH I NTEGER
P ROGRAMMING [3]; CN ET: C ONNECTIONIST N ETWORK ; CL:
C URRICULUM L EARNING ; CD: C LEAN D ICTIONARY;
Model
EM - bigram
EM - trigram
BHMM [4]
CE+spl [6]
InitEM-HMM [2]
IP+EM [3]
CNet
CNet + CL
CNet + CL + CD
Percentage Accuracy
using 47 tags
81.7
74.5
86.8
88.6
91.4
91.6
67.72
88.34
90.55
We presented a novel method for weakly supervised training
of neural networks for part-of-speech tagging which does not
makes use of explicit annotated data. Our method makes use
of an organized training regime, called Curriculum Learning,
to train the neural network while at the same time training a
distributed embedding of the words. While the performance
of the model is based entirely on lexical information, it stands
to question as to whether the performance can be further
improved using global or long-distance constraints. Finally,
the technique used in this paper might also be used for other
disambiguation tasks such as Supertagging.
R EFERENCES
[1] B. Merialdo, “Tagging english text with a probabilistic model,” Computational Linguistics, 1994.
[2] Y. Goldberg, M. Adler, and M. Elhadad, “Em can find pretty good hmm
pos-taggers (when given a good start),” In Proc. of ACL, pp. 746 – 754,
June 2008.
[3] S. Ravi and K. Knight, “Minimized models for unsupervised part-ofspeech tagging,” In Proc. of ACL-IJCNLP, pp. 504 – 512, August 2009.
[4] S. Goldwater and T. Griffiths, “A fully bayesian approach to unsupervised
part-of-speech tagging,” In Proc. of ACL, pp. 744 – 751, June 2007.
[5] K. Toutanova and M. Johnson, “A bayesian lda based model for semisupervised part-of-speech tagging,” Advances in Neural Information Processing Systems, no. 20, 2008.
[6] N. Smith and J. Eisner, “Contrastive estimation: Training log-linear
models on unlabeled data,” In Proc. ACL, pp. 354 – 362, 2005.
[7] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic
language model,” Journal of Machine Learning Research, vol. 3, pp.
1137–1155, 2003.
[8] R. Collobert and J. Weston, “A unified architecture for natural language
processing: deep neural networks with multitask learning,” in ICML ’08:
Proceedings of the 25th international conference on Machine learning.
New York, NY, USA: ACM, 2008, pp. 160–167.
[9] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
learning,” in ICML ’09: Proceedings of the 26th Annual International
Conference on Machine Learning. New York, NY, USA: ACM, 2009,
pp. 41–48.
20 of 104
Online Clustering with Experts
Claire Monteleoni∗
Department of Computer Science
George Washington University
cmontel@gwu.edu
Anna Choromanska
Department of Electrical Engineering
Columbia University
aec2163@columbia.edu
Abstract
Approximating the k-means clustering objective with an online learning algorithm
is an open problem. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors
in order to re-weight the experts, the algorithms compute an approximation to the
current value of the k-means objective obtained by each expert. When the experts
are batch clustering algorithms with b-approximation guarantees with respect to
the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation
guarantees with respect to the k-means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework
proposed by Dasgupta as an analog to regret. Our algorithms track the best clustering algorithm on real and simulated data sets.
1
Introduction
As data sources continue to grow at an unprecedented rate, it is increasingly important that algorithms to analyze this data operate in the online learning setting. This setting is applicable to a
variety of data stream problems including forecasting, real-time decision making, and resourceconstrained learning. Data streams can take many forms, such as stock prices, weather measurements, and internet transactions, or any data set that is so large compared to computational resources,
that algorithms must access it in a sequential manner. In the online learning model, only one pass is
allowed, and the data stream is infinite.
Most data sources produce raw data (e.g. speech signal, or images on the web), that is not yet labeled
for any classification task, which motivates the study of unsupervised learning. Clustering refers to
a broad class of unsupervised learning tasks aimed at partitioning the data into clusters that are
appropriate to the specific application. Clustering techniques are widely used in practice, in order
to summarize large quantities of data (e.g. aggregating similar online news stories), however their
outputs can be hard to evaluate. For any particular application, a domain expert may be useful in
judging the quality of a resulting clustering, however having a human in the loop may be undesirable.
Probabilistic assumptions have often been employed to analyze clustering algorithms, for example
i.i.d. data, or further, that the data is generated by a well-separated mixture of Gaussians.
Without any distributional assumptions on the data, one way to analyze clustering algorithms is to
formulate some objective function, and then to prove that the clustering algorithm either optimizes
it, or is an approximation algorithm. Approximation guarantees, with respect to some reasonable
objective, are therefore useful. The k-means objective is a simple, intuitive, and widely-cited clustering objective, however few algorithms provably approximate it, even in the batch setting. In this
∗
This work was done while C.M. was at the Center for Computational Learning Systems, Columbia University.
1
21 of 104
work, inspired by an open problem posed by Dasgupta [4], our goal is to approximate the k-means
objective in the online setting.
The k-means clustering objective
1.1
One of the most widely-cited clustering objectives for data in Euclidean space is the k-means objective. For a finite set, S, of n points in Rd , and a fixed positive integer, k, the k-means objective is to
choose a set of k cluster centers, C in Rd , to minimize:
X
ΦX (C) =
min kx − ck2
x∈S
c∈C
which we refer to as the “k-means cost” of C on X. This objective formalizes an intuitive measure of
goodness for a clustering of points in Euclidean space. Optimizing the k-means objective is known
to be NP-hard, even for k = 2 [2]. Therefore the goal is to design approximation algorithms.
Definition 1. A b-approximate k-means clustering algorithm, for a fixed constant b, on any input
data set X, returns a clustering C such that: ΦX (C) ≤ b · OP TX , where OP TX is the optimum of
the k-means objective on data set X.
Surprisingly few algorithms have approximation guarantees with respect to k-means, even in the
batch setting. Even the algorithm known as “k-means” does not have an approximation guarantee.
Our contribution is a family of online clustering algorithms, with regret bounds, and approximation
guarantees with respect to the k-means objective, of a novel form for the online clustering setting.
We extend algorithms from [5] and [6] to the unsupervised learning setting, and introduce a flexible
framework in which our algorithms take a set of candidate clustering algorithms, as experts, and
track the performance of the “best” expert, or best sequence of experts, for the data. Our approach
lends itself to settings in which the user is unsure of which clustering algorithm to use for a given
data stream, and exploits the performance advantages of any batch clustering algorithms used as
experts. Our algorithms vary in their models of the time-varying nature of the data; we demonstrate
encouraging performance on a variety of data sets.
Online k-means approximation
2
Specifying an evaluation framework for online clustering is a challenge. One cannot use a similar
analysis setting to [3, 1] in the finite data stream case. With no assumptions on the data, one can
always design a sequence of observations that will fool a seeding algorithm (that picks a set of
centers and does not update them) into choosing seeds that are arbitrarily bad (with respect to the
k-means objective) for some future observations, or else into simply abstaining from choosing seeds.
Our analysis is inpired, in part, by an evaluation framework proposed by Dasgupta as an analog to
regret [4]. The regret framework, for the analysis of supervised online learning algorithms, evaluates
algorithms with respect to their additional prediction loss relative to a hindsight-optimal comparator
method. With the goal of analyzing online clustering algorithms, Dasgupta proposed bounding the
difference between the cumulative clustering loss since the first observation:
X
LT (alg) =
min kxt − ck2
(1)
t≤T
c∈Ct
where the algorithm outputs a clustering, Ct , before observing the current point, xt , and the optimal
k-means cost on the points seen so far. We provide clustering variants of predictors with expert
advice from [5] and [6], and analyze them by first bounding this quantity in terms of regret with
respect to the cumulative clustering loss of the best (in hindsight) of a finite set of batch clustering
algorithms. Then, adding assumptions that the batch clustering algorithms are b-approximate with
respect to the k-means objective, and that the clustering setting is non-predictive, i.e. the clustering
algorithms observe the current point before trying to cluster it, we extend our regret bounds to
obtain bounds on a non-predictive variant of Equation 1 (where the point xt is now observed before
the clusterings are output), with respect to the optimal k-means cost on the points seen so far.1
1
We also show that no b-approximate algorithm will trivially optimize our loss function by outputting xt .
2
22 of 104
References
[1] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. In NIPS, 2009.
[2] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-ofsquares clustering. Mach. Learn., 75:245–248, May 2009.
[3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, 2007.
[4] Sanjoy Dasgupta. Course notes, CSE 291: Topics in unsupervised learning. Lecture 6: Clustering in an
online/streaming setting. Section 6.2.3. In http://www-cse.ucsd.edu/∼dasgupta/291/lec6.pdf, University of
California, San Diego, Spring Quarter, 2008.
[5] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151–178, 1998.
[6] Claire Monteleoni and Tommi Jaakkola. Online learning of non-stationary sequences. In NIPS, 2003.
3
23 of 104
Efficient Learning of Word Embeddings via
Canonical Correlation Analysis
(Also appearing in NIPS ’11)
Paramveer S. Dhillon
Department of Computer and Information Science
University of Pennsylvania, Philadelphia, PA 19104 U.S.A
dhillon@cis.upenn.edu
Dean Foster
Statistics, Wharton School
University of Pennsylvania, Philadelphia, PA 19104 U.S.A
foster@wharton.upenn.edu
Lyle Ungar
Department of Computer and Information Science
University of Pennsylvania, Philadelphia, PA 19104 U.S.A
ungar@cis.upenn.edu
1
Abstract
Recently, there has been substantial interest in using large amounts of unlabeled data to learn word
representations/embeddings which can then be used as features in supervised classifiers for NLP
tasks [1]. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary
space with only one dimension “on” at a given time. Broadly, these embedding methods fall into
two categories:
1. Clustering based word representations: Clustering methods, often hierarchical, are used to
group distributionally similar words based on their contexts e.g. Brown Clustering.
2. Dense representations: These representations are dense, low dimensional and real-valued.
Each dimension of these representations captures latent information about a combination
of syntactic and semantic word properties. They can either be induced using neural networks like C&W embeddings [3] and Hierarchical log-linear (HLBL) embeddings [4] or
by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI).
Unfortunately, most of these representations are slow to train, are sensitive to the scaling of the embeddings (especially `2 based approaches like LSA/PCA) and learn a single embedding for a given
word type; i.e. all the occurrences of the word “bank” will have the same embedding, irrespective
of whether the context of the word suggests it means “a financial institution” or “a river bank”.
We propose a novel context-specific word embedding method called Low Rank Multi-View Learning,
LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. Our LRMVL embeddings are context-specific, but context oblivious embeddings can be trivially gotten
from our model. Furthermore, building on recent advances in spectral learning for sequence models
like HMMs [6, 7] we show that LR-MVL has strong theoretical grounding. Particularly, we show
that LR-MVL estimates low dimensional context-specific word embeddings which preserve all the
1
24 of 104
information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear
does not face the danger of getting stuck in local optima as is the case for an EM trained HMM.
In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled
corpus to find the common latent structure, i.e., the hidden state associated with each token. These
induced representations of the tokens can then be used as features in a supervised classifier.
The context around a word, consisting of the h words to the right and left of it, sits in a high
dimensional space, since for a vocabulary of size v, each of the h words in the context requires an
indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word
space down to a k dimensional state space. Thus, all eigenvector computations are done in a space
that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000
words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the
size of calculations that are needed.
The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which
maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the
“eigenfeature dictionary”.
Our theorem below shows that the LR-MVL algorithm learns a reduced rank matrix A that allows a
significant data reduction while preserving the information in our data, and that the estimated state
does the best possible job of capturing any label information that can be inferred by a linear model.
Theorem 1.1. Let L be an n × hv matrix giving the words in the left context of each of the n tokens,
where the context is of length h, R be the corresponding n × hv matrix for the right context, and
W be an n × v matrix of indicator functions for the words themselves.
Define A by the following limit of the right singular vectors:
CCAk ([L, R], W)right ≈ A.
Under some rank assumptions (omitted for brevity), if CCAk (L, R) ≡ [ΦL , ΦR ] then
CCAk ([LΦL , RΦR ], W)right ≈ A.
In conclusion, LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is
theoretically elegant, and achieves state-of-the-art performance on named entity recognition (NER)
and chunking problems.
References
[1] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for
semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational
Linguistics (2010) 384–394
[2] Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural
networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167
[3] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML
’07, New York, NY, USA, ACM (2007) 641–648
[4] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In:
COLT. (2009)
[5] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS-2010.
(2010)
2
25 of 104
Improved Semi-Supervised Learning using Constraints and
Relaxed Entropy Regularization via Deterministic Annealing
Paramveer S. Dhillon
Department of Computer and Information Science
University of Pennsylvania, Philadelphia, PA 19104 U.S.A
dhillon@cis.upenn.edu
Sathiya Keerthi, Kedar Bellare, Olivier Chapelle, S. Sundararajan
Yahoo! Research
Santa Clara, CA 95054, U.S.A
1
Abstract
Recently there has been substantial interest in learning in the presence of background knowledge particularly in the
form of expectation constraints. To this end several related frameworks have been proposed, e.g. Generalized Expectation (GE) [1], Posterior Regularization (PR) [2] [3] [4], Learning from Measurements [5] and Constraint Driven
Learning (CODL) [6].
However, a different thread of research that has shown great empirical success for binary and multiclass semisupervised classification problems (e.g. TSVM) is exploiting the cluster (Low density separation) assumption which
states that the separating hyperplane for classification should be placed in a low density region. Entropy regularization [7] represents the same assumption for probabilistic models. This assumption works well for binary, multiclass
problems but its unclear if it helps in sequence labeling and other structured output tasks. The main premise of this
assumption is that it allows us to place the separating hyperplane more finely compared to a supervised setting, hence
leading to better generalization performance and test accuracy.
The optimization problem solved by binary/multiclass TSVM [8] is
θ∗ , Y∗ = arg min R(θ) + L(YL ; XL , θ) + L(Y; X, θ) +
θ,Y
1
kξk2
2
(1)
ψ(Y; X) − c ≤ ξ
where θs are the parameters of model features, (XL , YL ) and (X, Y) are tokens, labels pair for labeled and unlabeled
data respectively. ψ(Y; X) are the constraint features and c is their value (user specified which we would want our
model to match). and R(θ) are the regularizers for constraints and model parameters respectively and L(YL ; XL , θ)
is a loss function, e.g., log-loss. ξ is a vector of slacks on the constraints.
The above combinatorial optimization can be solved efficiently for binary/multiclass problems by using label switching techniques, where the constraints are generally label fraction constraints. However, this becomes intractable for
structured prediction as we have to optimize over exponentially many (Y) labeling possibilities for a given sequence.
So, we relax this by inducing a distribution a(Y) over the unlabeled data which helps make our approach very general
(works for binary/multiclass/structured prediction data).
X
1
θ∗ , a(Y)∗ = arg min R(θ) + L(YL ; XL , θ) +
a(Y)L(Y; X, θ) + kξk2
(2)
2
θ,a(Y),ξ
Y
Ea [ψ(Y; X)] − c ≤ ξ
The unlabeled term in the above equation above can be viewed as a relaxed version of entropy regularization (ER).
While ER sets a(Y) to the model’s distribution pθ (Y) (in the process, complicating the primal as constraints have to
be imposed on pθ (Y)) we relax this by using a separate distribution a(Y)). Our formulation aims to find a constraintssatisfying distribution a(Y), which, when used as label distribution will be consistent with the model training.
If, in 2 we fix θ and ξ, the remaining inner problem is a linear programming problem in a(Y). The resulting (sparse)
solution is nothing but a distribution in the sub-polytope of constrained Viterbi solutions. Because of this property the
model resulting from (2) tends to be very good at satisfying constraints in test set inference. Note that methods such
26 of 104
as GE and PR enforce constraints only in an expected sense, and so they may not satisfy the constraints well in test
inference. This is a key advantage associated with our approach.
P
To reach a good minimum we use deterministic annealing and include the homotopy term, (T Y a(Y) log a(Y));
this ensures that the constraints (background knowledge) are satisfied transductively in the deterministic annealing
limit (T → 0).
X
X
1
θ∗ , a(Y)∗ = arg min R(θ) + L(YL ; XL , θ) +
a(Y)L(Y; X, θ) + T
a(Y) log a(Y) + kξk2
(3)
2
θ,a(Y),ξ
Y
Y
X
s.t. Ea [ψ(Y; X)] − c ≤ ξ;
a(Y) = 1
Y
For any given T the overall objective is non-convex but it becomes convex when one set of variables are fixed. So, we
use alternating optimization.
Keeping a(Y) fixed the objective becomes
X
θ∗ = arg min R(θ) + L(YL ; XL , θ) +
a(Y)L(Y; X, θ)
(4)
θ
Y
and with θ fixed the optimization problem becomes
X
X
1
a(Y)L(Y; X, θ) + T
a(Y) log a(Y) + kξk2
a(Y)∗ = arg min
2
a(Y),ξ Y
Y
X
s.t. Ea [ψ(Y; X)] − c ≤ ξ;
a(Y) = 1
(5)
Y
The above primal objective for optimization of a(Y) contains one variable per labeling Y, so we work in the dual
space where we will have one variable per expectation constraint. From a computational point of view, the ideas in
this solution are similar to the dual ideas used in posterior regularization [2] [3] [4].
Our formulation can handle two types of constraints:
1. Instance level constraints: e.g. “Journal (or any field in general) field appears at most once in a given sequence”.
2. Corpus level constraints: e.g. “ The average number of authors in a citation should be 3” or “35% of all the
tokens should be labeled Author”.
We achieve state-of-the-art accuracy on a bunch of information extraction tasks.
References
[1] Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled
data. Journal of Machine Learning Research 11 (2010) 955–984
[2] Gärtner, T., Le, Q.V., Burton, S., Smola, A.J., Vishwanathan, S.V.N.: Large-scale multiclass transduction. In:
NIPS. (2005)
[3] Ganchev, K., Graca, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models.
Technical Report MS-CIS-09-16, University of Pennsylvania Department of Computer and Information Science
(2009)
[4] Bellare, K., Druck, G., McCallum, A.: Alternating projections for learning with expectation constraints. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI ’09, Arlington, Virginia,
United States, AUAI Press (2009) 43–50
[5] Liang, P., Jordan, M.I., Klein, D.: Learning from measurements in exponential families. In: Proceedings of
the 26th Annual International Conference on Machine Learning. ICML ’09, New York, NY, USA, ACM (2009)
641–648
[6] Chang, M.W., Ratinov, L.A., Roth, D.: Guiding semi-supervision with constraint-driven learning. In: ACL.
(2007)
[7] Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: NIPS. (2004)
[8] Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML. (1999)
200–209
27 of 104
A Reliable Effective Terascale Linear Learning System
Alekh Agarwal
UC Berkeley
Olivier Chapelle
Yahoo! Research
Abstract
We present a system and a set of techniques
for learning linear predictors with convex
losses on terascale datasets, with billions of
training examples and millions of parameters in an hour or less using a cluster of 1000
machines. This result is highly effective,
producing around 500-fold speed-up over an
optimized online gradient descent approach.
One of the core techniques used is a new
communication infrastructure implemented
in a Hadoop cluster, which allows rapid aggregation of information across nodes (this
operation is often referred to as AllReduce).
The communication infrastructure appears
broadly reusable for many other tasks.
Improving computational complexity of existing algorithms is notoriously difficult, because the space of solutions involves the space of algorithms, which is difficult to comprehend. For example, in machine learning, many algorithms can be stated as statistical query
algorithms [2] which are easily phrased in MapReduce framework [1, 3]. The parallelization here might
appear quite compelling, as the MapReduce implementation can yield good speed-up curves. However,
these algorithms are uncompelling when a basic online gradient descent approach can, on a single node,
learn a good linear predictor 100 or 1000 times faster
than the slow stastical query algorithm [4, 5]. There
are many other crude, but extremely effective tricks,
such as learning on a subsampled dataset.
Given this observation, we should skeptically evaluate
a parallel learning system. When can we be sure that
Miroslav Dudı́k
Yahoo! Research
John Langford
Yahoo! Research
it is doing something nontrivial? We focus on learning systems fitting generalized linear models such as
linear regression or logistic regression. To evaluate
efficacy of their parallel implementation, we propose
the following two conditions:
1. Is the parallel algorithm’s throughput faster than
the I/O interface of a single machine? Throughput is measured as the input size divided by the
running time.
2. Is subsampling ineffective on our prediction problem? If we have a trillion examples, but only
want to learn one parameter with a convex loss,
subsampling is an extremely effective strategy.
If the answer is “yes” to the first question, the only
possibility in which a single machine could match the
performance of the parallel algorithm is by subsampling. If, however, also the answer to the second question is “yes”, this possibility is excluded and hence
no single-machine algorithm is a viable alternative to
the parallel learning algorithm. For our learning algorithms, the answers are “yes” and “yes”, as verified
experimentally. Furthermore, ours appear to be the
first parallel learning algorithms for which both answers are “yes”.
There is no single trick which allows us to achieve
this—instead it is a combination of techniques, most
of which have appeared elsewhere, and a few of which
are new. One of the most important new tricks is
a communication infrastructure that implements the
most expensive operation in the fitting of generalized
linear models, which is the accumulation of the gradient across examples. It is functionally similar to
an MPI-style AllReduce operation (hence we use the
28 of 104
name), but it is made compatible with Hadoop in important ways. AllReduce is an operation which starts
with a scalar or a vector in each node, and ends with
every node containing the aggregate (typically the sum
or average, but there are other possibilities). This is
commonly implemented as a sequence of “reduce”
(pairwise aggregation) and broadcast operations on a
spanning tree over the individual compute nodes. The
implementation can be deeply pipelined, implying that
latency is unimportant, and the bandwidth requirements
at any node in the spanning tree are within a constant
factor of the minimum.
Implementation of AllReduce using a single spanning
tree is less desirable than MapReduce in terms of reliability, because if any individual node fails, the entire
computation fails. Instead of implementing a generalpurpose reliability protocol (such as network errorcorrecting codes), which would impose a significant
slow-down, we implemented two light-weight tricks
which make AllReduce reliable enough to use in practice on computations up to 10K node hours.
1. The primary source of node failure comes from
disk read failures. In this case, Hadoop automatically restarts on a different machine with identical data. We delay the initialization of the communication infrastructure until every node finishes
reading its data, which avoids disk read failures.
2. It is common for large clusters of machines to be
busy with many jobs which use the cluster in an
uneven way, which commonly results in one of
a thousand nodes to be very slow. To get around
this, Hadoop can speculatively execute a job on
identical data, using the first job to complete and
killing the others. By using a slightly more sophisticated spanning tree construction protocol,
AllReduce can benefit from the same system. The
net effect is perhaps another order of magnitude
of scalability in practice.
With the two tricks above, we found the system reliable enough for up to 1000 nodes. In Fig. 1, we plot
the speed-up curve for up to 100 nodes.
Programming machine learning algorithms with AllReduce is typically easier than programming with MapReduce, because code does not need to be rewritten—it
suffices to add a few lines to an existing single thread
program to synchronize the relevant state.
Figure 1: Speed-up relative to the run with 10 nodes,
as a function of the number of nodes. The results are
for a logistic regression problem (click-through-rate
predicton) with 2.3B examples, 16M parameters, and
about 100 non-zero covariates per example.
Other new tricks that we use are hybrid online+batch
algorithms and more sophisticated weight averaging.
Both of these tricks are straightforwardly useful. The
net effect is that we can train logistic regression for
click-through-rate prediction using 17B examples, 16M
parameters (around 100 non-zero covariates per example) in about 70 minutes using 1000 nodes. This
means a net throughput of 500M non-zero covariates/s,
exceeding the input bandwidth (1Gb/s) of any individual node involved in the computation.
References
[1] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.
Oper. Syst. Design and Impl., 2004.
[2] M. Kearns. Efficient noise-tolerant learning from
statistical queries. J. ACM, 1993.
[3] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A.
Ng, and K. Olukotun, Map-Reduce for Machine
Larning on Multicore, NIPS 2007.
[4] L. Bottou. Stochastic Gradient Descent Examples on Toy Problems, http://leon.bottou.
org/projects/sgd, 2007.
[5] Vowpal
Wabbit
open
source
project,
http://github.com/JohnLangford/vowpal_
wabbit/wiki,
2007.
29 of 104
Density Estimation Based Ranking from Decision Trees
Haimonti Dutta
The Center for Computational Learning Systems, Columbia University, NY 10115.
Decision tree algorithms (such as C4.5, CART, ID3) are used extensively in machine learning
applications for classification problems. However, many applications including recommendation
systems for movies, books and music, information retrieval (IR) systems to access documents,
software to estimate susceptibility to failure of devices, personalized email filters giving priority
to unread or urgent email focus on learning to order instances rather than classifying them.
Probability Estimation Trees (PETs) are decision trees that can order instances based on the
probability of class membership. They estimate the posterior probabilities P (y|x) instead of the
predicted class labels. A simple approach to computing probability from a leaf-node of a decision
tree is to divide the number of instances belonging to a class by the total number of instances in
the leaf node. Other sophisticated techniques recommend refraining from pruning the tree and
using smoothing techniques (such as Laplace smoothing and M-estimation [1]). The popularity of
the Probability Estimation Trees for generating ranking models stem from the fact that they are
easily understandable by following a path in the tree.
PETs, however, suffer from several known problems: (1) All instances falling in the same leaf
will share the same probability estimate if the leaf is pure which leads to ties. (2) The number of
instances in a leaf is usually too small to provide a reliable probability estimate.
Ties result in ambiguity for the task of ranking. Kernel density based estimation techniques can
be used to resolve them. The modes of the underlying density are used to define cluster centers
and valleys in density define boundaries separating clusters. The value of the kernel function
provides an estimate of whether a point is close to the center of the class. Instances which are
closer to the center are ranked higher than others in the same class. For leaves with small number
of instances an adaptive kernel smoothing technique is used to ensure greater reliability of
probability estimates.
Non-parametric Kernel Density Estimation (KDE) techniques have been studied in statistics since
the 1950’s ([2]). The formulation of KDE presented here is based on these standard techniques.
Consider the univariate case of estimating the density f(x) given samples {xi}, 1 <= i <=N, where
t
p(X < t) =
# f (x)dx
and
!
!
!!
! !" = 1 . The estimate f(x), called Parzen’s estimate, is
!"
obtained by summing the contributions of the kernel K(x−xi) over all the samples and
1
normalizing such that the estimate itself is a density. Thus, fˆ (x) =
N
" K(
Nh
i=1
x!xi
)
h
where h is the bandwidth of the estimator. Common choices of Kernels include: (a) Gaussian,
Cauchy, Epanechnikov etc. A point x which lies in a dense region and has many data points close
to it is expected to have a high density estimate f(x) while a point which is far away will only
receive contributions from the tails of the kernels and the estimate is relatively small. The kernel
function obeys smoothness properties and choice of the bandwidth h can often be difficult. If h is
too small, the estimate is “spiky” while too large a value of h smoothes out details.
The Parzen method of density estimation (described above) is insensitive to variations in
magnitude of f(x). Thus if there is only one sample point, the estimate will have its peak at x = xk
and is likely to be too low at other regions; in contrast if f(x) has many sample points clustered
30 of 104
together accounting for a large value of f(x), the Parzen estimate is likely to be spread out. Since
the leaf-nodes in an unpruned decision tree can have even one example, the insensitivity to
peakedness of the kernel poses problems for ranking.
In the method of variable kernel estimation, the estimates are obtained as follows:
N
x " xj
1
ˆf (x) = 1 !
K(
)
d
N j=1 (! k DK (x))
! k DK (x)
where ! k is a constant multiplicative factor. This method of estimation enables the kernel to be
spread out in low density areas while in high density areas it will be more peaked. Furthermore,
the variable kernel estimator combines the desirable smoothness properties of the Parzen-type
estimators with the data-adaptive nature of the K-Nearest Neighbor estimators. It also does not
incur extensive computation cost as the distance from a point to its K-Nearest Neighbors can be
computed once and stored as a pre-processing step in the algorithm.
The density estimation-based tree construction and ranking thus proceeds as described below:
Algorithm1. Density Estimation based Ranking from Decision Trees
Input: The training data set.
Output: A ranking of instances in the training data.
1. Construct an unpruned classification tree using a standard decision tree construction
algorithm like C4.5 or CART.
2. Obtain the probability estimation for each instance.
3. Resolve ties using KDE. To obtain the probability estimate for a d-dimensional instance
follow the path in the decision tree to the appropriate leaf node.
4. Assume there are m classes -- for each class generate a local density estimate.
5. Smooth the Kernel Density Estimates obtained in step 3 by Variable Kernel Density
Estimation.
Several aspects of the algorithm require further discussion: (1) Scalability: In general, kernel
density estimation techniques are known to scale poorly for multi-dimensional data. It can be
shown theoretically that to achieve a constant approximation error as the number of dimensions
grow one needs exponentially many more examples. The hybrid technique we propose here
requires the estimation of the density at the leaf nodes only and thus it suffices to use only those
attributes that appear in the path to the leaf. (2) Multi-class extension: While we do not present
results for multi-class ranking models, the design of the algorithm is robust enough to support
more than two classes. (3) Effect of Pruning Trees: For our algorithm we do not consider
pruning the decision trees. Such a design was conceived because prior work [3] discouraged
pruning for obtaining better probability estimates.
Future work involves examination of the effect of pruning techniques on kernel smoothing,
incorporation of shrinkage into the probability estimation scheme and comparison with other
ranking algorithms that are not based on decision trees.
References
[1] B.Censnik and I.Bratko. On estimation probabilities in tree pruning. Proceedings of European
Working Session on Learning, 482, 1991:138–150, 1991.
[2] Scott D. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley &
Sons, Inc., New York, NY, 1992.
[3] F. Provost and P. Domingos. Tree induction for probability-based ranking. Machine Learning,
52(3):199–215, 2003.
31 of 104
Optimization Techniques for Large Scale Learning
Clement Farabet1 , Marco Scoffier2 , and Yann LeCun1
1
Courant Institute of Mathematical Sciences
2
Net-Scale Technologies Inc.
Sept 9, 2011
Abstract
Optimization methods for un-constrained non-convex problems such as Conjugate Gradient
Descent (CG) and Limited BFGS (L-BFGS) have long been used in machine learning but for
many real world problems, especially problems where the models are large, function evaluations expensive, and data throughput a bottle neck, simple Stochastic Gradient Descent (SGD)
has provided equivalent results. The advent of increased parallelization available in today’s
computing clusters and multi-core architectures as well as the availability of general purpose
computation on FPGA and GPU architectures has changed the landscape of computation. We
evaluate several batch and stochastic optimization techniques applied to the task of learning
weights for large (deep) models on large datasets. We also make available our source code for
a parallel machine learning library so that other researchers can build upon our results.
1
Introduction
Recently Quoc Le et. al. [1] have reported good results using L-BFGS and CG to train Deep
Networks. In this work we extend their exploration beyond the MNIST dataset. We have found
that using L-BFGS as a black box optimization does not immediately improve results on more
complicated datasets such as the scene understanding Stanford Backgound Dataset. While being
attracted to the ability to parallelize L-BFGS and CG in a map-reduce framework, we find that there
are details which must be understood to train large systems using these optimization techniques. In
this work we explore the space of large scale optimization of large models on large datasets and
seek to answer some rules of thumb for the designer of such systems.
2
Experiments
Using several large scale scene parsing datasets (Stanford Background and MSRC as well as the
baselines of MNIST and NORB, we evaluate the convergence properties of SGD, L-BFGS and CG,
covering details of the algorithms which impact convergence under several different metrics.
2.1
Number of Function Evaluations
The overall number of function evaluations required is often the limiting factor when models are
large. SGD makes one weight update per function evaluation and is essentially a serial operation
1
32 of 104
which is difficult to parallelize and hence to scale. CG and L-BFGS require multiple function
evaluations per parameter update and operate on mini-batches of input which further adds to the
computational complexity per update. Do the updates from these more expensive techniques truly
out perform SGD by the time each algorithm has performed the same number of function evaluations
per training sample?
2.2
Wall Clock Time
The overall time spent can include the impact of parallelization. If an algorithm is more easily
parallelized it can process much more data in less time which makes it interesting even if it requires
more function evaluations.
2.3
Batch Size vs. Time Through Full Training Set
L-BFGS and CG re-evaluate the same mini-batch several times to make one update, moving through
the full training set more slowly than SGD which updates for every function evaluation. Smaller
batches resemble SGD where as larger batches are a better approximation of the full training set,
can be easily parallelized, but are harder to optimize. Where is the trade-off? Can we benefit from
adaptive batch sizes – emulating SGD at the begining of training and large batches at the end?
2.4
Final Generalization Performance
In machine learning we are interested in generalization. Here we explore a set of under the hood
tweaks to attempt the best overall generalization error. 100% accuracy on the training set, perfect optimization, over-fits in most cases. A better optimization algorithm may be more prone to
over-fitting and thus require a smaller model, making the direct comparison of the same model
with different optimization techniques difficult. The optimization guides the choice of model and
vice-versa. Does a perfect model/optimization combination generalize better? Are we better off optimizing less on each batch with less rigorous stopping criteria or even allowing updates on partial
batch evaluations as can happen frequently in less stable cloud computing situations?
References
[1] Quoc Le, Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, and Andrew Ng. On optimization methods for deep learning. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th
International Conference on Machine Learning (ICML-11), ICML ’11, pages 265–272, New York, NY,
USA, June 2011. ACM.
2
33 of 104
Real-time Multi-class Segmentation using Depth Cues
Nathan Silberman, Clément Farabet, Rob Fergus, Yann LeCun
Courant Institute of Mathematical Sciences
New York University
September 9, 2011
This work demonstrates a real-time multi-class segmentation system.
While signicant
progress has been made in multi-class segmentation over the last few years, per-pixel label
prediction for a given image typically takes on the order of minutes.
This renders use of
these systems impractical for real-time applications such as robotics, navigation and humancomputer interaction. Concurrent with these advances, there has been a renewed interest in
the use of depth sensors following the release of the Microsoft Kinect [4] to aid various tasks
in computer vision.
In this work, we describe a real-time system that provides dense label
predictions for a scene given both intensity and depth images. We evaluate this system against
a recently released, densly labeled depth dataset[7].
Our approach is motivated by the desire to provide real-time prediction while not sacricing accuracy. Using a multi-scale neural network, we are able to provide local predictions
using scene-level context. Guided by recent success in detection using high-throughput depth
cameras [6], we provide the neural network with both intensity and depth inputs of a scene.
The end-to-end prediction pipeline is structured as follows: Using open source drivers [3],
we read RGB and depth images from the Kinect in real time.
Before they can be fed to
the neural network, each pair of frames is aligned using standard techniques for recovery of
homography matrices. Following alignment, the depth maps still contain numerous artifacts.
Most notable of these is a depth shadow on the left edges of objects.
These regions are
visible from the depth camera, but not reached by the infra-red laser projector pattern. Consequently their depth cannot be estimated, leaving a hole in the depth map. A similar issue
arises with specular and low albedo surfaces. The internal depth estimation algorithm also
produces numerous eeting noise artifacts, particularly near edges. Before extracting features
for recognition, these artifacts must be removed. To do this, we ltered each image using the
cross-bilateral lter of Paris [5]. Using the RGB image intensities, it guides the diusion of
the observed depth values into the missing shadow regions, respecting the edges in intensity.
D images of size M ×N , we train a multi-scale convolutional network [2] to associate
each input pixel Iij to one of K classes. Intuitively, the multi-scale representation allows the
network to predict the class distribution at each location ij by looking at the entire scene,
with high-resolution around ij and a logarithmically decreasing resolution as the distance to
ij increases.
Each pixel Iij is a 4−dimensional vector with components being the 3 classical color chanGiven
nels (red, green, blue), and the aormentioned ltered depth estimate. Each target (label) is
a
K−dimensional
vector
t
with a single non-zero element. To achieve the multi-scale repre-
sentation, a Gaussian pyramid
G = {Xs | s = 1..S}
1
is rst constructed on the input image
34 of 104
I.
C(Xs , W ) is then applied to each scale, yielding a pyramid of
feature maps {Ys | s = 1..S} . These maps are then upsampled producing S maps of size
M × N and concatenated to produce a single map of P −dimensional vectors. A two-layer
classier (perceptron) is then applied at each location ij , producing a single K−dimensional
vector, which is normalized using a sof tmax function, to produce a distribution of classes for
that location. Let fij (I, W ) be the transform that associates each input pixel Iij to an output
A convolutional network
class distribution. The error function we minimize is the negative log-likelihood (or multi-class
cross-entropy) over the dataset:
E(f (I, W ), t) = −
D X
M X
N X
K
X
tkd ln(fij (I, W )).
d=1 i=1 j=1 k=1
The convolutional network at each scale is the same, using a single parameter vector
W.
Intuitively, having a dierent set of parameters for each scale would result in a much larger
model, which is thus more prone to over-tting. Sharing weights can be seen as a simple form
of regularization, as the ConvNet
C(G, W )
is trying to learn a good representation for all
scales.
Once trained, the convolutional network can be eciently computed on commodity hardware, such as GPUs, or FPGAs, thanks to their homogeneous structure, which essentially
involves 2D convolutions. We use the neuow processor [1] to process the convolutional network, reducing its computation from a few seconds in software to about
100ms.
To evaluate our system, we used the dataset from [7], which contains 2347 densely labeled
images.
While the original dataset contains over 1000 classes, we selected 12 of the most
common classes and a generic background class covering the remaining labels. Since many of
the images in the dataset are taken from the same room, we split the images into a training
and test set ensuring that images taken from the same scene were never split across training
and test sets. We measured the performance of the system using the mean diagonal of the
confusion matrix, computed for per-pixel classication over the 13 classes on the test set. We
obtained 51% accuracy which is on par with [7], yet runs at several frames a second rather
than over a minute per image.
References
[1] Clément Farabet, Berin Martini, Polina Akselrod, Selcuk Talay, Yann LeCun, and Eugenio Culurciello. Hardware accelerated convolutional neural networks for synthetic vision systems. In
International Symposium on Circuits and Systems (ISCAS'10), Paris, May 2010. IEEE.
[2] Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional networks and applications
in vision. In International Symposium on Circuits and Systems (ISCAS'10), Paris, May 2010.
IEEE.
[3] Hector Martin. Openkinect.org. Website, 2010. http://openkinect.org/.
[4] Microsoft. Microsoft kinect. Website, 2010. http://www.xbox.com/kinect/.
[5] Sylvain Paris and Fredo Durand. A fast approximation of the bilateral lter using a signal processing
approach. In In Proceedings of the European Conference on Computer Vision, pages 568580, 2006.
[6] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore,
Alex Kipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from Single Depth
Images. June 2011.
[7] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation
and Recognition, 2011.
2
35 of 104
A text-based HMM model of foreign affair sentiment:
A Mechanical Turkers’ history of recent geopolitical events
by Sean Gerrish (with helpful comments from David Blei)
Historians currently have limited resources to review thousands of relevant news sources, leading to research biased by popular knowledge or even politics and culture. The goal of this work
is to create a history of the relationships between the world’s nations using the text of newspaper
articles and political commentary.
An assumption of our work is that the tension between two nations – or a warm and robust
relationship between them – is reflected by the language we use to discuss them, an idea inspired
by relational topic models [1]. An advantage of a text-based approach to history is that we can
incorporate information from all articles of a given collection with modest computational cost.
This means that historians and political scientists can then search and review past documents to
identify forgotten or overlooked “blips” in history.
A Spatial Model of Political Sentiment
To infer the sentiment between nations, we assume that each nation lies in a space of latent ‘”foreign sentiment”. A spatial a model has two benefits. First, it allows for interpretability: nations
with similar positions in this latent space tend to interact more positively, while nations further
apart tend to have more tension in their relationship. Second, a spatial model allows us to draw on
existing work in multidimensional scaling, including work in both item response theory [2] and
latent space models [3].
A temporal model of interaction. We assume that each nation c starts at a mean position x̄c,0 ∈
R p and drifts over time with the Markov transition
x̄c,t |x̄c,t−1 ∼ N(x̄c,t−1 , σ2K ).
(1)
At any time t, countries c1 and c2 may appear in an article d in the news, and the news may reflect
their relationship with sentiment sd ∈ R with the noisy model
xc1 ,d ∼ N(x̄c1 ,t , σ2D )
xc2 ,d ∼ N(x̄c2 ,t , σ2D )
sd := xcT1 ,d xc2 ,d ,
(2)
where we interpret sd as the sentiment between c1 and c2 : a higher sentiment suggests warm
relations between the nations, while a lower sentiment suggests tension or conflict between the
nations. A graphical model representing these assumptions is shown in Figure 1 (a).
We assume that the sentiment between these two countries will be reflected by the language
used to describe them in each news article. We therefore use a discriminative language model to
learn sentiment based on these articles’ text with text regression [4]. 1 As alluded to above, we fit
text regression parameters with supervision using Amazon Mechanical Turk.
Inference
We fit the MAP objective of this probabilistic model. This has the benefit of both cleaner exposition
and simpler implementation.
one of the simplest language models, the nth word wn of the article is determined by a mixture of the sentiment
p(wn |sd ) ∝ exp (sd aw + bw ), where we interpret the coefficient aw to be term w’s sentiment parameter and the intercept
bw to be its background distribution.
1 In
1
36 of 104
zero
zero
zero
x
x
x
3
israel
c,t
c,t+1
iraq
2
0.2
x
c,t-1
pakistan
x
c,t
c,t+1
C
canada
0.0
afghanistan
russia/ussr
Ip2
x
india
switzerland
china
−0.2
s
s
s
w
D
w
D
w
D
t-1
t
united_states
Sentiment with the United States
c,t-1
0.4
Country
afghanistan
canada
china
1
france
iraq
israel
pakistan
0
russia/ussr
switzerland
united_states
−1
−0.4
t+1
(a)
−2
−0.6
france
−1.5
−1.0
−0.5
0.0
0.5
1.0
Ip1
(b)
1.5
2.0
1990
1995
2000
2005
Date
(c)
Figure 1: (a) A time-series model of countries’ interactions. The large plate shows Markov drift for
each country. Articles with words wd appear at various intervals, expressing sentiment s between countries. Pseudo-observations at position zero are added for regularization for sparsely mentioned countries. (b) Example positions of countries in the latent space of national sentiment. (c) Mutual sentiment
s = x̄cT1 ,· x̄united states,· with the United States over time.
We optimize the MAP objective in this model using an EM-like algorithm. In the M step, the
mean x̄c of each country’s position is estimated using a modified Kalman filter; the primary difference from a standard Kalman filter is that we may have zero or multiple observations at each time
period for each country pair. In the E step, the sentiment for each news article is inferred, given
its mean and the sentiment suggested by the article’s text, combined with a sentiment suggested
by Mechanical Turkers.
Experiments and Results
We fit this model over twenty years of the New York Times’s Foreign Desk, a collection of 94,589
articles from 1987 to 2007. We performed standard stopword removal and tagged each article
with the two most-frequently mentioned countries. Figures 1 (b,c) show results based on both
mechanical turkers’ ratings and manually-curated scores for individual words.
In future work, we hope to extend the language model and to apply this method to both Foreign
Affairs, a journal covering much of the past century’s international foreign policy (from a U.S.
perspective), and the Journal of Modern History.
References
[1] Chang, J., D. M. Blei. “Relational topic models for document networks.” Proceedings of the
12th International Conference on Artificial Intelligence and Statistics (AIStats) 2009, 5, 2009.
[2] Martin, A. D., K. M. Quinn. “Dynamic ideal point estimation via markov chain monte carlo
for the u.s. supreme court, 1953-1999.” Political Analysis, 10:134–153, 2002.
[3] Hoff, P., A. E. Raftery, M. S. Handcock. “Latent space approaches to social network analysis.”
Journal of the American Statistical Association, 97:1090–1098, 2002.
[4] Kogan, S., D. Levin, B. Routledge, J. Sagi, N. Smith. “Predicting risk from financial reports with regression.” In “ACL Human Language Technologies,” 272–280. Association for
Computational Linguistics, 2009.
2
37 of 104
Title: A Probabilistic Foundation for Policy Priors
Authors: Samuel Gershman (Princeton), Carlos Diuk (Princeton), David Wingate (MIT)
Abstract
We describe a probabilistic framework for incorporating structured inductive biases into
reinforcement learning. These inductive biases arise from policy priors, probability distributions
over optimal policies. Borrowing recent ideas from computational linguistics and Bayesian
nonparametrics, we define several families of policy priors that express compositional, abstract
structure in a domain. Compositionality is expressed using probabilistic context-free grammars,
enabling a compact representation of hierarchically organized sub-tasks. Useful sequences of
sub-tasks can be cached and reused by extending the grammars nonparametrically. We present
Monte Carlo methods for performing inference, and show how structured policy priors lead to
substantially faster learning in complex domains compared to methods without inductive biases.
Contact:
Samuel Gershman
Graduate student
Department of Psychology and Princeton Neuroscience Institute
Princeton University
sjgershm@princeton.edu
38 of 104
Large-Scale Collection Threading using Structured k-DPPs
Jennifer Gillenwater Alex Kulesza Ben Taskar
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 19104
{jengi,kulesza,taskar}@cis.upenn.edu
September 9, 2011
Thanks to the increasing availability of large, interrelated document collections, we now have easy access
to vast stores of information that are orders of magnitude too big for manual examination. Tools like search
have made these collections useful for the average person; however, search tools require prior knowledge of
likely document contents in order to construct a query, and typically reveal no relationship structure among
the returned documents. Thus we can easily find needles in haystacks, but understanding the haystack itself
remains a challenge.
One approach for addressing this problem is to provide the user with a small, structured set of documents
that reflect in some way the content space of the collection (or possibly a sub-collection consisting of documents related to a query). In this work we consider structure expressed in “threads”, i.e., singly-connected
chains of documents. For example, given a corpus of academic papers, we might want to identify the most
significant lines of research, representing each by a citation chain of its most important contributing papers.
Or, in response to a search for news articles from a particular time period, we might want to show the user
the most significant stories from that period, and for each such story provide a timeline of its major events.
We formalize collection threading as the problem of finding diverse paths in a directed graph, where
the nodes correspond to items in the collection (e.g., papers), and the edges indicate relationships (e.g.,
citations). A path in this graph describes a thread of related items, and by assigning weights to nodes and
edges we can place an emphasis on high-quality paths. A diverse set of high-quality paths then forms a cover
for the most important threads in the collection.
To model sets of paths in way that allows for repulsion (and hence diversity), we employ the structured
determinantal point process (SDPP) framework [1], incorporating k-DPP extensions to control the size of
the produced threads [2]. The SDPP framework provides a natural model over sets of structures where
diversity is preferred, and offers polynomial-time algorithms for normalizing the model and sampling sets of
structures.
However, even these polynomial-time algorithms can be too slow when dealing with many real-world
datasets, since they scale quadratically in the number of features. (In our experiments, the exact algorithms
would require over 200 terabytes of memory.) We address this problem using random feature projections,
which reduce the dimensionality to a manageable level. Furthermore, we show that this reduction yields a
close approximation to the original SDPP distribution, proving the following theorem based on a result of
Magen and Zouzias [3].
Theorem 1. Let P k be the exact k-SDPP distribution on sets of paths, and let P̃ k (Y ) be the k-SDPP
distribution after projecting the similarity features to dimension d = O(max{k/, (log(1/δ) + log N )/2 }),
where N is the total number of possible path sets. Then with probability at least 1 − δ,
kP k − P̃ k k1 ≤ e6k − 1 ≈ 6k .
(1)
Finally, we demonstrate our model using two real-world datasets. The first is the Cora research paper
dataset, where we extract research threads from a set of about 30,000 computer science papers. Figure 1
1
39 of 104
• Retrieval and Reasoning in Distributed Case Bases
• Cooperative Information Gathering: A Distributed
Problem Solving Approach
• MACRON: An Architecture for Multi-agent Cooperative Information Gathering
• Research Summary of Investigations Into Optimal
Design-to-time Scheduling
• Control Heuristics for Scheduling in a Parallel Blackboard System
• Partial Global Planning: A Coordination Framework for Distributed Hypothesis Formation
• 3 Distributed Problem Solving and Planning
• Auto-blocking Matrix-Multiplication or Tracking
BLAS3 Performance with Source Code
• A Model and Compilation Strategy for Out-of-Core
Data Parallel Programs
• Tolerating Latency Through Software-Controlled
Prefetching in Shared-Memory Multiprocessors
• Designing Memory Consistency Models For SharedMemory Multiprocessors
• Sparcle: An Evolutionary Processor Design for
Large-Scale Multiprocessors
• Integrating Message-Passing and Shared-Memory:
Early Experience
• Integrated Shared-Memory and Message-Passing
Communication in the Alewife Multiprocessor
• Designing a Family of Coordination Algorithms
• Quantitative Modeling of Complex Environments
• Introducing the Tileworld: Experimentally Evaluating Agent Architectures
• Compiling
Passing
for
Shared-Memory
and
Message-
• Compiling for Distributed-Memory Systems
• Distributed Memory Compiler Design for Sparse
Problems
Figure 1: Two example threads (left and right) from running a 5-SDPP on the Cora dataset.
CLS
NMX
k-SDPP
2005a
3.53
3.87
6.91*
2005b
3.85
3.89
5.49*
2006a
3.76
4.59
5.79*
2006b
3.62
5.12
8.52*
2007a
3.47
3.73
6.83*
2007b
3.32
3.49
4.37*
2008a
3.70
4.58
4.77
2008b
3.00
3.59
3.91
Table 1: Quantitative results on news text, measured by similarity to human summaries. a = January-June;
b = July-December. Starred (*) entries are significantly higher than others in the same column at 99%
confidence.
shows some sample threads pulled from the collection by our method. The second dataset is a multi-year
corpus of news text from the New York Times, where we produce timelines of the major events over six
month periods. We compare our method against multiple baselines using human-produced news summaries
as references. Table 1 contains measurements of similarity to the human summaries for our approach (kSDPP) versus clustering (CLS) and non-max suppression (NMX) baselines.
Future work includes applying the diverse path model to other applications, and studying the empirical
tradeoffs between speed, memory, and accuracy inherent in using random projections.
References
[1] Alex Kulesza and Ben Taskar. Structured determinantal point processes. In Proc. Neural Information
Processing Systems, 2010.
[2] A. Kulesza and B. Taskar. k-DPPs: fixed-size determinantal point processes. In Proceedings of the 28th
International Conference on Machine Learning, 2011.
[3] A. Magen and A. Zouzias. Near optimal dimensionality reductions that preserve volumes. Approximation,
Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 523–534, 2008.
2
40 of 104
Online Learning for Mixed Membership Network Models
Prem Gopalan, David Mimno, Michael J. Freedman and David M. Blei
9 September 2011
Introduction. In this era of “Big Data”, there is intense interest in analyzing large networks using statistical models. Applications range from community detection in online social networks to predicting the
functions of a protein. MMSB [1] is a powerful mixed-membership model for learning communities and
their interactions. It assigns nodes to multiple communities rather than simple clusters. Posterior inference
for MMSB is intractable, and approximate inference algorithms such as variational inference or MCMC
sampling are applied. However, these methods require multiple passes through the data, and do not easily
work with streaming data. Inspired by the recent work on online variational Bayes for LDA [2], we develop
an online variational Bayes algorithm for MMSB based on stochastic optimization.
A mixed-membership model. MMSB is a Bayesian probabilistic model of relational data that assumes
context dependent membership of nodes in K groups, and that each interaction can be explained by two
interacting groups. Given the groups, the generative process draws a K-dimensional mixed-membership
vector πa ∼ Dir(α) for each node a and per-interaction membership indicators za→b , za←b for each binary
pair ya,b . The indicators are used to index into a blockmodel matrix BK×K of Bernoulli rates, and ya,b is
drawn from it. The only observed variables in this model are ya,b .
There is a degree of non-identifiability in MMSB due to both π p and B competing to explain reciprocated
interactions. If communities are known to be densely connected internally with sparse external interactions,
or have only reciprocated interactions, as is the case with some online social networks, then a simpler model
suffices. We replace B with K intragroup interaction rates βk ∼ Beta(ηk ) and a small, fixed interaction rate
ε between distinct groups.
Posterior inference. In variational Bayes for MMSB, the true posterior is approximated by a simpler distribution q(β, π, z→ , z← |γ, φ→ , φ← , λ). Following [1], we choose a fully factorized distribution q of the form
q(za→b = k) = φa→b,k , q(π p ) =Dir(π p ; γ p ) and q(βk ) =Dir(βk ; λk ). We then apply stochastic optimization to
the variation objective. We subsample the interactions ya,b , compute an approximate gradient and follow the
gradient with decreasing step-size. Options for subsampling include selecting a node or a pair of interactions uniformly at random, or sampling by exploration where we first select a node uniformly at random and
then explore its neighbors. We derive a first-order stochastic natural gradient algorithm for MMSB below
assuming random pair sampling.
ya,b
1: Define f (ya,b , βk ) = βk .(1 − βk )(1−ya,b ) . Initialize γ, λ.
2: for t = 0 to ∞ do
3:
E step ∀ (a, b) in mini-batch S. Initialize φta→b , φta←b .
4:
repeat
5:
Set g(φ, k) = ∑i6=k φt−1
log f (ya,b , ε)
i
10:
Set φta→b,k ∝ exp{Eq [log πa,k ] + φt−1
a←b,k Eq [log f (ya,b , βk )] + g(φa←b , k)} ∀k
t−1
t
Set φa←b,k ∝ exp{Eq [log πb,k ] + φa→b,k Eq [log f (ya,b , βk )] + g(φa→b , k)} ∀k
until convergence
M step
Compute γ̃a,k = αk + N(N−1)
∑S (φta→.,k + φt.←a,k ) ∀k, ∀a
|S|
11:
Compute λ̃k,i = ηk,i + N(N−1)
∑S (φta→b,k φta←b,k ya,b,i )∀i ∈ (0, 1), ∀k
|S|
6:
7:
8:
9:
′
′
12:
Set γ = (1 − ρt )γ + ρt γ̃. Set λ = (1 − ρt )λ + ρt λ̃
13: end for
Figure 1: Online variational Bayes for MMSB
Despite conceptual and notational similarities with online LDA [2], online MMSB faces unique challenges. First, the online LDA E step finds locally optimal values of parameters associated with a selected
document holding the topics fixed. In MMSB a given node a’s γa cannot be optimized independently of
other nodes. Therefore, we derive updates for both γ and λ. Second, the dimension of γ, the number of
1
41 of 104
K=3
K=5
K=3
9
K=5
0.30
8
0.25
6
5
kappa=0.5
kappa=0.5
7
0.20
0.15
0.10
3
0.05
Per−edge RSS
4
9
0.30
Perplexity
8
0.25
6
5
kappa=0.7
kappa=0.7
7
0.20
online
0.15
batch
online
batch
0.10
4
3
0.05
9
0.30
8
0.25
6
5
kappa=0.9
kappa=0.9
7
0.20
0.15
0.10
4
3
0.05
512
1024
2048 512
1024
2048
512
1024
number of nodes
2048 512
1024
2048
number of nodes
(a)
(b)
Figure 2: Perplexity (left) and accuracy (right) comparisons of batch and online MMSB on simulated networks for various node
sizes and learning parameter κ, run for 4 hours each. Accuracy is measured as the RSS between two sets of Hellinger distances,
one based on true π and the other based on E[π]. The distances are computed only for ya,b = 1. In both plots, the lower the value
the better the performance of the algorithms. For both K=3 and K=5, online has lower perplexity than batch as N increases, and
approximately equal perplexity for N=512. Accuracy becomes significantly poorer for batch as N increases compared to online.
Batch values for N=2048 are unavailable. They took long after 4 hours to compute.
50
−20000
40
group:1
30
−1
20
−40000
10
30
20
−80000
held−out likelihood
number of papers
−3
40
group:2
approx. log likelihood
−2
0
50
−60000
10
0
−4
−5
−100000
50
40
−6
group:3
30
−120000
20
−7
online, arXiv subgraph
held−out edges
held−out non−edges
10
0
0
1000
2000
3000
time in seconds
(a)
4000
5000
6000
1992
1993
1994
1995
1996
years
(b)
1997
1998
1999
2000
20000
40000
60000
80000
100000
120000
140000
time in seconds
(c)
Figure 3: (a) and (b) show the convergence and resulting groups (shown are histograms of publications by years) respectively
on a 256 node subgraph of arXiv citation dataset. (c) shows the convergence of held-out likelihood computed over incoming
mini-batches on a 4096 node subgraph of arxiv citation dataset. We set K = 3 in both cases.
nodes, can be very large in contrast to LDA’s topics. Finally, MMSB can leverage efficient subsampling
strategies, such as forest fire sampling, to exploit network structure.
Preliminary results on real datasets We ran online MMSB on a complete subgraph of 256 nodes from
the Arxiv high-energy physics citation graph and obtained reasonable groups. The histogram of publications
by years in each group is shown in Fig 3(b). Group 3 consists of many publications in 1995. The average
frequency of certain top words (Calabi-Yau, string, symmetry etc.) was 150 − 180% of that in Groups
1 and 2. 1995 marked the beginning of the second superstring revolution with a flurry of research activity
spurred by M-theory. There is a greater frequency of singular memberships in group 3. We are currently
evaluating results on a subgraph of 4096 nodes.
References
[1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels, Journal of Machine Learning
Research, 9:1981–2014, 2008.
[2] M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation, Neural Information Processing Systems,
2010.
2
42 of 104
Planning in Reward Rich Domains via PAC Bandits
Sergiu Goschin, Ari Weinstein, Michael Littman, Erick Chastain
{sgoschin,aweinst,mlittman,erickc}@cs.rutgers.edu
Rutgers University
complexity has a dependence on D, as it may be likely or
unlikely to encounter an arm with high reward. Specifically,
define ρ = Pa∼D (E[a] ≥ r0) as the probability of sampling
a “good enough” arm. We assume the domain is reward
rich—specifically, that ρ is bounded away from zero.
Formally, we define an (, δ, r0)-correct algorithm ALG
for an IB(D) problem to be an algorithm that after a number
of samples T (, δ, r0, D) (that is finite with probability 1)
returns an arm a with expected value E[a] ≥ r0 − with
probability at least 1 − δ.
This setting extends the PAC Bandit model [1, 3] to an
infinite number of arms. The definition of the performance
measure generalizes earlier work [4] by allowing the agent
to aspire to any reward level, not just the optimal value.
1 Introduction
In some decision-making environments, successful solutions
are common. If the evaluation of candidate solutions is
noisy, however, the challenge is knowing when a “good
enough” answer has been found. We formalize this problem
as an infinite-armed bandit and provide upper and lower
bounds on the number of evaluations or “pulls” needed
to identify a solution whose evaluation exceeds a given
threshold r0. We use the algorithms presented to identify
reliable strategies for solving several screens from the video
games Infinite Mario and Pitfall!.
Consider the following simple problem. A huge jar of
marbles contains some fraction ρ of black (success) marbles
and the rest white (failure) marbles. We want to find a
black marble as quickly as possible. If the black marbles
are sufficiently plentiful in the jar, the problem is simple:
Repeatedly draw marbles from the jar until a black one
is found. The expected sample complexity is Θ(1/ρ).
This kind of generate-and-test approach is simple, but can
be extremely effective when solutions are common—for
example, finding an unsatisfying assignment for a randomly
generated CNF formula is well solved with this approach.
The corresponding noisy problem is distinctly more
challenging. Imagine the marbles in our jar will be used to
roll through some sort of obstacle course and (due to weight
or balance or size) some marbles are more successful at
completing the course than others. If we (quickly) want to
find a marble that navigates the obstacle course successfully
at least r0 = 25% of the time, how do we best allocate our
test runs on the course? When do we run another evaluation
of an existing marble and when do we grab a new one out of
the jar? How do we minimize the (expected) total number
of runs while still assuring (with high probability) that we
end up with a good enough marble?
3 Results
We can prove a lower bound on the expected sample
complexity of a correct algorithm for the case of Bernoulli
arms of the form T (, δ, r0, D) = Ω( 12 ( ρ1 + log 1δ )).
On the algorithmic side, the interesting trade-off is between getting better accuracy estimates for the expectations
of previously sampled arms versus sampling new arms to
try to get one with a higher value. When the concentration
of good rewards (ρ) is known, the problem can be reduced
to the PAC-Bandit setting [1] with the upper bounds within
a logarithmic factor of the lower bound.
Algorithm 1 Template (, δ, r0, RejectionProcedure)
1:
2:
3:
4:
5:
6:
7:
8:
2 Setting
We formalize this problem as an infinite-armed bandit. We
define an arm as a probability distribution over possible
reward values over a bounded range [rmin, rmax]. When
an arm is pulled, it returns a reward value. One arm a is
preferred to another a0 if it has a higher expected reward
value, E[a] > E[a0]. Arms are sampled from an arm space
S, possibly infinitely large. The distribution D over the arm
space defines an infinite-armed bandit problem IB(D).
We seek algorithms that take a reward level r0 as input
and attempt to minimize the number of pulls needed to identify an arm with expected value of r0 or more. This sample
i = 1, found = FALSE
for i = 1, 2, ... do
Sample a new arm ai ∼ D
decision = RejectionProcedure(ai, i, , δ, r0)
if decision = ACCEPT then
return ai
end if
end for
In the more general (and interesting) case (ρ unknown),
we designed algorithms that have the structure of Algorithm 1: they sample an arm, make a bounded number of
pulls for the arm, check if the arm should be accepted (and
in this case, stop and return the arm) or rejected (sample
a new arm from D and repeat). The decision rule for
acceptance / rejection and when it can be applied is what
differentiates the algorithms.
A basic strategy (that we label Iterative Uniform
Rejection - IUR) pulls an arm for a number of times that
1
Infinite Mario (hard), r0 = 1, ε = 0.1
0.4
0.6
●
●
IGR
IHR
IUR
0.0
0.2
Det
IGR
IHR
IUR
●
0.4
Probability of completing
0.6
●
●
0.8
1.0
43 of 104
0.2
0.8
●
●
0.0
Probability of completing
1.0
Pitfall! (Crocs), r0 = 0.3, ε = 0.1
1e+00
1e+02
1e+04
1e+06
1e+03
Pulls
5e+03
5e+04
5e+05
Pulls
Figure 1: A screenshot of Infinite Mario and plots of the Figure 2: Plot of distribution of the sample complexity (pulls
distribution of the sample complexity for four algorithms, needed) for IGR, IHR and IUR over a set of 5000 repetitions.
including one that pulls each arm once (Det).
The distributions are plotted for one Pitfall level (shown
along with a representation of a successful policy in the left
half of the figure). Average sample complexity for each algoallows it to decide with high confidence if the arm has an rithm is marked with a circle. δ = 0.01 for all experiments.
expected reward that is at least r0 − and, if so, it stops.
Otherwise, the algorithm rejects the arm and samples a new
one. With high probability, as soon as it sees it has a good one was found with reward greater than r0 = 1.
arm, the algorithm will stop and return that particular arm.
The average number of pulls needed to find a strategy for
The algorithm is simple, correct, and achieves a bound close completing the first screen of each level in a set of 50 levels
1
to the lower bound for the problem—O( ρ12 log ρδ
).
ranged from 1 to 1000, with a median of 7.7 pulls and a
One problem with IUR is that it is very conservative in mean of 55.7 (due to a few very tough levels). Thus, testing
the sense of taking a large number of samples for each arm just a handful of randomly generated action sequences was
(with the dominant term always being 12 ). The algorithm sufficient to find a successful policy in this game.
Our second experiment involved Pitfall!, a game
does not take advantage of the fact that when the difference
between r0 and the reward of an arm is larger than , the developed for the Atari 2600. The objective in the game
decision to stop pulling the arm could be made sooner. To is to guide the protagonist, Pitfall Harry, through the jungle
address this issue, we designed another algorithm using while avoiding items that harm him. For our experiments,
ideas from the Hoeffding Races framework [2] (Iterative the goal was defined simply as arriving at the right side of
the screen on the top tier (some levels can be finished on
Hoeffding Rejection - IHR).
For an even more aggressive rejection strategy (that can a lower tier). To introduce variability and encourage more
sometimes throw away ”good arms” , with the advantage robust solutions, noise was added by randomly changing
of also quickly rejecting ”bad arms”) we used ideas from the joystick input from that requested to that of a centered
joystick with no button press 5% of the time. We chose
random walks (Iterative Greedy Rejection - IGR).
several levels of the game of varying difficulty and defined
4 Experiments
the arms to be action sequences of up to 500 steps (once
Our first experiment used a version of Infinite Mario (a again, excluding “backward” actions).
clone of the Super Mario video game, see Figure 1) that
Figure 2 illustrates the results of running the three
was previously modified to fit the RL-Glue framework for algorithms mentioned above on one Pitfall! level. The
the Reinforcement Learning Competition [5]. The game is random-walk-based IGR outperformed IHR, which outperdeterministic and gives us an opportunity to present a natural formed the highly conservative IUR by a very large margin.
problem that illustrates the “reward richness” phenomenon
References
motivating our work.
We modeled the starting screen in Mario, for 50 different [1] E. Even-Dar, S. Mannor, and Y. Mansour. Pac bounds
for multi-armed bandit and markov decision processes. In
difficulty levels, as a bandit with the arms being action
Conference on Computational Learning Theory, 2002.
sequences with a length of at most 50 actions. In the
experiments, the agent’s goal was to reach a threshold on [2] V. Heidrich-Meisner and C. Igel. Hoeffding and bernstein
races for selecting policies in evolutionary direct policy search.
the right side of the screen (just the very beginning of the
In International Conference on Machine Learning, 2009.
level). We restricted the action set of the agent to remove
the (somewhat unhelpful) backward action, resulting in 8 [3] S. Mannor, J. N. Tsitsiklis, K. Bennett, and N. Cesa-Bianchi.
The sample complexity of exploration in the multi-armed bantotal actions or an arm space of size 850. Action sequences
dit problem. Journal of Machine Learning Research, 2004.
were tested in the actual game, assigning rewards of −1
if the agent was destroyed, 0 if it did not reach the goal in [4] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms for
infinitely many-armed bandits. In Advances in Neural
50 steps, and a value of 100 − t, otherwise (where t was
Information Processing Systems, 2008.
the number of steps taken before reaching the goal). As the
domain is deterministic, no arm needed to be pulled more [5] S. Whiteson, B. Tanner, and A. White. The reinforcement
than once. Thus, the agent simply sampled new arms until
learning competitions. AI Magazine, 2010.
2
44 of 104
Nonparametric Multivariate Convex Regression with Applications
to Value Function Approximation
Lauren A. Hannah, joint work with David B. Dunson
September 6, 2011
We propose two new, nonparametric methods for multivariate regression subject to convexity
or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, reinforcement learning and financial engineering. Although
this problem has been studied since the 1950’s, there is currently no multivariate method that is
computationally feasible for more than a couple of thousand observations. We introduce frequentist Convex Adaptive Partitioning (CAP) and Bayesian Multivariate Bayesian Convex Regression
(MBCR), which both create a globally convex regression model from locally linear estimates fit on
adaptively selected covariate partitions. Adaptive partitioning makes computation efficient even
on large problems. We leverage the search procedure of CAP to create a computationally efficient
reversible jump MCMC sampler for MBCR. We give strong consistency results for CAP in the univariate case and strong consistency results for MBCR in the general case. We also give convergence
rates for MBCR, which scale adaptively to the dimensionality of an underlying linear subspace.
Convexity offers a few properties that can be exploited. First, it acts as a regularizer, making
CAP and MBCR resistant to overfitting. Second, convex functions can be quickly minimized
with commercial solvers. We CAP and MBCR to fit value function approximation for real-world
sequential decision problems with convex value-to-go functions. By allowing efficient search over
the decision space, CAP and MBCR allow us much larger stochastic optimization problems than
can currently be solved with state of the art methods. We show that MBCR produces much more
robust policies than CAP through model averaging. The methods are applied to pricing large
basket options and fitted Q-iteration for a complex inventory management problem.
Contact Information: Lauren A. Hannah, postdoctoral researcher at Duke University in the
Department of Statistical Science.
Email: lauren.hannah@duke.edu
Phone: (805)748-2894
Mailing address: Box 90251, Duke University, Durham, NC 27708
1
45 of 104
T HE N O -U-T URN S AMPLER
The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo
Matthew D. Hoffman
MDHOFFMA @ CS . PRINCETON . EDU
Department of Statistics
Columbia University
New York, NY 10027, USA
Andrew Gelman
GELMAN @ STAT. COLUMBIA . EDU
Departments of Statistics and Political Science
Columbia University
New York, NY 10027, USA
Editor:
Abstract
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that avoids
the random walk behavior and sensitivity to correlations that plague many MCMC methods by
taking a series of steps informed by first-order gradient information. These features allow it to
converge to high-dimensional target distributions much more quickly than popular methods such
as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive
to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is
too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the
algorithm wastes computation. We present the No-U-Turn Sampler (NUTS), an extension to HMC
that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set
of likely candidate points that spans a wide swath of the target distribution, stopping automatically
when it starts to double back and retrace its steps. NUTS is able to achieve similar performance
to a well tuned standard HMC method, without requiring user intervention or costly tuning runs.
TMC can thus be used in applications such as BUGS-style automatic inference engines that require
efficient “turnkey” sampling algorithms.
1. Introduction
Hierarchical Bayesian models are a mainstay of the machine learning and statistics communities.
Exact posterior inference in such models is rarely tractable, however, and so researchers and practitioners must usually resort to approximate statistical inference methods. Deterministic approximate
inference algorithms (for example, those described in (Wainwright and Jordan, 2008)) can be efficient, but introduce bias and can be difficult to apply to some models. Rather than computing a
deterministic approximation to a target posterior (or other) distribution, Markov Chain Monte Carlo
(MCMC) methods offer schemes for drawing a series of correlated samples that will converge in distribution to the target distribution (Neal, 1993). MCMC methods tend to be less efficient than their
deterministic counterparts, but are more generally applicable and are (asymptotically) unbiased.
Not all MCMC algorithms are created equal. For complicated models with many parameters,
simple methods such as random-walk Metropolis (Metropolis et al., 1953) and Gibbs sampling (Geman and Geman, 1984) may require an unacceptably long time to converge to the target distribution.
1
H OFFMAN AND G ELMAN
46 of 104
This is in large part due to the tendency of these methods to explore parameter space via inefficient
random walks (Neal, 1993). When model parameters are continuous rather than discrete, Hamiltonian Monte Carlo (HMC), also known as hybrid Monte Carlo, is able to suppress such random walk
behavior by means of a clever auxiliary variable scheme that transforms the problem of sampling
from a target distribution into the problem of simulating Hamiltonian dynamics (Neal, 2011). The
cost of HMC per independent sample from a target distribution of dimension D is roughly O(D5/4 ),
which stands in sharp contrast with the O(D2 ) cost of random-walk Metropolis.
This increased efficiency comes at a price. First, HMC requires the gradient of the log-posterior;
computing the gradient for a complex model is at best tedious and at worst impossible. This requirement can be made less onerous by using automatic differentiation. Second, HMC requires that the
user specify at least two parameters: a step size and a number of steps L for which to run a simulated Hamiltonian system. A poor choice of either of these parameters will result in a dramatic
drop in HMC’s efficiency. Although good heuristics exist for choosing , setting L typically requires one or more costly tuning runs, as well as the expertise to interpret the results of those tuning
runs. This hurdle limits the more widespread use of HMC, and makes it challenging to incorporate into a general-purpose inference engine such as BUGS (Gilks and Spiegelhalter, 1992), JAGS
(http://mcmc-jags.sourceforge.net), or Infer.NET (Minka et al.).
The main contribution of this work is the No-U-Turn Sampler (NUTS), an MCMC algorithm
that closely resembles HMC, but eliminates the need to choose the problematic number-of-steps
parameter L. We also provide schemes for automatically tuning the step size parameter in both
HMC and NUTS, which make it possible to run NUTS with no tuning at all. We will show that the
tuning-free version of NUTS samples as efficiently as (and often more efficiently than) HMC, even
discounting the cost of finding optimal tuning parameters for HMC. NUTS is thus suitable for use
in generic inference systems and by users who are unable or disinclined to spend time tweaking an
MCMC algorithm.
References
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of
images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984.
W. Gilks and D. Spiegelhalter. A language and program for complex Bayesian modelling. The
Statistician, 3:169–177, 1992.
N. Metropolis, A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.
T. Minka, J. Winn, J. Guiver, and D. Knowles. Infer.NET 2.4, Microsoft Research Cambridge, 2010.
http://research.microsoft.com/infernet.
R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRGTR-93-1, Department of Computer Science, University of Toronto, 1993.
R.M. Neal. Handbook of Markov Chain Monte Carlo, chapter 5: MCMC Using Hamiltonian Dynamics. CRC Press, 2011.
M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.
R in Machine Learning, 1(1-2):1–305, 2008.
Foundations and Trends
2
47 of 104
Distributed Collaborative Filtering Over Social Networks
Sibren Isaacman,
Margaret Martonosi
Princeton University
Statis Ioannidis
Technicolor
Augustin Chaintreau
Columbia University
Content created and exchanged on social networks receives everyday a larger and larger share of
the online users' attention. Its sheer volume and its heterogeneous quality call for collaborative
tools, to ensure users collectively divide the burden of identifying truly relevant content. But
current collaborative filtering are frequently blind to the social properties of content exchange,
and they are usually centralized, raising privacy concern. We prove that these two issues can be
addressed through a distributed algorithm that is built on top of any process of content exchange.
Our technique operates with strict restriction on information disclosure (i.e., content rating are
only exchanged between communicating pairs of users), and we characterize its evolution related
to the minimization of a weighted root mean square prediction error. We also present results to
establish that this approach is highly practical, as it converges quickly to an accurate prediction
that is competitive with centralized approach on the Netflix dataset, and is shown to work in
practice on a prototype application deployed on Facebook among 43 users.
More precisely, our main contributions are (details can be found in [1]):
•
We propose a mathematical model of a system for distributed sharing of user-generated
content streams. Our model captures a variety of different applications, and incorporates
correlations both on how producers deliver content and how consumers rate it.
•
We illustrate that estimating the probability distribution of content ratings can be
naturally expressed as a Matrix Factorization (MF) problem. This is in contrast to
standard MF formulations that focus on estimating ratings directly, rather than their
distribution [5], or assumes that the ratings follow the Gaussian distribution [4]. To the
best of our knowledge, our work is the first to apply a MF technique in the context of
rating prediction in user-generated content streams.
•
Using the above intuition, we propose a decentralized rating prediction algorithm in
which information is exchanged only across content producer/consumer pairs. Producers
and consumers maintain their own individual profiles; a producer shares its profile only
with consumers to which it delivers content, and consumers share a rating they give to an
item, as well as their profile, only with the producer that generated it.
•
In spite of the above restriction on how information is exchanged among users, our
distributed prediction algorithm optimizes a global performance objective. In particular,
we formally characterize the algorithm’s convergence properties under our model,
showing that it reduces a weighted mean square error of its rating distribution estimates.
48 of 104
•
We validate our algorithm empirically. First, we use the Netflix data set as a benchmark
to compare the performance of our distributed approach to offline centralized algorithms.
Second, we developed a Facebook application that reproduces the main features of a
peer-to-peer content exchange environment. Using a month-long experiment with 43
users we show that our algorithm predicts ratings accurately with limited user feedback.
The experience of browsing the web is more and more embedded in an explicit process of
information created, rated and exchanged by users (through Facebook like buttons, re-tweets,
and digg, reddit and other aggregators). This behavioral web experience has been met by a sharp
rise in privacy concern, even within the US congress [3]. Enabling users to view and access
relevant and quality user-generated content without releasing their private information to
untrusted third parties is hence a critical challenge. These results prove that information
exchange can be deployed only among trusted pairs while providing strong guarantees on the
rating prediction error, thanks to a fully distributed and asynchronous learning algorithm.
Our results expand the scope of recommender systems to operate on top of any social content
sharing platform, while providing privacy protection, which unveils important research questions
[2]. The case where trust is not reciprocal (such as the follower relationship within Twitter)
remains an interesting open case. Our algorithm could leverage secure multiparty computation,
to provide at a smaller cost the privacy guarantee offered by centralized schemes. Finally, our
model can be used to analyze how social proximity, captured through “rate of delivery” for
producer-consumer pairs, impacts the efficiency of learning. Both questions are interesting open
problems.
References:
[1] Isaacman, S., Ioannidis, S., Chaintreau, A., and Martonosi, M. “Distributed rating prediction in user
generated content streams”. ACM RecSys 2011 (full paper).
[2] Isaacman, S., Ioannidis, S., Chaintreau, A., and Martonosi, M. “Distributed collaborative filtering over
Social Networks”. IEEE Allerton Conference 2011 (invited paper).
[3] Angwin, J. US seeks web privacy ‘bill of rights’. Wall Street Journal (Dec. 17th 2010).
[4] Salakhutdinov, R., and Mnih, A. Probabilistic matrix factorization. Advances in Neural Information
Processing Systems 20 (2008).
[5] Takacs, G., Pilaszy, I., Nemeth, B., and Tikk, D. Scalable collaborative filtering approaches for large
recommender systems. JMLR 10 (2009), 623–656.
49 of 104
Can Public Data Help With Differentially-Private Machine
Learning?
Geetha Jagannathan
Columbia University
Claire Monteleoni
George Washington University
Krishnan Pillaipakkamnatt
Hofstra University
Abstract In the field of the design of privacy preserving algorithms, public knowledge is usually
used in an adversarial manner to gain insight into an individual’s information stored in the database.
Differential privacy is a privacy model that offers protection against such attacks. But it assumes all
data being analysed is private. In this paper, we ask a question in the other direction. Can public
data be used to “boost” the accuracy of differentially private algorithms? We answer this question in
the affirmative. Motivated by the well known semi-supervised model in machine learning literature,
we present a more realistic form of the differential privacy model in which privacy-preserving
analysis is strengthened with non-private data. Our main result is a differentially private classifier
that makes use of non-private data to increase the accuracy of a classifier constructed from a small
amount of private data. Our new model expands the range of useful applications of differential
privacy whereas most of the current results in the differentially private model require large private
data sets to obtain reasonable utility.
Differential Privacy This privacy model introduced by Dwork et al. [1] assures that the removal
or addition of a single item in a database does not have a substantial impact on the output of a
private database access mechanism. It provides protection against arbitrary amounts of auxiliary
data available to an attacker. Since the publication of [1] a large number of results have appeared
in differentially private data analysis. One major weakness of most of these results is that they
require a large quantity of data to obtain meaningful accuracy. Many of the real world datasets
are small, and hence many of the differential privacy results do not work on such data sets. In this
paper, we propose an enhanced privacy model that tries to address the above mentioned weakness.
Privacy Model Our new privacy model is motivated by the well known “semi-supervised” model
in machine learning literature. We propose a more realistic privacy model that assumes data
analyses are performed not only on private data but also on non-private data. This model is useful
in scenarios in which the data collector has both private and non-private data. For example, some
respondents in a survey may insist on privacy for their data, while others may be willing to make
their data publicly available on the basis of some inducement. The model also applies in situations
where the public data (such as voter registration data) misses a confidential attribute (such as
salary information). This is similar to the semi-supervised model in machine learning.
Problem Our goal is to construct a differentially private classifier that makes use of non-private
unlabeled data to ‘boost” the accuracy of the differentially private classifier constructed from a
small amount of private labeled data. We make the reasonable assumption that the private and
1
0.45
0.3
0.4
0.25
50 of 104
0.35
0.2
Labeled Only
Both
0.2
0.15
Error Rate AA
Error Rate AA
0.3
0.25
Labeled Only
0.15
Both
0.1
0.1
0.05
0.05
0
0
0.5
0.6
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Epsilon
Epsilon
Figure 1: Errors on the Nursery (12960 rows) and Mushroom (8124 rows) datasets.
the non-private data are both from the same distribution. In this paper, we present the RDT#
classifier which is a non-trivial extension of the random decision tree classifier (RDT) [2].
Differentially Private Random Decision Trees An RDT is created by choosing test attributes for the decision tree nodes completely at random. The entire tree structure is created
using the list of attributes without looking at the training data. The training instances are then
incorporated into the structure to compute the distribution of class labels at all the leaves of the
tree. Adding noise to the class distribution according to the Laplace mechanism makes the classifier
differentially private. A random decision tree classifier is an ensemble of such trees. To classify a
test instance the classifier averages the predictions from all the trees in the ensemble.
RDT# - A New Approach There are two issues in using RDT directly in our setting. (1)
Random decision trees were not designed to handle unlabeled data. (2) The problem for random
decision tree classifiers fundamentally lies in the “scattering” of the training instances into the
partitions of the instance space induced by the tree. If the number of partitions induced by a tree
is large, each partition would likely have a small number of rows of the training data. The noise
added to the summary values in each partition could overwhelm the true counts, thereby leading
to poor utility. On the other hand, if the number of partitions is small, there would be more rows
of the data set in each partition. However, the discriminatory power of each such partition (the
“purity” of a leaf node) would likely be low because it spans a large region of the instance space.
This also leads to poor utility. In both situations the problem is particularly acute when instances
are distributed unevenly in the instance space. This situation also occurs when a dataset suffers
from sparsity.
We extend the random decision tree idea to exploit the availability of (non-private) unlabeled
data. We use the unlabeled instances in two ways. First, we use them to control the partitioning
of the instance space so that denser regions of the space are partitioned more finely than sparse
regions. Second, we use the unlabeled examples to “propagate” the labels from the labeled instances to larger regions of the instance space. Together, these techniques boost the utility of the
differentially private classifier without lowering privacy. We experimentally demonstrate that our
private classifier produces good prediction accuracies even in the situations where the private data
is fairly limited. See Figure 1. The X-axis shows increasing values of ϵ, the privacy parameter, and
the Y -axis shows the error rate.
[1] C. Dwork, F. McSherry, K. Nissim and A. Smith, ”Calibrating Noise to Sensitivity in Private
Data Analysis”, in TCC 2006.
2
[2] G. Jagannathan, K. Pillaipakkamnatt, R. N. Wright, A Practical Differentially Private Random Decision Tree Classifier, in ICDMW 09.
51 of 104
3
52 of 104
Place Recommendation with Implicit Spatial Feedback
Berk Kapicioglu (Princeton, Sense Networks), David Rosenberg (Sense Networks),
Robert Schapire (Princeton), Tony Jebara (Columbia)
08/08/2011
1
Introduction
Since the advent of the Netflix Prize [1], there has been an influx of papers on recommender systems in machine learning
literature. A popular framework to build such systems has been collobarative filtering (CF) [6]. On the Netflix dataset, CF
algorithms were one of the few stand-alone methods shown to have superior performance.
Recently, web services such as Foursquare and Facebook Places started to allow users to share their locations over social
networks. This has led to an explosion in the number of virtual places that are available for checking in, inundating users
with many irrelevant choices. In turn, there is a pressing need for algorithms that rank nearby places according to the
user’s interest. In this paper, we tackle this problem by providing a machine learning perspective to personalized place
recommendation.
Our contributions are as follows: First, we transform and formalize the publicly available checkin information we scraped
from Twitter and Foursquare as a collobarative filtering dataset. Second, we introduce an evaluation framework to compare
algorithms. The framework takes into account the limitations of the mobile device interface (i.e. one can only display a few
ranked places) and the spatial constraints (i.e. user is only interested in a ranking of nearby venues). Third, we introduce
a novel algorithm that exploits implicit feedback provided by users and demonstrate that it outperforms state-of-the-art CF
algorithms. Finally, we discuss and report preliminary results on extending our CF algorithm with explicitly computed user,
place, and time features.
2
Dataset
Our dataset consists of 106127 publicly available Foursquare check-ins that occurred in New York City over the span of two
months. Each datapoint represents an interaction between a user i and a place j, and includes additional information such as
the local time of check-in, user’s gender and hometown, and place’s coordinates and categories. One can view the dataset as
a bipartite graph between users and places, where the edges between users and places correspond to check-ins. We preprocess
the dataset by computing the largest bipartite subgraph where each user and each place interacts with at least a certain
number of nodes in the subgraph. In the end, we obtain 2993 users, 2498 places, and 35283 interactions. We assume that
this dataset is a partially observed subset of an unknown binary matrix M of size (m, n), where m is number of users and n
is number of places. Each entry Mi,j is 1 if user i likes place j, −1 otherwise. Furthermore, denoting the set of all observed
incides with Ω, we assume that ∀ (i, j) ∈ Ω, Mi,j = 1. In other words, if a user checked into a place, we assume she likes the
place, but if she didn’t check in, we assume that we don’t know whether she likes the place or not.
3
Evaluation Framework
Here, we explain the evaluation framework we used to compare algorithms. We assigned each (i, j) ∈ Ω to either a train or a
test partition. We allowed our algorithms to train on the train partition. Then, for each datapoint (i, j) in the test partition,
we only show the algorithms the user i, the place j, and places N (j), where N (j) is the set of neighboring places of j within
a given radius (i.e. 500 meters). The algorithms do not know which one of the candidate places is the actual checked in place.
We constrain the candidates to be within a certain radius, since we expect the user to be only interested in nearby venues.
Algorithms provide a personalized ranking over the given places and we measure the 0 − 1 error to predict whether the actual
checked in place is among the top t places, where t ∈ {1, 2, 3}. Due to the limitations of the mobile device interface, we are
only interested in the accuracy of the top few rankings. We do randomized 10 fold cross-validation and report the results.
4
Algorithms
The algorithms we compare range from basic baseline algorithms to the state-of-the-art CF approaches. The baseline algorithms are a predictor that ranks randomly and a predictor that ranks the most popular venue with respect to the training
1
data. We also have matrix completion algorithms, where the objective is min kAk∗ s.t. Ai,j = Mi,j , ∀ (i, j) ∈ Ω and kk∗ denotes
53 of 104
the trace norm [5]. In this case, we interpret the entries of the approximation matrix A as confidence scores and rank venues
accordingly. This method is justified by [2], where they show that as long as certain technical assumptions are satisfied,
with high probability, M could be recovered exactly. In case M is not exactly low rank but approximately
low rank, we
use Optspace [4]. The use of Optspace is motivated by the upper bound associated with RM SE M, M̂ , where RM SE
indicates root mean square error and M̂ indicates the approximate matrix constructed by the
algorithm.
The problem with these matrix completion algorithms is that they try to minimize RM SE M, M̂ . However, what we are
interested in is an algorithm that simply scores the actual checked in venue higher than all the neighboring venues. In order to
construct such an algorithm, we decided to both measure the performance of vanilla Maximum Margin Matrix Factorization
(MMMF) and build upon it [7]. Analogous to how support vector machines were extended to deal with structured outputs,
we extended MMMF to rank checked in venues higher than nearby venues. Our approach was partially motivated by [3],
where users provided implicit feedback when they clicked on a webpage that was ranked low and the algorithm exploited that
feedback. Similarly, everytime a user checks in a venue during training, we assume that she preferred that venue over nearby
venues. The optimization problem associated with our algorithm is as follows:
λ X
1 2
2
kU k + kV k +
min √
U,V 2 mn
|K|
X
h
UV T
i,j
− UV T
i,k
(i,j)∈Ω k∈N (j)
where U ∈ Rm∗p , V ∈ Rn∗p are the user and place factors, K = {(i, j, k) | (i, j) ∈ Ω, k ∈ N (j)} is an extended set of indices,
and h is the smooth hinge function

1

z≤0
2 − z
2
1
h (z) = 2 (1 − z) 0 z 1


0
z≥1
The objective is convex and smoothly differentiable in U and V seperately, and similar to [9], we use alternating minimization
and BMRM [8] to minimize it. We have a fast implementation in Python where we implemented the objective and gradient
computations in C using Cython.
We will demonstrate the complete results of our comparison during the poster session. We will also show some preliminary
results on extending the algorithm to exploit explicit user, venue, and time features. However, here’s a sneak peek comparing
some of the algorithms:
Algorithm
Random
Popular
MMMF
Spatial MMMF
Average Accuracy (within top 1)
11.02%
30.67%
33.06%
35.88%
Average Accuracy (within top 3)
29.94%
52.38%
55.27%
59.11%
References
[1] James Bennett and Stan Lanning. The netflix prize. In In KDD Cup and Workshop in conjunction with KDD, August 2007.
[2] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, December 2009.
[3] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD ’02, pages 133–142, New York, NY, USA, 2002. ACM.
[4] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. J. Mach. Learn. Res.,
11:2057–2078, August 2010.
[5] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted Low-Rank
matrices. arXiv, March 2011.
[6] B. Marlin. Collaborative filtering: A machine learning perspective. Master’s thesis, University of Toronto, 2004.
[7] Jasson D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings
of the 22nd international conference on Machine learning, ICML ’05, pages 713–719, New York, NY, USA, 2005. ACM.
[8] Choon H. Teo, Alex Smola, Vishwanathan, and Quoc V. Le. A scalable modular convex solver for regularized risk minimization. In
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 727–736,
New York, NY, USA, 2007. ACM.
[9] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. COFI RANK - maximum margin matrix factorization
for collaborative ranking. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, NIPS. MIT Press, 2007.
2
54 of 104
Recovering Euclidean Distance Matrices via Landmark
MDS
Akshay Krishnamurthy
akshaykr@cs.cmu.edu
Knowledge of a network’s topology is essential to facilitating network design, improving performance, and maintaining reliability, among several other applications.
However, typical networks of interest, such as the internet, are highly decentralized
and grow organically, so that no single entity can maintain global structural information. Thus, a fundamental problem in networking is topology discovery, which
involves recovering the structure of a network from measurements taken between
end hosts or along paths.
Early solutions to the topology discovery problem were tools such as traceroute,
that provide detailed path-level measurements, but that also rely on cooperation
from intermediate routers to collect this information. For security and privacy
reasons, routers are increasingly blocking these types of requests, rendering these
tools obsolete. More recent algorithms use measurements taken between cooperating
end-hosts and infer the network structure, including uncooperative intermediate
nodes.
In this direction there are two main lines of research. The first focuses on finding
a network structure consistent with the measurements, often assumed to be observed
or directly measureable. Proposed algorithms for this task typically actively probe
for measurements, injecting large amounts of traffic into the network and disturbing
regular activity. Moreover, they often require co-operation from the end-hosts to
collect measurements. These drawbacks motivate the second line of research, which
involves using few measurements to accurately infer additional ones. Recent work
in this direction focuses on using passively gathered measurements to identify the
network’s structural characteristics[1] .
A recent algorithm[2] is particularly remarkable in that it can accurately estimate all of the measurements (specifically, hop counts) between pairs of end hosts
without requiring their cooperation and without injecting large amounts of traffic
into the network. This algorithm, known as landmark MDS, involves instrumenting
a few routers (landmark nodes) in the network that monitor traffic and passively
collect distance information about the end hosts. Using just the collected distances
and measurements between the landmark nodes, landmark MDS finds a distancepreserving embedding of the end hosts and landmarks in a Euclidean space and uses
[1]
[2]
Brian Eriksson, Paul Barford, and Robert Nowak. Network Discovery from Passive Measurements Categories and Subject Descriptors. ACM SIGCOMM Computer Communication
Review, 38(4):291–302, 2008.
Brian Eriksson, Paul Barford, and Robert Nowak. Estimating Hop Distance Between Arbitrary Host Pairs. In IEEE INFOCOM 2009 - The 28th Conference on Computer Communications, pages 801–809. Ieee, April 2009.
1
55 of 104
this embedding to infer the unobserved distances.
In this paper, we study the theoretical properties of landmark MDS. Specifically,
we focus on recovering a distance matrix D ∈ R(m+n)×(m+n) on m landmarks and n
end hosts using the observations between the landmarks and only a few observations
between end hosts and landmarks. Our analysis assumes that there exists a set of
points X1 , . . . Xn+m ∈ Rp such that Dij = ||Xi − Xj ||22 . We obtain the following
results:
1. In the absence of noise, we give sufficient conditions under which Landmark
MDS finds an embedding that perfectly recovers the distance matrix D. We
show that if the observed landmarks for each of the n end hosts span Rp then
we can exactly recover that end host’s coordinates. Applying this to all end
hosts results in exact recovery of the distance matrix.
2. Network measurements are invariably corrupted by noise, and we model this
with a symmetric perturbation matrix R. In this context, we derive bounds
on the average entrywise error between the recovered distance matrix D̂ and
D in terms of the noise variance σ 2 of the perturbation. We use this bound to
understand conditions under which Landmark MDS is statistically consistent.
Simply stated, our result shows that if the number of end hosts n = o(m)
where m is the number of landmarks then we can tolerate noise with variance
σ 2 = O(m) and still obtain consistency as n, m → ∞.
The noisy setting we consider has strong connections to the well-studied Noisy
Low-Rank Matrix Completion problem. Our work differs from the matrix completion literature in that typical matrix completion results assume that observations
are sampled uniformly from the matrix, whereas we place structure on the observations. The former model is not suitable for network tomography settings, where we
can only collect measurements from a few hosts and are interested in limiting the
amount of traffic we inject into the network. Unfortunately, the added structure of
our model results in worse rates of convergence in comparison with matrix completion, and we believe this is caused by the inherent randomness in the observations
of the matrix completion model.
In addition to deriving our results on recovery both with and without noise, we
present several experiments validating our theoretical results. We also perform an
empirical evaluation of Landmark MDS in comparison with state-of-the-art matrix
completion algorithms. Finally, we demonstrate the performance of Landmark MDS
on both Erdös-Rényi and power-law networks, the latter being a suitable model for
communication networks. These experiments encourage the use of Landmark MDS
in practice.
2
Efficient evaluation of large sequence kernels
1
56 of 104
Pavel P. Kuksa1 , Vladimir Pavlovic2
NEC Laboratories America, Inc 2 Department of Computer Science, Rutgers University
Classification of sequences drawn from a finite alphabet using a family of string kernels with inexact matching
(e.g., spectrum or mismatch) has shown great success in machine learning [6, 3, 9, 4]. However, selection of
optimal mismatch kernels for a particular task is severely limited by inability to compute such kernels for
long substrings with potentially many mismatches. We extend prior work on algorithms for computing (k, m)
mismatch string kernels and introduce a new method that allows us to evaluate kernels for large k, m. This
makes it possible to explore a larger set of kernels with a wide range of kernel parameters, opening a possibility
to better model selection and improved performance of the string kernels. To investigate the utility of large
(k,m) string kernels, we consider several sequence classification problems, including protein remote homology
detection, and music classification. Our results show that increased k-mer lengths with larger substitutions can
improve classification performance.
Background. A number of state-of-the-art approaches to classification of sequences over finite alphabet Σ
rely on fixed-length representations Φ(X) of sequences as the spectra (|Σ|k -dimensional histogram) of counts of
short substrings (k-mers), contained, possibly with up to m mismatches, in a sequence, c.f., spectrum/mismatch
methods [6, 7, 3]. However, computing similarity scores, or kernels, K(X, Y )=Φ(X)T Φ(Y ) using these representations can be challenging, e.g., efficient O(k m+1 |Σ|m (|X| + |Y |)) trie-based mismatch kernel algorithm [7]
strongly depends on the alphabet size and the number of mismatches m.
More recently, [4] introduced linear time algorithms with alphabet-independent complexity O(ck,m (|X| + |Y |)
applicable to computation of a large class of existing string kernels. The authors show that it is possible to
compute an inexact (k, m) kernel as
min(2m,k)
XX
X
Mi Ii ,
(1)
K(X, Y |m, k) =
I(a, b) =
a∈X b∈Y
i=0
where I(a, b) is the number of common substrings in the intersection of the mutation neighborhoods of a and
b, Ii is the size of the intersection of k-mer mutational neighborhood for Hamming distance i, and Mi is the
number of observed k-mer pairs in X and Y having Hamming distance i.
This result however requires that the number of identical substrings in (k, m)-mutational neighborhoods of kmers a and b (the intersection size) be known in advance, for every possible pair of m and the Hamming distance
d between k-mers (k and |Σ| are free variables). Obtaining the closed form expression for the intersection size
for arbitrary k, m is challenging, with no clear systematic way of enumerating the intersection of two mutational
neighborhoods. Closed form solutions obtained in [4] were only provided for cases when m is small (m ≤ 3).
No systematic way of obtaining these intersection sizes has been proposed in [4].
In this work we introduce a systematic and efficient procedure for obtaining intersection sizes that can be
used for large k and m and arbitrary alphabet size |Σ|. This will allow us to effectively explore a much larger
class of (k, m) kernels in the process of model selection which could further improve performance of the string
kernel method as we show experimentally.
Efficient evaluation of large sequence kernels. For large values of k and m finding intersection sizes
needed for kernel computation can be problematic. This is because while for smaller values of m combinatorial
closed form solution can be found easily, for larger values of m finding it becomes more difficult due to an
increase in the number of combinatorial possibilities as the mutational neighborhood increases (exponentially)
in size. On the other hand, direct computation of the intersection by trie traversal algorithm is computationally
difficult for large k and m as the complexity of traversal is O(k m+1 |Σ|k ), i.e. is exponential in both k and m.
The above mentioned issues do not allow for efficient kernel evaluation for large k and m.
Reduction-based computation of intersection size coefficients. We will now show that it is possible to efficiently compute the intersection sizes by reducing (k, m, |Σ|) intersection size problem to a set of less complex
intersection size computations and solving linear systems of equations. We discuss this approach below.
The number of k-mers at the Hamming distance of at most m from both k-mers a and b, I(a, b), can be
m
found in a weighted form
X
I(a, b) =
wi (|Σ| − 1)i .
(2)
i=0
Coefficients wi depend only on the Hamming distance d(a, b) between k-mers a and b for fixed k, m, |Σ|.
For every Hamming distance 0 ≤ d(a, b) ≤ 2m, the corresponding set of coefficients wi , i = 0, 1, . . . , m can be
found by solving a linear system Aw = I of m + 1 equations with each equation corresponding to a particular
alphabet size |Σ| ∈ {2, 3, . . . , m + 2}. The left-hand side matrix A is an (m+1,m+1) matrix with elements
aij = ij−1 , i = 1, . . . , m + 1, j = 1, . . . , m + 1.


10
11
12
...
1m


20
21
22
...
2m

A=


...
(m + 1)0 (m + 1)1 (m + 1)2 ... (m + 1)m
1
57 of 104
The right-hand side I = (I0 , I1 , . . . , Im )T is a vector of intersection sizes for a particular setting of k, m, d,
|Σ| = 2, 3, . . . , m + 2. Here, Ii , i = 0 . . . m is the intersection size for a pair of k-mers over alphabet size i + 2.
Note that Ii need only be computed for small alphabet sizes, up to m + 2. Hence, this vector can feasibly be
computed using a trie traversal for a pair of k-mers at Hamming distance d even for moderately large k as
the size of the trie is only (m + 2)k as opposed to |Σ|k . This allows now to evaluate kernels for large k and
m as the traversal is performed over much smaller tries, e.g., even in case of relatively small protein alphabet
with |Σ| = 20, for m = 6 and k = 13, the size of the trie is 2013 /813 = 149011 times smaller. Coefficients w
obtained by solving Aw = I do not depend on the alphabet size |Σ|. In other words, once found for a particular
combination of values (k, m), these coefficients can be used to determine intersection sizes for any given finite
alphabet |Σ| using Eq. 2.
Experimental evaluation. We evaluate the utility of large (k, m) computations as a proxy for model
selection, by allowing a significantly wider range of kernel parameters to be investigated during the selection
process. Such large range evaluation is the first of its kind, made possible by our efficient kernel evaluation
algorithm. In these evaluations we follow the experimental settings considered in [5] and [4]. We use standard
benchmark datasets: the SCOP dataset (7329 sequences, 54 experiments) [9] for remote protein homology
detection, and music genre data1 (10 classes, 1000 seqs) [8] for multi-class genre prediction.
Table 1: Remote homology. Classification perfor- Table 2: Multi-class music genre recognition. Classification performance of the mismatch method
mance of the mismatch kernel method
Kernel
Mean ROC Mean ROC50
mismatch(5,1)
87.75
41.92
mismatch(5,2)
90.67
49.09
mismatch(6,2)
90.74
49.66
mismatch(6,3)
90.98
49.36
mismatch(7,3)
91.31
52.00
mismatch(7,4)
90.84
49.29
mismatch(9,4)
91.45
53.51
mismatch(10,5)
91.60
53.78
mismatch(13,6)
90.98
50.11
Kernel
Error
mismatch(5,1)
mismatch(5,2)
mismatch(6,3)
mismatch(7,4)
mismatch(9,3)
mismatch(9,4)
mismatch(10,3)
mismatch(10,4)
34.8
32.6
31.2
31.1
31.4
32.2
32.3
31.7
Top-2
Error
18.3
18.0
17.2
18.0
18.0
17.8
18.0
19.1
F1
Top-2 F1
65.36
67.51
68.92
68.96
68.59
67.83
67.65
68.29
81.95
82.21
83.01
82.16
82.33
82.36
82.12
81.04
Results of mismatch kernel classification for the remote homology detection problem are shown in Table 1.
We observe that larger values of k and m perform better compared to typically used values of k=5-6, m=1-2. For
instance, (k=10,m=5)-mismatch kernel achieves significantly higher average ROC50 score of 53.78 compared
to ROC50 of 41.92 and 49.02 for the (k=5,m=1)- and (k=5,m=2)- mismatch kernels. The utility of such large
mismatch kernels was not possible to investigate prior to this study.
We also note that the results for per-family or per-superfamily based parameter selection suggest the need for
model selection and the use of multiple kernels, e.g., per-family kernel selection results in much higher ROC50
of 60.32 compared to 53.78 of the best single kernel.
For the music genre classification task (Table 2), parameter combinations with moderately long k and larger
values of m tend to perform better than kernels with small m. As can be seen from results, larger values of m
are important for achieving good classification accuracy and outperform setting with small values of m.
Conclusions. In this work we proposed a new systematic method that allows evaluation of inexact string
family kernels for long substrings k with large number of mismatches m. The method finds the intersection
set sizes by explicitly computing them for small alphabet size |Σ| and then generalizing this to arbitrary
large alphabets. We show that this enables one to explore a larger set of kernels which as we demonstrate
experimentally can further improve performance of the string kernels.
References
[1] Chris H.Q. Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines and neural networks.
Bioinformatics, 17(4):349–358, 2001.
[2] Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein fold recognition using adaptive
codes. In ICML ’05, pages 329–336, New York, NY, USA, 2005. ACM.
[3] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina S. Leslie. Profile-based string kernels
for remote homology detection and motif extraction. In CSB, pages 152–160, 2004.
[4] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable algorithms for string kernels with inexact matching. In NIPS,
2008.
[5] Pavel P. Kuksa and Vladimir Pavlovic. Spatial representation for efficient sequence classification. In ICPR, 2010.
[6] Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res.,
5:1435–1455, 2004.
[7] Christina S. Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernels for SVM protein
classification. In NIPS, pages 1417–1424, 2002.
[8] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification. In SIGIR ’03, pages
282–289, New York, NY, USA, 2003. ACM.
[9] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William Stafford Noble. Semi-supervised
protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005.
1 http://opihi.cs.uvic.ca/sound/genres
2
58 of 104
Unsupervised Hashing with Graphs
†
Wei Liu†
Jun Wang‡
Sanjiv Kumar§
Shih-Fu Chang†
Electrical Engineering Department, Columbia University, New York, NY, USA
{wliu,sfchang}@ee.columbia.edu
‡
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA
wangjun@us.ibm.com
§
Google Research, New York, NY 10011, USA
sanjivk@google.com
Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However,
learning short codes that yield good search performance is still a challenge. Moreover, in many cases real-world
data lives on a low-dimensional manifold, which should be taken into account to capture meaningful nearest
neighbors. In this work, we propose a novel graph-based hashing method which automatically discovers the
neighborhood structure inherent in the data to learn appropriate compact codes in an unsupervised manner.
One of the most critical shortcomings of the existing unsupervised hashing methods such as Locality-Sensitive
Hashing (LSH) [1] is the need to specify a global distance measure. On the contrary, in many real-world applications data resides on an intrinsic manifold. For these cases, one can only specify local distance measures, while
the global distances are automatically determined by the underlying manifold. Our basic idea is motivated by
Spectral Hashing (SH) [2] in which the goal is to embed the data in a Hamming space such that the neighbors in
the original data space remain to be neighbors in the Hamming space.
Solving the above problem requires three main steps: (i) building a neighborhood graph using all n points
from the database (O(dn2 )), (ii) computing r eigenvectors of the graph Laplacian (O(rn)), and (iii) extending r
eigenvectors to any unseen data point (O(rn)). Unfortunately, step (i) is intractable for offline training while step
(iii) is infeasible for online hashing given very large n. To avoid these bottlenecks, [2] made a strong assumption
that data is uniformly distributed. This leads to a simple analytical eigenfunction solution of 1-D Laplacians, but
the manifold structure of the original data is almost ignored, substantially weakening the basic theme of that work.
In this work, we propose a novel unsupervised hashing approach named Anchor Graph Hashing (AGH) to
address both of the above bottlenecks. We build an approximate neighborhood graph using Anchor Graphs [3], in
which the similarity between a pair of data points is measured with respect to a small number of anchors (typically
a few hundred). The resulting graph is built in O(n) time and is sufficiently sparse with performance approaching
to the true kNN graph as the number of anchors increases. Because of the low-rank property of an Anchor Graph’s
adjacency matrix, our approach can solve the graph Laplacian eigenvectors in linear time. One critical requirement
to make graph-based hashing practical is the ability to generate hash codes for unseen points. This is known as outof-sample extension in the literature. Significantly, we show that the eigenvectors of the Anchor Graph Laplacian
can be extended to the generalized eigenfunctions in constant time, thus leading to fast code generation.
Finally, to deal with the problem of poor quality of hash functions associated with the higher eigenfunctions of
the graph Laplacian, we propose a hierarchical threshold learning procedure in which each eigenfunction yields
multiple bits. Thus, one avoids picking higher eigenfunctions to generate more bits, and bottom few eigenfunctions
are visited multiple times. We describe a simple method for optimizing the thresholds to obtain multiple bits.
One interesting characteristic of the proposed hashing method AGH is that it tends to capture semantic neigh-
59 of 104
Figure 1. Hash functions. The left subfigure shows the hash function of LSH, and the right subfigure
shows the hash function of our approach AGH.
Table 1. Hamming ranking performance on MNIST and NUS-WIDE. r denotes the number of hash
bits used in hashing algorithms, and also the number of eigenfunctions used in SE ℓ2 linear scan.
The K-means execution time is 20.1 sec and 105.5 sec for training AGH on MNIST and NUS-WIDE,
respectively. All training and test time is recorded in sec.
Method
ℓ2 Scan
SE ℓ2 Scan
LSH
PCAH
USPLH
SH
KLSH
SIKH
One-Layer AGH
Two-Layer AGH
MNIST (70k samples)
MAP
Train Time Test Time
r = 24 r = 48
r = 48
r = 48
0.4125
–
0.5269 0.3909
–
–
0.1613 0.2196
1.8
2.1×10−5
0.2596 0.2242
4.5
2.2×10−5
0.4699 0.4930
163.2
2.3×10−5
0.2699 0.2453
4.9
4.9×10−5
0.2555 0.3049
2.9
5.3×10−5
0.1947 0.1972
0.4
1.3×10−5
0.4997 0.3971
22.9
5.3×10−5
0.6738 0.6410
23.2
6.5×10−5
NUS-WIDE (270k samples)
MP
Train Time Test Time
r = 24 r = 48
r = 48
r = 48
0.4523
–
0.4866 0.4775
–
–
0.3196 0.2844
8.5
1.0×10−5
0.3643 0.3450
18.8
1.3×10−5
0.4269 0.4322
834.7
1.3×10−5
0.3609 0.3420
25.1
4.1×10−5
0.4232 0.4157
8.7
4.9×10−5
0.3270 0.3094
2.0
1.1×10−5
0.4762 0.4761
115.2
4.4×10−5
0.4699 0.4779
118.1
5.3×10−5
borhoods. In other words, data points that are close in the Hamming space, produced by AGH, tend to share
similar semantic labels (see Fig. 1). This is because for many real-world applications close-by points on a manifold tend to share similar labels, and AGH is derived using a neighborhood graph which reveals the underlying
manifold, especially at large scale. Fig. 1 indicates that the hash function of AGH can generate hash bits along
manifolds. The key characteristic, semantic hash bits, of AGH is validated by extensive experiments carried out
on two datasets, where AGH outperforms state-of-the-art hashing methods as well as exhaustive linear scan in the
input space with the commonly used ℓ2 distance. The results are shown in Table 1.
References
[1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry, pp. 253-262, Brooklyn, New York,
USA, 2004.
[2] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances in Neural Information Processing Systems (NIPS) 21,
MIT Press, pp. 1753-1760, 2009.
[3] W. Liu, J. He, and S.-F. Chang. Large Graph Construction for Scalable Semi-Supervised Learning. In Proceedings of
the 27th International Conference on Machine Learning (ICML), pp. 679-686, Haifa, Israel, 2010.
60 of 104
Unifying Non-Maximum Likelihood Learning
Objectives with Minimum KL Contraction∗
Siwei Lyu
Computer Science Department
University at Albany, State University of New York
lsw@cs.albany.edu
Practical applications of machine learning algorithms in data analysis
call for efficient estimation of probabilistic data models. However, when
used to learn high dimensional parametric probabilistic models (e.g., Markov
random fields [9] and products of experts [6]), the classical maximum likelihood (ML) learning oftentimes suffers from computational intractability
due to the normalizing partition function. Such a difficulty motivates the
active developments of non-ML learning methods. Yet, because of their
divergent motivations and forms, the objective functions of many non-ML
learning methods are seemingly unrelated – each is conceived with specific
insights on the learning problem. The multitude of different non-ML learning methods causes a significant “burden of choice” for machine learning
practitioners.
In this work, we describe a general non-ML learning principle that we
term as minimum KL contraction (MKC). The MKC objective is based on
an information geometric view of parametric learning for probabilistic models on a statistical manifold. In particular, we first define a KL contraction
operator as a mapping between probability distributions on statistical manifolds, under which the Kulback-Leibler (KL) divergence of two distributions
always decreases unless the two distributions are equal. The MKC objective
then seeks optimal parameters that minimizes the contraction of the KL
divergence between the two distributions after they are transformed with
a KL contraction operator. Preliminary results have shown that the MKC
principle provides a unifying framework of a wide range of important or recently developed non-ML learning methods, including contrastive divergence
[6], noise contrast estimation [5], partial likelihood [3], non-local contrastive
∗
This work is to appear at NIPS 2011.
1
61 of 104
objectives [11], score matching [7], pseudo-likelihood [2], maximum conditional likelihood [8], maximum mutual information [1], maximum marginal
likelihood [4], and conditional and marginal composite likelihood [10]. Each
of these learning objective functions can be recast as an instantiation of the
general MKC objective functions with different KL contraction operators.
The MKC principle provides a deepened and unified understanding of
existing non-ML learning methods, which can facilitate designing new efficient non-ML learning methods to focus on essential aspects of the KL
contraction operators. To our best knowledge, such a unifying view of the
wide range of existing non-ML learning methods has not been explicitly
described previously in the literature of machine learning and statistics.
References
[1] L. Bahl, P. Brown, P. de Souza, and R. Mercer. Maximum mutual information
estimation of hidden markov model parameters for speech recognition. In
ICASSP, 1986.
[2] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–95,
1975.
[3] D. R. Cox. Partial likelihood. Biometrika, 62(2):pp. 269–276, 1975.
[4] I.J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian
Methods. MIT Press, 1965.
[5] M. Guttmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
[6] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.
[7] A. Hyvärinen. Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6:695–709, 2005.
[8] T. Jebara and A. Pentland. Maximum conditional likelihood via bound maximization and the CEM algorithm. In NIPS, 1998.
[9] J. Laurie Kindermann, Ross; Snell. Markov Random Fields and Their Applications. American Mathematical Society, 1980.
[10] B. G Lindsay. Composite likelihood methods. Contemporary Mathematics,
80(1):22–39, 1988.
[11] D. Vickrey, C. Lin, and D. Koller. Non-local contrastive objectives. In ICML,
2010.
2
62 of 104
Active prediction in graphical models
Satyaki Mahalanabis and Daniel Štefankovič ∗
1
Introduction
Given an undirected graph with an unknown labeling of its nodes (using a fixed set of labels) and the
ability to observe the labels of a subset of the nodes, consider the problem of predicting the labels of
the unobserved nodes. This is a widely studied problem in machine learning, where the focus until now
has largely been on designing the prediction algorithm. Recently however, the problem of selecting
the set of vertices to observe (i.e. active prediction) has been receiving attention [1], and this is what
our work tries to address. We are interested in designing adaptive strategies for the selection problem,
i. e., at any point in time, the next node we choose to observe can depend on the labels of nodes that
have already been selected and (hence) revealed. While previous research [2, 1] has mostly focused on
the worst case prediction error of the selection strategy (i. e., assume adversarial labeling of all nodes),
we analyze the expected error under the assumption that the labels are random variables with a known
joint distribution. In other words we consider graphical models (such as the Ising model) with known
parameters. We assume that the selection stategy is provided a budget of the number of nodes it can
observe, and we study the expected number of mispredictions as a function of this budget. We are
particularly interested in the question of whether there exist very simple yet optimal node selection
strategies.
The example we study is the (ferromagnetic) Ising model (one of the simplest graphical models) on
a chain graph. To further simplify the problem we consider a continuous version of the Ising model, that
is, the limit as the number of vertices on the path goes to infinity. We show that some simple adaptive
selection strategies have optimal error (up to a constant factor), and have asymptotically smaller number
of mispredictions than non-adaptive strategies. We also prove the hardness of selecting the optimal set
of nodes (either adaptively or non-adaptively) for graphical models on general graphs.
1.1
Related work
In [2] authors consider the problem of non-adaptive selection for minimizing the worst case prediction
error for the case of binary labels. They give an upper bound on the error of the min-cut predictor, for
a given set of observed nodes S, in terms of the cut size (that is, the number of edges whose endpoints
have different labels) of the unknown labeling and a graph cut function Ψ(S) (which depends on the
underlying graph). The problem of choosing the observed nodes then reduces to that of minimizing
Ψ(S) over sets S of a given (budget) size, for which they suggest a heuristic strategy (the goal of that
∗
Department of Computer Science, University of Rochester, {smahalan,stefanko}@cs.rochester.edu
1
63 of 104
paper is not to provide provable guarantees for the heuristic). Authors of [1] give a non-adaptive selection strategy in a similar adversarial setting for the special case of trees such that the min-cut predictor’s
number of mispredictions is at most a constant factor worse than any (non-adaptive) prediction strategy
(analyzed in the worst case for labelings that have bounded cut size). We point out that the algorithm of
[1] does not need to know the cut size constraint while in our case, the expected cut-size is known and
depends on the model parameters. Further, [1] provides a lower bound on the prediction error for general graphs as well. Unlike [2, 1], [3] considers an adaptive version of the selection problem for (certain
classes of) general graphs where the cut size induced by the labeling is known. Their goal, unlike ours,
is to recover the correct label of every node using as few observations as possible. They claim to give
a strategy whose required number of labels matches the information-theoretic lower bound (ignoring
constant factors).
Finally note that our goal here, namely, the active selection of nodes to observe, is different from that
of, e. g., [4] where the nodes to be labeled are chosen by an adversary, or from that of semi-supervised
settings, e. g., [5, 6] where the query nodes are chosen randomly. Also, we aim to minimize the expected
number of mispredictions, as opposed to inferring the most likely labeling (i. e., the “minimum energy
configuration”) of the unobserved nodes which is what most applications in computer vision tend to
focus on, e. g., [7] (see also [8]).
2
Summary of Results
We state our main results briefly (see the full version [9] for details).
Proposition 1. Computing the non-adaptive (or adaptive) strategy with budget B that minimizes the
expected average error is NP-hard.
Next, we study the continuous limit of a simple chain (i.e. 1-D) Ising model, which we define
as follows. Consider a 1-D Ising model on n vertices with inverse temperature β > 0. Its much
easier to study the behavior of this model in the limit as n → ∞ and β scales appropriately so that
L = − n−1
2 ln λ stays fixed, where λ = tanh(β). We now state results for predicting labels for such a
model with a label budget of B. The simplest non-adaptive strategy—querying uniformly spaced points
is, not surprisingly, optimal (among non-adaptive strategies).
Proposition2. The non-adaptive strategy
that queries points is an optimal non-adaptive strategy and
1−exp(−L/(B+1))
1
L
has error 2 1 −
= 4(B+1) + O(1/B 2 ).
L/(B+1)
Next we give a lower bound on the error of any adaptive strategy.
Proposition 3. The expected average error of any adaptive strategy is at least
For B ≥ L − 1 the lower bound can be simplified to
is a constant.)
(L/(B + 1))2 /9.
1
2
1−
tanh(L/(B+1))
L/(B+1)
.
(For B ≤ L − 1 the lower bound
Now we show that combining the non-adaptive strategies with binary search yields an adaptive
strategy with asymptotically better performance.
Proposition 4. There is an adaptive strategy with error bounded by 21(L/B)2 , assuming L ≥ 1.
2
64 of 104
References
[1] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella, “Active learning on trees and graphs,” in
COLT, pp. 320–332, 2010.
[2] A. Guillory and J. Bilmes, “Label selection on graphs,” in Conference on Neural Information Processing Systems, 2009.
[3] P. Afshani, E. Chiniforooshan, R. Dorrigiv, A. Farzan, M. Mirzazadeh, N. Simjour, and H. ZarrabiZadeh, “On the complexity of finding an unknown cut via vertex queries,” in COCOON, pp. 459–
469, 2007.
[4] N. Cesa-Bianchi, C. Gentile, and F. Vitale, “Fast and optimal prediction on a labeled tree,” in COLT,
2009.
[5] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in ICML,
pp. 19–26, 2001.
[6] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and
harmonic functions,” in ICML, pp. 912–919, 2003.
[7] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222–1239, 2001.
[8] J. M. Kleinberg and É. Tardos, “Approximation algorithms for classification problems with pairwise
relationships: Metric labeling and markov random fields,” in FOCS, pp. 14–23, 1999.
[9] S. Mahalanabis and D. Štefankovič, “Active
http://www.cs.rochester.edu/ smahalan/ising.pdf.
3
prediction
in
graphical
models.”
65 of 104
An Ensemble of Linearly Combined
Reinforcement-Learning Agents
Vukosi Marivate, vukosi@cs.rutgers.edu
Michael Littman, mlittman@cs.rutgers.edu
September 9, 2011
Abstract
range of computational domains, ensemble learning
methods have proven extremely valuable for tackling
complex problems reliably. Ensemble (or sometimes
modular or portfolio) methods [Rokach, 2010] harness multiple, perhaps quite disparate, algorithms
for a problem class to greatly expand the range of
specific instances that can be addressed. They have
proven themselves to be state-of-the-art approaches
for crossword solving, satisfiability testing, movie recommendation and question answering. We believe
the success of ensemble methods on these problems
is because they can deal with a range of instances that
require different low-level approaches. RL instances
share this attribute, suggesting that an ensemble approach could prove valuable there as well.
Reinforcement-learning (RL) algorithms are often
tweaked and tuned to specific environments when
applied, calling into question whether learning can
truly be considered autonomous in these cases. In
this work, we show how more robust learning across
environments is possible by adopting an ensemble approach to reinforcement learning. We find that a
learned linear combination of Q-values from multiple
independent learning algorithms results in an agent
that compares favorably to the best tuned algorithm
across a range of problems. We compare our proposed approach, which tries to minimize the error
between observed returns and the predicted returns
from the underlying ensemble learners, to an existing approach called policy fusion and show that our
approach is considerably more effective. Our work
provides a promising basis for further study into the
use of ensemble RL methods.
1
2
Linearly Combined Ensemble
Reinforcement Learning
In this work, we present an approach to ensemblebased RL using a linear Temporal Difference (TD)
learning algorithm as a meta learner to combine the
value estimates from multiple base RL algorithm
agents. Our approach goes beyond earlier efforts in
ensemble RL [Wiering and van Hasselt, 2008], which
fused policies or actions, in that we develop a fusion
method that is learned and adjusted given the base
agents in the ensemble instead of combining low-level
agents inflexibly. We propose a flexible method to
combine Action-Value functions
∑n (Q-values) from different learners, Q(s, a) = i=1 wi Qi (s, a), where wi
Introduction
The task of creating a single reinforcement-learning
(RL) agent that can learn in many possible environments without modification is not a simple one. It is
typical for algorithm designers to modify state representations, learning protocols, or parameter values
to obtain good performance on novel environments.
However, the more problem-specific tuning needed,
the less “autonomous” an RL system is, eroding some
of the value of RL systems in practice. Across a wide
1
66 of 104
is the weight for base learner i and Qi (s, a) is the
Q-value of state s and action a as estimated by the
ith learner. By using Q-values, we are able to solve
the credit assignment problem in terms of deciding
how to weigh the individual contribution of each algorithm. The high level learner (Meta learner) uses the
linearly combined and weighed individual Q-values
from each of the n low level (base) learners to choose
actions. Given that both the base learners and the
meta learner need to adapt, learning can be run in
either one or two stages. In the two-stage approach,
the base learners are trained on the environment in
question by themselves, they are frozen, and then the
meta learner learns the weights to combine the Qvalues of the base learners. With the above description of the combination of base learners, we can view
each base learner’s Q-value as a feature for the high
level learner that is dependent on a state action pair,
(s, a). In the single stage learning approach, both
the base and meta learners are trained at the same
time. The two-stage meta learner is a least squares
algorithm that minimizes the Bellman residual error
, and adjusts weights using the TD update rule with
individual agents Q-values, Qi (s, a), as features. The
two stage learner thus converges to the weights that
results in the smallest error between the estimates
and the real returns. In the single stage learning
approach, the simultaneous adaptation of base and
meta learner makes convergence less clear. Further
analysis of this case is still needed.
Figure 3.1: Cat & Mouse Results
ters could have a dramatic effect on their success in
a given environment. Promising results showing the
performance of the linearly combined ensemble RL
are shown in Figure 3.1. Here, the Two-stage meta
learner discovers the best single algorithm learner
while the single stage learners has a slightly higher
mean than all the other configurations.
4
Future Work
The two-stage and single-stage learning dynamics
need further investigation as they both can lead to
high performance gains in different environments.
The impact of individual effectiveness to the weighting and performance of the two-stage meta learner
could lead to better understanding of the approach.
Nevertheless, these promising initial results indicate
3 Experiments
that reinforcement-learning algorithms, like regresTo better understand the strengths and weaknesses sion and classification algorithms, can also benefit
of our proposed algorithm, we carried out a num- substantially from the ensemble approach.
ber of experiments on learning in multiple environments that were part of the 2009 RL Competition
References
Polyathlon challenge. The Polyathlon required each
participant to build a general RL agents that per- L. Rokach. Ensemble-based classifiers. Artificial Informed well across multiple environments without
telligence Review, 33(1):1–39, 2010.
prior knowledge of those environments. For the base
learners, we used 3 different types of TD(0) learn- M.A. Wiering and H. van Hasselt. Ensemble algorithms in reinforcement learning.
Systems,
ing agents: Table-Based Q-Learning, Table-Based
Man, and Cybernetics, Part B: Cybernetics, IEEE
SARSA and Linear SARSA (L-SARSA). Within
Transactions on, 38(4):930–936, 2008.
these broad categories, variations of their parame2
67 of 104
1
Autonomous RF Surveying Robot for Indoor Localization and Tracking
Ravishankar Palaniappan, Piotr Mirowski, Tin Kam Ho, Harald Steck, Philip Whiting and Michael MacDonald
Alcatel-Lucent Bell Labs, Murray Hill, NJ; ravishankar.palaniappan@alcatel-lucent.com
I. INTRODUCTION
Technologies for tracking people within buildings are important enablers for many public safety applications
and commercial services. The need has not been met by conventional navigation aids such as GPS and inertial
systems. There have been many approaches to solve this problem such as using pseudolites, highly sensitive Inertial
Measurement Units (IMUs), RF based systems using techniques like Received Signal Strength Indicator (RSSI),
Time Difference of Arrival (TDOA) or Time of Flight (TOF) on different spectral bands and standards [1], [2], [3],
[4], and hybrid methods using sensor fusion. Some solutions require special devices such as ultrasonic or RFID tags
and readers, or are constrained by requiring line of sight such as in the use of video cameras. The most desirable
are systems that cause minimal intervention to ongoing activities, use existing device and infrastructures, and are
cheap to deploy and maintain.
Over the years our team of researchers has worked on several components of a
WLAN-based tracking technology that are ready to be refined and integrated into an
operational solution. These include (1) a simulator for predicting radio propagation
using path-loss models and ray-tracing computation; (2) a set of statistical algorithms
for interpolating RSSI maps from manually collected, irregularly spaced raw data;
and (3) algorithms for using the refined signal maps to determine positions in
real time. But a practical solution still requires the elimination of the painstaking
process of creating a signal map manually by walking/driving through the space and
collecting signal strength measurements along with precise position information.
Furthermore, it is necessary to be able to repeat the signal map construction
periodically to adapt to potential changes in the radio environment. Motivated by
these needs, we have recently developed a robotic test-bed that can carry multiple
sensors and navigate within buildings autonomously while gathering signal strength
data at various frequencies of interest. A robotic platform that can perform such
tasks with minimal human intervention is a highly promising solution to bridge the Fig. 1. Indoor mapping robot.
gap between laboratory trials and practical deployments in large scale.
Our robotic platform is capable of conducting repeated surveying of an indoor
environment to update the signal maps as frequently as required. Also the robot serves as a test-bed to collect data
in controlled conditions such as under uniform motion and with accurate positioning. As it moves autonomously in
the indoor environment, it uses a new method of multi-sensor integration for Simultaneous Localization & Mapping
(SLAM) to construct the spatial model and infer its own location. In this way, the construction of the spatial map
of RF signal strength can be fully automated. The robot can carry different receivers and systematically compare
the effectiveness of different localization methods. Currently we experiment with WLAN fingerprinting techniques.
We expect that the robot can support similar experiments on other RF signals of interest, including GSM, CDMA
and LTE.
II. INDOOR MAPPING ROBOT
We now describe the hardware of the Indoor Mapping Robotic (IMR) test-bed vehicle which is used to survey
and construct radio signal maps within buildings quickly and as frequently as needed. Since the robotic platform
(see Fig. 1) is primarily intended for indoor application we chose a vehicle that was easily maneuverable through
narrow corridors and doorways. Our choice was a Jazzy Jet 3 wheelchair with two motors, two 12 V rechargeable
batteries and a maximum operating time of 4 hours. The chair and chair mount was replaced by a custom built
aluminum base that hosted all our sensor, electronics and hardware. The heart of the electronic system is a lowpower 1.66 GHz Mini-ITX computer motherboard that handles all the operations of the robot including navigation
and signal strength data collection. The robot is controlled through a microcontroller that sends serial PWM signals
to the motor controllers for navigation.
Data from different on-board sensor such as the sonar, inertial sensors, Microsoft Kinect and video cameras
are streamed through an on-board 802.11g link to a control station for post processing and analysis. The robot
68 of 104
2
autonomously navigates the indoor environment using collision detection algorithms. The main sensor used for this
R
feature is the Microsoft Kinect
RGB-D sensor developed by PrimeSense Inc, which can simultaneously acquire
depth and color images. An open-source software driver (available at http://openkinect.org) enables to grab both
depth and RGB images at 30Hz and 640x480 pixel resolution, and respectively as 11-bit depth and 8-bit color
definitions [5].
Autonomous robot navigation is currently implemented as a collision avoidance strategy using the Kinect and
other sensors including sonar and contact sensors. The collision detector primarily relies on the output from the
Kinect RGB-D sensor. After sub-sampling the depth image to an 80x30 pixel size and converting depth information
to metric distances, a simple threshold-based algorithm makes decisions to turn if there are obstacles within 1m on
the right or left of the field of depth vision. Another threshold is used for stairs detection, by scanning the bottom
five lines of the depth image. The robot moves forward if no collisions or falls are perceived. If the Kinect does
not detect small obstacles such as a pillar or trash cans which are not in the field-of-view of the Kinect camera the
sonar sensors serve as additional backup sensors for obstacle avoidance.
III. SIMULTANEOUS LOCALIZATION AND MAPPING
A critical need in constructing a spatial map of RF signals is to accurately
record the location of the receiver at the time when the signal strength is
measured. In addition, the location needs to be registered to the building
layout for use in applications and services, even if no detailed blueprints
are available for the building of interest. On the robotic platform, these can
be accomplished using a SLAM (Simultaneous Localization And Mapping)
algorithm. Conventional 3D database modeling of indoor environment by
mobile robots has involved the use of laser scanner and cameras. The laser
scanner is used to build a 3D point cloud map and the textures from the camera
images are stitched to these point clouds to form the complete database. A
recently published method succeeded in building precise 3D maps directly
from the Kinect RGB-D images [7], taking advantage of the dense depth
Fig. 2. Example of 3D office reconstruc- information, without relying on other sensors. It would estimate 3D point
tion from RGB-D-based odometry.
cloud transformations only between two consecutive RGB-D frames, and
would resolve loop closures (going twice through the same location) using
a maximum likelihood smoothing, but would be too computationally intensive for real-time application. We used a
simplified implementation of [7] (available at http://nicolas.burrus.name) to perform post-processing reconstruction
of the robot trajectory. Fig. 2 shows a snapshot of the 3D map being constructed from RGB-D images.
IV. ONGOING WORK
We are currently integrating additional sensing modalities on the robotic platform including sonar to improve
the navigation and localization capabilities of the robot. Since the most of the robot hardware was built from
commercial off the shelf hardware we plan to replicate this design for multiple robotic platforms which can be
used to quickly cover large buildings in a coordinated manner.
R EFERENCES
[1] K. Ozsoy, A. Bozkurt and I. Tekin, “2D Indoor positioning system using GPS signals”, Indoor Positioning and Indoor Navigation (IPIN),
2010 International Conference on, 15-17 Sept. 2010.
[2] M. Ciurana, D. Giustiniano, A. Neira, F. Barcelo-Arroyo and I. Martin-Escalona, “Performance stability of software ToA-based ranging
in WLAN”, Indoor Positioning and Indoor Navigation (IPIN), 2010 International Conference on, 15-17 Sept. 2010.
[3] H. Kroll and C. Steiner, “Indoor ultra-wideband location fingerprinting”, Indoor Positioning and Indoor Navigation (IPIN), 2010
International Conference on, 15-17 Sept. 2010.
[4] M. Hedley, D. Humphrey and P. Ho, “System and algorithms for accurate indoor tracking using low-cost hardware”, Position, Location
and Navigation Symposium, 2008 IEEE/ION , pp.633–640, 5-8 May 2008.
[5] R.M. Geiss, Visual Target Tracking, US Patent Application 20110058709, 2011.
[6] T. Roos, P. Myllymaki, H. Tirri, P. Misikangas and J. Sievanen, “A Probabilistic Approach to WLAN User Location Estimation”,
International Journal of Wireless Information Networks, vol.7, n.3, pp.155–163. 2002.
[7] P. Henry, M. Krainin, E. Herbst, X. Ren and D. Fox, “RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor
Environments”, Experimental Robotics, 2010 12th International Symposium on, 18-21 Dec. 2010.
69 of 104
Title: A Comparison of Text Analysis and Social Network Analysis using Twitter Data
Authors: John Myles White (Princeton), Drew Conway (NYU)
Abstract
Within the social sciences, interest in statistical methods for the automatic analysis of text has
grown considerably in recent years. Specifically, within the field of political science there has
been continued interest in using text analysis to extend well-known spatial models of political
ideology. Simultaneously, interest in how the structure of social networks affects political
attitudes and outcomes has been both growing and controversial. In an attempt to bridge the gap
between these two research areas we have collected a sample of Twitter data related to the U.S.
Congress. Twitter provides a rich data set containing both text and social network information
about its members. Here we compare the usefulness of text analysis and social network analysis
for predicting the political ideology of Twitter users — two methods that are, in principle,
applicable to both members of Congress (for whom roll call data and precise spatial estimates of
political ideology already exist) and to the surrounding network of Twitter users (for whom
precise estimates of political ideology do not exist). To compare text analysis methods with tools
from social network analysis, we fit a variety of L1- and L2-regularized regression models that
use word count data from individual tweets to predict the ideal points of Members of Congress.
We then compare the performance of the resulting text models with the performance of social
network models that employ techniques developed for predicting the spread of transmissible
diseases to predict the ideal points for the same Members of Congress.
Contact:
John Myles White
Graduate Student
Department of Psychology and Princeton Neuroscience Institute
Princeton University
jmw1729@princeton.edu
70 of 104
Time Dependent Dirichlet Process Mixture Models
for Multiple Target Tracking
Willie Neiswanger
Frank Wood
Columbia University, New York, NY 10027, USA
wdn2101@columbia.edu, fwood@stat.columbia.edu
Abstract
Multiple Target Tracking (MTT) is a machine vision problem that involves
identifying multiple concurrently moving targets in a video and maintaining
identification of these targets over time while tracking the frame-by-frame
locations of each. Difficulties in producing a single, general-purpose algorithm
capable of successfully carrying out MTT over a wide range of videos are mainly
due to the potential for videos to be highly dissimilar; parameters such as the
targets’ physical characteristics (including size, shape, color, and behavioral
patterns), backgrounds over which the targets move, and filming conditions may
all differ.
The technique developed in this project makes use of a known time dependent Dirichlet process mixture model, comprised of a sequence of interdependent
infinite Gaussian mixture models constructed to accommodate behavior that
changes over time. This work describes how pixel location values can be
extracted from a wide variety of videos containing multiple moving targets and
clustered using the aforementioned model; this process is intended to reduce the
need for explicit target identification and serve as a general MTT method applicable to targets with arbitrary characteristics moving over diverse backgrounds. The
technique is demonstrated on video segments showing multiple ant targets, of
distinct species and with diverse physical and behavioral characteristics, moving
over non-uniform backgrounds.
1
7
Conclusion
This paper has exhibited successful MTT over short video sequences with the use of a time dependent GPUDPM model. The technique was illustrated on two multiple target tracking scenarios
involving disparate targets and backgrounds. A general data extraction algorithm was outlined as a
simple way to collect data from a wide variety of videos containing multiple moving targets. We
hope that this research might help in development of more general MTT algorithms, with the ability
to track a large variety of targets with arbitrary characteristics moving over diverse backgrounds
References
[1] F. Caron, M. Davy, and A. Doucet. Generalized polya urn for time-varying Dirichlet process mixtures. 71 of 104
In 23rd Conference on Uncertainty in Artificial Intelligence (UAI’2007), Vancouver, Canada, July 2007,
2007.
[2] J. Gasthaus, F. Wood, D. Görür, and Y. W. Teh. Dependent Dirichlet process spike sorting. In Advances in
Neural Informations Processing Systems 22, 2008.
[3] Jan Gasthaus. Spike sorting using time-varying Dirichlet process mixture models, 2008.
[4] J. E. Griffin and M. F. J. Steel. Order-based dependent Dirichlet processes. Journal of the American
Statistical Association, 101(473):179–194, 2006.
[5] W. Ng, J. Li, S. Godsill, and J. Vermaak. A review of recent results in multiple target tracking. In Image
and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on,
pages 40 – 45. sept. 2005. doi: 10.1109/ISPA.2005.195381.
[6] Simo Sarkka, Toni Tamminen, Aki Vehtari, Aki Vehtari, and Jouko Lampinen. Probabilistic methods in
multiple target tracking, review and bibliography, 2004.
10
72 of 104
Manjot Pahwa,
Undergraduate Student Researcher
Netaji Subhas Institute of Technology, University of Delhi
Recipient of Google Women in Engineering Award, 2011
manjotpahwa@gmail.com
+91-(9953)310-807
An Efficient and Optimal Scrabble Playing Algorithm based on Modified DAWG and Temporal
Difference Lambda Learning Algorithm
Abstract:
The aim of this poster is to focus on the development and analysis of an algorithm to play the popular
game Scrabble. The poster will first describe and analyze the existing algorithms used. The
improvements which can be possible will be described in the algorithm. The algorithm will not only
be one of the most efficient algorithms for Scrabble but will also choose the highest scoring move or
in other words, play as optimally as possible. The first part is achieved with the help of a modified
version of a popular data structure known as DAWG (directed acyclic word graph). This data
structure is modified so as to enable the movement from a node at top bottom or anywhere in the
middle in any direction. This would enable to efficiently generate prefixes, suffixes or hook up words
between existing tiles on the board in either direction. Although such a structure would inevitably
occupy much more space than a DAWG, the processing time is reduced by an appreciable factor.
Furthermore, processing is speeded up further by searching using the tiles already on the board using
a backtracking algorithm which searches for the highest scoring word using tiles already on the board.
Although a primitive form of this algorithm has been described by Jacobsen and Appel in their paper
“The World’s Fastest Scrabble Program”, this poster will present an algorithm which augments their
work in terms of finding not the solution only efficiently but also giving the highest score. This can be
possible using an algorithm which chooses the word or searches for the word in order of descending
highest scores possible. This involves consideration of not only the length of the word but also the
individual tile scores allotted to each and every alphabet. Furthermore, greater score is expected with
the help of an approximate dynamic programming solution based on Temporal Difference (TD)
Learning. This Temporal Difference Learning algorithm is based on the formulation of Richard S.
Sutton known as TD-Lambda learning algorithm. Prediction of future scores can be done on the basis
of tiles remaining in the bag and possible patterns. Moreover, as the game progresses, the predictions
can be fine tuned to be more accurate. Such a methodology can vastly improve the game play of a
computer in comparison to a human. A discussion of the time and space efficiency of this solution
will also be presented. Lastly, this poster will describe some more changes which can further improve
the score fetched by the computer.
73 of 104
Hierarchically Supervised Latent Dirichlet Allocation
Adler Perotte
Nicholas Bartlett
Noémie Elhadad
Frank Wood
Columbia University, New York, NY 10027, USA
{ajp9009@dbmi,bartlett@stat,noemie@dbmi,fwood@stat}.columbia.edu
We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product
descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional
representations of the bag-of-word data are also of interest.
Our work operates within the framework of topic modeling. Our approach learns topic models of the underlying
data and labeling strategies in a joint model, while leveraging the hierarchical structure of the labels. For the sake
of simplicity, we focus on is-a hierarchies, but the model can be applied to other structured label spaces. Our work
extends supervised latent Dirichlet allocation (sLDA) [2] to take advantage of hierarchical supervision and proposes
an efficient way to incorporate such information into the model. We hypothesize that the context of labels within the
hierarchy provides valuable information about labeling. Other models, such as LabeledLDA [5], incorporate LDA and
supervision; however, none of these models leverage dependency structure in the label space.
We demonstrate our model on large, real-world datasets in the clinical and web retail domains. We observe that
hierarchical information is valuable when incorporated into the learning and improves our primary goal of multi-label
classification. Our results show that a joint, hierarchical model outperforms a classification with unstructured labels as
well as a disjoint model, where the topic model and the hierarchical classification are inferred independently of each
other.
HSLDA is a model for hierarchically, multiply-labeled, bag-of-word data. We will refer to individual groups of bag-ofword data as documents. Let wn,d ∈ Σ be the nth observation in the dth document. Let wd = {w1,d , . . . , w1,Nd } be
the set of Nd observations in document
d. Letthere be D such documents and let the size of the vocabulary be V = |Σ|.
Let the set of labels be L = l1 , l2 , . . . , l|L| . Each label l ∈ L, except root, has a parent pa(l) ∈ L also in the set
of labels. We will for exposition purposes assume that this label set has hard “is-a” parent-child constraints, although
this assumption can be relaxed at the cost of more computationally complex inference. Such a label hierarchy forms a
multiply rooted tree. Without loss of generality we will consider a tree with a single root r ∈ L. Each document has a
variable yl,d ∈ {−1, 1} for every label which indicates whether the label is applied to document d or not. In most cases
yi,d will be unobserved, in some cases we will be able to fix its value because of constraints on the label hierarchy,
and in the relatively minor remainder its value will be observed. In the applications we consider, only positive label
applications are observed.
In HSLDA, documents are modeled using the LDA mixed-membership mixture model with global topic estimation.
Label responses are generated using a conditional hierarchy of probit regressors[3]. The HSLDA graphical model is
given in Figure 1. In the model, K is the number of LDA “topics” (distributions over the elements of Σ), φk is a
distribution over “words,” θ d is a document-specific distribution over topics, and β is a global distribution over topics.
Posterior inference in HSLDA was performed using Gibbs sampling and Markov chain Monte Carlo. Note that, like in
collapsed Gibbs samplers for LDA [4], we have analytically marginalized out the parameters φ1:K and θ 1:D . HSLDA
also employs a hierarchical Dirichlet prior over topic assignments (i.e., β is estimated from data rather than assumed
to be symmetric). This has been shown to improve the quality and stability of inferred topics [6]. The hyperparameters
α, α0 , and γ are sampled using Metropolis-Hastings.
We applied HSLDA to data from two domains: predicting medical diagnosis codes from hospital discharge summaries
and predicting product categories from Amazon.com product descriptions. The clinical dataset consists of 6,000
clinical notes along with associated billing codes that are used to document conditions that a particular patient was
treated for. These billing codes (7298 distinct codes in our dataset) are organized in an is-a hierarchy. The retail
dataset consists of product descriptions for DVDs from the Amazon.com product catalog. This data was partially
1
74 of 104
Figure 1: HSLDA graphical model
obtained from the Stanford Network Analysis Platform (SNAP) dataset [1]. The comparison models included sLDA
with independent regressors (hierarchical constraints on labels ignored) and HSLDA fit by first performing LDA then
fitting tree-conditional regressions. The number of topics for all models was set to 50, the prior distributions of p (α),
p (α0 ), and p (γ) were gamma distributed with a shape parameter of 1 and a scale parameters of 1000.
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.2
0.00.0
Sensitivity
1.0
Sensitivity
1.0
0.4
0.2
0.2
0.4
(a)
0.6
0.8
1.0
0.00.0
0.4
0.2
0.2
0.4
0.6
1-Specificity
0.8
1.0
0.00.0
0.2
(b)
0.4
0.6
1-Specificity
0.8
1.0
(c)
Figure 2: ROC curves for out-of-sample ICD-9 code prediction from patient free-text discharge records ((a),(c)). ROC
curve for out-of-sample Amazon product category predictions from product free-text descriptions (b). Figures (a)
and (b) are a function of the prior means of the regression parameters. Figure (c) is a function of auxiliary variable
threshold. In all figures, solid is HSLDA, dashed are independent regressors + sLDA (hierarchical constraints on labels
ignored), and dotted is HSLDA fit by running LDA first then running tree-conditional regressions.
The results in Figures 2(a) and 2(b) suggest that in most cases it is better to do full joint estimation of HSLDA. An
alternative interpretation of the same results is that, if one is more sensitive to the performance gains that result from
exploiting the structure of the labels, then one can, in an engineering sense, get nearly as much gain in label prediction
performance by first fitting LDA and then fitting a hierarchical probit regression. There are applied settings in which
this could be advantageous.
References
[1] Stanford network analysis platform. http://snap.stanford.edu/, 2004.
[2] D. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing, 20:121–128, 2008.
[3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2nd ed. edition, 2004.
[4] T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl. 1):5228–5235, 2004.
[5] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: a supervised topic model for credit attribution in
multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages
248–256, 2009.
[6] Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA: Why priors matter. In Y. Bengio, D. Schuurmans,
J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1973–
1981. 2009.
2
75 of 104
Image Super-Resolution via Dictionary Learning
Gungor Polatkan, David Blei
Princeton University
Mingyuan Zhou, Ingrid Daubechies, Lawrence Carin
Duke University
Nowadays, one popular research problem in image analysis is super-resolution (SR). The main task
can be summarized as recovering a high-resolution (HR) image from a low-resolution (LR) input. The
need for enhancing the resolution of smart phone and surveillance cameras, the need for HR images in
medical and satellite visualizations motivates superior SR algorithms.
SR problem is generally an ill-posed problem since the LR to HR transformation is not unique. Several
regularization methods have been proposed in order to overcome this issue. In this abstract, we consider
SR algorithms in two groups: interpolation based and example based approaches. Interpolation based
methods, such as Bicubic and Biliniear, generally overly smooth the image, and details are lost. Example
based approaches use machine learning in order to overcome this problem. These algorithms use ground
truth HR and LR image pairs during training and uses the learned statistical relation in the testing phase
to reconstruct an HR image from a held-out low-resolution input.
Multiscale Extension for Landmark-Dependent Dictionary Learning. Recently a landmarkdependent hierarchical beta process is developed for dictionary learning problems. The incorporation of
the covariates into dictionary learning provides a tool where feature usages are likely to be similar for
samples close to each other in covariate space. By extending this framework into multi-stage dictionarylearning, we present a new algorithm for image super-resolution problem.
The multiscale extension for super-resolution is based on the fundamental assumption: given the HR
and LR patches, the goal is to learn dictionaries for both resolutions such that the sparse coefficients is
same for both resolutions. The original dictionary learning models the observation with xi Dpsi d
zi q i where xi represents the observations, D represents the dictionary, zi represents the binary factor
assignment, si represents the sparse factor score and i represents the error. The prior of zi incorporates
the covariates.
We write the multistage version of the model in a similar form of the original model.
pcq
xi
pl q
xi
phq
xi
pcq
, dk
plq
dk
phq
dk
pcq
, i
plq
i
phq
i
where the superscript pcq corresponds to concatenation of both resolutions. In this setup, the training of the dictionaries are same as the original model. The only difference is that we couple patches
plq
phq
pl q
phq
xi , xi and couple dictionaries dk , dk . The covariates can be the patch positions within the image or the relative position in the observation space. In the latter, if we define `i xi {||xi ||2 , then
||`i `j ||2
a
2 2cosdistpxi , xj q since cosdistpxi , xj q ||xi ||2i ,||jxj ||2 . Since the training set includes patches
from multiple image pairs, the locations of the patches can not be used as covariates (i.e. they are coming
from different images). Instead, we use the normalized observation. Moreover, in order to speed-up the
inference, we first reduce dimensionality to 3 with PCA and use the dimension reduced observations as
covariates.
One important problem is that the size of the patches for LR and HR are different which will lead
to the bias in fitting the shared factor score in favor of the HR patches. In order to prevent this we first
1
xT x
76 of 104
employ Bicubic interpolation and magnify the LR image such that it has the same size with an HR image.
By this way we also have a perfect match between LR and HR patches.
Super-Resolution of an LR Test Image. During the testing phase our goal is first to find the
plq
pl q
sparse factor scores by using the LR image patches xi and the LR dictionary dk learned at the training.
phq
Then, we reconstruct the corresponding HR patches by using the HR dictionary dk and learned factor
scores psi d zi q. While learning the sparse factor scores one important difference from regular inference
plq
phq
of the landmark dependent dictionary learning is that we do not sample the dictionary dk , dk , and the
precision γs of the factor score si . The former is obvious from the training-testing framework. However,
the letter is crucial as well in order to prevent over-fitting. If we also sample γs during testing, the sparse
pl q
factors si over-fits the LR patches xi and the reconstruction quality degrades. In other words, the main
assumption of sparse coefficient sharing is regularized by learning the γs in the training and using the
estimate in the testing. By this way we control the sparsity of factor scores in testing.
pcq
xi
pl q
xi
pcq
, dk
plq
dk
phq
dk
pcq
, i
pl q
i
Experiments. In experiments we use two data sets: (1) Berkeley natural image data set for testing,
(2) set of images collected from the the web for training. The diversity of both data sets provides us with
rich set of HR-LR patch pairs. We use the super-resolition ratio 3, patch size 8 8. We apply all the
algorithms only to the illuminance channel. We use the Bicubic interpolation in the color layers (Cb, Cr)
for all methods.
In terms of baselines, we use both interpolation and example based super-resolution algorithms. Bicubic interpolation is the gold standard in super-resolution literature. We also use nearest neighbor and
bilinear interpolation as the interpolation based methods. In terms of example based methods, we use
super-resolution via sparse representation (ScSR). The dictionary learning stage of both our method and
ScSR uses the same training set (100K patches sampled from the data set).
(a) High
(g) BPFA
(b)
Low
(c) Bicubic
(h) OurLoc
(i) OurVal
(d) NNI
(e) Bilinear
(j) HR Dictionary
(f) ScSR
(k) LR Dictionary
Figure 1: OurLoc: Spatial location used as covariate, OurVal: Dimension reduced observation used as
covariate, BPFA: Beta Process Factor Analysis, NNI: Nearest neighbor interpolation.
2
77 of 104
Structured Sparsity via Alternating Directions Methods
Zhiwei (Tony) Qin
1
Donald Goldfarb
Introduction
We consider a class of sparse learning problems in high dimensional feature space regularized by a structured
sparsity-inducing norm which incorporates prior knowledge of the group structure of the features. Such problems
often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of
the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms,
the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm.
min F (x) ≡ l(x) + Ω(x),
x∈Rm
(1)
P
Ωl1 /l2 (x) ≡ λ g∈G wg kxg k,
P
where l(x) =
−
∈
=
G = {g1 , · · · , g|G| } is the set of
Ωl1 /l∞ (x) ≡ λ g∈G wg kxg k∞ .
group indices with |G| = J, and the elements (features) in the groups possibly overlap [2, 4]. In this model, λ, wg , G
are all pre-defined. k · k without a subscript denotes the l2 -norm. Whenever the l1 /l2 - and l1 /l∞ -regularization
terms are mentioned, we assume that the groups overlap.
1
2 kAx
2
bk2 , A
Rn×m , Ω(x)
Algorithms
We reformulate problem 1 into an unconstrained problem
min
s.t.
1
kAx − bk2 + Ω̃(y)
2
Cx = y,
(2)
where Ω̃(y) is the non-overlapping group-structured penalty term corresponding to Ω(y) defined above. The
1
augmented Lagrangian of (2) is L(x, y, v) = 21 kAx − bk2 − v T (Cx − y) + 2µ
kCx − yk2 + Ω̃(y). We build a unified
framework based on the augmented Lagrangian method, under which problems with both types of regularization
and their variants can be efficiently solved. The core building-block of this framework is to compute an approximate
minimizer (x, y) of L(x, y, v) given v. For this specific task, we propose FISTA-p (Algorithm 2.1), which is a partial
linearization variant of FISTA [1], and we prove that this algorithm requires O( √1 ) iterations to obtain an -optimal
solution.
Algorithm 2.1 FISTA-p (partial linearization)
1: Given x0 , y 0 , v. Choose ρ, and z 1 = y 0 .
2: for k = 1, 2, · · · , K do
1
3:
xk ← arg minx 12 kAx − bk2 − v T (Cx − z k ) + 2µ
kCx − z k k2
1
4:
y k ← arg miny f (xk , z k ) + ∇y f (xk , z k )T (y − z k ) + 2ρ
ky − z k k2 + g(y)
√
1+ 1+4t2k
5:
tk+1 ←
2
−1
z k+1 ← y k + ttkk+1
(y k − y k−1 )
7: end for
8: return (xK+1 , y K+1 )
6:
In addition, we propose a partial-split version of ALM-S [3] as well as direct application of ALM-S and FISTA
to solve the core subproblem.
1
78 of 104
Figure 1: Left two: Scalability test results of the algorithms on the synthetic overlapping Group Lasso data sets
from [2]. The y-axis is in logarithmic scale. ALM-S is not included in the left plot because we did not run it on
the last two data sets due to the computational burden (expected CPU time exceeding 104 seconds). Right two:
Scalability test results on DCT set with l1 /l2 -regularization (left column) and l1 /l∞ -regularization (right column).
The y-axis is in logarithmic scale.
Figure 2: Separation results for the video sequence background substraction example [4]. Each training image had
120 × 160 RGB pixels. The training data contained 200 images in sequence. The accuracy indicated for each of
the different models is the percentage of pixels that matched the ground truth. The CPU time reported is for an
average run on the regularization path.
3
Experiments Results
We compare the algorithms on a collection of data sets and apply them to a video sequence background subtraction
task. The results are presented in Figures 1 and 2.
References
[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal
on Imaging Sciences, 2(1):183–202, 2009.
[2] X. Chen, Q. Lin, S. Kim, J. Peña, J. Carbonell, and E. Xing. An Efficient Proximal-Gradient Method for Single and
Multi-task Regression with Structured Sparsity. Arxiv preprint arXiv:1005.4717, 2010.
[3] D. Goldfarb and S. Ma. Fast alternating linearization methods for minimizing the sum of two convex functions. Arxiv
preprint arXiv:0912.4571, 2009.
[4] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In J. Lafferty, C. K. I.
Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23,
pages 1558–1566. 2010.
2
79 of 104
Ranking annotators for crowdsourced labeling tasks
Vikas C. Raykar
Shipeng Yu
Siemens Healthcare, Malvern, PA 19355 USA
vikas.raykar@siemens.com
shpieng.yu@siemens.com
7
11
15
12
12
77
67
75
346
26
6
2
5
11
3
9
20
22
31
10
13
12
30
18
8
4
1
19
29
14
17
27
15
28
21
25
7
16
33
23
24
32
654
171
238
917
104
284
428
374
249
229
453
525
541
437
175
119
1099
1211
572
Spammer Score
43
7
87
35
29
Annotating data is one of the major bottlenecks in using supervised learning to build good
sentiment | 1660 instances | 33 annotators
predictive models. Getting a dataset labeled
1
by experts can be expensive and time consuming. With the advent of crowdsourcing ser0.8
vices (Amazon’s Mechanical Turk (AMT) being
a prime example) it has become quite easy and
0.6
inexpensive to acquire labels from a large number of annotators in a short amount of time.
0.4
However one drawback of most crowdsourcing
services is that we do not have control over
0.2
the quality of the annotators. The annotators
0
can come from a diverse pool including genuine
experts, novices, biased annotators, malicious
Annotator
annotators, and spammers. Hence in order to
Figure 1. The ranking of annotators obtained using the proget good quality labels requestors typically get
posed spammer score for the Irish economic sentiment analeach instance labeled by multiple annotators
ysis data (Brew et al., 2010) annotatoed by 33 annotators.
and these multiple annotations are then conThe spammer score ranges from 0 to 1—lower the score more
spammy is the annotator. The mean spammer score and the
solidated either using a simple majority voting
95% confidence intervals (CI) are shown—obtained from 100
or more sophisticated methods that model and
bootstrap replications. The annotators are ranked based on
correct for the annotator biases (Raykar et al.,
the lower limit of the 95% CI. The number at the top of the
2010). While majority voting assumes all anno95% CI bar shows the number of instances annotated by that
annotator. Note that the CIs are wider when the annotator
tators are equally good the more sophisticated
labels only a few instances.
methods model the annotator performance and
then appropriately give different weights to the
annotators to reach the consensus. Here we are interested in ranking annotators based on how spammer
like each annotator is. In our context a spammer is an annotator who assigns random labels (maybe
because the annotator does not understand the labeling criteria, or does not look at the instances
when labeling). Spammers can significantly increase the cost of acquiring annotations and at the same
time decrease the accuracy of the final consensus labels. A mechanism to detect/eliminate spammers
is a desirable feature for any crowdsourcing market place. One can give monetary bonuses to good
annotators and deny payments to spammers. We formalize the notion of a spammer and specifically
we define a scalar metric which can be used to rank the annotators—with the spammers having a score
close to 0 and the good annotators having a score close to 1.
Annotator model for categorical labels Suppose there are K ≥ 2 classes. Let yij ∈ {1, . . . , K}
be the label assigned to the ith instance by the j th annotator, and let yi ∈ {1, . . . , K} be the actual
j
j
(unobserved) label. We model each annotator by the multinomial parameters αjc = (αc1
, . . . , αcK
),
∑K
j
j
j
j
where αck := Pr[yi = k|yi = c] and
k=1 αck = 1. The term αck denotes the probability that
80 of 104
Ranking annotators for crowdsourced labeling tasks
annotator j assigns class k to an instance given the true class is c. We do not dwell too much on the
estimation of the annotator model parameters. These parameters can either be estimated directly using
known gold standard or the iterative EM algorithms that estimate the annotator model parameters
without actually knowing the gold standard (Raykar et al., 2010).
Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator
does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot
pretending to be a human annotator. More precisely an annotator is a spammer if the probability of
observed label yij being k given the true label yi is independent of the true label, i.e., Pr[yij = k|yi ] =
Pr[yij = k], ∀k. This is equivalent to Pr[yij = k|yi = c] = Pr[yij = k|yi = c′ ], ∀c, c′ , k = 1, . . . , K, which
means knowing the true class label being c or c′ does not change the probability of the annotator’s
j
label. This indicates that the annotator j is a spammer if αck
= αcj′ k , ∀c, c′ , k = 1, . . . , K.
Spammer metric Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer
would have all the rows of Aj equal. Essentially Aj is a rank one matrix of the form Aj = ev⊤
j ,
K
⊤
for some vector vj ∈ R that satisfies vj e = 1, where e is vector of ones. One natural way to
summarize this would be in terms of the distance (Frobenius norm) of the confusion matrix to the
j
2
⊤ 2
closest rank one approximation, i.e, S j := ∥Aj − ev̂⊤
j ∥F , where v̂j = arg minvj ∥A − evj ∥F such that
j
j
⊤
v⊤
j e = 1. This yields v̂j = (1/K)A e, which is the mean of the rows of the matrix A . Hence we
(
) 2
∑
∑
j
have S j = I − K1 ee⊤ Aj F = K1 c<c′ k (αck
− αcj′ k )2 . So a spammer is an annotator for whom
j
j
S is close to zero. A perfect annotator has S = K − 1. We can normalize this to lie between 0 and 1.
∑∑ j
1
(αck − αcj′ k )2
Sj =
K(K − 1)
′
c<c
k
While this spammer score was implicit for binary labels in earlier works the extension to categorical
labels is novel and is quite different from the accuracy computed from the confusion rate matrix. A
recent attempt to quantify the quality of the workers was made by (Ipeirotis et al., 2010) where they
transformed the observed labels into posterior soft labels.
We use the proposed metric to rank annotators for the data collected for Irish economic sentiment
analysis (Brew et al., 2010). A total of 1660 articles from three Irish online news sources were annotated
by 33 volunteer users as positive, negative, or irrelevant. Each article was annotated by 6 annotators.
Figure 1 plots the ranking of annotators obtained using the spammer score with the annotator model
parameters estimated via the EM algorithm. The mean and the 95% confidence intervals obtained via
bootstrapping are also shown. The number at the top of the CI bar shows the number of instances
annotated by that annotator. Note that the CIs are wider when the annotator labels only a few
instances. Some annotators who label only a few instances may have a high mean spammer score but
the CI will be wide. We need to factor the number of instances labeled by the annotator into the
ranking. Ideally we would like to have annotators with a high score and at the same time label a lot
of instances. Hence we rank them based on the lower limit of the CI.
References
Brew, A., Greene, D., & Cunningham, P. (2010). Using crowdsourcing and active learning to track sentiment in online
media. Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems(PAIS’10).
Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. Proceedings of the
ACM SIGKDD Workshop on Human Computation (pp. 64–67).
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds.
Journal of Machine Learning Research, 11, 1297–1322.
81 of 104
Extracting Latent Economic Signal from Online
Activity Streams
Joseph Reisinger
Metamarkets
670 Broadway, New York, NY
joe@metamarketsgroup.com
Abstract
Online activity generates a myriad of weak economic signals, both as a result of
direct interaction, e.g. purchases, location-based checkins, etc, and indirect interaction, e.g. realtime display ad auctions, audience targeting and profile generation.
In this work we develop an approach to constructing latent economic indicators
from large collections of such realtime activity streams. In particular we investigate the relative impact of realtime topical discourse and ad pricing on a basket
of financial indicators. In order to effectively overcome noise inherent in the extracted weak signals, we develop a novel Bayesian vector autoregression (VAR)
formulation with sparse underlying factors. Our preliminary results indicates that
sparse subsets of online activity do indeed correlate strongly with economic activity, and can be used to improve the robustness of economic forecasts.
1
Background
Online textual activity streams have been used to predict a wealth of economic and political indicators. For example, O’Connor et al. use Twitter sentiment to predict public opinion around elections
[12] and Joshi et al. use movie reviews to predict revenue performance [11]. Online sentiment has
also been shown to potentially inform equity markets, as financial decisions are often based on mood
[7]. Google uses flu-related search queries to predict outbreaks, often faster than the CDC [8], and
also maintains a domestic trends index,1 tracking economically relevant search traffic across a diverse set of sectors [5]. Finally the Billion Prices Project seeks to define a bottom-up measure of
CPI based on extracting and classifying prices from thousands of online retailers [4].
Our work attempts to bridge the gap between large-scale extraction of such “nonstandard economic
indicators” and traditional econometric analysis, introducing a novel structural vector autoregression
(VAR) model based on group sparsity. Our model assumes low-rank autoregressive factor structure
for time-series within the same group (e.g. Twitter popularity or ad prices) linked by sparse betweengroup parameter vectors. This formulation is efficient and allows us to robustly scale our analyses
to upwards of 10k base time-series.
We collect eight months of daily time-series data (2011-01-01 to 2011-09-01) from two sources:
(Ad Exchanges) Ad impressions can be sold in realtime auctions on ad exchanges such as Google’s
AdX platform and Yahoo’s RightMedia exchange. These exchanges match advertisers (buyers)
with publishers (supply providers) at the granularity of individual impressions, and take a small
flat fee from each successful transaction. We obtained rich pricing data from three such exchanges
consisting of 10B impressions across a large number of publishers.
1
http://www.google.com/finance/domestic_trends
1
82 of 104
(Twitter) We extract daily time-series of news content and public mood from Twitter, a popular microblogging service.2 Raw unigram counts are smoothed using Latent Dirichlet Allocation
(LDA) with K “ 100 topics yielding coarse-grained topical content [3].
Using this data, we uncover components of both Twitter topics and ad exchange pricing dynamics
that are predictive of sectoral shifts in broader equities markets.
2
Sparse Pooling Vector Autoregression
We propose an approach to structural modeling based on the Bayesian vector autoregressive (VAR)
process [9] that is capable of scaling to a large number of noisy time-series. In particular we impose
restrictions on the covariance structure so that it is low-rank and sparse, limiting the number of
parameters to estimate.
Our sparse pooling VAR model (spVAR) consists of a set of K endogenous variables partitioned into
g
g
g
G groups: ytg “ py1t
, . . . , ykt
, . . . , yKt
q, k P r1, Ks, g P r1, Gs. The spVARppq process is defined
as
» 1 fi
» 1 1
fi
» 1 1
fi
Ap yt´p
yt
A1 yt´1
— .. ffi
—
ffi
—
ffi
..
..
– . fl “ Z1 –
fl ` . . . ` Zp –
fl ` ut
.
.
ytG
G
AG
1 yt´1
G
AG
p yt´p
where Agi is a mˆK coefficient matrix, Zi is a sparse K ˆpG¨mq matrix of random factor loadings,
and ut is a K-dimensional zero-mean, covariance-stationary innovation process, i.e. Erut s “ 0 and
g
Eruy uJ
t s “ Σu . That is, each covariate vector yt´i for each signal group G is projected onto a
separate linear subspaces of rank m and are combined via the shared sparse factor loading matrix
Zi .3 The constraints introduced by this procedure are similar to those found in structural VAR [9].
Note that unlike Factor-augmented VAR models [2], our model does not assume time-varying factor
structure, making it less suitable for modeling, e.g. the effects of exogenous shocks.
3
Structural Analysis and Future Work
Figure 1 shows and example structural decomposition of the ad exchange-Twitter-Equities threegroup system. Lagged Twitter factors 5 and 7 are found to have statistically significant (although
small) impact on a broad range of equities, while ad exchange factor 2 has a narrower impact across
commodities and utilities sectors and ad exchange factor 4 is significant for Yahoo stock returns.
(Exogenous Shocks) Both data sources exhibit large exogenous shocks; in the exchange case due to
new bidders entering the market or the availability of new targeting information, and in the Twitter
case due to spikes in topical content (e.g. “hurricapocalypse”). Switching VAR processes would be
better able to account for these regime changes [6].
(Cointegration) Residual correlation indicates that there may be significant cointegration between
groups. Extending spVAR to model Ip1q series can be achieved by replacing the underlying VAR
model with a VECM [9].
(Price Discovery) Microstructure models of such price discovery can be developed leveraging spVAR as a base model in order to understand how multiple ad exchanges interact to form market
pricing. [10]
(Forecasting) VAR models are commonly used for macroeconomic forecasting. We can use the
spVAR framework to identify subsets of underlying activity streams that are most predictive of
economic variables of interest.
References
[1] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. CoRR, abs/0812.1869,
2008.
2
3
http://twitter.com
Zi and Agi can be estimated efficiently as multiple joint sparse-PCA problems [13, 1].
2
83 of 104
*
*
**
**
*
*
*
**
**
**
*
***
***
*
*
**
*
*
*
*
*
***
**
*
**
*
*
**
*
***
***
**
***
**
***
**
*
***
* ***
*** *
*
*
**
*
**
*
***
**
***
***
*
**
***
**
***
***
*
**
**
*
**
*
*
*
*
*
*
**
*
**
*
*
*
*
**
**
1
**
**
***
**
***
***
***
**
***
2
*
twitter_4.l1
twitter_3.l1
twitter_2.l1
twitter_10.l1
trend
twitter_1.l1
const
ad_ex_9.l1
ad_ex_8.l1
ad_ex_7.l1
ad_ex_6.l1
ad_ex_5.l1
ad_ex_4.l1
ad_ex_3.l1
ad_ex_2.l1
ad_ex_1.l1
ad_ex_10.l1
XLY.l1
YHOO.l1
XLU.l1
**
XLK.l1
XLI.l1
0
*
**
XLF.l1
XLE.l1
SPY.l1
VXX.l1
−2
*
*
GSP.l1
Estimate
−1
*
***
**
***
*
**
***
***
*
*
***
***
*
*
*
***
*
**
*
twitter_9.l1
*
***
***
twitter_8.l1
*
***
***
*
twitter_7.l1
*
*
**
*
**
twitter_6.l1
*
**
*
*
twitter_5.l1
*
***
**
**
*
*
***
*
*
* ***
***
* *** *** **
**
*** **
*
*
*
*** *** ***
**
*
*
***
*** *** ** ***
*
*
*
*
**
*
***
*
AAPL.l1
twitter_9
twitter_8
twitter_7
twitter_6
twitter_5
twitter_4
twitter_3
twitter_2
twitter_10
twitter_1
ad_ex_9
ad_ex_8
ad_ex_7
ad_ex_6
ad_ex_5
ad_ex_4
ad_ex_3
ad_ex_2
ad_ex_10
ad_ex_1
YHOO
XLY
XLU
XLK
XLI
XLF
XLE
VXX
SPY
GSP
AAPL
Figure 1: spVAR model parameters fit to daily returns data drawn from latent Twitter topical factors
(twitter), latent ad exchange factors (ad ex) and several equity sectors. All data has been differenced
and smoothed using SMA(3).
[2] B. Bernanke, J. Boivin, and P. S. Eliasz. Measuring the effects of monetary policy: A factoraugmented vector autoregressive (favar) approach. The Quarterly Journal of Economics,
120(1):387–422, 2005.
[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning
Research, 3:2003, 2003.
[4] A. Cavallo and R. Rigobon. Billion prices project. http://bpp.mit.edu/.
[5] H. Choi and H. Varian. Predicting the present with Google trends. Technical report, Google,
2009.
[6] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Sharing features among dynamical
systems with beta processes. In Neural Information Processing Systems 22. MIT Press, 2010.
[7] E. Gilbert and K. Karahalios. Widespread worry and the stock market. In ICWSM. The AAAI
Press, 2010.
[8] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant.
Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014,
Nov. 2008.
[9] J. D. Hamilton. Time-series analysis. Princeton Univerity Press, 1994.
[10] J. Hasbrouck. Empirical Market Microstructure. Oxford University Press, 2006.
[11] M. Joshi, D. Das, K. Gimpel, and N. A. Smith. Movie reviews and revenues: an experiment in
text regression. In Proc. of NAACL-HLT 2010, pages 293–296. Association for Computational
Linguistics, 2010.
[12] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith. From tweets to polls:
Linking text sentiment to public opinion time series. In Proceedings of the International AAAI
Conference on Weblogs and Social Media, 2010.
[13] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics,
10(3):515–534, 2009.
3
84 of 104
Intrinsic Gradient Networks
Jason Rolfe, Matthew Cook, and Yann LeCun
September 9, 2011
Artificial neural networks are computationally powerful and exhibit brain-like dynamics. However, before an artificial neural network can be used to perform a computational task, its parameters must be trained
to minimize an error function which measures the desirability of the network’s outputs for each set of inputs.
It is generally believed that the backpropagation algorithm, conventionally used to train neural networks, is
not biologically plausible (Crick, 1989). As suggested by its name, the backpropagation algorithm requires
feedback messages to be propagated backwards through the network. These feedback messages, used to
calculate the gradient of the error function, are directly dependent on the feedforward messages, and so
must evolve on the same time scale as the network’s inputs. Nevertheless, the feedback messages must not
directly affect the original feedforward messages of the network; any influence of the feedback messages
on the feedforward messages will disrupt the calculation of the gradient. Direct biological implementations
of the backpropagation algorithm are not consistent with experimental evidence from the cortex, since the
cortex neither shows signs of a sufficiently fast reverse signaling mechanism (Harris, 2008), nor segregates
feedforward and feedback signals (Douglas & Martin, 2004). We address this biological implausibility by
constructing a novel class of recurrent neural networks, intrinsic gradient networks, for which the gradient
of an error function with respect to the parameters is a simple function of the network state. Intrinsic gradient networks do not generally have segregated feedforward computation and feedback training signals, and
so are potentially consistent with the recurrence observed in the cortex.
The highly recurrent connection topology (Felleman & Van Essen, 1991; Douglas & Martin, 2004) and
locality of training signals (Malenka & Bear, 2004) observed in the brain imply that the cortex is able to
learn efficiently using only local signals within a recurrently interconnected network. That is, the cortex
appears to use a single interdependent set of messages for both computation and learning. Since recurrence
and local training alone do not uniquely specify a neural network architecture, we make the biologically and
computationally motivated assumption that learning requires the approximate calculation of the gradient of
an error function. We restrict our attention to networks in which the gradient can be calculated completely
at a single network state, rather than stochastic networks in which learning is a function of some statistic
of the network activity over time, since it can take a long time to accurately estimate the statistics of large
networks. In light of these experimental observations and computational assumptions, we wish to construct
highly recurrent neural networks for which the gradient of an error function defined in terms of the network’s
activity can be calculated via a simple, local function of the intrinsic network activity.
To reach this goal, we first generalize slightly and characterize the class of (not necessarily highly recurrent) neural networks for which the gradient of an error function defined in terms of the network’s activity
is a simple (but not necessarily local) function of the intrinsic network activity; we call this novel class of
networks intrinsic gradient networks. In contrast to traditional artificial neural networks, intrinsic gradient
networks do not require a separate, implicit set of backpropagation dynamics to train their parameters. Intrinsic gradient networks consist of a time-varying vector of real-valued units~x(t) ∈ Rn . We write~x to denote
~x(t) at some arbitrary point in time t. The point at which the network has finished computing and produced
an output is defined by the output functions ~F(~x,~w) according to the fixed-point equation ~F(~x,~w) =~x, where
1
85 of 104
~w is a vector of parameters. At these output states ~x where ~F(~x,~w) = ~x, the desirability of the outputs is
defined by the error function E(~x). We find that intrinsic gradient networks are characterized by the equation
~T (~x) = ~S(~x, ~F(~x,~w)) + ∇E(~x) + ∇~F > (~x) · ~T (~x) ,
(1)
where ~S(~x, ~F(~x,~w)) = 0 when ~x = ~F(~x,~w). When equation 1 is satisfied,
dE(~x∗ (~w)) ~ > ∗
∂ ~F(~x,~w)
=
T
(~
x
(~
w
),~
w
)
·
dw0
∂ w0
at output states x∗ (~w) defined implicitly as a function of the parameters ~w by ~F(~x∗ (~w),~w) = ~x∗ (~w). We
show that we can choose ~F(~x,~w) so that the gradient-calculating function ~T (~x,~w) is a simple, local function
d~x(t)
1
~
of
~x. Any network
dynamics that converge to a fixed point of F(~x,~w) may be used, such as dt = τ ·
~F(~x(t),~w) − x(t) . So long as ~F(~x,~w) is highly recurrent and Ti (~x) is simple and local to unit xi , intrinsic
gradient networks satisfy our desiderata.
We find that equation 1 is satisfied when ~F(~x,~w) is the sum of a component determined by E(~x), corresponding to the inputs to the network, and


 D +1 
D
+1

 ψk
~F(~x,~w) = T−1 · (D + I)−1 · ∇ c + ∑ xψDψk ·

 k
k
j
D
 x j 
k j

h
∏ j  Dψk +1 

Dψk
j6=ψk
xψk
(2)
for any set of indices ψk indexed by k, and any set of differentiable scalar functions hkj (x) indexed by both
>
j and k, assuming that ~T (~x) = T ·~x and T−1 · T = D is diagonal. We show that it is easy to choose
parameterizations of the matrix T and the functions hkj (x) so that the network is highly recurrent, with a
local gradient-calculating function ~T (~x). For instance, if we choose T to be a pairwise permutation of
an appropriate form, then using hkj (x) ∈ w · x1/2 , x2/3 , 1 , we can construct networks consisting of linear
nodes with output functions Fi (~x) = ∑ j (Wi j +W ji ) · x j for some matrix W, and nonlinear nodes with output
2/3
functions Fα (~x) =
2/3
xβ ·xγ
1/3
xα
for all mappings of {i, j, k} to {α, β , γ}. As a proof-of-concept, we construct
some example networks satisfying equation 2, and show that they can be trained via stochastic gradient
descent to recognize handwritten digits.
References
Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129–132.
Douglas, R. J., & Martin, K. A. C. (2004). Neuronal circuits of the neocortex. Annual Review of Neuroscience, 27, 419–451.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral
cortex. Cerebral Cortex, 1, 1–47.
Harris, K. D. (2008). Stability of the fittest: Organizing learning through retroaxonal signals. Trends in
Neurosciences, 31(3), 130–136.
Malenka, R. C., & Bear, M. F. (2004). LTP and LTD: An embarrassment of riches. Neuron, 44, 5–21.
2
86 of 104
MirroRank: Convex aggregation and online ranking
with the Mirror Descent
Benoit Rostykus1 , Nicolas Vayatis2
1
Abstract
We study the learning problem of pairwise ranking of objects, from a set of scoring functions as
base learners. Restricting to convex aggregation of these learners with a pairwise missranking
loss, we present how the Mirror Descent algorithm can be adapted to this framework to derive
an efficient and sequential solution to the problem, that we call MirroRank. Along with a theoretical analysis, our empirical results on recommendation datasets show that the method proposed
outperforms several other ranking algorithms.
2
Convex surrogates and convex aggregation for ranking
Let (X,Y ) ∈ X × [0, 1] be a pair of random variables. In a ranking setup, X typically represents
the data, and Y is a score for point X. If Y > Y 0 , then the pair (X,Y ) is said to be prefered to the
pair (X 0 ,Y 0 ). In the pairwise ranking problem, one only observes X and X 0 , and wants to predict
the relation Y > Y 0 . We define a scoring function s : X → [0, 1] such that if s(X) > s(X 0 ), then
X is prefered to X 0 . Given a convex function ϕ : R → R+ , we are interested in finding a scoring
function s that mimizes the convex risk:
A(s) = Eϕ((s(X) − s(X 0 ))(Y −Y 0 ))
(1)
Note that such a choice of risk emphazises the magnitude-preserving properties of s (see [3]).
We will focus on procedures optimizing A (or its empirical counterpart) from a convex set of
base learners. The motivation is driven by applications: in a recommendation setting, it seems
indeed intuitive to restrict the search space to convex combination of similar scoring functions
(the base learners). Let M ∈ N∗ and s1 , ..., sM be a set of base scoring functions. If Θ is the
simplex in RM , the family of convex combinations of base scoring functions sk is noted:
(
)
M
S = sθ =
∑ θ(k) sk , θ ∈ Θ
(2)
k=1
3
Ranking with the Mirror Descent
Mirror Descent is an online procedure for convex optimization ([6],[5]). It can be interpreted
as a stochastic gradient descent in a dual space. We now show how to adapt it to the ranking
problem. Let Q(θ, i, j) = ϕ((sθ (xi ) − sθ (x j ))(yi − y j )) be the stochastic loss associated with
scoring function s and evaluated on pairs (xi , yi ) and (x j , y j ). Noticing that (∇θ sθ )(k) (x) = sk (x),
and noting ϕ0 the derivative of ϕ, one can easily show that:
(∇θ Q(θ, i, j))(k) = (yi − y j )(sk (xi ) − sk (x j ))×
ϕ0 ((sθ (xi ) − sθ (x j ))(yi − y j ))
(3)
Mirror Descent for Ranking is then described by the iterative scheme presented in the following
pseudo-code, where V is called a proxy function and (γ p ) and (β p ) are predefined sequences.
At each round, a new pair of couples (xi , yi ),(x j , y j ) is evaluated. The first step in the loop is a
classical stochastic gradient update ([1]) in a dual space. The second one updates the weights in
the primal. This key step minimizes a local approximation of A, penalized by the ”information”
1 Ecole
2 ENS
Centrale Paris, ENS Cachan - benoit.rostykus@centraliens.net
Cachan - nicolas.vayatis@cmla.ens-cachan.fr
1
87 of 104
MirroRank
for p = 1, ...,t s.t i p < j p : do
ζ p = ζ p−1 + γ p ∇θ Q(θ p−1 , i p , j p )
θ p = argmin θT ζ p + β pV (θ)
θ∈Θ
end for
return
sθ̂t with θ̂t =
∑tp=1 γ p θ p
∑tp=1 γ p
criterion V ([5]). In fact, choosing an entropic loss for V leads to a closed form solution for θ̂t ,
making the algorithm very efficient. Taken as a sequential versionp
of the MRE strategy described
in [2], we show that MirroRank has a generalization bound in O( log M/n), which can be seen
as optimal in some sense.
4
Experimental Results
Choosing ϕ to be the hinge-loss, experiments conducted on different recommendation datasets
show that MirroRank outperforms other ranking algorithms, on different classical metrics. This
is especially true when the range of possible scores is large. As opposed to RankBoost ([4]),
MirroRank weak learners are taken to be normalized scoring functions of neighbours users.
The following Table is an example of comparison between MirroRank and RankBoost that we
obtained on different datasets. The procedure employed for these experiments is very similar to
the one described in [3].
MovieLens1
MovieLens2
HTREC Last.fm
Mean Disagreement
RankBoost
MirroRank
44.7% ± 0.8 40.3% ± 1.1
42.8% ± 0.7 39.1% ± 0.7
47.2% ± 0.8 41.5% ± 1.0
NDCG@3
RankBoost
MirroRank
81.4% ± 1.2 82.7% ± 1.2
83.7% ± 0.8 84.7% ± 1.0
39.9% ± 2.7 49.1% ± 1.7
Experiments also exhibit the excellent time complexity of our algorithm. Finally, we provide
a natural interpretation of the weights θ(k) , and experimentally show that they can be related
to a notion of similarity between scoring functions, enabling MirroRank to jointly recommend
similar objects and similar users.
References
[1] L. Bottou and Y. LeCun. Large scale online learning. Advances in neural information
processing systems, 16:217–224, 2004.
[2] S. Clemençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of ustatistics. The Annals of Statistics, 36(2):844–874, 2008.
[3] C. Cortes, M. Mohri, and A. Rastogi. Magnitude-preserving ranking algorithms. In Proceedings of the 24th international conference on Machine learning, pages 169–176. ACM,
2007.
[4] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4:933–969, 2003.
[5] AB Juditsky, A.V. Nazin, A.B. Tsybakov, and N. Vayatis. Recursive aggregation of estimators by the mirror descent algorithm with averaging. Problems of Information Transmission,
41(4):368–384, 2005.
[6] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. 1983.
2
88 of 104
Preserving Proximity Relations and Minimizing Edge-crossings in Graph
Embeddings
Amina Shabbeer∗
Cagri Ozcaglar†
Bülent Yener‡
Kristin P. Bennett§
Rensselaer Polytechnic Institute
We propose a novel approach to embedding heterogeneous data
in high-dimensional space characterized by a graph. Targeted towards data visualization, the objectives of the embedding are twofold: (i) preserve proximity relations as measured by some embedding objective, and (ii) simultaneously optimize an aesthetic criterion, no edge-crossings in the embedding, to create a clear representation of the underlying graph structure. The data-points are the
nodes of the graph and have distances or similarity measures defined. It is often desirable, that drawings of such graphs map nodes
from high-dimensional feature space to low-dimensional vectors
that preserve these pairwise distances. This desired quality is frequently expressed as a function of the embedding and then optimized, eg. Multidimensional Scaling (MDS), the goal is to minimize the difference between the actual distances and Euclidean distances between all nodes in the embedding.
However, layouts that preserve proximity relations can have a
large number of edge-crossings obfuscating the relationships between nodes. The quality of a graph visualization can be gauged on
the basis of how easily it can be understood and interpreted. Therefore for graphs, it is desirable to minimize edge crossings. This
is a challenging problem in itself; determining the minimum number of crossings for a graph is NP-complete [4]. Thus, a natural
question for heterogeneous data that comprises of data points characterized by features and by an underlying graph structure is how to
optimize the embedding criteria while minimizing the large number
of edge crossings in the embedded graph. The principle contributions of this paper are (i) expressing edge-crossing minimization
as a continuous optimization problem so that both the embedding
criteria and aesthetic graph visualization criteria can be simultaneously optimized. (ii) An iterative penalty algorithm that efficiently
optimizes the layout with respect to a large number of non-convex
quadratic constraints.
This paper provides a new approach for addressing the edgecross minimization criteria that exploits Farkas’ Lemma to reexpress the condition for no edge-crossings as a system of nonlinear
inequality constraints. The approach has an intuitive geometric interpretation closely related to support vector machine classification.
While edge crossing minimization can be utilized in conjunction
with any optimization-based embedding objective, here we demonstrate the approach on multi-dimensional scaling by modifying the
stress majorization algorithm to include penalties for edge crossings. An alternating iterative penalty algorithm is developed that is
capable of elegantly minimizing the stress subject to a large number
of non-convex quadratic constraints. The algorithm is applied to a
problem in tuberculosis molecular epidemiology, creating ‘spoligoforests’ for visualizing genetic relatedness between strains characterized by fifty-five biomarkers with associated non-Euclidean ge∗ e-mail:
shabba@cs.rpi.edu
Figure 1:
Embeddings of spoligoforests of LAM (Latin-AmericanMediterranean) sublineages. Graph (c) is a planar embedding generated using Graphviz Twopi, the radial layout is visually appealing, but genetic distances between strains are not faithfully reflected. Graph (b), that optimizes
the MDS objective (generated using Neato), preserves proximity relations but
has edge-crossings. (d), Laplacian eigenmaps [1, 5, 7] use the eigenvectors
of the weighted Laplacian for dimensionality reduction such that locality properties are preserved making genetically similar groups cluster together. These
methods are fast but generate graphs that have edge-crossings and the genetic relatedness between all pairs of strains is less evident. In graph (a), the
proposed approach eliminates all edge crossings with little change in the overall stress. Note how in graph (a), the radial structure emerges naturally when
both distances and the graph structure are considered.
(a)
(b)
† e-mail:ozcagc2@cs.rpi.edu
‡ e-mail:yener@cs.rpi.edu
§ e-mail:bennek@rpi.edu
Figure 2: Embeddings for randomly generated graph in R7 with 50
nodes and 80 edges using (a) Stress majorization (stress=131.8,
number of crossings=369) and (b) EdgeCrossMin (stress=272.1,
number of crossings=0). The original planar embedding had stress=
352.5.
89 of 104
Figure 3: In (a) Edge A from a to c and edge B from b to d do not cross. Any
line between xu − γ = 1 and xu − γ = −1 strictly separates the edges. Using a
soft margin, the plane in (b) xu − γ = 0 separates the plane into half spaces that
should contain each edge.
is given by stress(X) = ∑i< j wi j (||Xi − X j || − di j )2 where Xi is the
position of the node i in the embedding and di j represents the distance between nodes i and j. Majorization algorithms, such as used
in this work, optimize this stress function by minimizing a quadratic
function that bounds the stress function from above [3].
We explore how edge-crossing constraints can be added to stress
majorization algorithms. We develop an algorithm which simultaneously minimizes stress while eliminating or reducing edge crossings using penalized stress majorization.
m
ρi
[||(−Ai (X)ui +(1+γ i )e)+ ||1 +||(Bi (X)ui +(1−γ i )e)+ ||1 ]
i=1 2
(2)
We chose to use a penalty approach because it provides an efficient mechanism for dealing with the large number of potential edge
l(l−1)
crossings. For ` edge objects there are 2 possible intersections.
We investigate the use of alternating direction method of multipliers to solve this program efficiently. Animations of the algorithm
illustrating how the edge crossing penalty progressively transform
the graphs are provided http://www.cs.rpi.edu/∼shabba/Final pictures/. Computational results demonstrate that this approach is
practical and tractable. Results on spoligoforests and high dimensional random graphs with planar embeddings show that the method
can find much more desirable solutions from a visualization point of
view with only relatively small changes in stress as compared with
other proximity-preserving dimensionality reduction techniques.
min stress(X)+ ∑
X,u,γ
netic distances of the Mycobacterium tuberculosis complex. Evolutionary relationships between strains are defined by a phylogenetic
forest as in Fig. 1. The method is also demonstrated on a suite
of randomly generated graphs with corresponding Euclidean distances that have planar embeddings with high stress as observed in
Fig. 2 (a) and (b). The proposed edge-crossing constraints and iterative penalty algorithm can be readily adapted to other supervised
and unsupervised optimization-based embedding or dimensionality
reduction methods. The constraints can be generalized to remove
intersections of general convex polygons including node-edge and
node-node intersections.
The condition that two edges do not cross is equivalent to the
feasibility of a system of nonlinear inequalities 1 . Two edges do not
intersect iff the following system of equations has no solution:
6 ∃ δA , δB s.t. A0 δA = B0 δB e0 δA = 1 e0 δB = 1 δA ≥ 0 δB ≥ 0
(1)
where e is a vector of ones of appropriate dimension and A =
ax ay
b
by
and B = x
. We prove this using a theorem of
cx cy
dx dy
the alternative: Farkas’ lemma [6]. Therefore, two edges do not
intersect iff ||(−Au + (1 + γ)e)+ || + ||(Bu + (1 − γ)e)+ || = 0 where
(z)+ = max(0, z). Geometrically the theorem states that two edges
(or more generally two polyhedrons) do not intersect if and only if
there exists a hyperplane that strictly separates the extreme points
of A and B. Figure 3 illustrates that when this system is satisfied, any plane that lies between xu + γ = 1 and xu + γ = −1 strictly
separates the two edges, and the edges do not intersect. This formulation bears resemblance to the parallel hyperplanes used to find
maximum margin hyperplanes in SVM [8]. The no-edge-crossing
constraint corresponds to introducing a hyperplane and requiring
each edge to lie in opposite half spaces. We develop a method to determine the separating plane based on alternating direction method
of multipliers [2] that achieves a 10-fold improvement in computation time as compared to MATLAB’s LP solver linprog.
This work demonstrates how edge-crossing constraints can be
formulated as a system of nonconvex constraints. Edges do not
cross if and only if they can be strictly separated by a hyperplane. If
the edges cross, then the hyperplane defines the desired half-spaces
that the edges should lie within. The edge-crossing constraints can
be transformed into a continuous edge-crossing penalty function in
either 1-norm or least-squares form. This general approach is applicable to the intersection of any convex polyhedrons including nodes
represented as boxes and edges represented as bars. The number
of node-node and node-edge overlaps increases as the graph size
grows. Greater clarity can be achieved in visualizations of large
graphs with the incorporation of constraints minimizing such overlaps.
Proximity relations are preserved by minimizing the stress,
a measure of the disagreement between distances in the highdimensional space and the new reduced space. The stress function
1 Details
available at http://www.cs.rpi.edu/research/pdf/11-03.pdf
R EFERENCES
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality
reduction and data representation. Neural computation, 15(6):1373–
1396, 2003.
[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method
of multipliers. Technical report, working paper on line, Stanford, Univ,
2010.
[3] E Gansner, Y Koren, and S North. Graph drawing by stress majorization. Graph Drawing, 3383:239–250, 2004.
[4] M.R. Garey and D.S. Johnson. Crossing number is np-complete. SIAM
Journal on Algebraic and Discrete Methods, 4:312, 1983.
[5] Y. Koren. On spectral graph drawing. Computing and Combinatorics,
pages 496–508, 2003.
[6] O.L. Mangasarian. Nonlinear programming. Society for Industrial
Mathematics, 1994.
[7] Thomas Puppe. Spectral Graph Drawing: A Survey. VDM Verlag,
2008.
[8] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag,
2000.
90 of 104
Title: A Learning-based Approach to Enhancing Checkout Non-Compliance Detection
Authors: Hoang Trinh, Sharath Pankanti, Quanfu Fan.
Abstract
In retail stores, cashier non-compliant activities at the Point of Sale (POS) are one of the
prevalent sources of retail loss, amounting to billions every year. A checkout non-compliance
occurs when an item passes the POS without being registered, e.g., failing to trigger the barcode
scanner. This non-compliance event, either done intentionally or unintentionally by the cashier,
leads to the same outcome: the item is not charged to the customer, causing loss to the store.
Surveillance cameras have long been used to monitor grocery stores, but their effectiveness is
much limited by the need of constant human monitoring. Recently many automated video
surveillance systems [1, 2, 3, 4, 5] have been introduced to detect such activities, showing more
advantages than human surveillance in effectiveness, efficiency and scalability. The ultimate goal
of these systems is to detect non-compliance activities, create a real-time alert for each of them
for human verification.
These systems accomplish this task by detecting checkout activities of cashiers during
transactions and identify unmatched evidence through joint analysis of cashier activities and
logged transaction data (TLog). The predominant cashier activity at the checkout is a repetitive
activity called \visual scan, which is constituted by three distinctive primitive checkout activities:
pick-up, scan and drop-off, corresponding to the process of registering one item by the cashier in
a transaction. By aligning these detected visual scans with the barcode signals from the POS
device, non-compliance activities can be isolated and detected.
However when it comes to the real-world deployment of such systems, there are still vast
technical challenges to overcome. Changing viewpoints, occlusions, and cluttered background are
just a few of those challenges from a realistic environment that any automatic video surveillance
system has to handle. In addition, in the low-margin retail business (checkout non-compliance
occurs with much lower frequency than regular events), it is crucial that such a system be
designed with careful control of alarms rate (AR) while still being scalable. The alarms rate is
defined as the total number of detected non-compliance activities divided by the total number of
scanned items. A high alarms rate will make it almost impossible for a human verifier to scan
through all the alarms, which means that the probability of finding true non-compliance would
decrease.
In this work, we propose a novel approach to reliably rank the list of detected non-compliance
activities of a given retail surveillance system, thereby provide a means of significantly reducing
the false alarms and improving the precision in non-compliance detection.
A checkout non-compliance activity is defined as a visual scan activity that cannot be associated
to any barcode signal. Therefore it is a complex human activity that can only be defined in the
joint domain of multiple modalities, in this case, by both the video data and the TLog data, or
more precisely, by the misalignment between these two data streams. This is the key difference
between non-compliance activity recognition and other human activity recognition tasks.
Our approach represents each detected non-compliance activity using multimodal features
coming from video data, transaction logs (TLog) data and intermediate results of the video
analytics. We then train a binary classifier that successfully separate true positives and false
positives in a labeled training set. A confidence score for each detected non-compliance activity
can then be computed using the decision value of the trained classifier, and a ranked list of
detections can be formed based on this score.
91 of 104
The benefit from having this ranked list is two-fold. First, a large number of false alarms can be
avoided by simply keeping the top part of the list and discarding the rest. Second, a trade off
between precision and recall can easily be performed by sliding the discarding threshold along
this ranked list.
Experimental results on a large scale dataset captured from real stores demonstrate that our
approach achieves better precision than a state-of-the-art system at the same recall. Our approach
can also reach an ROC point that exceeds the retailers' expectation in terms of precision, while
retaining an acceptable recall of more than 60%.
Although in this work, we applied our approach to the particular problem of retail surveillance,
we believe our approach can be generalized to other non-compliance detection applications in
other contexts.
Reference
[1] Agilence. http://www.agilenceinc.com/
[2] A. Dynamics. http://www.americandynamics.net/
[3] Q. Fan, R. Bobbit, Y. Zhai, A. Yanagawa, S. Pankanti, and A. Hampapur. Recognition of
repetitive sequential human activity. In CVPR, 2009
[4] StopLift. http://www.stoplift.com
[5] H. Trinh, Q. Fan, S. Pankanti, P. Gabbur, J. Pan, and S. Miyazawa. Detecting human activities
in retail surveillance using hierarchical finite state machine. In ICASSP, 2011.
…
92 of 104
Ensemble Inference on Single-Molecule Time Series Measurements
Single-molecule fluorescence resonance energy transfer (smFRET) experiments allow
observation of individual molecules in real-time, enabling detailed studies of the mechanics of cellular components such as nucleic acids, proteins, and other macromolecular complexes. In FRET experiments, the intensity of a fluorescence signal is a proxy
for the spatial separation of two labeled locations in a molecule. This signal hereby acts
as a molecular ruler, allowing detection of transitions between conformational states
in a molecule. This gives an experimentalist access to two quantities of interest: the
number of conformational steps involved in a biochemical process, and the transition
rates associated with these steps.
A sequence of conformational transitions is generally well-approximated as a Markov
process. From the perspective of a data scientist, these types of experiments are therefore obvious candidates for Hidden Markov Model (HMM) based approaches. Maximumlikelihood inference can be performed with the well-known Baum-Welch algorithm.
However, a common limitation of this technique is that it is prone to over fitting, since
the log likelihood tends to increase with the number of degrees of freedom in the model
parameters. From an experimental point of view, this means that it is difficult to differentiate between a fitting artifact and short-lived state or intermediate.
Recently, our group has shown that Variational Bayes Expectation Maximization (VBEM)
is an effective tool for both model selection and parameter estimation for these types of
datasets. Maximum likelihood techniques optimize log[p(X ∣ θ)], the log-likelihood
of the data x given a set of model parameters θ. In VBEM, the optimization criterion
is the evidence
p(x ∣ u) =
∫ dθ p(x ∣ θ)p(θ ∣ u) ,
which is approximated by a lower bound
L=∑
z
∫ dθ q(z)q(θ ∣ w) log [
p(x, z ∣ θ)p(θ ∣ u)
].
q(z)q(θ ∣ w)
Here z is the sequence of conformational states, and q(z) and q(θ ∣ w) are used to approximate the posterior p(z, θ ∣ x). The lower bound L can be optimized by iteratively
solving the variational equations δL/δq(z) = 0 and δL/δq(θ ∣ w) = 0. Rather than a
point estimate for the parameters θ, this method yields an approximation for the posterior distribution over the latent states and model parameters in light of the data. While
this method is approximate, in that it relies on a factorization p(z, θ ∣ x) ≃ q(z)q(θ ∣w),
validation on synthetic data shows that VBEM provides accurate estimates under common experimental parameter ranges, while also providing a robust criterion for the
selection of the number of states.
In the majority of experiments, data reporting on the dynamics of several hundred
individual molecules are recorded, giving rise to an ensemble of traces which report
on the same process but possess slightly different photo-physical parameters (e.g., smFRET state means and variances). Consequently, beyond learning from the individual
traces, learning a consensus model from the ensemble of traces present a key challenge
for the experimentalist.
Here we present a method to model the entire data set at once. This method performs
“ensemble” inference – inference of the parameters of each individual trace and in1
93 of 104
ference of the entire distribution of traces. VBEM is performed on each of N traces,
yielding a set of variational parameters w n that determine approximate posterior distributions q(θ n ∣ w n ). Subsequently, the total approximate evidence L = ∑n L n is maximized w.r.t. the prior parameters u. The update scheme for this ensemble method can
be summarized as:
Iterate until L = ∑n L n converges:
1. For each trace n, iterate over VBEM steps:
a. Update q(z n ) by solving δL n /δq(z n ) = 0
b. Update variational parameters w n by solving δL/δq(θ ∣ w) = 0
2. Update hyperparameters u by solving ∂L/∂u = 0
This method differs from what is often called a “hierarchical” approach, e.g. a method
where either the prior parameters u are themselves drawn from a prior, or a mixture
model of prior parameters u m is assumed. Rather we aim to construct the simplest possible scheme that will present an experimentalist with an ensemble method for model
selection and parameter estimation.
Using synthetic data, we show the statistical superiority of HMI for data inference over
methods which only learn from individual traces. The method is more robust under
high levels of emission noise, and gracefully handles inter trace variation of the FRET
levels. Moreover, rate constants extracted directly from an ensemble inferred transition
matrix are more accurate than rate constants learned from dwell-time analysis.
Finally we show that our ensemble method can be used to detect subpopulations of
states with degenerate emission levels but different transition rates. We present results
using both synthetic data and experimental smFRET data taken from the ribosome.
We also apply the method to measurements utilizing carbon nanotube transistors that
show a similar switching behavior between transition rates.
2
94 of 104
Bisection Search in the Presence of Noise
September 8, 2011
Abstract
Stochastic gradient algorithms are popular methods for nding the root of a function that
is only observed with noise. Finite time performance and asymptotic performance of these
methods heavily depend on a chosen tuning sequence. As an alternative to stochastic gradient algorithms, which are conceptually similar to the Newton-Raphson search algorithm,
we analyze a stochastic root nding algorithm that is motivated by the bisection algorithm.
In each step the algorithm queries an oracle as to whether the root lies to the left or right
of a prescribed point x. The oracle answers this question, but the received answer is incorrect with probability 1 − p(x). A Bayes-motivated algorithm for this problem that assumes
knowledge of p(·) repeatedly updates a density giving, in some sense, one's belief about the
location of the root. In contrast to stochastic gradient algorithms, the described algorithm
does not require the specication of a tuning sequence, but requires knowledge of p(·).
This probabilistic bisection algorithm has previously been introduced in Horstein (1963)
for the setting where p(·) is constant. However, very little is known about its theoretical
properties and how it can be extended when p(·) varies with x. We demonstrate how the
algorithm works, and provide new results that shed light on its performance, both when p(·)
is constant and when p(·) varies with x. When p(·) is constant, for example, we show that
the probabilistic bisection algorithm is optimal for minimizing expected posterior entropy.
95 of 104
PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm
for the MapReduce Framework
Shen Wang (sw2613@columbia.edu), Haimonti Dutta (haimonti@ccls.columbia.edu)
The Center for Computational Learning System, Columbia University, NY 10115.
Large datasets, of the order of peta- and tera- bytes, are becoming prevalent in many scientific
domains including astronomy, physical sciences, bioinformatics and medicine. To effectively
store, query and analyze these gigantic repositories, parallel and distributed architectures have
become popular. Apache Hadoop is a distributed file system that provides support for dataintensive applications. It provides an open source implementation of the MapReduce
programming paradigm, which can be used to build scalable algorithms for pattern analysis and
data mining. MapReduce has two computation phases – map and reduce. In the map phase, a
dataset is partitioned into disjoint parts and distributed to workers called mappers. The mappers
implement local data processing and the output of the map phase is of the form < key, value >
pairs. These are passed to the second phase of MapReduce called the reduce phase. The workers
in reduce phase (called reducers) take all the instances that have the same key and do computeintensive processing and produce the final result. For a complex computation task, several
MapReduce pairs may be involved.
When processing large datasets, clustering algorithms are frequently used to compress or extract
patterns from the original data. Agglomerative Hierarchical clustering is a popular clustering
algorithm. It proceeds in four steps: (1) At the start of the algorithm, each object is assigned to a
separate cluster. Then, all pair-wise distances between clusters are evaluated using a distance
metric of one’s choice (2) The two clusters with the shortest distance are merged into one single
cluster. (3) The distance between the newly merged cluster and the other clusters are calculated
and the distance matrix is updated accordingly. (4) If more than one cluster still exists, goto step 2.
Hierarchical clustering algorithm is known to have several advantages -- it does not require
apriori knowledge of the number of clusters in the dataset. Furthermore, the distance metric and
cluster cutting criteria can be adjusted easily. However, the time complexity of the hierarchical
clustering algorithm is relatively high. More importantly, to find the two clusters that are closest
to each other, the algorithm needs to know the distances between all the cluster pairs. This
characteristic makes hierarchical clustering algorithm very hard to scale in a distributed
computing framework.
In this abstract, we present a parallel, random-partition based hierarchical clustering algorithm for
the MapReduce framework. The algorithm contains two main components - a divide-and-conquer
phase and a global integration phase. In the divide-and-conquer phase, the data is randomly split
into several smaller partitions by the mapper by assigning each instance a random number as key.
The instances with the same key are forwarded to the same reducer. On the reducers, the
sequential hierarchical clustering algorithm is run and a dendrogram is generated. The
dendrogram, a binary tree organized by linkage length, is built on the local subset of data. To
obtain a global cluster assignment across all mappers and reducers, dendrogram integration needs
to be implemented. Such integration is non-trivial because insertion and deletion of a single
instance changes the structure of the dendrogram. Our approach is to align them by “stacking”
them one on top of another by using a recursive algorithm described in Algorithm 1.
Algorithm 1: Recursive dendrogram aligning algorithm
1: function align(Dendrogram D1, Dendrogram D2)
2:
if
similarity(D1.leftChild,
D2.leftChild)+similarity(D1.rightChild,D2.rightChild)<
96 of 104
similarity(D1.rightChild,D2.leftChild)+similarity(D1.leftChild,D2.rightChild)
then
3: Switch D2’s two children
4: end if
5: align(D1.leftChild, D2.leftChild)
6: align(D1.rightChild, D2.rightChild)
The following example provides an illustration of the technique:
Example 1: Consider two dendrograms a and b that need to be aligned. Assume a is the template
dendrogram -- this means dendrogram b is aligned to a and all structure changes will happen on b
only. First, the roots of these two dendrograms (nodes of depth 1) are aligned to each other. Then,
nodes at depth 2 need to be aligned. There are two choices -- the first is to align them as they are
seen in the figure and thus - align a2 with b2 and a3 with b3. Another choice is the opposite, which
aligns a2 with b3 and a3 with b2. The decision is made by comparing similarity (a2, b2) + similarity
(a3, b3) with similarity (a2,b3)+similarity(a3,b2) and taking the one with higher similarity value. In
this case, we will find it more reasonable to align a2 with b3 and a3 with b2. Therefore, b2 and b3
are switched and dendrogram b is transformed to dendrogram c. Then, for each pair of nodes that
have been aligned in two dendrograms, say a2 and c2, we repeat the same procedure to align c2’s
two children with a2’s two children. This procedure is repeated recursively until it reaches a depth
deep enough for labeling.
The cluster labeling comprises of two steps – the first step involves cutting the template
dendrogram into subtrees, similar to what a sequential hierarchical clustering algorithm would do
during labeling. Each subtree is given a cluster label. Then for the root ai of each subtree i in the
template dendrogram, the algorithm finds out all the nodes in other dendrograms that were
aligned with it in the alignment step. Each of these nodes is also a root of a subtree in its own
dendrogram and the algorithm will label all the instances belonging to this subtree with the same
cluster label as ai. Intuitively, this is like “stacking” all the aligned dendrograms together with the
template dendrogram being put on top. Then a knife is used to cut the template dendrogram into
pieces. After the template dendrogram is cut, the knife does not stop but cuts all the way to the
bottom of the stack. By doing this, the entire stack is cut into several smaller stacks, and instances
in each small stack are given the same cluster label.
The algorithm is implemented on an Apache Hadoop framework using the MapReduce
programming paradigm. Empirical results on two large data sets from the ACM KDD Cup
competition suggests that the PArallel RAndom-partition Based hierarchicaL clustEring
algorithm (PARABLE) has significantly better scalability than centralized solutions. Future work
involves theoretical analysis of convergence properties and performance benefits obtained from
this randomized algorithm and implementation of multiple levels of local clustering.
Acknowledgements: Funding for this work is provided by National Science Foundation award,
IIS-0916186.
97 of 104
A Reinforcement Learning Approach to Variational Inference
David Wingate
wingated@mit.edu
Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139 USA
Theophane Weber
theo@lyricsemiconductor.com
Lyric Semiconductor, One Broadway 14th Floor, Cambridge, MA 02142 USA
QT
1. Introduction
of marginals as pθ (x) =
We adopt a dynamical systems perspective on variational inference in deep generative models. This connects variational inference to a temporal credit assignment problem that can be solved using reinforcement
learning: policy search methods (such as policy gradients) become a direct search through variational parameters; state-space estimation becomes structured
variational inference, and temporal-difference methods
suggest novel inference algorithms.
Solving Eq. 1 is typically done by computing derivatives analytically, setting them equal to zero, solving
for a coupled set of equations, and deriving an iterative optimization algorithm. However, this general
approach fails for highly structured distributions (such
as those represented by, for example, probabilistic programs), because there is unlikely to be a tractable set
of variational equations.
Let p(x1 , · · · , xT ) be a directed graphical model with
graph structure G. We consider the directed edges to
be an explicit representation of a temporal process:
we begin at the root of G, and sequentially sample
variables conditioned on the values of their parents. G
imposes only a partial order on this “timeseries,” so
we impose a total order by selecting a fixed sample
path through all of the variables that respects G.
To foreshadow things, we note that variational inference can be viewed as a controlled analogue of this.
The goal of variational inference is to adjust the parameters θ of a distribution pθ (x) to minimize a cost
function C(θ), which is typically the KL divergence
Z
pθ (x)
.
C(θ) = KL(pθ (x)||p(x|y)) =
pθ (x) log
p(x|y)
x
(1)
Like p(x), we assume that pθ (x) decomposes as
QT
pθ (x) = t=1 pθ (xt |ht , θ). If we further make a meanfield assumption, this would decompose into a product
pθ (xt |θt ).
We instead consider generic approaches based on direct optimization of Eq. 1. This allows us to borrow
tools from other disciplines—including RL.
2. An RL Perspective
We begin by noting that C(θ) can be written as:
Z
C(θ)
This prior can now be viewed as an uncontrolled
dynamical system that stochastically evolves until
data is generated. We decompose this as p(x) =
QT
t=1 p(xt |x1 , · · · , xt−1 ); to simplify notation, we will
let ht = x1 , · · · , xt−1 be the history of the generative
process until time t. Given data y, inference is the
process of propagating information backwards through
this timeseries, creating the posterior conditional
disQT
tribution p(x|y) ∝ p(y|x)p(x) = p(y|x) t=1 p(xt |ht ).
t=1
=
x
=
pθ (x) log
"
Epθ (x) log
"
=
Epθ (x)
T
X
pθ (x)
p(x|y)
QT
!#
p
(x
|h
,
θ)
θ
t
t
t=1
QT
p(y|x) t=1 p(xt |ht )
#
Rt (xt |ht ) − log p(y|x)
t=1
where
Rt (xt |ht , θ) = log pθ (xt |ht , θ) − log p(xt |ht ).
Our key insight is that this equation has exactly the
same form as the definition of the expected reward
of a trajectory in a Markov Decision Process (MDP)
when we adopt the explicitly temporal decompositions
of both the target and variational distributions.
Specifically:
• The generative process p(x) is a dynamical system (as we have discussed).
• The parameters θ are the policy of an RL agent.
98 of 104
A Reinforcement Learning Approach to Variational Inference
• The divergence Rt (xt |ht , θ) is a historydependent reward; the agent incurs a cost whenever it sets θ such that the variational distribution diverges from the prior.
• The term log p(y|x) is a terminal reward obtained at the end of an episode.
Thus we see the temporal credit assignment problem
in generative models: information must be propagated
from observed data p(y|x) backwards through the generative process p(x), balancing the “costs” incurred by
deviating from the prior against the “reward” obtained
by generating the data with high likelihood.
Different assumptions about pθ (xt |θt ) map to different kinds of RL.
QT If we make the mean-field assumption pθ (x) = t=1 pθ (xt |θt ) (ie, ignoring the generative history ht ), the dynamical system becomes partially observable, and optimizing θ is now the problem of finding a first-order (or reactive) policy in a
POMDP. Similarly, state-space estimation is the process of summarizing ht such that the process becomes
Markov again. Finally, note that since pθ (x) is unconditional, it is easy to sample from; in terms of our
dynamical system perspective, this means that it is
easy to roll-out a trajectory.
By considering θ to be an agent’s policy, we can borrow techniques from RL to solve this problem. One
possible approach is (stochastic) gradient descent on
C(θ):
1 X
pθ (xj )
∇θ C(θ) ≈
∇θ log pθ (xj ) log
+1
N x
p(xj |y)
j
(2)
with xj ∼ pθ (x) being a trajectory roll-out. We note
that this has exactly the same form as a policy gradient
equation, such as Williams’ REINFORCE algorithm.
We could also use model-free dynamic programming
to help optimize C(θ) by defining a value function and
using temporal difference methods to find θ.
3. Experiments and Results
To illustrate the utility of this perspective, we present
one experiment on a complex generative model from
geophysics. The process creates a 3D volume of rock
by layering new sedimentary deposits on top of each
other, with each layer distributed in a complex way
based on the surface created by previous layers. The
position and “shape” of each layer are the variables
xt , · · · , xT of interest. At the end of the process, wells
are drilled through the rock, and a well-log y is computed by measuring the rock porosity as a function
of depth. The task is to reason about the posterior
Figure 1. Results of vanilla policy gradients vs. episodic
natural actor critic on the sedimentary model.
distribution over layers x given a well-logs y.
Because of the complex dependencies between layers,
there are no analytically tractable variational equations. Some dependencies can be broken under a
mean-field assumption: each layer can be deposited
independently of all others. However, there are still
dependencies in how the layers generate the final rock
volume that cannot be broken.
To solve this, we tested two algorithms. The first is
vanilla stochastic gradient descent on Eq. 1, with the
gradients given by Eq. 2. We also tested the Episodic
Natural Actor Critic algorithm (Peters et al., 2005), an
algorithm for estimating a natural policy gradient that
combines ordinary policy gradients with value function
estimation to reduce variance in the gradient estimate.
Figure 1 shows the results, which demonstrate much
faster convergence. We believe this is because ENAC
does a better job of propagating information from the
end of the process backward to the beginning, and
because of its use of natural gradients.
4. Conclusions
We have outlined how variational inference can be
viewed as an RL problem, and illustrated how RL algorithms bring new tools to inference problems. This
is especially appropriate in the context of deep generative models with complex structure, where RL can help
propagate information backwards through the process.
Future work will investigate more RL algorithms and
their properties when applied to variational inference.
References
J. Peters, S. Vijayakumar, and S. Schaal. Natural actorcritic. In European Conference on Machine Learning
(ECML), pages 280–291, 2005.
99 of 104
Using Support Vector Machine to Forecast Energy Usage of a
Manhattan Skyscraper
Rebecca Winter1, David Solomon2, Albert Boulanger3, Leon Wu3, Roger Anderson3
1
Department of Earth and Environmental Engineering, Columbia University Fu Foundation
School of Engineering and Applied Sciences
2
Department of Environmental Sciences, Columbia College
3
Columbia University Center for Computational Learning Systems
Introduction
As our society gains a better understanding of our ability to negatively impact the environment,
reducing carbon emissions and our overall energy consumption have become important areas of
research. One of the simplest ways to reduce energy usage is by making current buildings less
wasteful. By improving energy efficiency, this method of lowering our carbon footprint is
particularly worthwhile because it actually reduces energy costs to the building, unlike many
environmental initiatives that require large monetary investments. In order to improve the
efficiency of the heating and air conditioning (HVAC) system of a Manhattan skyscraper, 345
Park Avenue, a predictive computer model was designed to forecast the amount of energy the
building will consume. This model uses support vector machine (SVM), a method that builds a
regression purely based on history data of the building, requiring no knowledge of its size,
heating and cooling methods, or any other physical properties. This pure dependence on history
data makes the model very easily applicable to different types of buildings with few model
adjustments. The SVM model was built to predict a week of future energy usage based on past
energy, temperature, and dew point temperature data.
Modeling the energy usage of 345 Park is an important step to improving the efficiency of the
system. An accurate model of future energy usage can be compared to actual energy usage to
look for anomalies in the actual data that may represent wasteful usage of energy. The short-term
predicted energy usage, if accurate enough, could also be used to determine how much energy
should be used now. For example, if the model predicts a large increase in energy usage in two
hours, a moderate energy increase could be forced now to help combat the high future demand.
Alternatively, if a low energy requirement is predicted for the day, pre-heating or pre-cooling
times can be pushed later to decrease total consumption. Fundamentally, in order to control a
pattern, one must first be able to model its behavior. Modeling the energy usage of 345 Park will
allow the management to better understand their building’s energy requirements, which inevitably
leads to new and better ways of optimizing the system.
Data
The rate of energy consumption, called demand, is dependent on a number of factors. The time
and day of the week affect the energy demand most dramatically because the building is primarily
commercial, and uses relatively little energy when it is closed at night and on weekends. By
including past energy data in the prediction, this weekly cycle of high usage during regular
business hours is learned by the model and is used to produce future graphs of energy demand.
Another important factor in predicting energy usage of the heating and air conditioning is the
weather. On particularly hot days, far more energy is required to cool the building, and more
energy goes into heating the building on very cold days. Similarly, humidity and the presence of
precipitation can change the perceived temperature of the building, which affects the amount of
100 of 104
energy required to regulate temperature. Energy demand, temperature, and dew point temperature
were all used in the creation of the computer model.
Methods
There are various kinds of models that can be used to predict and analyze the energy demand of a
building. The DOE-2 model, created by the U.S. Department of Energy takes inputs that
characterize the physical aspects of the building in order to predict its energy needs (“Overview
of DOE-2.2”, Simulation Research Group, Lawrence Berkeley National Laboratory, University of
California, 1998). SimaPro is a tool that evaluates the embedded energy in the building’s
materials and construction history and also predicts operational energy. For this study, a purely
operational approach to energy demand forecasting was taken. The Support Vector Machine
algorithm developed by V. Vapnik in 1995 to perform Support Vector Machine Regression was
used to predict energy demand for 345 Park Avenue. This method requires no inputs pertaining to
the physical characteristics of the building. Rather, the method employs past hourly energy data
and corresponding hourly weather data to create a model that predicts energy demand into the
future. The strength of this model is that it can work for any building that has historical energy
demand data.
Support vector machine is a computer learning tool that plots past data in multi-dimensional
space in order to output a regression. Each data point of past energy usage is graphed against 92
hours of past energy data and against 92 hours of corresponding temperature, dew point
temperature, and energy consumption one year before. Each of these supporting energy,
temperature, and dew point temperature values in a data point, called time delay coordinates, adds
a new dimension to the model. When a long enough set of data points is given, each with its
corresponding time delay coordinates, the computer can “learn” correlations between energy
usage and the rest of the corresponding data. This learned model, along with the most recent data,
is used to and project into the future. A five month long hourly data set was used with a one week
test set, an RBF kernel, a gamma value of 0.1, and a c value of 200. The parameter values and the
data included in the model were chosen by optimizing the R2 and RMSE values of regressions.
Results and Conclusions
The computer generated models were reasonable estimates of future energy demand, with R2
values from .71 to .95. Additionally, by predicting an entire week into the future, these models
can be quite useful for buildings looking to predict and streamline their energy usage. Even more
accurate results could likely be achieved using multiple years of data because seasonal cycles
could be more easily learned by the computer model.
References
Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and
Knowledge Discovery, vol. 2 (1998): 121-167.
Farmer, JD. And Sidorowich, JJ. “Predicting Chaotic Time Series”, Physical Review Letters, vol
59 (1987).
Simulation Research Group, Lawrence Berkeley National Laboratory, University of California,
“Overview of DOE-2.2”, June 1998.
Xi, X-C. Poo, A-N, Chou, S-K. “Support Vector Regression Model Predictive Control on a
HVAC Plant”, Control Engineering Practice, vol 15 (2007).
Xuemei, L. Jin-hu, L. Lixing, D. Gang, X. Jibin, L. “Building Cooling Load Forecasting Model
Based on LS-SVM”, Asia Pacific Conference on Information Processing (2009).
101 of 104
Learning Sparse Representations of High Dimensional Data on Large
Scale Dictionaries
Zhen James Xiang, Hao Xu, Peter J. Ramadge
Department of Electrical Engineering, Princeton University
Abstract for the 6th Annual Machine Learning Symposium
Finding an effective representation of data is an important process in many machine learning applications. Representing a p-dimensional data point x as a sparse linear combination of the elements
of a dictionary of m (possibly linearly dependent) codewords: B = [b1 , b2 , . . . , bm ] ∈ Rp×m is an approach that has recently attracted much attention. As a further refinement, when given a training set
X = [x1 , x2 , . . . , xn ] ∈ Rp×n of data points, the dictionary B can be optimized to make the representation
weights as sparse as possible. This leads to the following problem:
min
B,W
s.t.
1
kX − BWk2F + λkWk1
2
kbi k22 ≤ 1, ∀i = 1, 2, . . . , m.
(1)
Average % of codewords discarded
Here k·kF and k·k1 denote the Frobenius norm and element-wise l1 -norm of a matrix, respectively.
There are two major advantages to this adaptive, sparse representation method. First, in the spirit
of many modern approaches (e.g. PCA, SMT [1], tree-induced bases [2]), rather than fixing B a priori
(e.g. Fourier, wavelet, DCT), problem (1) assumes minimal prior knowledge and uses sparsity as a cue
to learn a dictionary adapted to the data. Second, the new representation w is obtained by a nonlinear
mapping of x. In many other approaches (including [1, 2]), although the codebook B is cleverly chosen,
the new representation w is simply a linear mapping of x, e.g. w = B† x. As a final point, we note that
the human visual cortex uses similar mechanisms to encode visual scenes [3] and sparse representation has
exhibited superior performance on many difficult computer vision problems [4].
The challenge, however, is that solving (1) is computationally expensive for large scale dictionaries. Most state-of-theLearning non−hierarchical sparse representation
art algorithms solve this non-convex optimization problem by
ST3, original data
ST3, projected data
80
iteratively optimizing W and B, which result in large scale
ST2, original data
ST2, projected data
lasso and large scale constrained least square problems. This
ST1/SAFE, original data
60
ST1/SAFE, projected data
has limited the use of the sparse representation to applications
of moderate scales. In this work, we develop efficient methods
40
for learning sparse representations on large scale dictionaries
20
by controlling the dictionary size m and data dimension p.
In each iterative updating of W with fixed B, we have to
0
0
0.2
0.4
0.6
0.8
1
solve n lasso problems, each with m codewords (or regressors
λ
in the standard lasso literature). To speed up this process
when m is large, we first investigate “dictionary screening” to Figure 1: The average rejection percentage of
identify codewords that are guaranteed to have zero coefficient different screening tests. Examples are from
in the solution. The standard result in this area is the SAFE the COIL rotational image dataset.
rule [5]. We provide a new, general and intuitive geometrical framework for deriving screening tests and
using this framework we derive two new screening tests that are significantly better than all existing
tests. The new tests perform particularly well when the data points and codewords are highly correlated,
1
96
95
Traditional sparse representation:
m=64, with 6 different λ settings
m=128, with 6 λ (same as above)
m=192, with 6 λ
m=256, with 6 λ
m=512, with 6 λ
Our hierarchical framework:
m1=32, m2=512, with 6 λ
94
93
m =64, m =2048, with 6 λ
1
92
2
m1=16, m2=256, m3=4096, with 6 λ
Baseline: the same linear classifier
using 250 principal components
using original pixel values
91
2
3
5
10
20
30
Average encoding time for a testing image (ms)
100
90
80
70
60
Traditional sparse representation
Our hierarchical framework
Our framework with PCA projections
Linear classifier
Wright et al., 2008, SRC
50
32(0.1%)
64(0.2%)
128(0.4%)
256(0.8%)
# of random projections (percentage of image size) to use
Average encoding time (ms)
Classification accuracy (%) on testing set
97
Recognition rate (%) on testing set
102 of 104
80
60
Traditional sparse representation
Our hierarchical framework
Our framework with PCA projections
Linear classifier
40
20
0
32(0.1%)
64(0.2%)
128(0.4%)
256(0.8%)
# of random projections (percentage of image size) to use
Figure 2: Left: MNIST: The tradeoff between classification accuracy and average encoding time for various sparse
representation methods. Each curve is for one value of m as λ varies. Right: Face Recognition: The recognition
rate (top) and average encoding time (bottom) for various methods.
a typical scenario in sparse representation applications [4]. See Figure 1 for an illustrative example of
average rejection fractions obtained by our new tests (ST2, ST3) on a real data set.
We also examine projecting the data onto a lower dimensional space so that we can solve the sparse
representations for smaller p. We prove some new theoretical properties on the simplest projection method:
random projection. It is known that under mild assumptions random projection preserves the pairwise
distance of data points. Projection thus provides an opportunity to learn informative sparse representations
with fewer dimensions.
Finally, we combine the ideas of screening and projection into a hierarchical framework. This deals
with large m and p simultaneously and in an integrated way. This framework uses random projections to
incrementally extract information from the training data and uses this to refine the sparse representation
in stages. At each stage the prescreening process utilizes our new screening tests together with an imposed
zero-tree constraint that leads to a hierarchically structured dictionary.
In Figure 2 we show the results of testing our framework on the MNIST digits data set [70,000 28x28
hand written digit images, 60,000 training, 10,000 testing] and a face data set [38 subjects in the extended
YaleB data set, each with 64 cropped frontal face views under differing lighting conditions, randomly
divided into 32 training and 32 testing]. After learing the dictionary, we use the sparse representation
vectors wj as features to train a linear SVM classifier. For the digits data set, we tested the traditional
sparse representation algorithm for different values of m and λ, and we tested our new framework with
several settings of hierarchical levels, dictionary sizes and number of random projections. Compared to the
traditional sparse representation, our hierarchical framework achieves roughly a 1% accuracy improvement
given the same encoding time and a roughly 2X speedup given the same accuracy. For the face data set, we
tested algorithms as we varied the number of random data projections. Our stage-wise random projection
framework strikes a good balance between speed and accuracy.
References
[1] G. Cao and C.A. Bouman. Covariance estimation for high dimensional data vectors using the sparse matrix transform. In Advances
in Neural Information Processing Systems, 2008.
[2] M. Gavish, B. Nadler, and R.R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications
to semi supervised learning. In International Conference on Machine Learning, 2010.
[3] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research,
37(23):3311–3325, 1997.
[4] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition.
Proceedings of the IEEE, 98(6):1031–1044, 2010.
[5] L.E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Arxiv preprint arXiv:1009.3515,
2010.
2
103 of 104
Using Iterated Reasoning to Predict Opponent Strategies
Michael Wunder mwunder@cs.rutgers.edu
John Robert Yaros yaros@cs.rutgers.edu
September 9, 2011
in a multi-round location game called the LemonadeStand Game (LSG) [3]. In this recently introduced
game, three lemonade-selling players arrange on a circular beach with 12 locations to gain the most business from customers on the circle, where the rule is
that customers simply buy from the closest vendor. No
communication is allowed, so sellers must use only the
move history to establish a strategy.
The LSG has a number of properties that benefit
players who have a prior model of others, as opposed to
those expecting equilibrium or who attempt to gather
large amounts of data first. Iterated reasoning methods can be successful in these situations because they
provide a range of feasible strategies that contain various degrees of sophistication. Interactive POMDPs
provide sufficient generality for finding optimal solutions in these settings [1], but we have modified the
process for repeated games to take advantage of inherent structure in games like LSG. Solving the iterated
reasoning represented as an I-POMDP will yield the
hierarchy of actions so that strategy selection becomes
a problem of estimating the sophistication of the population. Our updated framework provides a hierarchy of
likely strategies and a way to estimate the amount of
reasoning in observed strategies [2], which were key
components that earned our team victory in the most
recent LSG competition.
Abstract
The field of multiagent decision making
is extending its tools beyond classical game
theory by embracing reinforcement learning, statistical analysis, and opponent modeling. For example, behavioral economists conclude from experimental results that people
act according to levels of reasoning that form
a “cognitive hierarchy” of strategies, rather
than merely following the hyper-rational Nash
equilibrium solution concept. This paper expands this model of the iterative reasoning
process by widening the notion of a level
within the hierarchy from one single strategy
to a distribution over strategies, leading to a
more general framework of multi-agent decision making. It provides a measure of sophistication for strategies and can serve as a
guide for designing good strategies for multiagent games, drawing its main strength from
predicting opponent strategies.
We apply these lessons to the recently
introduced Lemonade-Stand Game, a simple
setting that includes both collaborative and
competitive elements, where an agent’s score
is critically dependent on its responsiveness
to opponent behavior. The opening moves
are significant to the end result and simple
heuristics have outperformed intricate learning schemes. Using results from three open
tournaments, we show how the submitted entries fit naturally into our model and explain
why the top agents were successful.
1
2
Introduction
LSG Level Computation
In the simple case used in the first two competitions,
customers are equally distributed so scores are equivalent to the distance between competitors. Our levelbased model computes that the optimal action is to find
a partner to set up directly across from to the disadvantage of the third player. In addition, the model implies that stability is a prized quality of a good coali-
We have adapted level-based models of strategic thinking to operate in a repeated domain with multiple
agents. Specifically, we detail a process for discovering the solutions to an Interactive POMDP in the
context of a tournament between computer competitors
1
104 of 104
[2] M. Wunder, M. Kaisers, M. Littman, and J. R.
Yaros. Using iterated reasoning to predict opponent strategies. International Joint Conference on
Autonomous Agents and Multi-Agent Systems (AAMAS), 2011.
tion member. Therefore, it is most beneficial to wait
longer for a partner to join us, as this period increases
the likelihood that we will eventually have two partners
(or suckers) instead of just one. The length of waiting
time becomes a proxy for the amount of reasoning that
each agent implements, allowing us to analyze overall
levels of sophistication in the population and how they
change over time. As we see in Figure 1, the submitted
agents in the second year by and large achieved this
strategy. Additionally, the average level of reasoning
increased between the first and second tournaments.
In the latest iteration of the competition, the demand is unequally distributed on each of the 12 spots,
and known beforehand. Our model breaks the strategy
into two parts: initial action selection and subsequent
action if our initial action was suboptimal given others’
locations. In fact the solution follows the same logic as
in the previous game: choose a location such that the
other players will end up fighting over one half of the
customers. Because our agent chooses the second best
initial action, the most likely scenario is that the worstoff player will pick a position nearer the third player,
to our benefit. If that initial choice fails, the action is
changed once we find a location that creates the most
rewarding coalition. This combination of strategies resulting from our model led us to victory in the latest
competition (see Table 1).
3
[3] M. Zinkevich. The lemonade game competition.
http://tech.groups.yahoo.com/group/lemonadegame/,
December 2009.
Competition Score
9
2009
2010
8.5
8
7.5
7
6.5
0
0.5
1
1.5
2
Approximate Level
2.5
3
Figure 1: Estimated levels of competitors in two
Lemonade-stand Game tournaments. Both sets of
agents show positive correlation between reasoning
and performance. R2 values are 0.77 for 2009 and 0.34
for 2010. The more recent agents show a shift to higher
reasoning levels, as well as a compression of scores.
Conclusion
In more general games, we hope that our solving process will lead to insights for incorporating game structure into these powerful models. The ability to find
higher order strategies is crucial to limiting computation costs and optimizing against unknown populations.
Rank
1.
2.
3.
4.
5.
6.
7.
8.
References
[1] P. Gmytrasiewicz and P. Doshi. A framework for
sequential planning in multiagent settings. Journal
of AI Research (JAIR), 24:49–79, 2005.
Team
Rutgers
Harvard
Alberta
Brown
Pujara
BMJoe
Chapman
GATech
Score
50.397
48.995
48.815
48.760
47.883
47.242
45.943
45.271
Bound
± 0.022
±0.020
±0.022
±0.023
±0.020
±0.021
±0.019
±0.021
Table 1: The 2011 Rankings.
2