Princeton Machine Learning Lecture Series

The Machine Learning series takes place in the large conference room on the fourth floor of the Computer Science building on Wednesdays at 12:30.

Fall 2012

Wed Oct 10, 2012 12:30 PM CS 402 Miroslav Dudik (Microsoft Research - NYC)
Tractable market making in combinatorial prediction markets [Abstract]

Prediction markets are emerging as a powerful and accurate method of aggregating information from populations of experts (and non-experts). Traders in prediction markets are incentivized to reveal their information through buying and selling "securities" for events such as "Romney to win Florida". The prices of securities reflect the aggregate belief about the events and the key challenge is to correctly price the securities.

We present a new automated market maker for providing liquidity across multiple logically interrelated securities. Our approach lies somewhere between the industry standard---treating related securities as independent and thus not transmitting any information from one security to another---and a full combinatorial market maker for which pricing is computationally intractable. Our market maker, based on convex optimization and constraint generation, is tractable like independent securities yet propagates some information among related securities like a combinatorial market maker, resulting in more complete information aggregation. Our techniques borrow heavily from variational inference in exponential families. We prove several favorable properties of our scheme and evaluate its information aggregation performance on survey data involving hundreds of thousands of complex predictions about the 2008 U.S. presidential election.

Joint work with Sebastien Lahaie and David Pennock.

Bio: Miroslav Dudik joined MSR-NYC in May 2012. His interests are in combining theoretical and applied aspects of machine learning, statistics, convex optimization and algorithms. He received his PhD from Princeton in 2007.

Wed Oct 17, 2012 12:30 PM CS 402 Tamir Hazan (TTI-Chicago)
Inference and Learning with Random Maximum A-Posteriori Perturbations [Abstract]

Learning and inference in complex models drives much of the research in machine learning applications, from computer vision, natural language processing, to computational biology. The inference problem in such cases involves assessing the weights of possible structures, whether objects, parsers, or molecular structures. Although it is often feasible to only find the most likely or maximum a-posteriori (MAP) assignment rather than considering all possible assignment, MAP inference is limited when there are other likely assignments. In a fully probabilistic treatment, all possible alternative assignments are considered thus requiring summing over the assignments with their respective weights which is considerably harder (#P hard vs NP hard). The main surprising result of our work is that MAP inference (maximization) can be used to approximate and bound the weighted counting. This leads us to a new approximate inference framework that is based on MAP-statistics, thus does not depend on pseudo-probabilities, contrasting the current framework of Bethe approximations which lacks statistical meaning. This approach excels in regimes where there are several but not exponentially many prominent assignments. For example, this happens in cases where observations carry strong signals (local evidence) but are also guided by strong consistency constraints (couplings).

Joint work with Tommi Jaakkola

Wed Oct 24, 2012 12:30 PM CS 402 Shay Cohen (Columbia)
Consistent and Efficient Algorithms for Latent-Variable PCFGs [Abstract]

In the past few years, there has been an increased interest in the machine learning community in spectral algorithms for estimating models with latent variables. Examples include algorithms for estimating mixture of Gaussians or for estimating the parameters of a hidden Markov model.

Until the introduction of spectral algorithms, the EM algorithm has been the mainstay for estimation with latent variables. Still, with EM there is no guarantee of convergence to the global maximum of the likelihood function, and therefore EM generally does not provide consistent estimates for the model parameters. Spectral algorithms, on the other hand, are often shown to be consistent.

In this talk, I am interested in presenting a spectral algorithm for latent-variable PCFGs, a model widely used for parsing in the NLP community. This model augments the nonterminals in a PCFG grammar with latent states. These latent states re-fine the nonterminal category in order to capture subtle syntactic nuances in the data. This model has been successfully implemented in state-of-the-art parsers such as the Berkeley parser.

The algorithm developed is considerably faster than EM, as it makes only one pass over the data. Statistics are collected from the data in this pass, and singular value decomposition is performed on matrices containing these statistics. Our algorithm is also provably consistent in the sense that, given enough samples, it will estimate probabilities for test trees close to their true probabilities under the latent-variable PCFG model.

If time permits, I will also present a method aimed at improving the efficiency of parsing with latent-variable PCFGs. This method relies on tensor decomposition of the latent-variable PCFG. This tensor decomposition is approximate, and therefore the new parser is an approximate parser as well. Still, the quality of approximation can be theoretically guaranteed by inspecting how errors from the approximation propagate in the parse trees.

Joint work with Michael Collins, Dean Foster, Karl Stratos and Lyle Ungar.

Wed Nov 14, 2012 12:30 PM CS 402 Xin Luna Dong
Truth Finding on the Deep Web [Abstract]

The Web has been changing our lives enormously and people rely more and more on the Web to fulfill their information needs. Compared with traditional media, information on the Web can be published fast, but with fewer guarantees on quality and credibility. Indeed, Web sources are of different qualities, sometimes providing conflicting, out-of-date and incomplete data. The sources can also easily copy, reformat and modify data from other sources, propagating erroneous data.

In this talk we present a recent study for truthfulness of Deep Web data in two domains where we believed data quality is important to people's lives: Stock and Flight. We then describe how we can resolve conflicts from different sources by leveraging accuracy of the sources and the copying relationships between the sources using statistical models. We demo our SOLOMON system, which can effectively detect copying between data sources, leverage the results in truth discovery, and provide a user-friendly interface to facilitate users in understanding the results.

Dr. Xin Luna Dong is a researcher at AT&T Labs-Research. She received a Ph.D. in Computer Science and Engineering from University of Washington in 2007, received a Master's Degree in Computer Science from Peking University in China in 2001, and received a Bachelor's Degree in Computer Science from Nankai University in China in 1998. Her research interests include databases, information retrieval and machine learning, with an emphasis on data integration, data cleaning, personal information management, and web search. She has led the Solomon project, whose goal is to detect copying between structured sources and to leverage the results in various aspects of data integration, and the Semex personal information management system, which got the Best Demo award (one of top-3) in Sigmod'05. She has co-chaired Sigmod/PODS PhD Symposium'12, Sigmod New Researcher Symposium'12, QDB'12, WebDB'10, has served as a track chair for the program committee of ICDE'13, CIKM'11, and has served in the program committee of VLDB'13, Sigmod'12, VDLB'12, Sigmod'11, VLDB'11, PVLDB'10, WWW'10, ICDE'10, VLDB'09, etc.

Wed Nov 7, 2012 12:30 PM CS 402 Jean Honorio
Convergence Rates of Biased Stochastic Optimization for Learning Sparse Ising Models [Abstract]

We study the convergence rate of stochastic optimization of exact (NP-hard) objectives, for which only biased estimates of the gradient are available. We motivate this problem in the context of learning the structure and parameters of Ising models. We first provide a convergence-rate analysis of deterministic errors for forward-backward splitting (FBS). We then extend our analysis to biased stochastic errors, by first characterizing a family of samplers and providing a high probability bound that allows understanding not only FBS, but also proximal gradient (PG) methods. We derive some interesting conclusions: FBS requires only a logarithmically increasing number of random samples in order to converge (although at a very low rate); the required number of random samples is the same for the deterministic and the biased stochastic setting for FBS and basic PG; accelerated PG is not guaranteed to converge in the biased stochastic setting.

Jean Honorio is a PhD in Computer Science from Stony Brook University. He received his MSc in Computer Science from The George Washington University in 2006, and BSc in Systems Engineering from Universidad de Lima, Peru in 1997. His research focuses on tractable learning of graphical model structures from data. He will soon join CSAIL at MIT as a post-doctoral researcher. He was member of the Image Analysis Lab at Stony Brook University and the Medical Department at Brookhaven National Lab in 2007-2012, the Institute for Computer Graphics at The George Washington University in 2005-2006, and the Institute for System Engineering and Operations Research at Universidad de Lima in 1995-1996.

Wed Dec 12, 2012 12:30 PM CS 402 Lauren Hannah (Columbia)
Multivariate convex regression [Abstract]
Shape constraints, like convexity, occur in a variety of inference problems, including geometric programming based circuit design, options pricing and value function approximation. Current methods for multivariate regression subject to convexity constraints do not scale to more than a couple of thousand observations. In this talk, we present a new computationally efficient, consistent method for multivariate convex regression using adaptive partitioning with linear partition models. We apply this method to large scale inference problems and discuss extensions to objective function and constraint approximation in a convex optimization setting.

Spring 2012

Wed Mar 7, 2012 12:30 PM CS 402 David Sontag (NYU)
Complexity of Inference in Latent Dirichlet Allocation [Slides] [Abstract]
In this talk, I will discuss recent work on understanding the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). Time permitting, I will also discuss my recent work on using probabilistic models such as LDA for personalizing web search and for real-time surveillance of patients' medical records.
Wed Mar 14, 2012 12:30 PM CS 402 Marc Ratkovic (Politics)
Achieving Optimal Covariate Balance Under General Treatment Regimes [Slides] [Abstract]
Balancing covariates across treatment levels provides an effective and increasingly popular strategy for reducing bias in observational studies. Matching procedures, as a means of achieving balance, pre-process the data through identifying a subset of control observations whose background characteristics are similar to the treated observations. The proposed method adapts the support vector machine classifier to the matching problem. The method provides a fully automated, nonparametric procedure for identifying a balanced subset that accommodates both binary and continuous treatment regimes. I show that, for the identified subset, the expected treatment level is jointly independent of the pre-treatment covariates. Unlike existing methods, the proposed method maximizes balance across all covariates simultaneously, rather than along a summary measure of balance. Two applications, a benchmark dataset measuring the impact of a job training program and survey data with information on medical expenditures and smoking habits, illustrate the method's use and efficacy.
Wed Mar 28, 2012 12:30 PM CS 402
TBA [Abstract]
Wed Apr 4, 2012 12:30 PM CS 402 Ryan Adams (Harvard)
Tree-Structured Stick Breaking Processes for Hierarchical Data [Abstract]

The Dirichlet process and related distributions over infinite-dimensional random measures have had a large impact on Bayesian nonparametric modeling, but they can be limited by the lack of structure they place over their random partitions. In this work, we use a recursive stick breaking process to construct a Bayesian nonparametric prior for inferring hierarchies in data, yielding an exchangeable urn scheme on trees of unbounded width and depth. We use a novel slice sampling approach based on an auxiliary variable model to update the structure of the tree, as well as Gibbs moves that take advantage of invariance under size-biased permutation. We use this MCMC sampling scheme to perform hierarchical clustering on a set of 50,000 color images, and also to discover structure from natural language in a corpus of scientific documents.

This is joint work with Zoubin Ghahramani (Cambridge) and Michael Jordan (Berkeley).

Wed Apr 11, 2012 12:30 PM CS 402 Bob Wilson (Psychology)
TBA [Abstract]
Wed Apr 18, 2012 12:30 PM CS 402 Yael Niv (Neuroscience)
TBA [Abstract]
Wed May 9, 2012 12:30 PM CS 402 Kuzman Ganchev (Google)
Rich Prior Knowledge in Learning for Natural Language Processing [Abstract]
We possess a wealth of prior knowledge about most prediction problems, and particularly so for many of the fundamental tasks in natural language processing. Unfortunately, it is often difficult to make use of this type of information during learning, as it typically does not come in the form of labeled examples, may be difficult to encode as a prior on parameters in a Bayesian setting, and may be impossible to incorporate into a tractable model. Instead, we usually have prior knowledge about the values of output variables. For example, linguistic knowledge or an out-of-domain parser may provide the locations of likely syntactic dependencies for grammar induction. Motivated by the prospect of being able to naturally leverage such knowledge, no less than five different groups have recently developed similar, general frameworks for expressing and learning with side information about output variables. This talk will provide a brief survey of these frameworks as well as some examples of how they have been applied.
Wed May 16, 2012 12:30 PM CS 402 Matthew Hoffman (Columbia)
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo [Abstract]
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlations that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than popular methods such as random walk Metropolis or Gibbs sampling. However, HMC's performance is highly sensitive to two user-specified parameters: a step size epsilon and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We present the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. NUTS is able to achieve similar performance to a well tuned standard HMC method, without requiring user intervention or costly tuning runs. NUTS can thus be used in applications such as BUGS-style automatic inference engines that require efficient "turnkey" sampling algorithms.