Statistics
Statistics is a key technology that is at the core of systems like advertising, content optimization, search, recommendation systems etc. The massive scale and interdisciplinary nature of research provides a unique opportunity to work on a new class of statistical problems. Statisticians work closely with experts in machine learning, database, compute science and economics.
Challenges
Learning interactions to predict click-through rates from massive amounts of data
Extreme data sparseness, weak signal to noise ratio, millions of potential predictors, adjusting for several confounding factors makes the problem difficult and provides ample opportunities to conduct new methodological research. Applications include online advertising, web search and recommender systems.
Sequential methods
Constructing optimal sequential decision rules that converge rapidly to high yield regions in a multi-dimensional space is one of our main focus areas. Unlike clinical trials, it is relatively cheap for us to run large number of experiments on a continuous basis.
Time series, point process and survival analysis methods are important to learn the dynamic nature of many of our processes
Examples include inventory forecasting, predicting user visit patterns on different content properties. The scale of our problem (several million users) and extreme heterogeneity in our data makes it a challenging modeling problem.
Scalable statistical methods
Existing statistical methods are often fitted using computationally intensive methods and work well on moderate sized data sets; for most of our applications where one has the opportunity to learn statistical models using hundreds of millions of observations and several million potential covariates, traditional statistical methods do not scale well. Building scalable methods by possibly exploiting parallel computing infrastructure is challenging research topic.
Modeling massive social networks
Social networks arise in several Yahoo! applications (e.g. IM, Yahoo! Address Book), leveraging the social connections to improve our recommendation systems is an important problem and requires sophisticated statistical methods.
Anomaly detection
Detecting spam, low quality clicks are extremely important to keep our systems healthy and provide opportunities to expand on the rich literature that exists in areas like multiple testing, change point detection, hotspot detection.
Collaborative filtering with cold-start
Several applications at Yahoo! involve predicting a response variable y that is obtained when there is an interaction between a pair (dyad) (i,j). For instance, in a content optimization problem, the response is a binary variable (click or no-click) when an article j is shown to a user i. The goal is to predict the response variable (or some function of response variable) for future interactions. The i's and j's have rich meta-data associated with them. For instance, i may correspond to the content of a page to be shown and j may correspond to data we know about a user. Non-stationarity in the process over time is also typical.
New elements i and j appear at test time (also referred to as cold-start problem). For a small fraction of dyads, we have lots of data during training (heavy users interacting with popular articles); the estimation method should fall back on the maximum likelihood estimate (or similar) for such cases. More specifically, the cold-start problem comes in two guises:
- We know absolutely nothing about the new item, e.g. no metadata, no past user behavior.
- We have no interaction data but may at least rely on access to some metadata.
