Machine Learning
The Machine Learning group is a team of experts in computer science, statistics, mathematical optimization, and automatic control. We focus on making computers learn abstractions, patterns, conditional probability distributions, and policies from web scale data with the goal to improve the online experience for Yahoo! users, partner publishers, and advertisers.
Challenges
Computational limits of ML
Yahoo! has extremely large datasets which are constantly growing. How do you learn from very large datasets? Although there are plenty of methods known for speeding up slow learning algorithms, speeding up fast online learning algorithms is a much harder problem. Another problem is test time computational cost, because learned predictors should be deployable in real-time systems. How do you build effective learning algorithms with very low average or maximum test time computational cost?
Label complexity reduction methods
In many situations there are also sometimes very small amounts of labeled data, because some labels cost money, or because the prediction task to solve is very specific. Despite having very few labels, these tasks might nevertheless be solvable because there is plentiful extra data from other tasks, implying semisupervised, multitask, or other ancillary data incorporation methods can work. What are the limits of multitask learning? Active learning? How do you reuse knowledge bases & ontologies in machine learning?
Nonstationary data
The standard assumption in machine learning is that data sources do not change, but that is often severely untrue at Yahoo!. How do you deal with changing data sources? Adversarial data sources? (How you develop a classifier to combat spam that is robust against spammers affecting features or labels?)
Learning to rank
Ranking problems are pervasive at Yahoo!. How do you create a policy for ranking various types of objects such as webpages, ads, images or news articles? Clicks by user provide valuable relevance feedback. How such an implicit feedback can be best leveraged to improve the ranking function? In order to compare learning to rank algorithms, we've launched the Learning to Rank Challenge!
Exploration
Many learning problems at Yahoo involve user interaction. You can't rewind a user and try a different action, so you only get feedback for the chosen action, violating the idea that you have a complete label. How do you explore and use exploration data in learning with partial feedback? How do you do this in a way which allows use of logged data?
Structured prediction models for information extraction
Extracting structured data from semi-structured and unstructured web pages is extremely valuable for a number of web applications. Typically, this is accomplished using a judicious combination of structured prediction techniques such as conditional random fields or structural support vector machines. However, practical deployment of these techniques requires extending them to work with various forms of highly limited and noisy supervision (e.g., noisy databases, dictionaries, domain-specific constraints, etc.) and also to make effective use of structural commonalities within websites/webpages. It is also essential to incorporate intelligent data collection/filtering mechanisms into the prediction framework itself to achieve the desired scalability. More details at http://www.labs.yahoo.com/ksc/Information_Extraction.
Learning from sparse data
At Yahoo! we are face the challenge of learning from very large yet very sparse data sets. The data sparseness manifests itself along both the feature and target dimensions. We deal with potentially millions of user features including searches, page (URL) views, ad interactions (including views and clicks). On a per user level, the feature space is extremely sparse. Similarly on the target dimension, our problems tend to have very few positive examples. For instance, ad click rates of the order of 0.1% and advertiser side conversion rates of the order of 0.001% are not uncommon. Building robust and accurate models from such sparse data is a big challenge.
1-class learning
The need to learn models when the available labeled examples belong to a single class are prevalent at Yahoo! For instance, advertiser’s provide lists of users (cookies) who have ‘converted’ (purchased a product or service) at their sites (positive examples), but no explicitly specified list of non-converters (negative examples). Note that user clicks or dwell times on stories indicate user interest but absence of clicks does not necessarily convey user disinterest. The ability to learn models from a single class of labeled examples is thus very important. Effective modeling is desired even in the absence of hints such as the relative proportion of positive and negative examples or a well-defined distance function to measure the proximity of unlabeled examples to known positive examples.
Learning from uncertain (or noisy) target labels
Application domains such as categorization of objects (including queries, ads, and documents) rely on editorially labeled training data. Editorially labeled data can be noisy (due to human error) or even uncertain (difference of opinion among editors). This situation calls for the design of suitable machine learning methods that are robust to the uncertainty (or noise) in the target labels.
Data valuation
Several properties on the Yahoo! network and off-network Yahoo! partners offer rich data that can be used to augment the user data currently being leveraged in applications such as ad targeting and content personalization. How do we effectively value each new data source? An obvious answer to this question would be to integrate the new data into the modeling pipeline and running controlled experiments to compare the effectiveness of the models with and without the new data. Running a full-fledged test is extremely expensive. We need a comprehensive set of metrics and a systematic procedure for evaluating the value of new data without having to rely completely on controlled experiments.
