Security

As a popular destination on the Internet, Yahoo! is the target of abusive activity in its various forms including spam, phishing, harassment, and overuse. Yahoo! is at the forefront of investigating, developing and deploying measures to detect and stamp out abusive activities. Abuse prevention is a company priority at Yahoo! and proper mitigation strategies and tools are essential to the long term quality of our services. We are looking for advances that evolve existing defenses and offer new ways to thwart emerging patterns of abuse.

Challenges

Distinguishing/isolating patterns of abusive activity from normal activity

Activity data is available in the form of logs. Data analysis, cross-domain activity correlation and machine-learning techniques would be relevant. Log data are available, user-identifiable data in logs are subject to restrictive access.

Development of CAPTCHA- and CAPTCHA-like techniques for use as overt challenges or covert tests to distinguish fully- or partly-automated bots from genuine human users that are consuming a Yahoo! service

The objective is to detect and exclude bots that use fully automated or dedicated human labor with partial automation to help recognize and mitigate the impact of downstream abuse. In order to defeat dedicated human labor engaged in promoting abusive access, the techniques should consume significant time and effort that is tolerable to a genuine user but tedious and non-remunerative to repetitive solving by human labor.

Scalable and integrated access control for users

Users share data with a variety of applications within and outside Yahoo. Each of these applications has their own Terms of Service forcing users to specify separate access control rules for each application. This frustrates users and users feel like they have relinquished all control of where their data ends up. The challenge here is to design an integrated access control language and mechanism that can be used across applications from different organizations. At the very least, this would allow users to identify which information they have disclosed and to whom across different applications. Another challenge is to design a scalable "access control broker" that brokers access to user information to applications that satisfies user defined permissions.

Template detection

At their essence, most spammers are just e-commerce merchants with aggressive marketing tactics, so every message they send contains a URL that ultimately leads to a shopping site. These shopping sites are heavily templated, but to hide from spam filters they have subtle differences, and moreover the URLs go through multiple layers of obfuscated redirection before arriving at the ultimate site; this JavaScript obfuscation may even be booby trapped with infinite loops and the like. The challenge is to build a high-performance crawler/classifier that can identify whether a given URL would take users to a site built from a known template.

Automated phishing detector

Because of the financial consequences to their victims, phishing scams are one of the most pernicious of all cyber crimes. Yet these bogus login pages may share many elements in common, such as images and logos that do not “belong” to them or login forms that post to suspicious destinations. Could someone algorithmically identify web pages that were likely to be spoofs? Could this system operate fast enough to be usable in a large-scale production environment?

Passive botnet identification

Could a system be constructed that reliably identifies whether a Web GET or POST request came from a human or a bot, and remain resistant to attackers even once they realize something is there? What characteristics -- found in the data and/or the metadata of a request -- could be fed to a classifier to distinguish good from bad? What feedback and unsupervised training mechanisms would prove resistant to automated attempts by scammers to manipulate their classification?