Project: Data Quality and Integrity


Data Quality

Quality of data is very important at Yahoo! - everyone from the business analysts through engineers to researchers relies heavily on our data. But robots and malicious users can introduce a lot of noise into Yahoo! data and pollute the traffic of specific properties.

Working with the Abuse, Data Quality and Mail teams, the Yahoo! Labs Audience Sciences team is actively working to improve data quality at Yahoo!. Examples of our work include techniques for removing robot-generated events, identifying spammers and impression fraud, developing novel authentication techniques (such as improved CAPTCHA schemes) to reduce spurious user registration, and methods to determine trusted and untrusted users. The underlying problems are rich and varied, and we draw upon a wide range of algorithms and ideas from mathematics and machine learning.