Web Information Management - Information extraction

By Brian Cooper and Philip Bohannon


The Web Information Management group aims to develop the infrastructure to build, maintain, and analyze the next generation of online communities and software services. Information management on an unprecedented scale to support data-backed applications and to support, information discovery over web content that is intelligently interpreted, leveraging both algorithmic techniques and social interactions.

Challenges

Mashing Up Large Structured Databases with the Web to Satisfy User Content Needs

In order to exceed user's expectations for content in vertical domains, it is not longer sufficient to build web applications solely on feeds of structured data. Instead, it is required to deeply and accurately link that data to a host of web resources.

Creating such linkages at high quality and in a scalable manner defined a number of key scientific challenges for Yahoo! and for the Web Information Management Dept., including but not limited to:

  • rapid generation of a) heuristic, rule-based and b) machine-learned extraction modules
  • extraction substituting large databases of examples in place of training data (e.g., using highly noisy training data)
  • matching of structured data, web data, and matching structured data to web data, finding similar entities
  • fine-grained association of web data and structured records
  • joint data integration and extraction, (e.g. integrate data from three tables and fifty web sites)
  • ranking combinations of structured and unstructured data
  • feedback, debugging and quality management in extraction and integration pipelines
  • priority-based extraction leveraging search logs and other noisy indicators of user-intent