Web Information Management - Data management
By Brian Cooper and Philip Bohannon
The Community Systems group aims to develop the infrastructure to build, maintain, and analyze the next generation of online communities and software services. Information management on an unprecedented scale to support data-backed applications and to support, information discovery over web content that is intelligently interpreted, leveraging both algorithmic techniques and social interactions.
Challenges
How do we build databases that can operate at Internet scale?
Internet applications generate huge amounts of data, and generate huge numbers of data requests against that data. Traditional databases were designed to deal with comparatively small, well structured datasets and complex queries against those datasets.
But at Internet scale, traditional techniques break down. Many of the strengths of traditional databases become weaknesses (for example, strong transaction models severely inhibit scalability) and weaknesses of traditional databases become magnified (for example, handling uncertain data, which is not a "nice-to-have" but a "must-have" at internet scale). Building databases at Internet scale requires rethinking many of the settled solutions of traditional databases, and developing new models to cope with data that is uncertain or not well structured), and trimming functionality whenever possible to improve scalability. It may even require completely rethinking data models, query processing models, even the notions of what a database is supposed to do for you. Key questions in this area are:
- What is the right architecture for an Internet scale database?
- What is the right data model? How does that data model incorporate uncertainty in the structure and meaning of data?
- What is the right query model? How does that query model cope with the uncertainty allowed by the data model?
How can we build cloud data services?
Internet companies need to scale to huge datasets and request rates. They need their infrastructure to never go down. And they don't want to have an army of operators just to keep the system running.
"Cloud" computing is a buzzword intended to capture the notion of a computing utility run as a service. Beyond just a buzzword, however, Internet companies are already building clouds to manage their web crawls, analyze click streams, store user data, detect fraud, and to perform a variety of other tasks that require huge amounts of computing to tackle huge amounts of data. And these same companies are running into a variety of tough challenges:
- What is the right set of abstractions that the cloud can offer? Should it offer hosted databases? Hosted processing? Hosted messaging buses? Interoperation between these pieces? And how much complexity must live in the cloud in order to provide these abstractions to customers?
- How can multiple customers live on the same cloud, without interfering with each other? Interference might include performance, stability, availability, privacy, and a host of other issues.
- How can applications which are written for customized data layers be modified to use the cloud instead? What do those applications lose? Should the application weaken its requirements or should the cloud strengthen its functionality?
- How do we know the cloud is working properly? How do we deal with inevitable machine and component failures without bringing down the cloud? If the cloud is not "broken" but merely "sick," how do we detect and fix the root cause of the problem?
How do we manage Social Data?
The increasing popularity of social networking sites like Facebook and MySpace has led to the emergent trend of increasing integration between content and social sites. In one direction, social sites are adding more and more content (e.g., photo, video, news article, etc.) on to their sites to provide more practical utility to their users. In the other direction, content sites like Amazon and Yahoo! Travel are incorporating social activities and connections to more deeply engage their users. With the establishment of the OpenSocial (opensocial.org) Foundation, the integration of social sites and content sites will likely be one of the major trends in the next few years.
Helping users search and explore information on social content sites can be fundamentally different from the traditional tasks of web search and content recommendation. Some key challenges are:
- How to efficiently analyze the extremely large social graph for the purposes of information retrieval/recommendation and social trend detection
- How to incorporate social information into current retrieval models based on information retrieval relevance and authority ranking models
- How to effectively present to the user the underlying complex social graph that is relevant to her needs
