Publication

Bulk-Synchronous On-Line Crawling on Clusters of Computers

Source:

16th Euromicro International Conference on Parallel, Distributed and Network-based Processing (EuroPDP 2008), IEEE-CS (2008)

Abstract:

This paper describes the design of a software module devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries. The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk-synchronous so that it can be integrated into the same $P$-processors cluster executing the search engine. This paper describes and evaluates the practical feasibility of such a crawler. The distribution of document URLs onto processors is effected by web-sites where each processor is in charge of retrieving the documents belonging to a sub-set of the total amount of Web-sites. We present an evaluation of the performance of the proposed scheme by using a Web sample of 2.5 millions documents.