Distributed Confidence-Weighted Classification on MapReduce

Publication
Oct 7, 2013
Abstract

Explosive growth in data size, data complexity, and data rates, triggered by the emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, and computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the classification algorithms designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for the development of novel, more scalable algorithms that can handle large data sets. In this paper we propose such a method, called AROW- MR, a linear SVM solver for efficient training of recently proposed confidence-weighted (CW) classifiers. Linear CW models maintain a Gaussian distribution over parameter vectors, thus allowing a user to estimate parameter confidence, in addition to separating hyperplanes between two classes. The proposed method employs the MapReduce framework to train CW classifiers in a distributed way, obtaining significant improvements in both training time and accuracy. This is achieved through training local CW classifiers on each mapper, followed by optimally combining local classifiers on the reducer to obtain an aggregated, more accurate CW linear model. We validate the proposed algorithm on synthetic data, and further show that the AROW-MR algorithm outperforms the baseline classifiers on the industrial, large-scale task of Ad Latency prediction, with nearly one billion examples.

  • IEEE International Conference on Big Data
  • Conference/Workshop Paper

BibTeX