MapReduce and extensions

Gautam Shroff

doi:10.1017/CBO9780511778476.016

The MapReduce programming model was developed at Google in the process of implementing large-scale search and text processing tasks on massive collections of web data stored using BigTable and the GFS distributed file system. The MapReduce programming model is designed for processing and generating large volumes of data via massively parallel computations utilizing tens of thousands of processors at a time. The underlying infrastructure to support this model needs to assume that processors and networks will fail, even during a particular computation, and build in support for handling such failures while ensuring progress of the computations being performed. Hadoop is an open source implementation of the MapReduce model developed at Yahoo, and presumably also used internally. Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud platform, which has sparked interest in applying the MapReduce model for large-scale, fault-tolerant computations in other domains, including such applications in the enterprise context.

PARALLEL COMPUTING

Parallel computing has a long history with its origins in scientific computing in the late 60s and early 70s. Different models of parallel computing have been used based on the nature and evolution of multiprocessor computer architectures. The shared-memory model assumes that any processor can access any memory location, albeit not equally fast. In the distributed-memory model each processor can address only its own memory and communicates with other processors using message passing over the network.

Book contents

Chapter 11 - MapReduce and extensions

Summary

Access options

Book contents

Chapter 11 - MapReduce and extensions

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive