- Is Microsoft preparing us to move beyond Vista?
- Why Google wanted to lose wireless spectrum auction
- iPhone shortage fuels rumors of imminent 3G phone
- XP for cheap PCs: a second crack in the wall
- Darts into data: Leveraging random action to competitive advantage
- Most iPhone buyers are existing Apple customers
- AT&T's so-called open network principles
- Mono dev tool offered
- ActiveState upgrades IDE
- Serena plans SaaS products
January 09, 2008 | Comments: (0)
Google Fellows discuss parallel processing model
Google Fellows Jeff Dean and Sanjay Ghemawat have published a paper in this month's Communications of the ACM, a publication of the Association for Computing Machinery, detailing the programming model Google leverages to process more than 20 petabytes of data per day on commodity-based clusters.
The paper is an update of a prior article on the process and includes deeper insights into the effect the model has had on operations at Google in the time since first publication.
The methodology, known as MapReduce, allows users to break computations into a map and a reduce function, which the runtime system automatically parallelizes across large clusters, navigating machine failures and honing the efficiency of network and disk use in the process.
Inspired by similar primitives in Lisp, the methodology abstracts parallelization, fault tolerance, data distribution, and load balancing into a library. More than 10 thousand programs have been implemented at Google using MapReduce, which can also be used to parallelize computations for multicore processing on a single machine.
The model has been used for large-scale graph processing, text processing, data mining, machine learner, and statistical machine translation, among other algorithms, Dean and Ghemawat write.
The clusters on which MapReduce jobs run consist of thousands of commodity PCs connected by Gigabit Ethernet. The Linux-based dual-processor x86 machines have between 4GB and 8GB of memory per machine, with two 160GB IDE disks directly attached. Google's homegrown GFS (Google File System) manages the data stored to the disks.
Computations are submitted to a scheduler, which maps tasks to available machines. The MapReduce library splits input files into pieces typically between 16MB and 64MB and implements a master/worker model to perform tasks across the cluster.
MapReduce use has scaled significantly in its first four years of use at Google, with map input data topping 403 petabytes in September 2007. More than 11,000 machines were used that month to process 2.2 million jobs, with an average of 394 machines taking 395 seconds on average to complete each job.
As Dean and Ghemawat note in the paper, the most significant use of MapReduce has been a rewrite of the indexing system used for Google search. The MapReduce system has reduced computations from approximately 3,800 lines of C++ to 700 lines.
Posted by Jason Snyder on January 9, 2008 05:59 PM
RATE THIS ARTICLE:
-

- COMMENTS
TOP STORIES
ADDITIONAL RESOURCES

- Do you have the power to resolve technical issues with one call?
- Take control of your content- leverage Microsoft SharePoint
- Keeping the E-Mail Flowing

- How Does Your IT Help Desk Measure Up?
- Best Practices for the Service Desk
- Discover How to Provide Anytime, Anywhere IT Support





