Back to the Miscellaneous project page.
February 5, 2015
Word Frequency in the King James Bible
Somehow in undergrad I never got around to learning about the MapReduce algorithm. To fix this, I learned about it from the first lecture of MIT’s 6.824: Distributed Systems Engineering, and will attempt to explain MapReduce here, then implement it in Python on a real live example.
In short, MapReduce is an algorithm for parallelizing computation, based on a two-step process of “mapping” and “reducing.” If you are interested in reading the full Google Research publication about MapReduce, it is online here, and is a fairly accessible paper.
In the map
step, some function is “mapped” into a list of elements, and the output is a list of (key, value)
pairs, where each (key, value)
pair is the key
(or the elements in the list, before the function was applied to it) and the value
, which is f(key)
, or the value returned after the function is mapped to that element.
At the end of the map
phase, the output can be thought effectively of two lists: one list of all the original key
s, and one list of the value
s mapped to new values according to some function.
Then in the reduce
phase, this list of value
s is aggregated (or reduced) into one element that is the solution to whatever algorithm is desired.
The reason MapReduce is so powerful and parallelizable is that the map
and reduce
functions can be done in parallel to multiple lists (in fact, the mapping of function to elements of a list is inherently parallelizable, since the same function is applied to all elements of a list independently).
An example
It helps to think of an example. Let’s say we want to take the King James Bible and count the occurrence of each word (this allows us to find the most common word, etc.). This can be done in a parallel way with MapReduce, as follows:
-
The
map
phase consists of taking each line of the text file as a list of space-separated words and mapping it to a list of word-value pairs(word, 1)
. For example, the sentence"I cried unto thee"
becomes
("I", 1), ("cried", 1), ("unto", 1), ("thee", 1)
Here, the map
function is taking a list of words and outputting 1 every time it sees a word. This means that each time a word is seen, the 1 is like a counter.
- The
reduce
phase just takes the list (or lists) of(word, 1)
values given above and sums the numbers for each word. This means that in the end, each word will be a(word, int)
value, whereint
is the number of times that word was seen in themap
phase.
In the end, correctly implementing MapReduce on the King James Bible will yield the following results as the top 5 words that appear:
('the', 63937)
('and', 51699)
('of', 34624)
('to', 13569)
('that', 12913)
Actually implementing a parallelized version of MapReduce is challenging, since the communication between the threads that do the computation and the threads that do the dispatching gets rather hairy. For 6.824, the assignment is all implemented in Go and is in general much faster than Python, but it would be in poor taste to post the solutions to the homework on line. Instead, one can write a single-threaded version of MapReduce for the same task in Python, as shown here.
For efficiency’s sake, instead of using tuples to represent values, my Python implementation uses dictionaries. Additionally, when the map
step happens, there is part of the reduce
step that happens at the same time. Namely, when words are constructed, they are automatically added to a dictionary that keeps track of their frequency in that line, but only because the dictionary syntax used there is cleaner than using nested if
statements to check whether a key exists in the dictionary already.
All Posts about Miscellaneous
2021 October 18 -- Favorite Podcasts 2019-2021
2020 February 9 -- Favorite Podcasts 2018-2019
2019 April 18 -- Lasercutter Memes
2018 October 26 -- Protecting Yourself Against rm
2018 September 3 -- Why I Left Amazon
2018 August 25 -- Blogging with Jupyter for Learning Julia Part 2
2018 August 17 -- Favorite Podcasts 2017-2018
2017 March 18 -- Blogging with Jupyter for Learning Julia
2017 February 15 -- Favorite Podcasts 2016-2017
2015 September 8 -- Installing Zim Wiki on OSX
2015 February 5 -- Word Frequency in the King James Bible
2014 February 13 -- A Fictional Middle East
2013 June 6 -- MIT Graduation Cap
2011 April 28 -- Playing Zork on Linux