Back to the Miscellaneous project page.
February 5, 2015
Word Frequency in the King James Bible
Somehow in undergrad I never got around to learning about the MapReduce algorithm. To fix this, I learned about it from the first lecture of MIT’s 6.824: Distributed Systems Engineering, and will attempt to explain MapReduce here, then implement it in Python on a real live example.
In short, MapReduce is an algorithm for parallelizing computation, based on a two-step process of “mapping” and “reducing.” If you are interested in reading the full Google Research publication about MapReduce, it is online here, and is a fairly accessible paper.
map step, some function is “mapped” into a list of elements, and the output is a list of
(key, value) pairs, where each
(key, value) pair is the
key (or the elements in the list, before the function was applied to it) and the
value, which is
f(key), or the value returned after the function is mapped to that element.
At the end of the
map phase, the output can be thought effectively of two lists: one list of all the original
keys, and one list of the
values mapped to new values according to some function.
Then in the
reduce phase, this list of
values is aggregated (or reduced) into one element that is the solution to whatever algorithm is desired.
The reason MapReduce is so powerful and parallelizable is that the
reduce functions can be done in parallel to multiple lists (in fact, the mapping of function to elements of a list is inherently parallelizable, since the same function is applied to all elements of a list independently).
It helps to think of an example. Let’s say we want to take the King James Bible and count the occurrence of each word (this allows us to find the most common word, etc.). This can be done in a parallel way with MapReduce, as follows:
mapphase consists of taking each line of the text file as a list of space-separated words and mapping it to a list of word-value pairs
(word, 1). For example, the sentence
"I cried unto thee"
("I", 1), ("cried", 1), ("unto", 1), ("thee", 1)
map function is taking a list of words and outputting 1 every time it sees a word. This means that each time a word is seen, the 1 is like a counter.
reducephase just takes the list (or lists) of
(word, 1)values given above and sums the numbers for each word. This means that in the end, each word will be a
(word, int)value, where
intis the number of times that word was seen in the
In the end, correctly implementing MapReduce on the King James Bible will yield the following results as the top 5 words that appear:
('the', 63937) ('and', 51699) ('of', 34624) ('to', 13569) ('that', 12913)
Actually implementing a parallelized version of MapReduce is challenging, since the communication between the threads that do the computation and the threads that do the dispatching gets rather hairy. For 6.824, the assignment is all implemented in Go and is in general much faster than Python, but it would be in poor taste to post the solutions to the homework on line. Instead, one can write a single-threaded version of MapReduce for the same task in Python, as shown here.
For efficiency’s sake, instead of using tuples to represent values, my Python implementation uses dictionaries. Additionally, when the
map step happens, there is part of the
reduce step that happens at the same time. Namely, when words are constructed, they are automatically added to a dictionary that keeps track of their frequency in that line, but only because the dictionary syntax used there is cleaner than using nested
if statements to check whether a key exists in the dictionary already.
All Posts about Miscellaneous
2018 September 3 -- Why I Left Amazon
2018 August 25 -- Blogging with Jupyter for Learning Julia Part 2
2018 August 17 -- Favorite Podcasts 2017-2018
2017 March 18 -- Blogging with Jupyter for Learning Julia
2017 February 15 -- Favorite Podcasts 2016-2017
2015 September 8 -- Installing Zim Wiki on OSX
2015 February 5 -- Word Frequency in the King James Bible
2014 February 13 -- A Fictional Middle East
2013 June 6 -- MIT Graduation Cap
2011 April 28 -- Playing Zork on Linux